### Objective- To predict the quality of wine on a scale of 1-10 (bad----good)

In [None]:
# Importing the basic libraries
    # For Data Analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [None]:
df= pd.read_csv(r"Y:\Data\Projects\Machine Learning Project\notebooks\data\QualityPrediction.csv")

In [None]:
df.head(5)

In [None]:
df.info()

#### Observations-
All the variables are numerical in nature

In [None]:
df.describe()

The mean and median values for features volatile acidity, density and pH  are quite close to each other. This hints towards them being normally distributed

Checking for duplicate values

- Duplicate observations will not provide any additional information to the model and hence should be dropped
- Further, when the dataset will be split into training and test dataset, exactly same observation values may end up being in both. This may result in overfitting

In [None]:
print(df.duplicated().sum())

There are a total of 240 duplicate entries

In [None]:
df[df.duplicated()]

Removing the duplicate entries

In [None]:
df.drop_duplicates(keep='first', inplace = True)

In [None]:
df.shape

Now there are a total of 1359 observations and 12 columns (12 features and 1 target variable)

In [None]:
df.reset_index(drop=True, inplace=True ) # drop = True to avoid creating a new column

In [None]:
df.head()

In [None]:
df.columns

### Univariate Analysis

#### Plotting histograms and checking for skewness values

In [None]:
plt.subplots(3,4,figsize=(25,18))
for i,feature in enumerate(df.columns):
    if i== len(df.columns)-1: # To avoid plotiing the Target Variable
        pass
    else:
        plt.subplot(3,4,i+1)
        sns.histplot(data=df, x=feature, kde=True)
        plt.title(feature)
        #print(i,feature)
        print(f'skewness of feature "{feature}" is {df[feature].skew()}')

A lot of features look skewed

- density and pH look fairly normally distributed
- Rest all are rightly skewed
- Volatile acidity and citric acidity seem bimodal

#### Plotting boxplot to visualize outliers

In [None]:
plt.subplots(3,4,figsize=(25,18))
for i,feature in enumerate(df.columns):
    if i==len(df.columns)-1:
        pass
    else:
        plt.subplot(3,4,i+1)
        sns.boxplot(data=df, x=feature)
        plt.title(feature)


#### Observations-
Almost all the features contain a lot of outliers

Log transformation is one of the methods to reduce skewness (achieve normal distribution) and handle outliers.

It is especially helpful for models assuming normal distribution like linear regression

Trying to the transform the skewed featues of dataset here too (for logistic regression) to handle skewness and outliers

In [None]:
transformed_df = np.log1p(df.drop(['quality', 'pH', 'density'], axis=1)) # Transformation makes sense only for numerical features.
                                                                # Thus removing the categorical feature which happens to be the target variale and the only categorical variable

In [None]:
transformed_df = pd.concat([transformed_df,df[['pH', 'density','quality']]],axis=1)

In [None]:
transformed_df

In [None]:
transformed_df.columns

#### Plotting histograms and checking for skewness values for this transformed data


In [None]:
plt.subplots(3,4,figsize=(25,18))
for i,feature in enumerate(transformed_df.columns):
    plt.subplot(3,4,i+1)
    sns.histplot(data=transformed_df, x=feature, kde=True)
    plt.title(feature)
    #print(i,feature)
    print(f'skewness of feature "{feature}" is {transformed_df[feature].skew()}')


#### Observations-
- The distribution of the features much more normal than the original data
- This can also be verified by the  skewness value of features which has reduced significantly. For example-
    - skewness of feature "residual sugar" has reduced from 4.548153403940447 to 1.7652376788280852
    - skewness of feature "chlorides" has reduced from 5.502487294623722 to 1.8876423282330907

#### Plotting box plot for the transformed data-

In [None]:
plt.subplots(3,4,figsize=(25,18))
for i,feature in enumerate(transformed_df.columns):
    plt.subplot(3,4,i+1)
    sns.boxplot(data=transformed_df, x=feature)
    plt.title(feature)

#### Observations-
- The number of outliers have reduced too(few features still contain a decent amount though)

### Bivariate Analysis


Checking for collinearity

- Winemakers use pH as a way to measure ripeness in relation to acidity.
- TA, or "total acidity," is another way of looking at similar things, this time measuring acidity by volume. (The total acidity (TA) of a wine is measured assuming all the acid is tartaric)
- ##### How do they relate? The higher the pH, the lower the acidity.  Thus there are chances of existence of a negative correlation between "pH" and "fixed acidity" features
- Total sulphite is the sum of free sulphites and bound sulphites.
- ##### Thus chances of feature "total sulfur dioxide" and feature "free sulfur dioxide" being correlated are also high


In [None]:
plt.subplots(1,2,figsize=(25,6))
plt.subplot(121)
plt.title('pH v/s fixed acidity')
sns.scatterplot(data=df, x='fixed acidity', y='pH')
sns.regplot(data=df, x='fixed acidity', y='pH',color='orange',line_kws={'color':'green'})
plt.subplot(122)
plt.title('total sulfur dioxide v/s free sulfur dioxide')
#sns.scatterplot(data=df, x='total sulfur dioxide', y='free sulfur dioxide')
sns.regplot(data=df, x='total sulfur dioxide', y='free sulfur dioxide',color='orange',line_kws={'color':'green'})
plt.show()

#### Observations-
- As can be seen by the graph too, a linear relation exists between "pH and fixed acidity" and "free sulpfur oxide" and "total sulfur oxide"

Checking for collinearity to verify

In [None]:
#sns.heatmap(df.corr(), annot=True,cmap='viridis')
df.corr()

In [None]:
transformed_df.corr()

#### Observations
- "fixed acidity" has moderate correlation with a lot of features- "citric acid", "density" and "pH"
- "free sulfur dioxide" and "total sulfur oxide" also have a moderate correaltion
- These correaltions become stronger for the transformed data(except between "citric acid" and fixed "acidity")
- We can check for VIF values too (after scaling) 

Dropping the features "fixed acidity" and " free sulfur dioxide" owing to multicollinearity

In [None]:
df.drop(['fixed acidity', 'free sulfur dioxide'],axis=1,inplace=True)

In [None]:
df

In [None]:
transformed_df.drop(['fixed acidity', 'free sulfur dioxide'],axis=1,inplace=True)
transformed_df

Plotting graph features and target variable

In [None]:
plt.subplots(3,4,figsize=(25,18))
for i,feature in enumerate(df.columns):
    plt.subplot(3,4,i+1)
    sns.barplot(data=df, x='quality',y=feature)
    plt.title(feature)
    #print(i,feature)


#### Observations
- There are only 6 categories present in the target variable given dataset. All 6 of them have equal amount of density value and almost equal pH value

In [None]:
df.quality.unique()

- There are a total of 10 categories in the target variable. But the given dataset contains only 6 of them
- Random Forest is one model which may be able to predict the missing categories

#### Conclusions
- The given dataset contains a fair number of outliers which have been reduced via log transformation.
- While wine quality accepts values from 1-10, the dataset contains only 6 such values(3-8).
- "fixed acidity" has moderate correlation with a lot of features- "citric acid", "density" and "pH"
- "free sulfur dioxide" and "total sulfur oxide" also have a moderate correaltion
- These correaltions become stronger for the transformed data(except between "citric acid" and fixed "acidity")
- The features "fixed acidity" and " free sulfur dioxide" will be dropped owing to multicollinearity 