### This notebook uses the [Wine Quality Dataset](https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009) taken from the UCI Machine Learning Repository and may be found [here](https://archive.ics.uci.edu/ml/datasets/wine+quality). Here, the aim is to see general relations among different properties which determine the rating of the quality of the wine.  

Following are the properties in the dataset-

1. **Fixed acidity** - Most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2. **Volatile acidity** - The amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3. **Citric acid** - Found in small quantities, citric acid can add 'freshness' and flavor to wines

4. **Residual sugar** - The amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5. **Chlorides** - The amount of salt in the wine

6. **Free sulfur dioxide** - The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7. **Total sulfur dioxide** - Amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8. **Density** - The density of water is close to that of water depending on the percent alcohol and sugar content

9. **pH** - Describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10. **Sulphates** - A wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

11. **Alcohol** - The percent alcohol content of the wine

The ouput variable is the quality, with a value between 0 and 10 

### Importing the required libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

### Importing the datset and looking at basic information by summarizing it

In [None]:
df = pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

There are no null values and all columns are of the numerical type (float64). No categorical data to deal with!

In [None]:
df.describe()

### Exploratory analysis through visualisations

In [None]:
def draw_histograms(df, variables, n_rows, n_cols):
    fig=plt.figure(figsize=(12,10))
    for i, var_name in enumerate(variables):
        ax=fig.add_subplot(n_rows,n_cols,i+1)
        plt.hist(df[var_name],edgecolor='black')
        ax.set_title(var_name.upper())
    fig.tight_layout()
    plt.show()

draw_histograms(df, df.columns, 4, 3)

In [None]:
plt.figure(figsize=(5,5))
sns.countplot(x='quality',data=df, color='blue')
plt.xlabel('QUALITY')
plt.ylabel('COUNT')
plt.show()

It is clear that the classes are imbalanced. There are way more number of quality ratings that are either 5 or 6 than there are any other values, even combined.

In [None]:
plt.figure(figsize=(12,12))

for i,var_name in enumerate(list(df.columns)):
    plt.subplot(4,3,i+1)
    sns.kdeplot(data=df[var_name], shade=True)
    plt.title(var_name.upper())
    plt.xlabel(None)
    plt.ylabel(None)
    
plt.tight_layout()

In [None]:
df.skew()[:'alcohol'].plot(kind='bar')
plt.show()

Residual sugar and chlorides are the most positively skewed set of values

In [None]:
plt.figure(figsize=(15,10))

for i,var_name in enumerate(list(df.columns)):
    plt.subplot(4,3,i+1)
    sns.boxplot(x=var_name, data=df)
    plt.title(var_name.upper())
    plt.xlabel(None)
    plt.ylabel(None)
    
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(12,12))

for i,var_name in enumerate(list(df.columns)):
    plt.subplot(4,3,i+1)
    sns.regplot(x=df[var_name], y=df['quality'])
    plt.title(var_name.upper())
    plt.xlabel(None)
    plt.ylabel(None)
    
plt.tight_layout()

In [None]:
df.corr()

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(data=df.corr(),annot=True,mask=np.triu(np.ones_like(df.corr())))
plt.show()

In [None]:
plt.figure(figsize=(12,8))
df.corr()['quality'][0:-1].plot(kind='bar')
plt.axhline(y=0,color='black')
plt.yticks([-0.5,-0.4,-0.3,-0.2,-0.1,0,0.1,0.2,0.3,0.4,0.5])
plt.show()

2 features that stand out are **alcohol** with somewhat high positive correlation and **volatile acidity** with somewhat strong negative correlation. This can further be seen in the scatter plots with lines of regression.

**Volatile acidity** having a relatively high negative correlation makes sense since too much of it causes an unpleasant taste, as is clear from the decription. Unpleasent taste means lower quality, hene the negative correlation.

Higher **alcohol** content leads a wine to have a richer texture, hence the higher quality rating and the relatively high positive correlation.

In [None]:
r_squared_list = list(df.corr()['quality'][:'alcohol']**2*100)

from scipy import stats
pvalue_list=[]
for i in list(df.columns)[:-1]:
    temp = stats.linregress(df[i],df['quality'])
    pvalue_list.append(temp.pvalue)

rp_df=pd.DataFrame(data={'R squared':r_squared_list,'p value':pvalue_list})
rp_df.index = list(df.columns)[:-1]
rp_df

Only **15.253538%** of the variation in quality is explained by **volatile acidity** and only **22.673437%** of the variation in quality is explained by **alcohol**. They both have sufficiently low p-values to make them statistically significant.

### Next, using regression models to predict the score given the quality rating given the other factors

First defining a function to return mean absolute error between predictions and actual values to compare performace of different values.

In [None]:
from sklearn.metrics import mean_absolute_error

def get_mae(actual,predicted):
    return mean_absolute_error(actual,predicted)

Next splitting the dataframe in training and validation sets

In [None]:
from sklearn.model_selection import train_test_split

y = df['quality']
X=df.drop('quality',axis=1)

X_train,X_valid,y_train,y_valid = train_test_split(X,y,random_state=42)

### First up, linear regression

In [None]:
from sklearn.linear_model import LinearRegression

model_1 = LinearRegression()
model_1.fit(X_train,y_train)
print('MAE from Linear Regression is',get_mae(y_valid,model_1.predict(X_valid)))

### Next, Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor

mae_dict = {}

for i in list(range(100,1001,50)):
    model_2 = RandomForestRegressor(n_estimators=i,random_state=42)
    model_2.fit(X_train,y_train)
    mae_dict[i] = get_mae(y_valid,model_2.predict(X_valid))
    print('MAE for',i,'estimators is',mae_dict[i])

print()
print()

for trees,mae in mae_dict.items():
    if mae == min(list(mae_dict.values())):
        print('MAE is minumum for',trees,'trees and is',mae)

Plotting the mean absolute errors

In [None]:
temp = pd.DataFrame(list(mae_dict.values()))
temp.index = list(mae_dict.keys())
temp.plot(legend=None,figsize=(12,5))
plt.show()

These deviations are very small and a perfect example of how scale is important is drawing conclusions. Here is how the graph looks when the y axis is scaled for the values to be between 0 and 1

In [None]:
temp.plot(legend=None,figsize=(12,5))
plt.ylim(0,1)
plt.show()

### Third, extreme gradient boosting

In [None]:
from xgboost import XGBRegressor

mae_dict = {}

for i in list(range(100,1651,50)):
    model_3 = XGBRegressor(n_estimators=i,learning_rate=0.05)
    model_3.fit(X_train, y_train)
    mae_dict[i] = get_mae(y_valid,model_3.predict(X_valid))
    print('MAE for',i,'estimators is',mae_dict[i])

print()
print()

for estms,mae in mae_dict.items():
    if mae == min(list(mae_dict.values())):
        print('MAE is minumum for',estms,'estimators and is',mae)

After 1500 estimators, the MAE does not improve anymore.

Again plotting the mean absolute errors

In [None]:
temp = pd.DataFrame(list(mae_dict.values()))
temp.index = list(mae_dict.keys())
temp.plot(legend=None,figsize=(12,5))
plt.show()

Once again seeing it on a 0 to 1 scale

In [None]:
temp.plot(legend=None,figsize=(12,5))
plt.ylim(0,1)
plt.show()

### Since the dataset isn't very large, using cross validation is a better way to evaluate models
### Using the number of estimators that gave the least mean absolute error earlier 

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

pipeline_1 = Pipeline(steps=[('model', RandomForestRegressor(n_estimators=250, random_state=42))])

pipeline_2 = Pipeline(steps=[('model', XGBRegressor(n_estimators=1500,learning_rate=0.05))])

scores_1 = -1 * cross_val_score(pipeline_1, X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')

scores_2 = -1 * cross_val_score(pipeline_2, X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')

print("Average MAE scores from Random Forest Regressor", scores_1.mean())
print("Average MAE scores from XGB Regressor", scores_2.mean())

#### When evaluating models using the initially divided datasets, XGBRegressor does better but when using cross validation, Random Forest Regressor performs overall relatively better.

### That's all for now folks!