# Red Wine quality prediction
## Structure
### 1. Data Ingestion
### 2. Data Validation
### 3. Data Preprocessing
### 4. Model Training
### 5. Model Validation
### 6. Iterations
### 7. Conclusion

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### 1. Data Ingestion

In [None]:
df = pd.read_csv("/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv") 

df.head()

### 2. Data Validation

#### 2.1 Overview 

In [None]:
df.info()

A quick overview tells us that the ingested data contains 11 features that lead to a classification of the redwine quality (12th column). We have 1599 instances that we can train and test the model on. 

#### 2.2 Missing data detection

In [None]:
df.isnull().sum()

The null test shows that there is no null value in the data which means further missing data treatment ist not required.

### 3. Data Preprocessing

#### 3.1 Data Visualization


In [None]:
#Visualization
from matplotlib import pyplot as plt
import seaborn as sns

print(df.quality.value_counts()) #shows that classifications as quality 3 and 8 might be unclassifiable since only a few samples for training exist
df.quality.value_counts().plot(kind='bar')

Counting the different classifications shows a slight imbalance in the data since data classifid as 3, 4 and 8 are underpresented compared to wine with qualities 5, 6 and 7. 
From the information about the data in Kaggle, we know that the rating was given from 1 to 10. Quality values 1,2,9 and 10 are not presented in the data set which means the algorithm to be chosen will never classify any new introduced wine with these values. How and if we treat the imbabalanced dataset will be decided after one iteration showing in the model evaluation how the imbalanced data influences the predictions. 

#### 3.2 Data distribution

In [None]:
#Boxplots for every feature and the according quality values from 3 to 8

features = df.drop(columns=['quality'])
    
f, ax = plt.subplots(4, 3, figsize=(22, 16))
for i, var in enumerate(features):
    sns.boxplot(x='quality', y=var, data=df, ax=ax.flatten()[i])

As a first visualization method the boxplot diagramm is a handy method to look at the distribution of the data. Since we know that the data lacks instances with qualities 3,4, and 8, we look at the distribution of the feature data for every quality (3 to 8) to make sure not to misqualify data points as outliers only because they are out of range due to the underrepresented classes. 

Secondly we can get a first idea which properties might contribute to outstanding quality (8) (which need to be proven later):

- relatively small volatile acidity
- high amount of citric acid
- low density
- high amount of sulphates
- tend to have a higher alcohol content

For the red wines classified with poor quality it is the other way around.

In [None]:
#Boxplot values for different features broken down to the different quality classes
for feature in features:
    print(feature)

    for i in range(3,9):     
        df_i = df.loc[df['quality'] == i]

        features_perqual = df_i.drop(columns=['quality'])
      
  
        upper_quartile = np.percentile(features_perqual[feature], 75)
        lower_quartile = np.percentile(features_perqual[feature], 25)
        #print(upper_quartile)
        
        iqr = upper_quartile - lower_quartile
        upper_whisker = features_perqual[feature][features_perqual[feature]<=upper_quartile+1.5*iqr].max()
        lower_whisker = features_perqual[feature][features_perqual[feature]>=lower_quartile-1.5*iqr].min()
        #print('Quality:', i)
        print('Quality:', i, 'Upper whisker/Lower whisker:', upper_whisker,'/', lower_whisker)

In [None]:
#Boxplot values for all different features across all different quality classes
for feature in features:
    
    upper_quartile = np.percentile(features[feature], 75)
    lower_quartile = np.percentile(features[feature], 25)
        #print(upper_quartile)
        
    iqr = upper_quartile - lower_quartile
    upper_whisker = features[feature][features[feature]<=upper_quartile+1.5*iqr].max()
    lower_whisker = features[feature][features[feature]>=lower_quartile-1.5*iqr].min()
    
    print(feature)
    print('Upper whisker/Lower whisker:', upper_whisker,'/', lower_whisker)

Comparing the values for the overall values for the whiskers per feature with the values for the whiskers per feature and quality shows that the interquartile ranges mostly lie randomly within the ranges of the feature per quality. 
Since we have multiple features, treating outliers by only deleting data points that do not fall in the inter quartile range seems to be wrong. Because it would imply that the features do not interact with each other. That is an assumption we cannot make.

In [None]:
#Create the pairplot

sns.set_palette("bright")
pp = sns.pairplot(df, hue='quality', height=1.8, aspect=1.8, diag_kind = 'kde', palette="bright")
    
fig = pp.fig 
fig.subplots_adjust(top=0.93, wspace=0.3)
t = fig.suptitle('Wine Attributes Pairwise Plots', fontsize=14)

We can interpret from the pairplot that there are not many outliers compared to the number of instances because the clouds of points seem to be very dense for the majority of features, only a few points lie outside. 
Additionally, the quality does not seem to have an impact whether the points are within the cloud or outside. All features seem to follow more or less a gaussian distribution. These findings will help us to choose an appropiate outlier detection technique later on.

#### 3.3 Correlations

In [None]:
#With calculating the Pearson correlation, linear correlation between the features are identified
lin_corr = features.corr()

f, ax = plt.subplots(figsize=(10, 5))
sns.heatmap(lin_corr, annot=True, linewidths=.5, ax = ax)

The heatmap gives us a first guess which features could be higher correlated. It shows an increased linear correlation for following pairs:
1. "free sulfur dioxide" and "total sulfur dioxide"
2. "fixed acidity" and "citric acid" 

I consider features being highly correlated if the correlation is > 0.7 https://www.andrews.edu/~calkins/math/edrm611/edrm05.htm.
By looking at the specific linear correlation values non of the identified pairs have a correlation above 0.7. 
To ensure that the features do not correlate at all, we check the Spearman correlation which gives us also information about correlations that are non-linear. 

In [None]:
#Checking the Spearman correlation whether there are any correlations

spear_corr = features.corr(method='spearman')

f, ax = plt.subplots(figsize=(10, 5))
sns.heatmap(spear_corr, annot=True, linewidths=.5, ax = ax)

As we can see, the free and total sulfur dioxide  are non-linear correlated with 0.79. 

Nevertheless, this dataset only contains 11 features. If the amount of features were higher, we could consider eliminating some correlated due to the high computational costs. Meanwhile a correlation underneath 0.95 with only 11 features in total in the set is not enough to consider elimination here. 

#### 3.4 Outlier treatment

For a multivariate Outlier Detection we choose Isolation Forest. The  To learn more about **Isolation Forest** read https://towardsdatascience.com/outlier-detection-with-isolation-forest-3d190448d45e.
Above all we use Isolation Forest because do not exactly know the distribution of the data as well as it is a multidimensional feature space. 
On top of that, Isolation Forests are easy to optimize and it is fairly robust. Why we do not use DBSCAN instead is a matter of effort. DBSCAN requires preprocessing steps whereas Isolation Forest does not. 

In [None]:
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import GridSearchCV

#all feature columns must be modelled
features=df.columns[1:11]

#hyperparameters to be decided
clf=IsolationForest(n_estimators=50, max_samples='auto', contamination = 0.05, \
                        max_features=1.0, bootstrap=False, n_jobs=-1, verbose=0)
clf.fit(df[features])

pred = clf.predict(df[features])
df['anomaly']=pred
outliers=df.loc[df['anomaly']==-1]
outlier_index=list(outliers.index)

#Find the number of anomalies and normal points here points classified -1 are anomalous
print(df['anomaly'].value_counts())

For the parameters, we use trial and error by checking the pairwaise plot each time we change the parameters. Since we do not have a classification of outlier or not, viewing the pairwise plot and identify whether the anomaly detection has coloured the outliers in blue has to be sufficient. By using this methods we need to estimate specifically whether it is worth it to upgrade the **contamination rate** in order to identify more outliers trading a false positive marking inliers as outliers. We tried different contamination rates with the result that the predictions that excluded less than 5% of the instances got worse. Another way to overcome the problem of not knowing the contamination rate of outliers is an **Extended Isolation Forest**. To learn more about Extended Isoltation Forest read https://towardsdatascience.com/outlier-detection-with-extended-isolation-forest-1e248a3fe97b

In [None]:
#Pairplot marking the outliers
pp = sns.pairplot(df, hue='anomaly', height=1.8, aspect=1.8, diag_kind = 'kde', palette="bright")
           
fig = pp.fig 
fig.subplots_adjust(top=0.93, wspace=0.3)
t = fig.suptitle('Wine Attributes Pairwise Plots', fontsize=14)

Comparing the Outliers identified by Isolation Forest with the boxplots shows that not in ervery case the feature value of the anormal cases lie within the 1st and 4th quantiles of the boxplots. 
That shows, eliminating outliers only by looking at the boxplots could lead to wrong conclusions. 

We chose to remove the outliers from the dataset. What proved to be sufficient here, can be seen as cracking a nut with a sledgehammer. Other treatment could be considered marking the outliers as missing values and impute them by using the **Miss Forest** algorithm. 

In [None]:
#Remove all identified outliers from the dataset 

df_preprocessed = df[df.anomaly != -1]

df_preprocessed = df_preprocessed.drop(columns=['anomaly'])

print(df_preprocessed.quality.value_counts())

As a result of the preprocessing steps above we receive a dataframe that 
- contains less outliers 
- we approved has no drastic correlations across the features

Although it is very unlikely to receive good predictions for the qualities 3,4 and 8 or any predictions for the qualities 0, 1, 2, 9, 10 which are not listed at all, the data is now brought in a suitable form for training 


#### 3.6 Train-Test Split

In [None]:
from sklearn.model_selection import train_test_split

#X Dataset, y dataset

X = df_preprocessed.drop(columns=['quality'])
y = df_preprocessed['quality']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#### 3.7 Standardization

The last preprocessing step will be a standardization of the X data. Why standardization can optimize your model you can read here https://towardsdatascience.com/how-and-why-to-standardize-your-data-996926c2c832. In our case it is obvious when you see the values for the different physiochemical units. Where "total sulfur dioxide" varies from 0 - 300, chlorides only vary from 0 - 0.6. This could lead to a bias in our model. 

In [None]:
#Scaling the X Data

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

### 4. Model training

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# 4. Model training

#Perform Grid Search for hyperparameter tuning
param_grid = {'n_estimators': [1,5,10,50,100,200,500],
              'max_depth': [2, 4, 6, 8]}

RFC_qual = RandomForestClassifier()

grid_clf = GridSearchCV(RFC_qual, param_grid)
grid_clf.fit(X_train, y_train)

print(grid_clf.best_estimator_.n_estimators)
print(grid_clf.best_estimator_.max_depth)

In [None]:
y_pred = grid_clf.predict(X_test)

### 5. Model testing & Evaluation

In [None]:
from sklearn import metrics

print('RFC Accuracy =', metrics.accuracy_score(y_test, y_pred))

In [None]:
from sklearn.metrics import confusion_matrix

#Confusion matrix creation
confusion_matrix(y_test,y_pred)
pd.crosstab(y_test, y_pred, rownames = ['Actual'], colnames =['Predicted'])

As estimated before, the prediction for qualities 3, 4 and 8 are poor since they are not even there. We cannot leave the results with an accuracy of around 60 something %.

In [None]:
#Feature importance matrix 

from sklearn.inspection import permutation_importance
from pandas import DataFrame

(pd.Series(grid_clf.best_estimator_.feature_importances_, index=X.columns).nlargest(11).plot(kind='barh'))

### 6. Iterations
#### 6.1 Bin creation

The poor performance for the qualities 3, 4 and 8 due to the few training samples, we have predicted before.
In order to eliminate that fact, I will split the data in three groups: Bad quality and good quality. 

In [None]:
#Create a new dataframe that contains column bin instead of quality
#Create train/test data that can be distinguished from train/test on single quality level

# Creating  categories for poor quality as 0 (0, 1, 2, 3), medium quality as 1 (4, 5, 6) and high quality as 2 (7, 8, 9, 10)

df_preprocessed.loc[(df_preprocessed['quality']<= 3) , 'quality'] = 0
df_preprocessed.loc[(df_preprocessed['quality']>= 4) & (df_preprocessed['quality']< 7), 'quality'] = 1
df_preprocessed.loc[(df_preprocessed['quality']>= 7) , 'quality'] = 2

In [None]:
print(df_preprocessed.quality.value_counts())

In [None]:
#X Dataset, y dataset

X_cat = df_preprocessed.drop(columns=['quality'])
y_cat = df_preprocessed['quality']

X_train, X_test, y_train, y_test = train_test_split(X_cat, y_cat, test_size=0.2, random_state=42)

In [None]:
#Scaling the X Data,
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

To learn more about caling read this https://towardsdatascience.com/scale-standardize-or-normalize-with-scikit-learn-6ccc7d176a02

In [None]:
# 4. Model training

RFC = RandomForestClassifier(n_jobs=-1, max_features= 'sqrt' ,n_estimators=50)

#Perform Grid Search for hyperparameter tuning
param_grid = {'n_estimators': [1,5,10,50,100,200,500],
              'max_depth': [2, 4, 6, 8],
              'max_features': ['auto', 'sqrt', 'log2']}

grid_clf = GridSearchCV(RFC, param_grid)
grid_clf.fit(X_train, y_train)

print(grid_clf.best_estimator_.n_estimators)
print(grid_clf.best_estimator_.max_depth)

In [None]:
from sklearn.metrics import classification_report

# 5. Model testing
y_pred = grid_clf.predict(X_test)

print(classification_report(y_test, y_pred))
print('RFC Accuracy =', metrics.accuracy_score(y_test, y_pred))

In [None]:
#Confusion Matrix creation
confusion_matrix(y_test,y_pred)
pd.crosstab(y_test, y_pred, rownames = ['Actual'], colnames =['Predicted'])

In [None]:
#Feature importance matrix 

from sklearn.inspection import permutation_importance
from pandas import DataFrame

(pd.Series(grid_clf.best_estimator_.feature_importances_, index=X.columns).nlargest(11).plot(kind='barh'))

### 7. Conclusion

We loaded the data and had a quick look. As a result we can state:
- There are instances of red wine with 11 features of chemical values and identified quality with classes 3,4,5,6,7 and 8.
- There is no missing data.
- Some classes are underrepresented.

We visualized the data and found out:
- No features are significantly correlated, so we kept every features.
- The boxplots for every class show that points outside the interquartile range are not related to the imbalancy in the data. 

We deleted outliers by trial and error comparing the amount of classified outliers with the outliers in the pairplot
We standardized our training and test data and trained a random forest model. We used random forest because we are dealing with classification and experience showed high performance adressing classification. 
For optimizing the parameters, we used Gridsearch searching widely used parameters. 

The evaluation of the predicted results showed a lack of performance predicting the underrepresented quality classification right. 
Therefore, we categorized the different classes into bad, medium and good quality and retrained the model using the modified categories. 
As a different approach we oversampled the underrepresented classes with SMOTE and retrained the model using the initial classes. 
In the end we combined binning with oversampling. When oversampling the data, we always made sure, only the training data is oversampled, test data is kept in its initial state.

Binning the data without any oversampling data treatment worked out best regarding the accuracy score of ~88-90%. But those results need to be looked at critically since they will never classify a wine with poor quality and also when classifying excellent quality the performance is lacking. 
Any oversampling approach failed to improve the single classification or the binned classification better. 

The feature importance confirmed what we could already guessed from the boxplots. Both models (the binned and the single classification) are highlgy influenced by:

- alcohol 
- volatile acidity and 
- sulphates 
