#### I will be using Random Forest Classifier to predict the quality of wine, which is the target variable

### Acknowledgement

Before I begin, my work here is an improvement to an existing approach. So, credits to https://www.kaggle.com/taha07/wine-quality-prediction-data-analysis for his accuracy of 93%

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

#### Now I will be taking you through the steps

## Importing necessary packages

In [None]:
import numpy as np
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

## Reading the data

In [None]:
df = pd.read_csv('/kaggle/input/wine-quality/winequalityN.csv')

In [None]:
df.sample(5)

## Data Cleaning and Preprocessing

Before performing any analysis on data, it's important to deal with null values as they are prone to major errors and inconsistencies

Now we check for total no. of null values in each column

In [None]:
df.isna().sum()

There are null values present in the following columns:
'fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'pH', 'sulphates'

For each column, instead of dropping rows with null values, I will instead be replacing them with either median or mean. Simply dropping the rows will considerably reduce the size of the dataset and hence might degrade performance of the models

In [None]:
#Replacing null values in fixed acidity with median
df['fixed acidity'].fillna(df['fixed acidity'].median(), inplace=True)
df['fixed acidity'].isna().sum()

In [None]:
#Replacing null values in volatile acidity with mean
df['volatile acidity'].fillna(df['volatile acidity'].mean(), inplace=True)
df['volatile acidity'].isna().sum()

In [None]:
#Replacing null values in citric acid with mean
df['citric acid'].fillna(df['citric acid'].mean(), inplace=True)
df['citric acid'].isna().sum()

In [None]:
#Replacing null values in residual sugar with mean
df['residual sugar'].fillna(df['residual sugar'].mean(), inplace=True)
df['residual sugar'].isna().sum()

In [None]:
#Replacing null values in chlorides with median
df['chlorides'].fillna(df['chlorides'].median(), inplace=True)
df['chlorides'].isna().sum()

In [None]:
#Replacing null values in pH with mean
df['pH'].fillna(df['pH'].mean(), inplace=True)
df['pH'].isna().sum()

In [None]:
#Replacing null values in sulphates with median
df['sulphates'].fillna(df['sulphates'].median(), inplace=True)
df['sulphates'].isna().sum()

In [None]:
df.isna().sum()

No more null values.

Now, since we're predicting the target variable quality, we'll have to categorize the numbers into low, medium and high and then encode it to 0,1 and 2 for classification

In [None]:
df['quality'].min()
df['quality'].value_counts()

In [None]:
#Mapping values of target variable quality to 'low', 'medium' and 'high' categories for classification
df['quality']=df['quality'].map({3:'low', 4:'low', 5:'medium', 6:'medium', 7:'medium', 8:'high', 9:'high'})

In [None]:
df['quality']=df['quality'].map({'low':0,'medium':1,'high':2})

In [None]:
df.sample(5)

### Removal of Outliers

Outliers are extreme cases of data that may severely affect the prediction capailities of the machine learning models. Therefore, its critical that we remove them.

I will now be plotting a boxplot to view the general distribution of data across all features to check for outliers.

In [None]:
sn.set()
plt.figure(figsize=(30,15))
sn.boxplot(data=df)
plt.show()

In [None]:
fig, ax =plt.subplots(1,3)
plt.subplots_adjust(right=2.5, top=1.5)
sn.boxplot(df['residual sugar'], df['type'], ax=ax[0])
sn.boxplot(df['free sulfur dioxide'], df['type'], ax=ax[1])
sn.boxplot(df['total sulfur dioxide'], df['type'], ax=ax[2])
plt.show()

In these three columns we can notice significant outliers. Therefore, they must be removed from the respective columns.

In [None]:
#Removing outliers in residual sugar
lower = df['residual sugar'].mean()-3*df['residual sugar'].std()
upper = df['residual sugar'].mean()+3*df['residual sugar'].std()
df = df[(df['residual sugar']>lower) & (df['residual sugar']<upper)]

#Removing outliers in free sulfur dioxide
lower = df['free sulfur dioxide'].mean()-3*df['free sulfur dioxide'].std()
upper = df['free sulfur dioxide'].mean()+3*df['free sulfur dioxide'].std()
df = df[(df['free sulfur dioxide']>lower) & (df['free sulfur dioxide']<upper)]

#Removing outliers in total sulfur dioxide
lower = df['total sulfur dioxide'].mean()-3*df['total sulfur dioxide'].std()
upper = df['total sulfur dioxide'].mean()+3*df['total sulfur dioxide'].std()
df = df[(df['total sulfur dioxide']>lower) & (df['total sulfur dioxide']<upper)]

### 1-Hot encoding

The 'type' column must be 1-hot encoded for classification. 1-hot encoding creates a binary column for each category. Here we use pd.get_dummies() to remove the first category and essentially bring it to one column of 1's and 0's where 1 denotes white wine and 0 denotes not white (red wine).

In [None]:
dummies = pd.get_dummies(df['type'], drop_first=True)
df = pd.concat([df, dummies], axis=1)
df.drop('type', axis=1, inplace=True)

## Correlation between features

In [None]:
#Checking relationship between features
cor=df.corr()
plt.figure(figsize=(20,10))
sn.heatmap(cor,xticklabels=cor.columns,yticklabels=cor.columns,annot=True)
cor

## Train-Test split

I will be splitting the dataset into training and testing sets in the ratio of 0.80:0.20

In [None]:
X = df.loc[:,df.columns!='quality']
y = df['quality']

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20, random_state=0)

## Model Fitting

I will be fitting the sklearn's <b>RandomForestClassifier</b> model.

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc=RandomForestClassifier()
print(rfc.get_params())

We can see the default parameters used by the classifier

In [None]:
# Fit the model
rfc.fit(X_train,y_train)

In [None]:
y_pred=rfc.predict(X_test)
accuracy_score(y_test,y_pred)

We have achieved an accuracy score of 0.946 (94.6%). That's great! But the performance can further be enhanced by tuning the parameters.

## Hyperparameter Tuning

I will now use RandomizedSearchCV for searching over and performing 3-fold cross validation on the grid of parameters that can be used for the Random Forest model for this dataset.

In [None]:
from sklearn.model_selection import RandomizedSearchCV

n_estimators = [int(x) for x in np.linspace(start=90, stop=200, num=12)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(start=10, stop=110, num=11)]
max_depth.append(None)
min_samples_split=[2, 5, 10]
min_samples_leaf=[1, 2, 4]
bootstrap=[True, False]

In [None]:
random_search_grid = {'n_estimators': n_estimators,
                      'max_features': max_features,
                      'max_depth': max_depth,
                      'min_samples_split': min_samples_split,
                      'min_samples_leaf': min_samples_leaf,
                      'bootstrap': bootstrap}
print(random_search_grid)

In [None]:
rfc=RandomForestClassifier()
rf_random = RandomizedSearchCV(estimator=rfc, param_distributions = random_search_grid, n_iter=100, 
                          cv=3, verbose=2, random_state=0, n_jobs=-1)

In [None]:
rf_random.fit(X_train, y_train)

We can now get the best set of parameters from the function the evaluated grid.

In [None]:
rf_random.best_params_

Let's fit the model once again with the updated parameters.

In [None]:
rfc = RandomForestClassifier(n_estimators=90, min_samples_split=2, min_samples_leaf=1, 
                             max_features='auto', max_depth=50, bootstrap=True,
                             random_state = 42)

In [None]:
rfc.fit(X_train,y_train)

In [None]:
y_pred=rfc.predict(X_test)
accuracy_score(y_test,y_pred)

#### 94.71%

<b>There seems to be a slight improvement. Nevertheless, we get around 95% accuracy, which is awesome!</b>

Now I will be showing the confusion matrix and the classification report.

In [None]:
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print('\nClassification Report:\n', classification_report(y_test, y_pred))

## I hope you found this useful :)