# White Wine Quality: Classification

Welcome and thanks for opening this notebook! This notebook is excellent for beginners; I use classification machine learning techniques and build a model to predict the white wine quality. This is a 4-way classification analysis. I have built 4 models to predict the quality of white wine, including the following machine learning techniques: random forest classifier, logistic regression, decision tree, and support vector machine learning. I start with some data exploration. Next, I prepare the data for machine learning. Lastly, I create 4 classification models to predict white wine quality.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
#import other packages
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
path = '/kaggle/input/white-wine-quality/winequality-white.csv'

df = pd.read_csv(path,sep=';')

In [None]:
df.head(5)

In [None]:
df.describe()

In [None]:
#Count the missing values in the dataset
df.isnull().sum()

There are no missing values in the dataset.

In [None]:
#unique values for quality
df.quality.unique()

In [None]:
len(df)

It is a pretty large dataset.

In [None]:
#Divide the quality of wine into good and bad wine
#Binary classification
winequality_names = ['bad', 'good']
df['quality'] = pd.cut(df['quality'], bins = (2, 5.5, 9), labels = winequality_names)
df.head(3)

In [None]:
df.quality.value_counts()

The wine has been classified into *'good'* wine and *'bad'* wine. A wine is a good wine when its score is higher than 5.5; it is a bad wine when its score is lower than 5.5.

In [None]:
#Boxplot
fig, axes = plt.subplots(4, 3, figsize=(20,20))

fig.suptitle("White Wine Quality Distribution")
sns.boxplot(ax=axes[0, 0], data=df, x='quality', y='fixed acidity')
sns.boxplot(ax=axes[0, 1], data=df, x='quality', y='volatile acidity')
sns.boxplot(ax=axes[0, 2], data=df, x='quality', y='citric acid')
sns.boxplot(ax=axes[1, 0], data=df, x='quality', y='residual sugar')
sns.boxplot(ax=axes[1, 1], data=df, x='quality', y='chlorides')
sns.boxplot(ax=axes[1, 2], data=df, x='quality', y='free sulfur dioxide')
sns.boxplot(ax=axes[2, 0], data=df, x='quality', y='total sulfur dioxide')
sns.boxplot(ax=axes[2, 1], data=df, x='quality', y='density')
sns.boxplot(ax=axes[2, 2], data=df, x='quality', y='pH')
sns.boxplot(ax=axes[3, 0], data=df, x='quality', y='sulphates')
sns.boxplot(ax=axes[3, 1], data=df, x='quality', y='alcohol')

In [None]:
#Make a correlation diagram
corr = df.corr()

ax = sns.heatmap(corr,vmin=-1, vmax=1, center=0,cmap=sns.diverging_palette(20, 220, n=200),square=True)

ax.set_xticklabels(ax.get_xticklabels(),rotation=45,horizontalalignment='right')

### Data Preprocessing for Machine Learning

*Data Normalization*

In [None]:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

In [None]:
df.columns

Let's select the independent variables; these are the variables that contribute to wine quality.

In [None]:
#Define the independent variables
Features = df[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol']]
X = Features

Now, let's define 'quality' as the dependent variable.

In [None]:
#Define the dependent variable
y = df['quality'].values
y[0:5]

In [None]:
X= preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

In [None]:
#split the data in a train and test set
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2)
print ('The length of the train set is:', X_train.shape,  y_train.shape)
print ('The lenth of the test set equals:', X_test.shape,  y_test.shape)

The training dataset is 80% of the total data; the test set is 20% of the total data.

The data is ready for a classification machine learning analysis. With classification machine learning techniques, we can predict the quality of a specific white wine. 

### 1.Random Forest Classifier

In [None]:
#import sklearn package for the random forest classifier
from sklearn.ensemble import RandomForestClassifier

In [None]:
#import score and metric packages
from sklearn import metrics
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
import itertools

In [None]:
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train,y_train)
y_pred = rfc.predict(X_test)

In [None]:
#What is the accuracy of the above model
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

In [None]:
#Other performance metrics: how did the model perform?
print(classification_report(y_test,y_pred))

The random forest model gives an overall accuracy of 85%.

In [None]:
#Confusion Matrix
cm = confusion_matrix(y_test,y_pred)
print(cm)

sns.heatmap(cm,cbar=False,annot=True,cmap='Blues',fmt="d")
plt.xlabel("y_pred")
plt.ylabel("y_test")
plt.title("Confusion Matrix: Random Forest Classifier")
plt.show()

### 2.Logistic Regression

Let's find the optimal logistic regression model first. Which model has the highest accuracy score?

In [None]:
#import the logistic regression packages
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

In [None]:
#liblinear regression
LR_a = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
yhat_prob_a = LR_a.predict_proba(X_test)
log_loss(y_test, yhat_prob_a)

In [None]:
LR_b = LogisticRegression(C=0.01, solver='saga').fit(X_train,y_train)
yhat_prob_b = LR_b.predict_proba(X_test)
log_loss(y_test, yhat_prob_b)

In [None]:
LR_c = LogisticRegression(C=0.01, solver='newton-cg').fit(X_train,y_train)
yhat_prob_c = LR_c.predict_proba(X_test)
log_loss(y_test, yhat_prob_c)

In [None]:
LR_d = LogisticRegression(C=0.01, solver='lbfgs').fit(X_train,y_train)
yhat_prob_d = LR_d.predict_proba(X_test)
log_loss(y_test, yhat_prob_d)

In [None]:
LR_e = LogisticRegression(C=0.01, solver='sag').fit(X_train,y_train)
yhat_prob_e = LR_e.predict_proba(X_test)
log_loss(y_test, yhat_prob_e)

The differences in accuracy are rounding differences! So, let's pick the model with liblinear logistic regression.

In [None]:
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
LR

In [None]:
#log loss score
lr_ypred = LR.predict_proba(X_test)
log_loss(y_test,lr_ypred)

The logistic regression model has a logg loss score of 52%. The closer the logg loss to zero, the higher the accuracy.

### 3.Decision Tree

In [None]:
#import the decision tree packages 
from sklearn.tree import DecisionTreeClassifier

**Find the Optimal Decision Tree Length**

*Approach: Grid Search*

In [None]:
decision_tree = DecisionTreeClassifier(criterion="entropy",random_state=42)
decision_tree = decision_tree.fit(X_train,y_train)

In [None]:
param_grid = {'max_depth':range(1, decision_tree.tree_.max_depth+1, 2),'max_features': range(1, len(decision_tree.feature_importances_)+1)}

wine_gr = GridSearchCV(DecisionTreeClassifier(criterion="entropy",random_state=42),param_grid=param_grid,scoring='accuracy',n_jobs=-1)

wine_gr = wine_gr.fit(X_train, y_train)

In [None]:
wine_gr.best_estimator_.tree_.node_count, wine_gr.best_estimator_.tree_.max_depth

The optimal depth includes 24 leaves. 

**Best Model: Decision Tree**

In [None]:
# Create Decision Tree classifer object with the optimal depth
clf = DecisionTreeClassifier(criterion="entropy", random_state=42, max_depth=24)

# Train Decision Tree Classifer
DT_Model = clf.fit(X_train,y_train)

#Predict the response for test dataset
ypred_tree = DT_Model.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, ypred_tree))

In [None]:
#Calculate the f1-score
fscore_tree = f1_score(y_test,ypred_tree, average='weighted') 
print("Accuracy:",fscore_tree)

In [None]:
#Other accuracy measures
accuracy_dt = classification_report(y_test,ypred_tree)
print(accuracy_dt)

The overall accuracy score of the decision tree classifier equals 79%.

In [None]:
#decision tree visualization
import graphviz
from sklearn.tree import export_graphviz 
from IPython.display import Image  
from sklearn import tree


In [None]:
featureNames = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol']
class_names = df['quality'].unique().tolist()

dot_data = tree.export_graphviz(clf,feature_names = featureNames,class_names=class_names,filled=True, rounded=True)

graph = graphviz.Source(dot_data)  
graph

### 4.Support Vector Machine Learning

Let's select the best model for SVM (support vector machine learning abbreviated). Support Vector Machine Learning is typically used for smaller dataset. This dataset is a quite large, but let's give it a try.

In [None]:
#import support vector machine learning packages from sklearn
from sklearn import svm
from sklearn.svm import SVC

In [None]:
kernel = ['rbf','linear','poly','sigmoid']   
acc_score_list = []

for k in kernel:
    clf = svm.SVC(kernel=k)
    clf.fit(X_train, y_train)
    ypred = clf.predict(X_test)
    acc_score_list.append(f1_score(y_test, ypred, average='weighted')) 
    
acc_score_list

The *'rbf'* kernel has the best accuracy for support vector machine learning. The accuracy is 77%.

In [None]:
#Best model 
svmmodel = svm.SVC(kernel='rbf')
svmmodel  = svmmodel.fit(X_train, y_train) 


In [None]:
svm_ypred = svmmodel.predict(X_test)
svm_score = f1_score(y_test, svm_ypred, average='weighted')
print("Accuracy using F-score: ",svm_score)

In [None]:
#Other performance metrics: how well did the support vector model perform?
print(classification_report(y_test,svm_ypred))

The overall weighted average classification model score equals 78%. The random forest classifier model has a higher accuracy score than the support vector machine learning model.

In [None]:
#Confusion Matrix
cm_svm = confusion_matrix(y_test,svm_ypred)
print(cm_svm)

sns.heatmap(cm_svm,cbar=False,annot=True,cmap='Blues',fmt="d")
plt.xlabel("y_pred")
plt.ylabel("y_test")
plt.title("Confusion Matrix: Support Vector Machine Learning")
plt.show()

**Thanks for going through this notebook! This is the end of this analysis. Hope you enjoyed it! If you find this notebook useful or if you enjoyed it please upvote :)**