# Introduction

**Red Wine classification dataset is a publicly shared [UCI repo](https://archive.ics.uci.edu/ml/datasets/wine+quality). I have used the available version in Kaggle**

**The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult: [Web Link](http://www3.dsi.uminho.pt/pcortez/wine/) or the reference [Cortez et al., 2009](http://www3.dsi.uminho.pt/pcortez/Home.html). Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).**

**These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.**

**For simplicity I aim to approach this as classification problem**


**I have made it beginner friendly for someone who is completely new to Machine Learning, I aim to make Machine Learning approach as easier as possible for you guys, so do read and upvote it so that it can reach maximum people**

![red_wine](https://archive.ics.uci.edu/ml/assets/MLimages/Large186.jpg)

# Contents

**Input variables (based on physicochemical tests):**

> fixed acidity

> volatile acidity

> citric acid

> residual sugar

> chlorides

> free sulfur dioxide

> total sulfur dioxide

> density

> pH

> sulphates

> alcohol


**Output variable (based on sensory data):**


> quality (score between 0 and 10)

**HOPE YOU ENJOY MY NOTEBOOK!!**

# Loading Libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from pandas_profiling import ProfileReport
from sklearn import metrics
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder


# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Loading Dataset

In [None]:
df = pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')
print('The Dataset contains {} rows and {} columns '.format(df.shape[0], df.shape[1]))

In [None]:
df.head()

# Let's explore the Data

In [None]:
df.describe()

# Pandas Profiling

In [None]:
ProfileReport(df)

In [None]:
df['quality'].value_counts().index

**So the ratings are 3,4,5,6,7 and 8 making only 6 values in quality column**

# Correlation

In [None]:
plt.figure(figsize=(18,10))
sns.heatmap(df.corr(), annot=True, cmap=plt.cm.plasma)

# Missing Values?

In [None]:
df.isnull().sum().sum()

**So our dataset is clean**

In [None]:
df.info()

**The dataset primarily contains values of float data types**

# Histogram and Density Plots of Columns

**Creating certain visualizations to make understanding of the columns easier**

In [None]:
df.hist(bins=40, figsize=(10,15))
plt.show()

In [None]:
df.plot(kind='density', subplots=True, layout=(4,3), sharex=False)
plt.show()

**What do we Understand?**

**Data distribution for attribute “alcohol” is positively skewed, for attribute “density” data quite normally distributed. Take attention to the wine quality data distribution. It’s a bimodal distribution and there are more wines with average quality than wines with ‘good’ or ‘bad’ quality.**

# Citric Acid, Fixed Acidity and Density

In [None]:
data = df.groupby(by="fixed acidity")[["fixed acidity", "density", "citric acid"]].first().reset_index(drop=True)

# Figure
f, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize = (16, 6))

a = sns.distplot(data["fixed acidity"], ax=ax1, hist=False, kde_kws=dict(lw=6, ls="--"))
b = sns.distplot(data["density"], ax=ax2, hist=False, kde_kws=dict(lw=6, ls="--"))
c = sns.distplot(data["citric acid"], ax=ax3, hist=False, kde_kws=dict(lw=6, ls="--"))

a.set_title("Fixed Acidity Distribution", fontsize=16)
b.set_title("Density Distribution", fontsize=16)
c.set_title("Citric Acid distribution", fontsize=16)

# Scatterplot Analysis

**Since we found out through correlation plots about certain columns having good correlation, let's make a scatter plot matrix that will tell us about the columns that had good correlations**

In [None]:
from pandas.plotting import scatter_matrix

sm = scatter_matrix(df, figsize=(16, 10), diagonal='kde')

[s.xaxis.label.set_rotation(40) for s in sm.reshape(-1)]
[s.yaxis.label.set_rotation(0) for s in sm.reshape(-1)]

#May need to offset label when rotating to prevent overlap of figure

[s.get_yaxis().set_label_coords(-0.6,0.5) for s in sm.reshape(-1)]

#Hide all ticks

[s.set_xticks(()) for s in sm.reshape(-1)]
[s.set_yticks(()) for s in sm.reshape(-1)]
plt.show()

**Analysis**

**Here we can observe positive linear correlation between the higly correlated columns, for instance `fixed acidity` and `density` columns had correlation value of 0.67 and the scatter plot shows the high correlation of it**

**Human wine preferences scores varied from 3 to 8, so it’s straightforward to categorize answers into ‘bad’ or ‘good’ quality of wines. This allows us to practice with hyperparameter tuning on e.g. decision tree algorithms. Visualizing the graph of the number of values for each category, we could see that there are far many bad answers than good ones. Of course, machine learning algorithms operate digital values, so we assign for categorizes corresponding discrete values 0 or 1.**

In [None]:
# Dividing wine as good and bad by giving the limit for the quality

bins = (2, 6, 8)
group_names = ['bad', 'good']
df['quality'] = pd.cut(df['quality'], bins = bins, labels = group_names)
# Now lets assign a labels to our quality variable

label_quality = LabelEncoder()

# Bad becomes 0 and good becomes 1
df['quality'] = label_quality.fit_transform(df['quality'])
print(df['quality'].value_counts())
sns.countplot(df['quality'])
plt.show()

# Model Development

In [None]:
x = df.drop(['quality'], axis=1)
y = df['quality']

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 50)

In [None]:
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.fit_transform(x_test)


cols = ['fixed acidity',
'volatile acidity',
'citric acid',
'residual sugar',
'chlorides',
'free sulfur dioxide',
'total sulfur dioxide',
'density',
'pH',
'sulphates',
'alcohol'
       ]

# Hyperparameter Optimization

**for DecisionTreeClassifier max_depth can be adjusted to increase accuracy, similarly n_estimators can be used for RandomForestClassifiers, However these are completely optional and you may chose to skip them**

# Decision Tree

In [None]:
dtc = DecisionTreeClassifier(max_depth=200)
dtc.fit(x_train, y_train)
preds = dtc.predict(x_test)
score = dtc.score(x_test, y_test)
score

In [None]:
preds[:5]

In [None]:
y_test[:5]

**Let's look for best depth values**

In [None]:
Ks = 100
mean_acc = np.zeros((Ks-1))
for n in range(1,Ks):
    
    #Train Model and Predict  
    dtc = DecisionTreeClassifier(max_depth = n).fit(x_train,y_train)
    yhat=dtc.predict(x_test)
    mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)

mean_acc

In [None]:
print( "The best accuracy was with", mean_acc.max(), "with depth =", mean_acc.argmax()+1)

# Decision Tree Classification Report

In [None]:
cf = metrics.classification_report(preds,y_test)
print(cf)

# Random Forest Classifier

In [None]:
rfc = RandomForestClassifier()
rfc.fit(x_train, y_train)
preds = rfc.predict(x_test)
score = rfc.score(x_test,y_test)
score

In [None]:
preds[:5]

In [None]:
y_test[:5]

In [None]:
Ks = 100
mean_acc = np.zeros((Ks-1))
for n in range(1,Ks):
    
    #Train Model and Predict  
    rfc = RandomForestClassifier(n_estimators = n).fit(x_train,y_train)
    yhat=dtc.predict(x_test)
    mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)

mean_acc

In [None]:
print( "The best accuracy was with", mean_acc.max(), "with n_estimator =", mean_acc.argmax()+1)

# Classification Report

In [None]:
cf = metrics.classification_report(preds,y_test)
print(cf)

# ROC curve plot

In [None]:
rfc_plot = metrics.plot_roc_curve(rfc, x_test,y_test)

In [None]:
dtc_plot = metrics.plot_roc_curve(dtc, x_test,y_test)

# Cross Validation Score Approach
**Let's check if our metrics get improved using this**

In [None]:
dtc_eval = cross_val_score(dtc, x_test, y_test, cv=10)
print('Cross Val Score accuracy is {:.2f}'.format(dtc_eval.mean()))

In [None]:
rfc_eval = cross_val_score(rfc, x_test, y_test, cv=10)
print('Cross Val Score accuracy is {:.2f}'.format(rfc_eval.mean()))

# GridSearchCV

**For DecisionTree**

**Instead of chosing all possible hyperparameters that can improve the scores, GridSearchCV does that for you selecting the best possible parameter to get best score**

In [None]:
tree_para = {'criterion':['gini','entropy'],'max_depth':[4,5,6,7,8,9,10,11,12,15,20,30,40,50,70,90,120,150]}
dtc_cv = GridSearchCV(DecisionTreeClassifier(), tree_para, cv=10)
dtc_cv.fit(x_test, y_test)

**To check the best hyperparameter**

In [None]:
dtc_cv.best_params_

In [None]:
dtc_new = DecisionTreeClassifier(criterion='entropy', max_depth = 8)
dtc_new.fit(x_train,y_train)
new_score  = dtc_new.score(x_test, y_test)
new_score

**So this actually works!!**

**Let's do the same for RandomForest as well**

In [None]:
param_grid = { 
    'n_estimators': [200, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
}

rfc_cv = GridSearchCV(estimator=rfc, param_grid=param_grid, cv=5)
rfc_cv.fit(x_test, y_test)

In [None]:
rfc_cv.best_params_

In [None]:
rfc_new = RandomForestClassifier(criterion='gini', max_depth = 5, max_features='auto', n_estimators=500)
dtc_new.fit(x_train,y_train)
new_score  = dtc_new.score(x_test, y_test)
new_score

# Conclusion

**Using Cross_val_score and GridSearchCV can go a long way in making best scores possible for your developed model so feel free to use them as per your convinience**