***Overview ***

This project mainly explores the relationship between red wine quality with wine’s physicochemical and sensory variables (1 - fixed acidity; 2 - volatile acidity; 3 - citric acid; 4 - residual sugar; 5 – chlorides; 6 - free sulfur dioxide; 7 - total sulfur dioxide; 8 – density; 9 – pH; 10 – sulphates; 11 - alcohol Output variable). In addition, EDA(seaborn and ggplot) and multiple machine learning algorithms are used to determine which physiochemical properties have impact on a wine’s quality.


In [None]:
!pip install -q plotnine   
from plotnine import *
%matplotlib inline

import pandas as pd
import numpy as np

In [None]:
df_wine = pd.read_csv('../input/winequality-red.csv') 

In [None]:
df_wine.head()

# Exploratory Data Analysis

### Seaborn: Correlation Heatmap

In [None]:
import seaborn as sns
color = sns.color_palette()

import matplotlib.pyplot as plt
sns.set(style="white")

In [None]:
# Calculate the correlation
corr= df_wine.corr()
corr

In [None]:
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

Based on the heatmap above, except for "residual sugar", 'free sulfur dioxide' and 'pH', other variables seem to have some relationships with “quality”.

In [None]:
df_wine.drop(["residual sugar",'free sulfur dioxide','pH'],axis = 1,inplace = True)

In [None]:
df_wine.head()

### Seaborn: pairplot

Arbitrary cutoffs are set for the dependent variable (wine quality) and independent variable (alcohol) based on their distributions in order to facilitate the further analysis.  

In [None]:
sns.distplot(df_wine['quality'])
plt.show()

In [None]:
# Bin "quality" variable into three levels: poor, normal and excellent

bins = [0, 4, 6, 10]
labels = ["poor","normal","excellent"]
df_wine['binned_quality'] = pd.cut(df_wine['quality'], bins=bins, labels=labels)
df_wine.head()
df_wine.drop('quality',axis =1, inplace = True)

In [None]:
sns.distplot(df_wine['alcohol'])
plt.show()

In [None]:
# Bin "alcohol" variable into three levels: low, median and high

bins = [0, 10, 12, 15]
labels = ["low alcohol","median alcohol","high alcohol"]
df_wine['binned_alcohol'] = pd.cut(df_wine['alcohol'], bins=bins, labels=labels)
df_wine.drop('alcohol',axis =1, inplace = True)

In [None]:
df_wine.head()

In [None]:
sns_plot = sns.pairplot(df_wine, hue="binned_quality", palette="husl",
             diag_kind="kde")
sns_plot.savefig("pairplot.png")

According to the pairplot above, "volatile acidity" and "citric acid" are two variables whose distributions are rather distinguishable among three-level quality.

### ggplot: Faceted plot, Violin boxplot and Generic boxplot

In [None]:
(ggplot(df_wine, aes('citric acid', 'volatile acidity', color = 'binned_alcohol',
                          size = 'binned_alcohol',
                          shape = 'binned_alcohol'))
 + geom_point(alpha=0.3)
 + facet_wrap("binned_quality",ncol =1)
 + theme_xkcd())

In [None]:
 (
    ggplot(df_wine) +
    geom_violin(
        aes(x = 'binned_quality',
            y = 'volatile acidity')) +
    labs(
        title ='Distribution of volatile acidity by quality',
        x = 'wine quality',
        y = 'volatile acidity',
    ))

In [None]:
(
    ggplot(df_wine) +
    geom_boxplot(
        aes(x = 'binned_quality',
            y = 'citric acid')
    ) +
    labs(
        title ='Distribution of citric acid by quality',
        x = 'wine quality',
        y = 'citric acid',
    ) 
)

Based on the three plots below, we can conclude that compared with poor quality level, excellent quality level has higher proportion of high alcohol wine; on average, higher level the wine quality, lower the volatile acidity and higher the citric acid.

# ML algorithms and model comparison

In [None]:
# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

In [None]:
df_wine_ml = df_wine.copy()
df_wine_ml.info()

In [None]:
#get dummies
df_wine_ml = pd.get_dummies(df_wine_ml, columns=["binned_alcohol"], drop_first=True)
df_wine_ml.head()

### sklearn StandardScaler

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

scaler.fit(df_wine_ml.drop('binned_quality',axis=1))
scaled_features = scaler.transform(df_wine_ml.drop('binned_quality',axis=1))
df_wine_ml_sc = pd.DataFrame(scaled_features, columns=df_wine_ml.columns.difference(['binned_quality']))

### train_test_split

In [None]:
# use 70% of the data for training and 30% for testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_wine_ml.drop( "binned_quality",axis=1), df_wine_ml["binned_quality"], test_size=0.30, random_state=101)
X_train_sc, X_test_sc, y_train_sc, y_test_sc = train_test_split(df_wine_ml_sc, df_wine_ml["binned_quality"], test_size=0.30, random_state=101)

In [None]:
# unscaled
X_train_all = df_wine_ml.drop("binned_quality",axis=1)
y_train_all = df_wine_ml["binned_quality"]


# scaled
X_train_all_sc = df_wine_ml_sc
y_train_all_sc = df_wine_ml["binned_quality"]


### 1. Logistic Regression

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train,y_train)
pred_logreg = logreg.predict(X_test)
print(accuracy_score(y_test, pred_logreg))

In [None]:
logreg.coef_

### 2. Gaussian Naive Bayes¶

In [None]:
gnb=GaussianNB()
gnb.fit(X_train,y_train)
pred_gnb = gnb.predict(X_test)
print(accuracy_score(y_test, pred_gnb))

### 3. kNN

In [None]:
knn = KNeighborsClassifier(n_neighbors=20)
knn.fit(X_train_sc,y_train_sc)
pred_knn = knn.predict(X_test)
print(accuracy_score(y_test, pred_knn))

### 4. Decision Tree

In [None]:
dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)
pred_dtree = dtree.predict(X_test)
print(accuracy_score(y_test, pred_dtree))

In [None]:
dtree_2 = DecisionTreeClassifier(max_features=7 , max_depth=6,  min_samples_split=8)
dtree_2.fit(X_train,y_train)
pred_dtree_2 = dtree_2.predict(X_test)
print(accuracy_score(y_test, pred_dtree_2))

### 5. Random Forest

In [None]:
rfc = RandomForestClassifier(max_depth=6, max_features=7)
rfc.fit(X_train, y_train)
pred_rfc = rfc.predict(X_test)
print(accuracy_score(y_test, pred_rfc))

In [None]:
# feature importance
importances = pd.DataFrame({'feature':X_train.columns,
                            'importance':np.round(rfc.feature_importances_,3)})
importances = importances.sort_values('importance',ascending=False).set_index('feature')
importances.head(15)

### 6. SVM

In [None]:
svc = SVC(gamma = 0.01, C = 100, probability=True)
svc.fit(X_train_sc, y_train_sc)
pred_svc = svc.predict(X_test_sc)
print(accuracy_score(y_test_sc, pred_svc))

## K fold cross-validation

Among 6 algorithms above, logistic regression, kNN, SVM and random forest have the highest accuracy rate. Thus, K fold cross-validation is used here to further estimate model accuracy.

**For logistic regression:**

In [None]:
scores_logreg = cross_val_score(logreg, X_train_all_sc, y_train_all_sc, cv=10, scoring='accuracy')
print(scores_logreg)
print(scores_logreg.mean())

**For knn:**

In [None]:
scores_knn = cross_val_score(knn, X_train_all_sc, y_train_all_sc, cv=10, scoring='accuracy')
print(scores_knn)
print(scores_knn.mean())

 **For SVM:**

In [None]:
scores_svc = cross_val_score(svc, X_train_all_sc, y_train_all_sc, cv=10, scoring='accuracy')
print(scores_svc)
print(scores_svc.mean())

**For rfc:**

In [None]:
scores_rfc = cross_val_score(rfc, X_train_all_sc, y_train_all_sc, cv=10, scoring='accuracy')
print(scores_rfc)
print(scores_rfc.mean())

Based on k fold cross-validation, SVM(Support vector machine) has the best performance. 

### Confusion matrix, without normalization for SVM

In [None]:
df= pd.DataFrame(y_test_sc)
df['binned_quality'].value_counts()

In [None]:
from sklearn.metrics import confusion_matrix
# creating a confusion matrix 
cm = confusion_matrix(y_test_sc, pred_svc) 
cm

In [None]:
names = ["excellent","normal","poor"]
df = pd.DataFrame(cm, index=names, columns=names)
df

However, what we should not ignore is that the wine quality classes are ordered and **not balanced** (e.g. there are much more normal wines than excellent or poor ones). Consequently, as we can see from the confusion matrix above, the predictive ability of the model for normal class(accuracy rate = 0.966) is much better than poor(accuracy rate = 0) and excellent class(accuracy rate = 0.265). 

For the next step, we should try to find out methods to improve the model performance for unbalanced dataset. 