# Intro

Data from UCI ML database: https://archive.ics.uci.edu/ml/datasets/Wine+Quality

Goal: To model wine quality based on physicochemical tests

## Contents:

* Data setup
* EDA
* ML model setup
* ML Models
    * Logistic Regression Model
    * KNN Classifier
    * Decision Tree
    * Random Forests
    * Support Vector Machine
    * Gridsearch - can we increases accuracy?
* Results
* Feature Importance

# Setup data

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
sns.set_style("darkgrid")

In [None]:
dataRed = pd.read_csv('../input/winequality-red.csv')

In [None]:
dataRed.head()

In [None]:
dataRed.describe()

#### Bucket the quality into fewer groups:

Six groups is a lot given the data - we may need to narrow the scope a bit to get a more accurate model.

In [None]:
wineDict = {3:"bad",4:"bad",5:"passable",6:"passable",7:"good",8:"good"}

In [None]:
dataRed["wineQuality"] = dataRed.quality.map(wineDict)

In [None]:
dataRed.head()

# EDA

In [None]:
sns.pairplot(dataRed,hue="wineQuality")

### Let's compare the size of the full quality buckets vs. the narrower, qualitative descriptions we gave 

In [None]:
plt.figure(figsize=(16,6))

plt.subplot(1,2,1)
sns.countplot(data=dataRed,x="quality")

plt.subplot(1,2,2)
sns.countplot(data=dataRed,x="wineQuality")
plt.tight_layout()

In [None]:
plt.figure(figsize=(12,12))
sns.heatmap(data=dataRed.corr(),annot=True,cmap="Blues",linewidths=1)

## Visualizing some key factors

In [None]:
plt.figure(figsize = (16,20))
nrows = 4
ncols = 2

plt.subplot(nrows,ncols,1)
sns.violinplot(data=dataRed,x="quality",y="alcohol")
plt.subplot(nrows,ncols,2)
sns.violinplot(data=dataRed,x="wineQuality",y="alcohol")

plt.subplot(nrows,ncols,3)
sns.violinplot(data=dataRed,x="quality",y="volatile acidity")
plt.subplot(nrows,ncols,4)
sns.violinplot(data=dataRed,x="wineQuality",y="volatile acidity")

plt.subplot(nrows,ncols,5)
sns.violinplot(data=dataRed,x="quality",y="sulphates")
plt.subplot(nrows,ncols,6)
sns.violinplot(data=dataRed,x="wineQuality",y="sulphates")

plt.subplot(nrows,ncols,7)
sns.violinplot(data=dataRed,x="quality",y="citric acid",split=True)
plt.subplot(nrows,ncols,8)
sns.violinplot(data=dataRed,x="wineQuality",y="citric acid",split=True)

# Predicting Quality through Machine Learning Models

Let's run a few different types of ML models to see if we can predict how good a wine is (the "quality" score) based on the different qualities of the wine itself.

First, standardize the variables

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()

In [None]:
scaler.fit(dataRed.drop(["quality","wineQuality"],axis=1))

In [None]:
scaled_features = scaler.transform(dataRed.drop(["quality","wineQuality"],axis=1))

In [None]:
df_feat = pd.DataFrame(scaled_features,columns=dataRed.columns[:-2])
df_feat.head()

#### X and y arrays

In [None]:
X = scaled_features
y = dataRed["wineQuality"]

#### sklearn imports

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
from sklearn.metrics import classification_report,confusion_matrix

## Logistic Regression model

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30)

In [None]:
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

In [None]:
predictions = logmodel.predict(X_test)

In [None]:
print(classification_report(y_test,predictions))

In [None]:
print(confusion_matrix(y_test,predictions))

## KNN Classifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)

In [None]:
knn.fit(X_train,y_train)

In [None]:
knn_pred = knn.predict(X_test)

In [None]:
print(classification_report(y_test,knn_pred))

In [None]:
print(confusion_matrix(y_test,knn_pred))

## Decision Trees

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
dtree = DecisionTreeClassifier()

In [None]:
dtree.fit(X_train,y_train)

In [None]:
dtree_pred = dtree.predict(X_test)

In [None]:
print(classification_report(y_test,dtree_pred))

In [None]:
print(confusion_matrix(y_test,predictions))

## Random Forests

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train,y_train)

In [None]:
rfc_pred = rfc.predict(X_test)

In [None]:
print(classification_report(y_test,rfc_pred))

In [None]:
print(confusion_matrix(y_test,rfc_pred))

## Support Vector Machines

In [None]:
from sklearn.svm import SVC

In [None]:
svc_model = SVC()

In [None]:
svc_model.fit(X_train,y_train)

In [None]:
svc_pred = svc_model.predict(X_test)

In [None]:
print(classification_report(y_test,svc_pred))

In [None]:
print(confusion_matrix(y_test,svc_pred))

## SVC model with Gridsearch

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
param_grid = {'C': [0.1,1, 10, 100, 1000], 'gamma': [1,0.1,0.01,0.001,0.0001], 'kernel': ['rbf']} 

In [None]:
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=3)

In [None]:
grid.fit(X_train,y_train)

In [None]:
grid.best_params_

In [None]:
grid.best_estimator_

In [None]:
grid_predictions = grid.predict(X_test)

In [None]:
print(classification_report(y_test,grid_predictions))

In [None]:
print(confusion_matrix(y_test,grid_predictions))

# Results

Most models came in around the 80% precision and recall mark (+/- a few points), while the KNN and random forests models had the best f1 scores (though all in the same ballpark). Not bad. The accuracy seemed to be highest for the passable category, and very difficult for the bad category.

I was also very pleased the Gridsearch improved the result compared to the basic SVC model - and was better than the random forest model, which had been the most accurate. 

# Feature Importance

Let's figure out which features are most important to the model (using a RFC model).

In [None]:
features = list(dataRed.columns)
features

In [None]:
importance = RandomForestClassifier(random_state=0,n_jobs=-1)
imp_model = importance.fit(X,y)

In [None]:
model_importances = model.feature_importances_

In [None]:
indices = np.argsort(model_importances)[::-1]
names = [features[i] for i in indices]

In [None]:
plt.figure(figsize=(10,6))
plt.title("Feature Importance")
plt.bar(range(X.shape[1]),model_importances[indices])
plt.xticks(range(X.shape[1]), names,rotation=60)
plt.show()

#### References
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

Relevant publication

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.