# Wine Tasting Data (UCI, 2015)
This dataset was downloaded from the UCI Machine Learning Repository.

**INTRODUCTION**  
Three wine-tasting experts (herein refered to as TASTERS) tasted ~ 6500 samples of Portuguese "Vinho Verde" wine, where ~30% and ~ 70% of the data are the red and white variants respectively. Each taster rated the QUALITY of samples on scale of 0 - 10 (0 = bad, 10 = great) and the final recorded quality rating is the median of the 3 judges' rating. Each wine sample has data detailing its chemical and physical properties like pH (measure of acidity), the amount of certain chemical compounds, the density and the total amount of alcohol. 

Some immediate questions concerning this dataset:

    - Can we CLASSIFY samples as red or white wine with high accuracy given the sample data?
    - Can we CLASSIFY samples as high or low quality with high accuracy given the sample data?
    - Can we PREDICT the numeric quality of a sample with high accuracy (ie. replace tasters with computers)?
    - How consistent and/or accurate are wine-tasting experts at identifying an objectively "good" sample?
    
Before we start, it is certainly worth asking the question: is there an objective set of chemical and physical properties that makes a wine good or bad? There are different _types_ of wines of the same colour, each of which are differentiated by things like the type of grapes used, added fruits or flavourings, aging times and temperatures, manner of storage and varying nutrient densities in the soils of the vineyard. Rephrasing the question, _for a given type of wine_, is there an objective set of properties that makes it good or bad? Without too much knowledge of how wine-tasters taste, I'd assume that tasters are indeed looking for particular qualities that they will identify with their tastebuds and no knowledge of the actual chemical structure of the wine. For example, a chef knows what a filet mignon _should_ taste like given the way it has historically been cooked, and thus knows what to look for in the taste of a "good" filet mignon.  

Interestingly, there exist many studies that show wine-tasting experts often "incorrectly" rate the quality of wine in international competitions. By incorrectly, we mean rating wines that have won awards for quality in other competitions as poor, or vice-versa (Hodgson 2012, Journal of Wine Economics). These competitions are often blind taste tests with no information about the tasted wine (no bottle shape, no label or mention of accrued awards, no suggested flavours or time of aging) and thus a rating depends exclusively on _taste_. This does not necessarily imply that there are no objective qualities of good or bad wines, but that the sensory experience of taste itself is heavily influenced by other factors like the label design and colour, knowledge of awards, bottle shape, wine texture, etc. Those involved in the food industry are well aware that presentation is key - nobody wants to eat something visually unappetizing even if it tastes the same as a regular steak.

Tasting experts can even be fooled into believing a _white wine_ is a red one if it is died red and _presented_ as a red wine (Morrot 2001, Brain and Language). This implies that, despite any findings we may produce that clearly differentiate red from white wine on a chemical or physical basis, an expert wine-taster may not be able to properly _interpret_ these differences by the tongue. Fascinating!
    
**VARIABLE DESCRIPTION:**
    - QUALITY (integer in [1 - 10])
    - FIXED ACIDITY (numeric, g/dm$^3$)
    - VOLATILE ACIDITY (numeric, g/dm$^3$)
    - CITRIC ACID (numeric, g/dm$^3$)
    - RESIDUAL SUGAR (numeric, g/dm$^3$)
    - CHLORIDES (numeric, g/dm$^3$)
    - FREE SULFUR DIOXIDE (numeric, mg/dm$^3$)
    - TOTAL SULFUR DIOXIDE (numeric, mg/dm$^3$)
    - DENSITY (numeric, g/dm$^3$)
    - pH (numeric, <= 14)
    - SULPHATES (numeric, g/dm$^3$)
    - ALCOHOL (numeric, [0.0 - 1.0], % volume)
    
Units for most quantities are in MASS DENSITY, with grams (g) or milligrams (mg) per cubic decimeter (dm).

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
sns.set_context('notebook')
sns.set_style('whitegrid')
%matplotlib inline

def PlottingDefault():
    sns.set(rc={'figure.figsize':(11.7,8.27),"font.size":25,"axes.titlesize":25,"axes.labelsize":25},style="white")
PlottingDefault()

## Exploratory Data Analysis (EDA)

In [None]:
# Read in data and describe it.
df = pd.read_csv('/Users/reubengazer/Downloads/winequalityN.csv')
df.describe()

Deal with observations missing values. We can interpolate/impute values or remove observations altogether.


In [None]:
n = df[df.isnull().values == True].shape[0]
print "There are {} rows ({}% of data) with at least one missing cell.".format(n,round(n/float(df.shape[0]),2))
# With such a small percentage, we can simply remove all "error" rows without worry.
print "Removing rows with missing values..."
df.drop(df[df.isnull().values == True].index,inplace=True)

There are almost exactly 3 times as many white wines as red in the data. 

### Plot 1a: Pairplot of All Variables

Red wine is displayed as RED and white as GREEN (white markers would be difficult!)

In [None]:
sns.pairplot(df,hue='type',hue_order=['red','white'],palette=['red','green'],)

There appear to be obvious visual differences in physical and chemical properties between the red and white wine populations. As expected, these wines are inherently different enough to build an accurate classifier to predict the wine type. Let's also make a heatmap to see correlation numbers to begin feature selection.

### Plot 1b: Heatmap of All Variables

In [None]:
plt.figure(figsize=[10,10],dpi=2000)
sns.heatmap(df.corr(),cmap='coolwarm',linewidth=2,annot=True,annot_kws={'fontsize':10})

### Caption 1a, 1b

From the above two correlation plots, some noteworthy correlations are:

    - Density vs. Alcohol, Residual Sugar or Fixed Acidity (negative) 
        - alcohol is less dense than water, therefore higher alcohol = lower density, holding all else constant
        - residual sugar is denser than water. Higher sugar = higher density, holding all else constant
    - pH vs. any type of acid 
        -the pH is a directly dependent on the relative amounts of each acid, which each have their own pH
    - Free and Total Sulfur Dioxide
        - One comprises the other
    - Quality vs. Alcohol (positive)
    - Quality vs. Density (negative) 
        - but recall ALCOHOL and DENSITY are themselves correlated physically (negatively)
    - Quality vs. Volatile Acidity (negative)

### Comparing Red and White Wine Samples
Let's leave all predictors in for the moment and build a logistic regresser that predicts the wine type with __sklearn__ to classify wine type.  
First, build red and white copy data sets for syntax ease.

In [None]:
# Make copies for red and white datasets.
red,white = df[df['type']=='red'],df[df['type']=='white']

print("Dataset comprised of:\n{:3.0f}% are RED\n{:3.0f}% are WHITE\n".format(100*float(len(red))/len(df),100*float(len(white))/len(df)))

# How many of each type of wine are, red and white?
for wine in ['red','white']:
    print "Number of {} wines = {}".format(wine,len(df[df['type']==wine]))

### Plot 2: Boxplot Distributions of All Predictors (Split by Red, White)
How different are the predictor distributions for each wine type?

In [None]:
fig, axes = plt.subplots(nrows=2,ncols=6,figsize=[20,10])
columngrid = np.array(df.columns[df.columns!='type']).reshape(2,6)

for i,column in enumerate(df.columns[df.columns!='type']):
    ind1,ind2 = np.where(columngrid==column)
    ind1,ind2=ind1[0],ind2[0]
    sns.boxplot(x='type',y=column,data=df,ax=axes[ind1,ind2],palette=['yellow','red'])
    #axes[ind1,ind2].set_title(column+'\n',fontsize=25)
plt.tight_layout()

This grid displays the "fingerprint" of both red and white wine samples. . Visually it appears red and white wines differ in most categories and are only similar in the quality ratings and the alcohol content. 

Red wines are characterized apart from white wines by:
    - Larger FIXED ACIDITY, VOLATILE ACIDITY, CHLORIDES, pH, SULPHATES, DENSITY
    - Lower CITRIC ACID, RESIDUAL SUGAR, FREE and TOTAL SULFUR DIOXIDE

Notably, _the quality of the wine is independent of the type of wine_.

** Student's T-test for Significant Differences of Mean Predictor Values in Red, White Wine**  
Group by colour/type and show the mean of each aggregate variable.

In [None]:
df.groupby('type').mean()

In [None]:
# Initialize dataframe for t, p statistics outputs with index as each attribute save type.
tdf=pd.DataFrame(index=[colname for colname in df.columns[df.columns!='type']])
t,p,sig=[],[],[]
for colname in df.columns[df.columns!='type']: # type is a string and we've grouped by it.
    alpha = 0.05 # significance tolerance.
    t2, p2 = stats.ttest_ind(red[colname],white[colname],equal_var=False) # Welch's 2-sided t-test.
    t.append(t2)
    p.append(p2)
    if p2<alpha:
        sig.append(True)
    else:
        sig.append(False)
# Assign columns.
tdf['t'],tdf['p'],tdf['Significant?'] = t,p,sig
tdf

Although all differences in means between red and white are statistically significant, this does not mean that the differences are IMPORTANT to us. For example, the fractional differences between the means in QUALITY is incredibly small (2%), so this doesn't really reflect a large enough difference for us to care. However, mean values of predictors like total sulfur dioxide are not only statistically different by the t-test but almost importantly different for the fractional difference is quite large (~50%).

### Model I: Logistic Classifier of Wine Colour

How accurately can we classify the type of wine given its predictors? Let's build a simple model that classifies the colour of the wine. Given the obvious visual differences in predictor distributions in Plot 2 by colour, the classes appear significantly recognizable and a highly accurate classifier seems plausible.

Let's start with a logistic regression model including all predictors to get an initial look at classification accuracy.

In [None]:
# Initialize train and test sets of 80% and 20% of the data respectively.
from sklearn.cross_validation import train_test_split, KFold, cross_val_predict, cross_val_score
X,y = df.drop('type',axis=1),df['type']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

# Train a logistic model on the training data.
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

# Print the coefficients of the above model.
print("\nLogistic Regression Coefficients:\n")
print(pd.DataFrame.from_records(zip(df.columns[df.columns!='type'],logmodel.coef_[0]),columns=['Predictor','Coef']))

# Estimate the test error of the model using 5-Fold Cross Validation.
k = 5
cv = KFold(len(X_train),n_folds=k,shuffle=True)
log_accuracy = cross_val_score(logmodel, X_train, y_train, cv=cv).mean()
print "\nModel Accuracy With {}-Fold CV = {:.2f}%".format(k,100*log_accuracy)

# Make predictions with our logistic model, and produce some quality metrics.
predictions = pd.Series(cross_val_predict(logmodel,X=X_test,y=y_test))
from sklearn.metrics import classification_report, confusion_matrix
print("\nClassification Report:\n")
print(classification_report(y_test,predictions))
print("Confusion Matrix:\n")
fig,ax=plt.subplots(1)
sns.heatmap(confusion_matrix(y_test,predictions),annot=True,fmt='g',cbar_kws={'cmap':'coolwarm'},ax=ax)
ax.set_xlabel('Predicted labels');ax.set_ylabel('True labels'); 
ax.xaxis.set_ticklabels(['red', 'white'],fontsize=20); ax.yaxis.set_ticklabels(['red', 'white'],fontsize=20)

From the above, it was fairly easy to predict a red or white wine given the types of predictors available. Without much thought as to feature selection and including all variables, we created a classifier with only a ~2% error rate. The worst we could do is to guess a constant classifier of white wine which produces an error rate of 9% (red comprises 9% of the dataset). Our fairly naive classifier is already \better by ~7% accuracy.

### Machine Learning Model II: Classification of Wine Quality 
Let's classify the data by separating out wines with quality ratings of:

    - [8-10] : 'good' 
    - [4-7]  : 'mid'
    - [0-3]  : 'bad'

How many are good or bad, and are they predominantly red or white?

In [None]:
df['Quality Class'] = 'mid'
df.loc[df['quality']>=8,'Quality Class'] = 'good'
df.loc[df['quality']<=3,'Quality Class'] = 'bad'

good, bad = df[df['quality']>=8],df[df['quality']<=3]

print("Dataset comprised of:\n{:3.0f}% are RED\n{:3.0f}% are WHITE\n".format(100*float(len(red))/len(df),100*float(len(white))/len(df)))
print("Of the {} GOOD wines:\n{:4.0f}% are RED\n{:4.0f}% are WHITE\n".format(len(good),100*float(len(good[good['type']=='red']))/len(good),100*float(len(good[good['type']!='red']))/len(good)))
print("Of the {} BAD wines:\n{:4.0f}% are RED\n{:4.0f}% are WHITE".format(len(bad),100*float(len(bad[bad['type']=='red']))/len(bad),100*float(len(bad[bad['type']!='red']))/len(bad)))

There exists a larger fraction of GOOD wines that are WHITE than the actual population fraction of whites (75%), and a larger fraction of BAD wines that are RED than the actual fraction of reds (25%). 

### Plot 3a: Quality Class vs. All Predictors  
How different are the predictor distributions for each wine type? Each of these plots will show any trends that exist between bad and good wines in general, independent of the wine colour.

In [None]:
fig, axes = plt.subplots(nrows=2,ncols=6,figsize=[20,10])
columngrid = np.array(df.columns.drop(['type','Quality Class'])).reshape(2,6)

for i,column in enumerate(df.columns.drop(['type','Quality Class','quality'])):
    ind1,ind2 = np.where(columngrid==column)
    ind1,ind2=ind1[0],ind2[0]
    sns.boxplot(x='Quality Class',y=column,data=df,ax=axes[ind1,ind2],order=['bad','mid','good'])
plt.tight_layout()

The above plots are of Quality Class vs. Predictor trends for both types, but there are ~3x more white wine samples as red. Therefore these trends are heavily weighted towards the trends of white wine, and we should stratify this plot by colour for a better interpretation.

### Plot 3b: Quality Class vs. All Predictors, Stratified by Colour/Type
Repeat the same as above, except stratify by the colour.

In [None]:
fig, axes = plt.subplots(nrows=2,ncols=6,figsize=[20,10])
columngrid = np.array(df.columns.drop(['type','Quality Class'])).reshape(2,6)

for i,column in enumerate(df.columns.drop(['type','Quality Class','quality'])):
    ind1,ind2 = np.where(columngrid==column)
    ind1,ind2=ind1[0],ind2[0]
    sns.boxplot(x='Quality Class',y=column,data=df,ax=axes[ind1,ind2],order=['bad','mid','good'],hue='type',palette=['yellow','red'])
plt.tight_layout()

Here we see that colour plays a key role in trend interpretation. Sulphates vs. quality class in 3a indicates a non-trend across colour. Plot 3b shows the red samples trend positively with amount of sulphates, but this is masked in 3a due to red presence in the data of only 25%. This motivates us to split the data by colour permanently and describe wine sample properties specific to their colour.

Keeping in mind that each trend is interpreted _with all other variables held constant_, the takeaways from each of the above plots, in order:

    - FIXED ACIDITY is not a strong predictor of quality
    - Red wine quality is higher with less VOLATILE ACIDITY, CHLORIDES, more CITRIC ACID, SULPHATES
    - White wine is not significantly affected by VOLATILE ACIDITY, CITRIC ACID, CHLORIDES, SULPHATES
    - RESIDUAL SUGAR has no impact on quality rating across both colours
    - Red wine is better when more acidic (lower pH), white wine quality is independent of pH
    - The BEST RED AND/OR WHITE WINES HAVE THE HIGHEST ALCOHOL CONTENT (~2% higher than wines rated below 8)
    
And remember our conclusions about properties of red and white wines in general:

    - REDS have larger FIXED ACIDITY, VOLATILE ACIDITY, CHLORIDES, pH, SULPHATES, DENSITY than WHITES
    - REDS have lower CITRIC ACID, RESIDUAL SUGAR, FREE and TOTAL SULFUR DIOXIDE than WHITES

**Logistic Regression for Classification of Wine Quality**


In [None]:
from sklearn.cross_validation import train_test_split, KFold, cross_val_predict, cross_val_score

# Convert wine type to dummies.
dums = pd.get_dummies(df['type'],drop_first=True)
df = pd.concat([df,dums],axis=1)
good_or_bad = df[df['Quality Class']!='mid'].copy()

# Split into training and testing sets.
X,y = good_or_bad.drop(['type','quality','Quality Class'],axis=1),good_or_bad['Quality Class']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

# Train logistic model on training set.
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

# Print the coefficients of the model fitted on the training set.
print("\nLogistic Regression Coefficients:\n")
print(pd.DataFrame.from_records(zip(X.columns,logmodel.coef_[0]),columns=['Predictor','Coef']))

# Estimate the test-error using K-Fold Cross Validation.
cv = KFold(len(X_train),n_folds=5,shuffle=True)
log_accuracy = cross_val_score(logmodel, X_train, y_train, cv=cv).mean()
print "\nModel Accuracy or Estimate Test Error (via 5-Fold CV) = {:.6f}%".format(100*log_accuracy)

# Classify the testing data X_test, y_test and see accuracy report.
predictions = cross_val_predict(logmodel,X=X_test,y=y_test)
from sklearn.metrics import confusion_matrix,classification_report
print("\nClassification Report:\n")
print(classification_report(y_test,predictions))
print("Confusion Matrix:\n")
fig,ax=plt.subplots(1)
sns.heatmap(confusion_matrix(y_test,predictions),annot=True)
ax.set_xlabel('Predicted labels');ax.set_ylabel('True labels'); 
ax.xaxis.set_ticklabels(['good', 'bad'],fontsize=20); ax.yaxis.set_ticklabels(['good', 'bad'],fontsize=20)

**KNN Classifier**

In [None]:
# Scale the predictors.
from sklearn.preprocessing import StandardScaler
X,y = good_or_bad.drop(['type','quality','Quality Class'],axis=1),good_or_bad['Quality Class']
scaler = StandardScaler()
scaler.fit(X)
scaled_features = scaler.transform(X)
X_feat = pd.DataFrame(scaled_features,columns=X.columns) # scaled X matrix of predictors.

# Produce a final validation set.
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
X_train,X_test,y_train,y_test = train_test_split(X_feat,y,test_size=0.2)

# Track accuracy of classifier (using 10-fold CV) for each value of k in carray cv_scores.
cv_scores = []
for k in range(1,20):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train,y_train)
    scores = cross_val_score(knn,X_train,y_train,cv=10,scoring='accuracy')
    cv_scores.append(scores.mean())
    
# Compute optimal k, plot misclassification rate vs. k.
MSE = [1-x for x in cv_scores] # misclassification rate
optimal_k = range(1,20)[MSE.index(min(MSE))]
print("The optimal value of k is {}.".format(optimal_k))
    
# Plot error rate vs. choice of k-neighbours.
fig, ax = plt.subplots(1,figsize=[6,6])
ax.plot(range(1,20), MSE, color='blue', linestyle='--', marker='o', markerfacecolor='red', markersize=10)
ax.set_xlabel('K-Neighbours')
ax.set_ylabel('Error Rate')

# With optimal k chosen, perform fit and evaluate on the test data.
knn = KNeighborsClassifier(n_neighbors=optimal_k)
knn_accuracy = cross_val_score(knn,X_test,y_test,cv=10,scoring='accuracy').mean()
print "\nModel Accuracy or Estimate Test Error (via 10-Fold CV) = {:.6f}%".format(100*knn_accuracy)