Final Project
=============


Dataset: http://mlr.cs.umass.edu/ml/datasets/Wine+Quality
Data Source: http://mlr.cs.umass.edu/ml/datasets/Wine+Quality 
Dataset Description: http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/winequality.names 


Project Questions
================

** What is the question you hope to answer? **

- Wine making is considered an art. But is there a formula for a quality wine?
- What basic properties are the formula for a good wine?
- Do white wine and red wine share the same formula?


** What data are you planning to use to answer that question? **

- I'll be using both datasets. Conducting data analysis and apply ML models to each of the dataset.
- Keep all 12 features.



** What do you know about the data so far? **

- The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).

- The two original datasets are of different sizes (1599 vs 4898).

- For both original datasets, all columns consist of numnerical values. 

- No missing values for both original datasets.

- The combined master dataset has a sample size of 6497. The number is sufficient to support any ML model. 

** Why did you choose this topic? **

- Personal interest in wine. I'm trying to explore wine tasting as a hobby. 

**STEP 1. Clean and analyze the dataset**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#import seaborn as sb

%matplotlib inline

Take a peek at the data

In [None]:
!head -10 'winequality-white.csv'

In [None]:
!head -10 'winequality-red.csv'

In [None]:
#Load Data

#names=['fx_acd','vol_acd','cir_acd','res_sgr','chlor','fr_SO2','tl_SO2','den','pH','sulph','alchl','quality']

data_white = pd.read_csv('winequality-white.csv',sep=';',header=0)
data_red = pd.read_csv('winequality-red.csv',sep=';',header=0)

In [None]:
data_white.head()

In [None]:
data_white.mean(axis=0)

In [None]:
data_red.head()

In [None]:
data_red.mean(axis=0)

In [None]:
data_red.shape

In [None]:
data_white.shape

In [None]:
data_red.info()

In [None]:
data_white.info()

In [None]:
data_red.isnull().any()

In [None]:
data_white.isnull().any()

I notice the gap in sample size between two datasets. One (red wine) with 1599 observations, one (white wine) with 4898. Since 1599 is quite a large sample size, I would not consider having 3X the size of one sample will have a material impact on the modelling results.

In [None]:
data_white.describe()

In [None]:
data_red.describe()

**Looking into the features:**

There are no missing values. What about outliers? I did some research into the attributes and the way they contribute to the complexity of wines and have the following findings:

[References: http://waterhouse.ucdavis.edu/whats-in-wine;
             http://winefolly.com/; http://winobrothers.com/2011/10/11/sulfur-dioxide-so2-in-wine/]



The expected level for some of the physiochemical attributes are:
 
1,500 - 14,500 mg/L tartaric acid;
0 - 500 mg/L citric acid;
0 - 3 g/L volatile acid;
10 - 350 mg/L sulphates;

However I find these range levels are still vague in terms of determining outliers. Reasons are: 

1) "The predominant fixed acids found in wines are tartaric, malic, citric, and succinic." Looks like there are some overlapping between fixed acidity and citric acid here. Same overlapping effect occurs between free sulfur dioxide and total sulfur dioxide.

2) Wine making is an art. There are no hard line on certain physiochemical levels except for country-specific wine making laws, whose effect is eliminated here since all the wine samples were taken from Portugal. That is to say, each component's level range really depends on enologists' preferences. 




What to do next:

1) Drop overlapping features, particularly, citric acid and free sulfur dioxide.

2) Detect the noises in the datasets from a visual perspective.

3) Reassign values to 'quality' column, i.e. make it a binary feature, good quality (score>5) vs bad quality (score <=5).


In [None]:
#Create a function that differentiate good quality wine (with value 1) from bad quality wine (with value 0)

def numeric_to_binary(x):
    if x >5:
        return 1
    else:
        return 0


In [None]:
data_white = data_white.drop(['citric acid','free sulfur dioxide'],axis=1)

In [None]:
data_white['bin_quality'] = data_white['quality'].apply(numeric_to_binary)
#data_white.drop(['quality'],axis=1)

data_white.head()

In [None]:
data_red = data_red.drop(['citric acid','free sulfur dioxide'],axis=1)

In [None]:
data_red['bin_quality'] = data_red['quality'].apply(numeric_to_binary)
#data_white.drop(['quality'],axis=1)

data_red.head()

In [None]:
white = pd.DataFrame(data_white.mean(axis=0))

white.columns=['White']

white

In [None]:
red=pd.DataFrame(data_red.mean(axis=0))

red.columns=['Red']

red

In [None]:
wr=pd.concat([white, red], axis=1)

wr

In [None]:
white_red = pd.DataFrame(wr, columns=['White','Red'])

white_red.plot(kind='bar',figsize=(12,8));

In [None]:
#The features are on different measuring scales, e.g. g/L, mg/L. It needs to be scaled first. 

from sklearn.preprocessing import StandardScaler

red_q = data_red['bin_quality']
red_X = data_red.drop(['quality','bin_quality'], axis=1)

white_q = data_white['bin_quality']
white_X = data_white.drop(['quality','bin_quality'], axis=1)

stdsc = StandardScaler()

red_X_std = stdsc.fit_transform(red_X)
white_X_std = stdsc.fit_transform(white_X)

#red_q_std = stdsc.fit_transform(red_X)
#white_q_std = stdsc.fit_transform(white_X)

In [None]:
white_X_std = pd.DataFrame(white_X_std,columns = white_X.columns)
red_X_std = pd.DataFrame(red_X_std,columns = red_X.columns)

In [None]:
a= white_X_std.plot(kind='box',showmeans=True,figsize=(20,10));

plt.title('White Wine Attributes Analysis')

plt.show()


In [None]:
red_X_std.plot(kind='box',showmeans=True,figsize=(20,10));

plt.title('Red Wine Attributes Analysis')

plt.show()

Based on the above two box plots, there are some noticeable outlying values among features. Again, winemaking process totally depends on what enologists tend to do. Actually I believe it is this wide range of The hard line for 'outliers' is not that 'hard'. I'm not gonna drop these 'outliers' off the dataset but they surely will contribute to the source of errors when applying regression modelling. 

In [None]:
#Take a look at the feature correlations

red_X_std.corr()

In [None]:
white_X_std.corr()

In [None]:
plt.subplot(121)
plt.title('Red Wine Dataset Correlation')
heatmap = plt.pcolor(red_X_std.corr())

plt.subplot(122)
plt.title('White Wine Dataset Correlation')
heatmap = plt.pcolor(white_X_std.corr())


By looking at the correlations among features, 

 - In red wine dataset, fixed acidity is correlated to a certain degree with density.
 - In white wine dataset, there is strong correlation between residual sugar and density.

**STEP 2. Apply Supervised Learning Models**

In [None]:
#Split the data into training and test sets

from sklearn.cross_validation import train_test_split


X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(red_X_std, red_q, 
                                                    test_size=0.2, random_state=7)
X_train_w, X_test_w, y_train_w, y_test_w = train_test_split(white_X_std, white_q, 
                                                    test_size=0.2, random_state=7)

In [None]:
from sklearn.dummy import DummyClassifier

dc = DummyClassifier(strategy='uniform', random_state=None, constant=None)

In [None]:
from sklearn.cross_validation import cross_val_score

dc.fit(X_train_r, y_train_r)
dc.fit(X_train_w, y_train_w)

In [None]:
from sklearn.metrics import classification_report

print 'Classification Report for Red Wine'
print ' '

print classification_report(dc.predict(X_test_r), y_test_r)


In [None]:
print ' '

print classification_report(dc.predict(X_test_w), y_test_w)

In [None]:
#from sklearn.metrics import precision_recall_fscore_support
#from sklearn.metrics import accuracy_score

#metrics_r = list(precision_recall_fscore_support(dc.predict(X_test_r), y_test_r, average='binary'))[:3]
#metrics_r.append(accuracy_score(dc.predict(X_test_r), y_test_r))

In [None]:
#metrics_w = list(precision_recall_fscore_support(dc.predict(X_test_w), y_test_w, average='binary'))[:3]
#metrics_w.append(accuracy_score(dc.predict(X_test_w), y_test_w))

In [None]:
#Red Wine Dataset

t_r = %timeit -o dc.fit(X_train_r, y_train_r)

In [None]:
#White Wine Dataset

t_w = %timeit -o dc.fit(X_train_w, y_train_w)

In [None]:
#metrics_r.append(t_r.best)
#metrics_w.append(t_w.best)

In [None]:
#pd.set_option('display.float_format', lambda x: '%.6f' % x)

#model_stats_r = pd.DataFrame(metrics_r,columns=['dummy_r'],index=['precision','recall','fscore','accuracy','time'])
#model_stats_r

In [None]:
#model_stats_w = pd.DataFrame(metrics_w,columns=['dummy_w'],index=['precision','recall','fscore','accuracy','time'])
#model_stats_w

In [None]:
#pd.concat([model_stats_r, model_stats_w], axis=1)

In [None]:
benchmark1=cross_val_score(dc,X_train_r, y_train_r, cv=20)

benchmark1

In [None]:
benchmark1.mean()

In [None]:
benchmark2=cross_val_score(dc,X_train_w, y_train_w, cv=20)

benchmark2

In [None]:
benchmark2.mean()

**Benchmark**

Red Wine: 50.8% accuracy, 58% F1 score.

White Wine: 50.2% accuracy, 47% F1 score.

Apply Logistic Regression with L2 - Ridge Regularization

In [None]:
from sklearn.cross_validation import ShuffleSplit

cv1 = ShuffleSplit(X_train_r.shape[0], n_iter=10, test_size=0.2, random_state=0)
cv2 = ShuffleSplit(X_train_w.shape[0], n_iter=10, test_size=0.2, random_state=0)

In [None]:
from sklearn.linear_model import LogisticRegression

lr1 = LogisticRegression()
lr2 = LogisticRegression()

parameters_lr = {'penalty': ['l2'], 'C':np.linspace(0.1,2.0,50)}



In [None]:
from sklearn.grid_search import GridSearchCV

clf1 = GridSearchCV(lr1, parameters_lr, cv=cv1)
clf2 = GridSearchCV(lr2, parameters_lr, cv=cv2)

clf1.fit(X_train_r, y_train_r)
clf2.fit(X_train_w, y_train_w)

In [None]:
clf1.best_params_

In [None]:
clf2.best_params_

In [None]:
best_lr1 = clf1.best_estimator_
best_lr2 = clf2.best_estimator_

In [None]:
print classification_report(best_lr1.predict(X_test_r), y_test_r)
score1 = cross_val_score(best_lr1, X_train_r, y_train_r, cv=cv1)
print('CV score {}, Average score {}'.format(score1, score1.mean()))
best_lr1.score(X_test_r, y_test_r)

Logistic Regression model for Red Wine achieve 75.3% accuracy and 75% in F1 score, beating the benchmark.

In [None]:
print classification_report(best_lr2.predict(X_test_w), y_test_w)
score2 = cross_val_score(best_lr2, X_train_w, y_train_w, cv=cv2)
print('CV score {}, Average score {}'.format(score2, score2.mean()))
best_lr2.score(X_test_w, y_test_w)

Logistic Regression model for White Wine achieve 76.8% accuracy and 78% in F1 score, beating the benchmark.

In [None]:
from sk_modelcurves.learning_curve import draw_learning_curve

print 'Logistic Regression Learning Curve for Red Wine'

draw_learning_curve(best_lr1,X_train_r, y_train_r, cv=cv1);

The Logistic Regression model shows little variance; accuracy improves with more training data but still not to a satisfactory level.

In [None]:
print 'Logistic Regression Learning Curve for White Wine'

draw_learning_curve(best_lr2,X_train_w, y_train_w, cv=cv2);

The Logistic Regression model shows some variance and high bias, and not score high on accuracy. 

In [None]:
#Coefficients for Quality Red Wine Formula

pd.DataFrame({'features': red_X.columns, 'coefs': best_lr1.coef_[0]}).sort_values(by='coefs',ascending=False)

For red wine, alcohol seems to be the biggest contributor to the quality, with sulphates, total sulfur dioxide and volatile acidity as co-factors, while the effects from sulphates and total sulfur dioxide might offset each other.

In [None]:
#Coefficients for Quality White Wine Formula

pd.DataFrame({'features': white_X.columns, 'coefs': best_lr2.coef_[0]}).sort_values(by='coefs',ascending=False)

For white wine, alcohol and residual sugar play huge part in the wine quality, with density and volatile acidity having adverse contributions to the overall quality.

Apply Random Forest Model

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf1 = RandomForestClassifier(random_state=1)
rf2 = RandomForestClassifier(random_state=1)

In [None]:
rf1.fit(X_train_r, y_train_r)
rf2.fit(X_train_w, y_train_w)


In [None]:
features = red_X_std.columns
feature_importances = rf1.feature_importances_

features_red = pd.DataFrame({'Features': features, 'Importance Score': feature_importances})
features_red.sort_values('Importance Score', inplace=True, ascending=False)

features_red

Top features contribute to quality red wine are: alcohol, volatile acidity, sulphates and total sulfur dioxide. 

In [None]:
features = white_X_std.columns
feature_importances = rf2.feature_importances_

features_white = pd.DataFrame({'Features': features, 'Importance Score': feature_importances})
features_white.sort_values('Importance Score', inplace=True, ascending=False)

features_white

Top features contribute to quality white wine are: alcohol, volatile acidity, density, residual sugar and total sulfur dioxide.

In [None]:
parameters_rf = {'n_estimators':range(1,10,1), 'min_samples_split':range(2,30),'min_samples_leaf':range(1,20)}

clf1 = GridSearchCV(rf1, parameters_rf)
clf1.fit(X_train_r, y_train_r)


In [None]:
clf2 = GridSearchCV(rf2, parameters_rf)
clf2.fit(X_train_w, y_train_w)

In [None]:
clf1.best_params_

In [None]:
clf2.best_params_

In [None]:
best_rf1 = clf1.best_estimator_
best_rf2 = clf2.best_estimator_

In [None]:
print classification_report(best_rf1.predict(X_test_r), y_test_r)
score1 = cross_val_score(best_rf1, X_train_r, y_test_r, cv=cv1)
print('CV score {}, Average score {}'.format(score1, score1.mean()))
best_rf1.score(X_test_r, y_test_r)

In [None]:
print classification_report(best_rf2.predict(X_test_w), y_test_w)
score2 = cross_val_score(best_rf2, X_train_w, y_test_w, cv=cv2)
print('CV score {}, Average score {}'.format(score2, score2.mean()))
best_rf2.score(X_test_w, y_test_w)

In [None]:
t1 = %timeit -o best_rf1.fit(X_train_r, y_train_r)

In [None]:
t2 = %timeit -o best_rf2.fit(X_train_w, y_train_w)

In [None]:
print 'Random Forest Learning Curve for Red Wine'

draw_learning_curve(best_rf1,X_train_r, y_train_r, cv=cv1);

In [None]:
print 'Random Forest Learning Curve for White Wine'

draw_learning_curve(best_rf2,X_train_w, y_train_w, cv=cv2);

**Unsupervised Learning**

Apply k-Mean Clustering

In [None]:
#Add one column to each dataset so they can be identified after the merge.

#Type 1 is White Wine; Type 2 is Red Wine.

data_white['type'] =1

data_red['type'] =0

In [None]:
#Concatenate two original datasets

wine = pd.concat([data_white,data_red])

#Take a look at the new dataset

#wine.drop(['bin_quality'],axis=1)

wine.head()

In [None]:
wine.info()

In [None]:
wine.describe()

In [None]:
from sklearn.cluster import KMeans

In [None]:
wine_scale = stdsc.fit_transform(wine)

wine_scale = pd.DataFrame(wine_scale)

In [None]:
wine_scale.head()

In [None]:
#plt.figure(figsize=(20,10))
#pd.tools.plotting.parallel_coordinates(wine_scale, 'type')

In [None]:
%%time

km = KMeans(n_clusters=2, n_init=20)
cluster_labels = km.fit_predict(wine_scale)

In [None]:
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters

# The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. 
# Negative values generally indicate that a sample has been assigned to the wrong cluster, 
# as a different cluster is more similar.
from sklearn.metrics import silhouette_score
print silhouette_score(wine_scale, cluster_labels)

In [None]:
wine_scale.head()