# DATA201 - Assignment 4

Please use this page http://apps.ecs.vuw.ac.nz/submit/DATA201 for submitssion and submit only this single Jupyter notebook with your code added into it at the appropriate places.

The due date is **Saturday 30th May, before midnight**.

The dataset for this assignment is file **whitewine.csv** which is provided with this notebook.

Please choose menu items *Kernel => Restart & Run All* then *File => Save and Checkpoint* in Jupyter before submission.

## Dataset

The dataset was adapted from the Wine Quality Dataset (https://archive.ics.uci.edu/ml/datasets/Wine+Quality)

### Attribute Information:

For more information, read [Cortez et al., 2009: http://dx.doi.org/10.1016/j.dss.2009.05.016].

Input variables (based on physicochemical tests):

    1 - fixed acidity 
    2 - volatile acidity 
    3 - citric acid 
    4 - residual sugar 
    5 - chlorides 
    6 - free sulfur dioxide 
    7 - total sulfur dioxide 
    8 - density 
    9 - pH 
    10 - sulphates 
    11 - alcohol 
Output variable (based on sensory data):

    12 - quality (0: normal wine, 1: good wine)
    
## Problem statement
Predict the quality of a wine given its input variables. Use AUC (area under the receiver operating characteristic curve) as the evaluation metric.

First, let's load and explore the dataset.

In [1]:
import numpy as np
import pandas as pd

np.random.seed = 42

In [2]:
data = pd.read_csv("whitewine.csv")
data.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,0
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,0
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,0
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,0
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,0


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4715 entries, 0 to 4714
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed_acidity         4715 non-null   float64
 1   volatile_acidity      4715 non-null   float64
 2   citric_acid           4715 non-null   float64
 3   residual_sugar        4715 non-null   float64
 4   chlorides             4715 non-null   float64
 5   free_sulfur_dioxide   4715 non-null   float64
 6   total_sulfur_dioxide  4715 non-null   float64
 7   density               4715 non-null   float64
 8   pH                    4715 non-null   float64
 9   sulphates             4715 non-null   float64
 10  alcohol               4715 non-null   float64
 11  quality               4715 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 442.2 KB


In [4]:
data["quality"].value_counts()

0    3655
1    1060
Name: quality, dtype: int64

Please note that this dataset is unbalanced.

## Questions and Code

**[1]. Split the given data using stratify sampling into 2 subsets: training (80%) and test (20%) sets. Use random_state = 42. [1 points]**

In [5]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size = 0.2, stratify=data["quality"], random_state = 42)

In [6]:
X_train = train.drop("quality", axis=1)
y_train = train["quality"].copy()

X_test = test.drop("quality", axis=1)
y_test = test["quality"].copy()

**[2]. Use ``GridSearchCV`` and ``Pipeline`` to tune hyper-parameters for 3 different classifiers including ``KNeighborsClassifier``, ``LogisticRegression`` and ``svm.SVC`` and report the corresponding AUC values on the training and test sets. Note that a scaler may need to be inserted into each pipeline. [6 points]**

Hint: You may want to use `kernel='rbf'` and tune `C` and `gamma` for `svm.SVC`. Find out how to enable probability estimates (for Question 3).

Document: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit, StratifiedKFold
from sklearn.metrics import roc_auc_score

In [8]:
from sklearn.neighbors import KNeighborsClassifier

pipeline_kNN = Pipeline(
                [("scaler", MinMaxScaler()),
                 ("classifier", KNeighborsClassifier())])

parameters_grid_kNN = {'classifier__n_neighbors': range(1,6)}

grid_cv_kNN = GridSearchCV(pipeline_kNN, parameters_grid_kNN, scoring = 'roc_auc', n_jobs=-1, return_train_score=True,
                                                  cv = StratifiedKFold(n_splits=5, shuffle=True, random_state = 42))

#grid_cv_kNN = GridSearchCV(pipeline_kNN, parameters_grid_kNN, scoring = 'roc_auc', n_jobs=-1, return_train_score=True, cv = 5)
    
grid_cv_kNN.fit(X_train, y_train)

GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=42, shuffle=True),
             error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('scaler',
                                        MinMaxScaler(copy=True,
                                                     feature_range=(0, 1))),
                                       ('classifier',
                                        KNeighborsClassifier(algorithm='auto',
                                                             leaf_size=30,
                                                             metric='minkowski',
                                                             metric_params=None,
                                                             n_jobs=None,
                                                             n_neighbors=5, p=2,
                                                             weights='uniform'))],
                                verbose=False),
  

In [9]:
print('k-NN optimal parameters: ', grid_cv_kNN.best_params_)

k-NN optimal parameters:  {'classifier__n_neighbors': 5}


In [10]:
print('k-NN train AUC: ', roc_auc_score(y_train, grid_cv_kNN.predict_proba(X_train)[:,1]))
print('k-NN test AUC: ', roc_auc_score(y_test, grid_cv_kNN.predict_proba(X_test)[:,1]))

k-NN train AUC:  0.9353891348114498
k-NN test AUC:  0.845749554758279


In [11]:
from sklearn.linear_model import LogisticRegression

pipeline_LR = Pipeline(
                [("scaler", MinMaxScaler()),
                 ("classifier", LogisticRegression())])

parameters_grid_LR = {'classifier__penalty' : ['l1', 'l2'],
                      'classifier__C' : np.logspace(-4, 4, 20)}

grid_cv_LR = GridSearchCV(pipeline_LR, parameters_grid_LR, scoring = 'roc_auc', n_jobs=-1, return_train_score=True,
                                                  cv = StratifiedKFold(n_splits=5, shuffle=True, random_state = 42))

grid_cv_LR.fit(X_train, y_train)

GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=42, shuffle=True),
             error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('scaler',
                                        MinMaxScaler(copy=True,
                                                     feature_range=(0, 1))),
                                       ('classifier',
                                        LogisticRegression(C=1.0,
                                                           class_weight=None,
                                                           dual=False,
                                                           fit_intercept=True,
                                                           intercept_scaling=1,
                                                           l1_ratio=None,
                                                           max_iter=100,
                                                           multi_class='auto',
    

In [12]:
print('LR optimal parameters: ', grid_cv_LR.best_params_)

LR optimal parameters:  {'classifier__C': 545.5594781168514, 'classifier__penalty': 'l2'}


In [13]:
print('LR train AUC: ', roc_auc_score(y_train, grid_cv_LR.predict_proba(X_train)[:,1]))
print('LR test AUC: ', roc_auc_score(y_test, grid_cv_LR.predict_proba(X_test)[:,1]))

LR train AUC:  0.7867796279327878
LR test AUC:  0.7986991198410036


In [14]:
from sklearn import svm

pipeline_SVM = Pipeline(
                [("scaler", MinMaxScaler()),
                 ("classifier", svm.SVC(kernel='rbf', probability=True))])

parameters_grid_SVM = {'classifier__gamma': [ 0.1, 1, 10, 100, 1000],
                       'classifier__C': [0.01, 0.1, 1, 10, 100]}

grid_cv_SVM = GridSearchCV(pipeline_SVM, parameters_grid_SVM, scoring = 'roc_auc', n_jobs=-1, return_train_score=True,
                                                    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state = 42))

grid_cv_SVM.fit(X_train, y_train)

GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=42, shuffle=True),
             error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('scaler',
                                        MinMaxScaler(copy=True,
                                                     feature_range=(0, 1))),
                                       ('classifier',
                                        SVC(C=1.0, break_ties=False,
                                            cache_size=200, class_weight=None,
                                            coef0=0.0,
                                            decision_function_shape='ovr',
                                            degree=3, gamma='scale',
                                            kernel='rbf', max_iter=-1,
                                            probability=True, random_state=None,
                                            shrinking=True, tol=0.001,
                                  

In [15]:
print('SVM optimal parameters: ', grid_cv_SVM.best_params_)

SVM optimal parameters:  {'classifier__C': 1, 'classifier__gamma': 100}


In [16]:
grid_cv_SVM.best_score_

0.8733639819607664

In [17]:
print('SVM train AUC: ', roc_auc_score(y_train, grid_cv_SVM.predict_proba(X_train)[:,1]))
print('SVM test AUC: ', roc_auc_score(y_test, grid_cv_SVM.predict_proba(X_test)[:,1]))

SVM train AUC:  0.9991603321890405
SVM test AUC:  0.9088512763596005


**[3]. Train a soft ``VotingClassifier`` with the estimators are the three tuned pipelines obtained from [2]. Report the AUC values on the training and test sets. Comment on the performance of the ensemble model. [1 point]**

Hint: consider the voting method.

Document: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier

In [18]:
from sklearn.ensemble import VotingClassifier

clf1 = VotingClassifier(estimators=[('knn', grid_cv_kNN), ('lr', grid_cv_LR), ('svm', grid_cv_SVM)], voting='soft')
clf1 = clf1.fit(X_train, y_train)

In [19]:
print('VotingClassifier train AUC: ', roc_auc_score(y_train, clf1.predict_proba(X_train)[:,1]))
print('VotingClassifier test AUC: ', roc_auc_score(y_test, clf1.predict_proba(X_test)[:,1]))

VotingClassifier train AUC:  0.9946962193170379
VotingClassifier test AUC:  0.9170688898639754


**[4]. Redo [3] with a sensible set of ``weights`` for the estimators. Comment on the performance of the ensemble model in this case. [1 point]**

In [20]:
clf2 = VotingClassifier(estimators=[('knn', grid_cv_kNN), ('lr', grid_cv_LR), ('svm', grid_cv_SVM)], voting='soft',
                        weights=[1,1,2])
clf2 = clf2.fit(X_train, y_train)

In [21]:
print('VotingClassifier train AUC: ', roc_auc_score(y_train, clf2.predict_proba(X_train)[:,1]))
print('VotingClassifier test AUC: ', roc_auc_score(y_test, clf2.predict_proba(X_test)[:,1]))

VotingClassifier train AUC:  0.9974975318121984
VotingClassifier test AUC:  0.9217665126603515


**[5]. Use the ``VotingClassifier`` with ``GridSearchCV`` to tune the hyper-parameters of the individual estimators. The parameter grid should be a combination of those in [2]. Report the AUC values on the training and test sets. Comment on the performance of the ensemble model. [1 point]**

Document: https://scikit-learn.org/stable/modules/ensemble.html#using-the-votingclassifier-with-gridsearchcv

In [32]:
from datetime import datetime
start = datetime.now()

clf3 = VotingClassifier(estimators=[('knn', pipeline_kNN), ('lr', pipeline_LR), ('svm', pipeline_SVM)], voting='soft')

params = {'knn__classifier__n_neighbors': range(1,6),
          'lr__classifier__penalty' : ['l1', 'l2'], 'lr__classifier__C' : np.logspace(-4, 4, 20),
          'svm__classifier__gamma': [ 0.1, 1, 10, 100, 1000], 'svm__classifier__C': [0.01, 0.1, 1, 10, 100]}

grid = GridSearchCV(estimator=clf3, param_grid=params, scoring = 'roc_auc', n_jobs=-1, return_train_score=True,
                                                    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state = 42))
grid = grid.fit(X_train, y_train)

print(datetime.now()-start)

0:34:41.395953


In [33]:
print('VotingClassifier optimal parameters: ', grid.best_params_)
print('VotingClassifier train AUC: ', roc_auc_score(y_train, grid.predict_proba(X_train)[:,1]))
print('VotingClassifier test AUC: ', roc_auc_score(y_test, grid.predict_proba(X_test)[:,1]))

VotingClassifier optimal parameters:  {'knn__classifier__n_neighbors': 1, 'lr__classifier__C': 0.23357214690901212, 'lr__classifier__penalty': 'l2', 'svm__classifier__C': 0.01, 'svm__classifier__gamma': 1000}
VotingClassifier train AUC:  1.0
VotingClassifier test AUC:  0.9310004387889426
