# DATA201 - Assignment 4

Please use this page http://apps.ecs.vuw.ac.nz/submit/DATA201 for submitssion and submit only this single Jupyter notebook with your code added into it at the appropriate places.

The due date is **Saturday 30th May, before midnight**.

The dataset for this assignment is file **whitewine.csv** which is provided with this notebook.

Please choose menu items *Kernel => Restart & Run All* then *File => Save and Checkpoint* in Jupyter before submission.

## Dataset

The dataset was adapted from the Wine Quality Dataset (https://archive.ics.uci.edu/ml/datasets/Wine+Quality)

### Attribute Information:

For more information, read [Cortez et al., 2009: http://dx.doi.org/10.1016/j.dss.2009.05.016].

Input variables (based on physicochemical tests):

    1 - fixed acidity 
    2 - volatile acidity 
    3 - citric acid 
    4 - residual sugar 
    5 - chlorides 
    6 - free sulfur dioxide 
    7 - total sulfur dioxide 
    8 - density 
    9 - pH 
    10 - sulphates 
    11 - alcohol 
Output variable (based on sensory data):

    12 - quality (0: normal wine, 1: good wine)
    
## Problem statement
Predict the quality of a wine given its input variables. Use AUC (area under the receiver operating characteristic curve) as the evaluation metric.

First, let's load and explore the dataset.

In [1]:
import numpy as np
import pandas as pd

np.random.seed = 42

In [2]:
data = pd.read_csv("whitewine.csv")
data.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,0
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,0
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,0
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,0
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,0


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4715 entries, 0 to 4714
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed_acidity         4715 non-null   float64
 1   volatile_acidity      4715 non-null   float64
 2   citric_acid           4715 non-null   float64
 3   residual_sugar        4715 non-null   float64
 4   chlorides             4715 non-null   float64
 5   free_sulfur_dioxide   4715 non-null   float64
 6   total_sulfur_dioxide  4715 non-null   float64
 7   density               4715 non-null   float64
 8   pH                    4715 non-null   float64
 9   sulphates             4715 non-null   float64
 10  alcohol               4715 non-null   float64
 11  quality               4715 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 442.1 KB


In [4]:
data["quality"].value_counts()

0    3655
1    1060
Name: quality, dtype: int64

Please note that this dataset is unbalanced.

## Questions and Code

**[1]. Split the given data using stratify sampling into 2 subsets: training (80%) and test (20%) sets. Use random_state = 42. [1 points]**

In [5]:
from sklearn.model_selection import train_test_split

x = data.drop(['quality'], axis=1)
y = data['quality']

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, test_size=0.2, random_state=42, stratify=data['quality'])

**[2]. Use ``GridSearchCV`` and ``Pipeline`` to tune hyper-parameters for 3 different classifiers including ``KNeighborsClassifier``, ``LogisticRegression`` and ``svm.SVC`` and report the corresponding AUC values on the training and test sets. Note that a scaler may need to be inserted into each pipeline. [6 points]**

Hint: You may want to use `kernel='rbf'` and tune `C` and `gamma` for `svm.SVC`. Find out how to enable probability estimates (for Question 3).

Document: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC

In [6]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import roc_auc_score

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, x_train.columns)])

In [7]:
clf1 = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression())])

clf1.fit(x_train, y_train)
print(clf1.score(x_test, y_test))

param_grid = {
    'classifier__C': [0.1, 1.0, 10, 100],
}

grid_search = GridSearchCV(clf1, param_grid, cv=10)
grid_search.fit(x_train, y_train)

print('best params')
print(grid_search.best_params_)

print('score')
print(grid_search.score(x_test, y_test))

print('AUC:')
print('train')
print(roc_auc_score(y_train, grid_search.predict_proba(x_train)[:,1]))

print('test')
print(roc_auc_score(y_test, grid_search.predict_proba(x_test)[:,1]))

0.7889713679745494
best params
{'classifier__C': 100}
score
0.7889713679745494
AUC:
train
0.7868054390470537
test
0.7986539503910385


In [8]:
clf2 = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', KNeighborsClassifier())])
clf2.fit(x_train, y_train)
print(clf2.score(x_test, y_test))

param_grid = {
    'classifier__n_neighbors': [3, 5, 7, 10],
}

grid_search = GridSearchCV(clf2, param_grid, cv=10)
grid_search.fit(x_train, y_train)

print('best params')
print(grid_search.best_params_)

print('score')
print(grid_search.score(x_test, y_test))

print('AUC:')
print('train')
print(roc_auc_score(y_train, grid_search.predict_proba(x_train)[:,1]))

print('test')
print(roc_auc_score(y_test, grid_search.predict_proba(x_test)[:,1]))

0.8165429480381761
best params
{'classifier__n_neighbors': 3}
score
0.8250265111346765
AUC:
train
0.965423995947655
test
0.8494695816018379


In [9]:
clf3 = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', SVC(kernel='rbf', probability=True))])
clf3.fit(x_train, y_train)
print(clf3.score(x_test, y_test))

param_grid = {
    'classifier__C': [0.1, 1.0, 10],
    'classifier__gamma': ['scale', 'auto']
}

grid_search = GridSearchCV(clf3, param_grid, cv=10)
grid_search.fit(x_train, y_train)

print('best params')
print(grid_search.best_params_)

print('score')
print(grid_search.score(x_test, y_test))

print('AUC:')
print('train')
print(roc_auc_score(y_train, grid_search.predict_proba(x_train)[:,1]))

print('test')
print(roc_auc_score(y_test, grid_search.predict_proba(x_test)[:,1]))

0.8197242841993637
best params
{'classifier__C': 10, 'classifier__gamma': 'scale'}
score
0.8335100742311771
AUC:
train
0.9378488533412488
test
0.8485468342668353


**[3]. Train a soft ``VotingClassifier`` with the estimators are the three tuned pipelines obtained from [2]. Report the AUC values on the training and test sets. Comment on the performance of the ensemble model. [1 point]**

Hint: consider the voting method.

Document: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier

In [10]:
from sklearn.ensemble import VotingClassifier
vclf = VotingClassifier(estimators=[('lr', clf1), ('knn', clf2), ('svc', clf3)], voting='soft')
vclf = vclf.fit(x_train,y_train)

print('score')
print(vclf.score(x_test, y_test))

print('AUC:')
print('train')
print(roc_auc_score(y_train, vclf.predict_proba(x_train)[:,1]))

print('test')
print(roc_auc_score(y_test, vclf.predict_proba(x_test)[:,1]))

# the values seem to be close to the avergae of the scores of the previous classifications except for the test AUC values, which
# is higher than all the pervious values

score
0.8260869565217391
AUC:
train
0.9282265506026894
test
0.880068657563947


**[4]. Redo [3] with a sensible set of ``weights`` for the estimators. Comment on the performance of the ensemble model in this case. [1 point]**

In [11]:
from sklearn.ensemble import VotingClassifier
vclf = VotingClassifier(estimators=[('lr', clf1), ('knn', clf2), ('svc', clf3)], voting='soft', weights=[1, 2, 1])
vclf = vclf.fit(x_train,y_train)

print('score')
print(vclf.score(x_test, y_test))

print('AUC:')
print('train')
print(roc_auc_score(y_train, vclf.predict_proba(x_train)[:,1]))

print('test')
print(roc_auc_score(y_test, vclf.predict_proba(x_test)[:,1]))

# it looks like that the accuracy, and the AUC scores have increased slightly from the previous question except for the AUC 
# value for the test set.

score
0.8335100742311771
AUC:
train
0.9387623248070619
test
0.8770423044162817


**[5]. Use the ``VotingClassifier`` with ``GridSearchCV`` to tune the hyper-parameters of the individual estimators. The parameter grid should be a combination of those in [2]. Report the AUC values on the training and test sets. Comment on the performance of the ensemble model. [1 point]**

Note that it may take a long time to run your code for this question.

Document: https://scikit-learn.org/stable/modules/ensemble.html#using-the-votingclassifier-with-gridsearchcv

In [12]:
param_grid = {
    'lr__classifier__C': [1.0, 100.0],
    'knn__classifier__n_neighbors': [3, 5, 7, 10],
    'svc__classifier__C': [0.1, 1.0, 10],
    'svc__classifier__gamma': ['scale', 'auto']
}

grid_search = GridSearchCV(estimator=vclf, param_grid=param_grid, cv=5)
grid_search.fit(x_train, y_train)

print('best params')
print(grid_search.best_params_)

print('score')
print(grid_search.score(x_test, y_test))

print('AUC:')
print('train')
print(roc_auc_score(y_train, grid_search.predict_proba(x_train)[:,1]))

print('test')
print(roc_auc_score(y_test, grid_search.predict_proba(x_test)[:,1]))

# it seems that the given parameters have increased the predictions score, and AUC values to give better results than the
# values of the previous two classifications

best params
{'knn__classifier__n_neighbors': 3, 'lr__classifier__C': 100.0, 'svc__classifier__C': 10, 'svc__classifier__gamma': 'auto'}
score
0.8419936373276776
AUC:
train
0.9721836847946725
test
0.892393464625868
