<h1 align="center"> 
DATS 6202, Fall 2018, Exercise_8 (solution)
</h1>

<h4 align="center"> 
Yuxiao Huang ([yuxiaohuang@gwu.edu](mailto:yuxiaohuang@gwu.edu))
</h4>

## Note
- Complete the missing parts indicated by # Implement me
- We expect you to follow a reasonable programming style. While we do not mandate a specific style, we require that your code to be neat, clear, **documented/commented** and above all consistent. **Marks will be deducted if these are not followed.**

## Objective
Students are expected to understand:
- how to use the combination of Pipeline and GridSearchCV for hyperparameter tuning and model selection

## Overview
The only difference between this exercise and exercise 7 is as follows:
- in exercise 7, we manually compared the model under each settings of hyperparameters
- here, this is taken care of by GridSearchCV

## Load the Hepatitis Data

In [1]:
import warnings
warnings.filterwarnings('ignore')
    
import pandas as pd

# Load the data
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/hepatitis.data', header=None)

# Specify the name of the columns
df.columns = ['Target', 'AGE', 'SEX', 'STEROID', 'ANTIVIRALS', 'FATIGUE', 'MALAISE', 'ANOREXIA', 'LIVER BIG', 'LIVER FIRM', 'SPLEEN PALPABLE', 'SPIDERS', 'ASCITES', 'VARICES', 'BILIRUBIN', 'ALK PHOSPHATE', 'SGOT', 'ALBUMIN', 'PROTIME', 'HISTOLOGY']

# Show the header and the first five rows
df.head()

Unnamed: 0,Target,AGE,SEX,STEROID,ANTIVIRALS,FATIGUE,MALAISE,ANOREXIA,LIVER BIG,LIVER FIRM,SPLEEN PALPABLE,SPIDERS,ASCITES,VARICES,BILIRUBIN,ALK PHOSPHATE,SGOT,ALBUMIN,PROTIME,HISTOLOGY
0,2,30,2,1,2,2,2,2,1,2,2,2,2,2,1.0,85,18,4.0,?,1
1,2,50,1,1,2,1,2,2,1,2,2,2,2,2,0.9,135,42,3.5,?,1
2,2,78,1,2,2,1,2,2,2,2,2,2,2,2,0.7,96,32,4.0,?,1
3,2,31,1,?,1,2,2,2,2,2,2,2,2,2,0.7,46,52,4.0,80,1
4,2,34,1,2,2,2,2,2,2,2,2,2,2,2,1.0,?,200,4.0,?,1


## Remove rows with missing values

In [2]:
import numpy as np

print('Number of rows before removing rows with missing values: ' + str(df.shape[0]))

# Replace ? with np.NaN
df = df.replace('?', np.NaN)

# Remove rows with np.NaN
df = df.dropna(how='any')

print('Number of rows after removing rows with missing values: ' + str(df.shape[0]))

Number of rows before removing rows with missing values: 155
Number of rows after removing rows with missing values: 80


## Get the feature and target vector

In [3]:
# Specify the name of the target
target = 'Target'

# Get the target vector
y = df[target].values

# Specify the name of the features
features = list(df.drop(target, axis=1).columns)

# Get the feature vector
X = df[features].values

## Divide the data into training and testing
This part is not necessary for this exercise (since cross validation is used)

In [4]:
# from sklearn.model_selection import train_test_split

# # Randomly choose 30% of the data for testing (set randome_state as 0 and stratify as y)
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)

## Fit svm using different settings for the hyperparameters
Here:
- we use the combination of Pipeline and GridSearchCV for hyperparameter tuning and model selection
- this is much more convenient than what we did in exercise 7

In [5]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV

# Declare the classifier with hyperparameter class_weight and random_state
clf = SVC(class_weight='balanced', random_state=0)

# The pipeline, with StandardScaler and clf defined above
pipe_clf = Pipeline([('StandardScaler', StandardScaler()), ('clf', clf)])

# The list of value for hyperparameter C (penalty parameter)
Cs = [0.01, 0.1, 1]

# The list of choice for hyperparameter kernel
kernels = ['linear', 'rbf', 'sigmoid']

# The parameter grid
param_grid = [{'clf__C': Cs,
               'clf__kernel': kernels}]

# GridSearchCV
gs = GridSearchCV(estimator=pipe_clf,
                  param_grid=param_grid,
                  scoring='accuracy',
                  n_jobs=-1,
                  cv=StratifiedKFold(n_splits=10,
                                     shuffle=True,
                                     random_state=0))

# Hyperparameter tuning and model selection
gs.fit(X, y)

# Print the best settings
print('The best settings are:')
print(gs.best_params_)

The best settings are:
{'clf__C': 1, 'clf__kernel': 'rbf'}


## Discussion

The above results are the same as those in exercise 7. In practice, hyperparameter tuning and model selection are usually done using the combination of Pipeline and GridSearchCV (with StratifiedKFold). 