<h1 align="center"> 
DATS 6202, Fall 2018, Exercise_7
</h1>

<h4 align="center"> 
Yuxiao Huang ([yuxiaohuang@gwu.edu](mailto:yuxiaohuang@gwu.edu))
</h4>

## Note
- Complete the missing parts indicated by # Implement me
- We expect you to follow a reasonable programming style. While we do not mandate a specific style, we require that your code to be neat, clear, **documented/commented** and above all consistent. **Marks will be deducted if these are not followed.**

## Objective
Students are expected to understand:
- how to use pipeline to sequentially apply a list of transforms and a final estimator
- how to use stratifiedkfold for cross validation
- how to use cross_val_score to wrap stratifiedkfold 

## Overview
The only difference between this exercise and exercise 6 is as follows:
- in exercise 6, we divided the data into training and testing using one split
- here, we divide the data using cross validation 

## Load the Hepatitis Data

In [2]:
import warnings
warnings.filterwarnings('ignore')
    
import pandas as pd

# Load the data
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/hepatitis.data', header=None)

# Specify the name of the columns
df.columns = ['Target', 'AGE', 'SEX', 'STEROID', 'ANTIVIRALS', 'FATIGUE', 'MALAISE', 'ANOREXIA', 'LIVER BIG', 'LIVER FIRM', 'SPLEEN PALPABLE', 'SPIDERS', 'ASCITES', 'VARICES', 'BILIRUBIN', 'ALK PHOSPHATE', 'SGOT', 'ALBUMIN', 'PROTIME', 'HISTOLOGY']

# Show the header and the first five rows
df.head()

Unnamed: 0,Target,AGE,SEX,STEROID,ANTIVIRALS,FATIGUE,MALAISE,ANOREXIA,LIVER BIG,LIVER FIRM,SPLEEN PALPABLE,SPIDERS,ASCITES,VARICES,BILIRUBIN,ALK PHOSPHATE,SGOT,ALBUMIN,PROTIME,HISTOLOGY
0,2,30,2,1,2,2,2,2,1,2,2,2,2,2,1.0,85,18,4.0,?,1
1,2,50,1,1,2,1,2,2,1,2,2,2,2,2,0.9,135,42,3.5,?,1
2,2,78,1,2,2,1,2,2,2,2,2,2,2,2,0.7,96,32,4.0,?,1
3,2,31,1,?,1,2,2,2,2,2,2,2,2,2,0.7,46,52,4.0,80,1
4,2,34,1,2,2,2,2,2,2,2,2,2,2,2,1.0,?,200,4.0,?,1


## Remove rows with missing values

In [3]:
import numpy as np

print('Number of rows before removing rows with missing values: ' + str(df.shape[0]))

# Replace ? with np.NaN
df = df.replace('?', np.NaN)

# Remove rows with np.NaN
df = df.dropna(how='any')

print('Number of rows after removing rows with missing values: ' + str(df.shape[0]))

Number of rows before removing rows with missing values: 155
Number of rows after removing rows with missing values: 80


## Get the feature and target vector

In [4]:
# Specify the name of the target
target = 'Target'

# Get the target vector
y = df[target].values

# Specify the name of the features
features = list(df.drop(target, axis=1).columns)

# Get the feature vector
X = df[features].values

## Divide the data into training and testing
This part is not necessary for this exercise (since cross validation is used)

In [5]:
# from sklearn.model_selection import train_test_split

# # Randomly choose 30% of the data for testing (set randome_state as 0 and stratify as y)
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)

## Fit svm using different settings for the hyperparameters
Here:
- we first use StratifiedKFold to get the indices of training and testing data for each fold
- we then train and test the (pipeline) estimator on the training and testing data

The detailed steps help us to understand what StratifiedKFold returns

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
import numpy as np

# The list of value for hyperparameter C (penalty parameter)
Cs = [0.01, 0.1, 1]

# The list of choice for hyperparameter kernel
kernels = ['linear', 'rbf', 'sigmoid']

# The list of [score, setting], where score is the score of the classifier and setting a pair of (C, kernel)
score_settings = []

# For each C
for C in Cs:
    # For each kernel
    for kernel in kernels:
        # Declare the classifier with hyperparameter C, kernel, class_weight, and random_state
        clf = SVC(C=C, kernel=kernel, class_weight='balanced', random_state=0)
        
        # The pipeline, with StandardScaler and clf defined above
        pipe_clf = Pipeline([('StandardScaler', StandardScaler()), ('clf', clf)])
        
        # StratifiedKFold, with n_splits=10, shuffle=True, and random_state=0
        skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
        
        # The scores across the 10-folds cross validation
        scores = []
        
        # For the training and testing indices with respect to each fold
        for train_index, test_index in skf.split(X, y):
            # Get X_train and X_test
            X_train, X_test = X[train_index], X[test_index]
            # Get y_train and y_test
            y_train, y_test = y[train_index], y[test_index]

            # Fit the pipeline
            pipe_clf.fit(X_train, y_train)

            # Get the score 
            score = pipe_clf.score(X_test, y_test)
            
            # Update scores
            scores.append(score)
        
        # Get the setting, which is a pair of (C, kernel)
        setting = [C, kernel]
        
        # Get the average score across the 10-folds cross validation (rounding to two decimal places)
        mean_score = round(np.mean(scores), 2)

        # Append [mean_score, setting] to score_settings
        score_settings.append([mean_score, setting])
        
# Sort score_settings in descending order of score
score_settings = sorted(score_settings, key=lambda x: x[0], reverse=True)

# Print score_settings
print('The list of [score, setting] is:')
for score_setting in score_settings:
    print(score_setting)
print()

# Print the best setting
print('The best setting is:')
print('C: ' + str(score_settings[0][1][0]))
print('kernel: ' + score_settings[0][1][1])

The list of [score, setting] is:
[0.87, [1, 'rbf']]
[0.84, [1, 'linear']]
[0.82, [0.1, 'sigmoid']]
[0.81, [0.01, 'linear']]
[0.8, [0.01, 'sigmoid']]
[0.79, [0.01, 'rbf']]
[0.79, [0.1, 'linear']]
[0.77, [1, 'sigmoid']]
[0.73, [0.1, 'rbf']]

The best setting is:
C: 1
kernel: rbf


## Discussion

The above results are not the same as those in exercise 6. In this case, the default settings for the two hyperparameters, C and kernel, are indeed the best ones. This is largely due to the fact that the scores in exercise 6 were obtained from the training set divided (from the input data) by one split. Thus the score from the training set may not be reliable. However, the score in this exercise is the average one obtained by 10-fold cross validation, and could be more reliable.

## Fit svm using different settings for the hyperparameters (again)
Here we use cross_val_score that
- wraps StratifiedKFold
- reports the average score obtained by 10-folds cross validation

This shows us a simple way to use cross validation in practice.

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

# The list of value for hyperparameter C (penalty parameter)
Cs = [0.01, 0.1, 1, 10]

# The list of choice for hyperparameter kernel
kernels = ['linear', 'rbf', 'sigmoid']

# The list of [score, setting], where score is the score of the classifier and setting a pair of (C, kernel)
score_settings = []

# For each C
for C in Cs:
    # For each kernel
    for kernel in kernels:
        # Declare the classifier with hyperparameter C, kernel, class_weight, and random_state
        clf = SVC(C=C, kernel=kernel, class_weight='balanced', random_state=0)
        
        # The pipeline, with StandardScaler and clf defined above
        pipe_clf = Pipeline([('StandardScaler', StandardScaler()), ('clf', clf)])
        
        # Get the score with respect to each fold (using cross_val_score)
        scores = cross_val_score(estimator=pipe_clf,
                                 X=X,
                                 y=y,
                                 scoring='accuracy',
                                 cv=StratifiedKFold(n_splits=10,
                                                    shuffle=True,
                                                    random_state=0),
                                 n_jobs=-1)
        
        # Get the setting, which is a pair of (C, kernel)
        setting = [C, kernel]
        
        # Get the average score (rounding to two decimal places)
        mean_score = round(np.mean(scores), 2)

        # Append [mean_score, setting] to score_settings
        score_settings.append([mean_score, setting])
        
# Sort score_settings in descending order of score
score_settings = sorted(score_settings, key=lambda x: x[0], reverse=True)

# Print score_settings
print('The list of [score, setting] is:')
for score_setting in score_settings:
    print(score_setting)
print()

# Print the best setting
print('The best setting is:')
print('C: ' + str(score_settings[0][1][0]))
print('kernel: ' + score_settings[0][1][1])

The list of [score, setting] is:
[0.87, [1, 'rbf']]
[0.84, [1, 'linear']]
[0.83, [10, 'rbf']]
[0.82, [0.1, 'sigmoid']]
[0.82, [10, 'linear']]
[0.81, [0.01, 'linear']]
[0.8, [0.01, 'sigmoid']]
[0.79, [0.01, 'rbf']]
[0.79, [0.1, 'linear']]
[0.77, [1, 'sigmoid']]
[0.77, [10, 'sigmoid']]
[0.73, [0.1, 'rbf']]

The best setting is:
C: 1
kernel: rbf


## Discussion

The above results are exactly the same as before. While using StratifiedKFold to get the indices of training and testing data helps us to understand what happens under the hood, cross_val_score (that wraps StratifiedKFold) is much more convenient and thus usually used in practice.