<h1 align="center"> 
DATS 6202, Fall 2018, Exercise_11 (solution extended)
</h1>

<h4 align="center"> 
Yuxiao Huang ([yuxiaohuang@gwu.edu](mailto:yuxiaohuang@gwu.edu))
</h4>

## Note
- Complete the missing parts indicated by # Implement me
- We expect you to follow a reasonable programming style. While we do not mandate a specific style, we require that your code to be neat, clear, **documented/commented** and above all consistent. **Marks will be deducted if these are not followed.**

## 1. Objective
Students are expected to understand:
- The meaning of the class labels in the dataset
- The problem produced by skewed class labels
- How to address this problem

## 2. Overview
Apply hyperparameter tuning and model selection to five classifers on [Hepatitis Data Set](https://archive.ics.uci.edu/ml/datasets/hepatitis)

## 3. Data preprocessing

### 3.1. Load the Hepatitis Data

In [1]:
import warnings
warnings.filterwarnings("ignore")
    
import pandas as pd

# Load the data
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/hepatitis.data',
                 header=None,
                 names=['TARGET', 'AGE', 'SEX', 'STEROID', 'ANTIVIRALS', 'FATIGUE', 'MALAISE', 'ANOREXIA', 'LIVER BIG', 'LIVER FIRM', 'SPLEEN PALPABLE', 'SPIDERS', 'ASCITES', 'VARICES', 'BILIRUBIN', 'ALK PHOSPHATE', 'SGOT', 'ALBUMIN', 'PROTIME', 'HISTOLOGY'])

# Show the header and the first five rows
df.head()

Unnamed: 0,TARGET,AGE,SEX,STEROID,ANTIVIRALS,FATIGUE,MALAISE,ANOREXIA,LIVER BIG,LIVER FIRM,SPLEEN PALPABLE,SPIDERS,ASCITES,VARICES,BILIRUBIN,ALK PHOSPHATE,SGOT,ALBUMIN,PROTIME,HISTOLOGY
0,2,30,2,1,2,2,2,2,1,2,2,2,2,2,1.0,85,18,4.0,?,1
1,2,50,1,1,2,1,2,2,1,2,2,2,2,2,0.9,135,42,3.5,?,1
2,2,78,1,2,2,1,2,2,2,2,2,2,2,2,0.7,96,32,4.0,?,1
3,2,31,1,?,1,2,2,2,2,2,2,2,2,2,0.7,46,52,4.0,80,1
4,2,34,1,2,2,2,2,2,2,2,2,2,2,2,1.0,?,200,4.0,?,1


### 3.2. Remove rows with missing values

In [2]:
import numpy as np

print('Number of rows before removing rows with missing values: ' + str(df.shape[0]))

# Replace ? with np.NaN
df = df.replace('?', np.NaN)

# Remove rows with np.NaN
df = df.dropna(how='any')

print('Number of rows after removing rows with missing values: ' + str(df.shape[0]))

Number of rows before removing rows with missing values: 155
Number of rows after removing rows with missing values: 80


### 3.3. Get the feature and target vector

In [3]:
# Get the feature vector
X = df.drop(labels='TARGET', axis=1)

# Get the target vector
y = df['TARGET']

### 3.4. Encode the target

In [4]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(y)

### 3.5. Divide the data into training and testing

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)

print([np.where(y_train == 0)[0].shape[0], np.where(y_train == 1)[0].shape[0]])

[9, 47]


### 3.6. Over sampling

In [6]:
# from imblearn.over_sampling import RandomOverSampler

# ros = RandomOverSampler(random_state=0)
# X_train, y_train = ros.fit_sample(X_train, y_train)

# print([np.where(y_train == 0)[0].shape[0], np.where(y_train == 1)[0].shape[0]])

## 4. Hypterparameter tuning and model selection
In this section, we will first use the combination of pipeline and GridSearchCV to tune the hyperparameters of five classifiers:
- logistic regression
- multi-layer perceptron
- decision tree
- random forest
- support vector machine

Next we will select the best model across the five classifiers.

### 4.1. Create the dictionary of classifiers
In the dictionary:
- the key is the acronym of the classifier
- the value is the classifier (with random_state=0)

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

clfs = {'lr': LogisticRegression(random_state=0),
        'mlp': MLPClassifier(random_state=0),
        'dt': DecisionTreeClassifier(random_state=0),
        'rf': RandomForestClassifier(random_state=0),
        'svc': SVC(random_state=0, probability=True)}

# clfs = {'lr': LogisticRegression(random_state=0, class_weight='balanced'),
#         'mlp': MLPClassifier(random_state=0),
#         'dt': DecisionTreeClassifier(random_state=0, class_weight='balanced'),
#         'rf': RandomForestClassifier(random_state=0, class_weight='balanced'),
#         'svc': SVC(random_state=0, probability=True, class_weight='balanced')}

### 4.2. Create the dictionary of pipeline
In the dictionary:
- the key is the acronym of the classifier
- the value is the pipeline (with StandardScaler and the classifier)

In [8]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipe_clfs = {}

for name, clf in clfs.items():
    pipe_clfs[name] = Pipeline([('StandardScaler', StandardScaler()), ('clf', clf)])

### 4.3. Create the dictionary of parameter grids
In the dictionary:
- the key is the acronym of the classifier
- the value is the parameter grid of the classifier

In [9]:
param_grids = {}

### 4.3.1. The parameter grid for logistic regression
The hyperparameters we want to tune are:
- multi_class
- solver
- C

Here we need to use two dictionaries in the parameter grid since 'multinomial' (multi_class) does not support 'liblinear' (solver). See details of the meaning of the hyperparametes in [sklearn logistic regression documentation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [10]:
C_range = [10 ** i for i in range(-4, 5)]

param_grid = [{'clf__multi_class': ['ovr'], 
               'clf__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
               'clf__C': C_range},
              {'clf__multi_class': ['multinomial'],
               'clf__solver': ['newton-cg', 'lbfgs', 'sag', 'saga'],
               'clf__C': C_range}]

param_grids['lr'] = param_grid

### 4.3.2. The parameter grid for multi-layer perceptron
The hyperparameters we want to tune are:
- hidden_layer_sizes
- activation

See details of the meaning of the hyperparametes in [sklearn multi-layer perceptron documentation](http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier)

In [11]:
param_grid = [{'clf__hidden_layer_sizes': [10, 100, 200],
               'clf__activation': ['identity', 'logistic', 'tanh', 'relu']}]

param_grids['mlp'] = param_grid

### 4.3.3. The parameter grid for decision tree
The hyperparameters we want to tune are:
- min_samples_split
- min_samples_leaf

See details of the meaning of the hyperparametes in [sklearn decision tree documentation](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)

In [12]:
param_grid = [{'clf__min_samples_split': [2, 10, 30],
               'clf__min_samples_leaf': [1, 10, 30]}]

param_grids['dt'] = param_grid

### 4.3.4. The parameter grid for random forest
The hyperparameters we want to tune are:
- n_estimators
- min_samples_split
- min_samples_leaf

See details of the meaning of the hyperparametes in [sklearn random forest documentation](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

In [13]:
param_grid = [{'clf__n_estimators': [2, 10, 30],
               'clf__min_samples_split': [2, 10, 30],
               'clf__min_samples_leaf': [1, 10, 30]}]

param_grids['rf'] = param_grid

### 4.3.5. The parameter grid for support vector machine
The hyperparameters we want to tune are:
- C
- gamma
- kernel

See details of the meaning of the hyperparametes in [sklearn support vector machine documentation](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

In [14]:
param_grid = [{'clf__C': [0.01, 0.1, 1, 10, 100],
               'clf__gamma': [0.01, 0.1, 1, 10, 100],
               'clf__kernel': ['linear', 'poly', 'rbf', 'sigmoid']}]

param_grids['svc'] = param_grid

## 4.4. Hyperparameter tuning
Here we use two functions for hyperparameter tuning:
- [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html): Exhaustive search over specified parameter values for an estimator
- [StratifiedKFold](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html): Stratified K-Folds cross-validator

In [15]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold

# The list of [best_score_, best_params_, best_estimator_]
best_score_param_estimators = []

# For each classifier
for name in pipe_clfs.keys():
    # GridSearchCV
    gs = GridSearchCV(estimator=pipe_clfs[name],
                      param_grid=param_grids[name],
                      scoring='precision',
                      n_jobs=1,
                      cv=StratifiedKFold(n_splits=10,
                                         shuffle=True,
                                         random_state=0))
    # Fit the pipeline
    gs = gs.fit(X_train, y_train)
    
    # Update best_score_param_estimators
    best_score_param_estimators.append([gs.best_score_, gs.best_params_, gs.best_estimator_])

## 4.5. Model selection

In [16]:
# Sort best_score_param_estimators in descending order of the best_score_
best_score_param_estimators = sorted(best_score_param_estimators, key=lambda x : x[0], reverse=True)

# For each [best_score_, best_params_, best_estimator_]
for best_score_param_estimator in best_score_param_estimators:
    # Print out [best_score_, best_params_, best_estimator_], where best_estimator_ is a pipeline
    # Since we only print out the type of classifier of the pipeline
    print([best_score_param_estimator[0], best_score_param_estimator[1], type(best_score_param_estimator[2].named_steps['clf'])], end='\n\n')

[0.9821428571428571, {'clf__C': 0.0001, 'clf__multi_class': 'ovr', 'clf__solver': 'liblinear'}, <class 'sklearn.linear_model.logistic.LogisticRegression'>]

[0.9821428571428571, {'clf__activation': 'identity', 'clf__hidden_layer_sizes': 10}, <class 'sklearn.neural_network.multilayer_perceptron.MLPClassifier'>]

[0.9821428571428571, {'clf__min_samples_leaf': 1, 'clf__min_samples_split': 2, 'clf__n_estimators': 2}, <class 'sklearn.ensemble.forest.RandomForestClassifier'>]

[0.9464285714285714, {'clf__C': 0.01, 'clf__gamma': 1, 'clf__kernel': 'poly'}, <class 'sklearn.svm.classes.SVC'>]

[0.9249999999999999, {'clf__min_samples_leaf': 10, 'clf__min_samples_split': 2}, <class 'sklearn.tree.tree.DecisionTreeClassifier'>]



## 5. Print the accuracy of the best model on the testing data

In [17]:
print(best_score_param_estimators[0][2].score(X_test, y_test))

0.75


## 6. Discussion
- Do you know what the score above means?
- Is the best model acceptable?

## 7. Let us find out the answers (to the questions above)

### 7.1. Print the confusion matrix

In [18]:
from sklearn.metrics import confusion_matrix

y_test_pred = best_score_param_estimators[0][2].predict(X_test)

confusion_matrix(y_test, y_test_pred, labels=[0, 1])

array([[ 4,  0],
       [ 6, 14]])