<h1 align="center"> 
DATS 6202, Fall 2018, Exercise_11 (solution)
</h1>

<h4 align="center"> 
Yuxiao Huang ([yuxiaohuang@gwu.edu](mailto:yuxiaohuang@gwu.edu))
</h4>

## Note
- Complete the missing parts indicated by # Implement me
- We expect you to follow a reasonable programming style. While we do not mandate a specific style, we require that your code to be neat, clear, **documented/commented** and above all consistent. **Marks will be deducted if these are not followed.**

## 1. Objective
Students are expected to understand:
- how to use the combination of Pipeline (with PCA) and GridSearchCV for hyperparameter tuning and model selection

## 2. Overview
Apply hyperparameter tuning and model selection to five classifers on [Congressional Voting Records Data](https://archive.ics.uci.edu/ml/datasets/congressional+voting+records)

## 3. Data preprocessing

### 3.1. Load the Congressional Voting Records Data

In [1]:
import warnings
warnings.filterwarnings("ignore")
    
import pandas as pd

# Load the data
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data', header=None)

# Specify the name of the columns
df.columns = ['target', 'handicapped-infants', 'water-project-cost-sharing', 'adoption-of-the-budget-resolution', 'physician-fee-freeze', 'el-salvador-aid', 'religious-groups-in-schools', 'anti-satellite-test-ban', 'aid-to-nicaraguan-contras', 'mx-missile', 'immigration', 'synfuels-corporation-cutback', 'education-spending', 'superfund-right-to-sue', 'crime', 'duty-free-exports', 'export-administration-act-south-africa']

# Show the header and the first five rows
df.head()

Unnamed: 0,target,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


### 3.2. Remove rows with missing values

In [2]:
import numpy as np

print('Number of rows before removing rows with missing values: ' + str(df.shape[0]))

# Replace ? with np.NaN
df = df.replace('?', np.NaN)

# Remove rows with np.NaN
df = df.dropna(how='any')

print('Number of rows after removing rows with missing values: ' + str(df.shape[0]))

Number of rows before removing rows with missing values: 435
Number of rows after removing rows with missing values: 232


### 3.3. Get the feature and target vector

In [3]:
# Get the feature vector
X = df.drop(labels='target', axis=1)

# Get the target vector
y = df['target']

### 3.4. Encode the features using one-hot-encoding

In [4]:
import pandas as pd

X = pd.get_dummies(X)
X.head()

Unnamed: 0,handicapped-infants_n,handicapped-infants_y,water-project-cost-sharing_n,water-project-cost-sharing_y,adoption-of-the-budget-resolution_n,adoption-of-the-budget-resolution_y,physician-fee-freeze_n,physician-fee-freeze_y,el-salvador-aid_n,el-salvador-aid_y,...,education-spending_n,education-spending_y,superfund-right-to-sue_n,superfund-right-to-sue_y,crime_n,crime_y,duty-free-exports_n,duty-free-exports_y,export-administration-act-south-africa_n,export-administration-act-south-africa_y
5,1,0,0,1,0,1,1,0,0,1,...,1,0,0,1,0,1,0,1,0,1
8,1,0,0,1,1,0,0,1,0,1,...,0,1,0,1,0,1,1,0,0,1
19,0,1,0,1,0,1,1,0,1,0,...,1,0,1,0,1,0,0,1,0,1
23,0,1,0,1,0,1,1,0,1,0,...,1,0,1,0,1,0,0,1,0,1
25,0,1,1,0,0,1,1,0,1,0,...,1,0,1,0,1,0,0,1,0,1


### 3.5. Encode the target

In [5]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(y)

## 4. Hypterparameter tuning and model selection
In this section, we will first use the combination of pipeline and GridSearchCV to tune the hyperparameters of five classifiers:
- logistic regression
- multi-layer perceptron
- decision tree
- random forest
- support vector machine

Next we will select the best model across the five classifiers.

### 4.1. Create the dictionary of classifiers
In the dictionary:
- the key is the acronym of the classifier
- the value is the classifier (with random_state=0)

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

clfs = {'lr': LogisticRegression(random_state=0),
        'mlp': MLPClassifier(random_state=0),
        'dt': DecisionTreeClassifier(random_state=0),
        'rf': RandomForestClassifier(random_state=0),
        'svc': SVC(random_state=0)}

### 4.2. Create the dictionary of pipeline
In the dictionary:
- the key is the acronym of the classifier
- the value is another dictionary, where the key is the value in n_components (defined in cell 7) and the value is the pipeline, with StandardScaler, PCA (when the value in n_components is smaller than X.shape[1]), and the classifier

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

n_components = [X.shape[1] // 4, X.shape[1] // 2, X.shape[1]]

pipe_clfs = {}

for name, clf in clfs.items():
    pipe_clfs[name] = {}
    for n_component in n_components:
        if n_component < X.shape[1]:
            pipe_clfs[name][n_component] = Pipeline([('StandardScaler', StandardScaler()), 
                                                     ('PCA', PCA(n_components=n_component, random_state=0)), 
                                                     ('clf', clf)])
        else:
            pipe_clfs[name][n_component] = Pipeline([('StandardScaler', StandardScaler()), 
                                                     ('clf', clf)])

### 4.3. Create the dictionary of parameter grids
In the dictionary:
- the key is the acronym of the classifier
- the value is the parameter grid of the classifier

In [8]:
param_grids = {}

### 4.3.1. The parameter grid for logistic regression
The hyperparameters we want to tune are:
- multi_class
- solver
- C

Here we need to use two dictionaries in the parameter grid since 'multinomial' (multi_class) does not support 'liblinear' (solver). See details of the meaning of the hyperparametes in [sklearn logistic regression documentation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [9]:
C_range = [10 ** i for i in range(-4, 5)]

param_grid = [{'clf__multi_class': ['ovr'], 
               'clf__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
               'clf__C': C_range},
              {'clf__multi_class': ['multinomial'],
               'clf__solver': ['newton-cg', 'lbfgs', 'sag', 'saga'],
               'clf__C': C_range}]

param_grids['lr'] = param_grid

### 4.3.2. The parameter grid for multi-layer perceptron
The hyperparameters we want to tune are:
- hidden_layer_sizes
- activation

See details of the meaning of the hyperparametes in [sklearn multi-layer perceptron documentation](http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier)

In [10]:
param_grid = [{'clf__hidden_layer_sizes': [10, 100, 200],
               'clf__activation': ['identity', 'logistic', 'tanh', 'relu']}]

param_grids['mlp'] = param_grid

### 4.3.3. The parameter grid for decision tree
The hyperparameters we want to tune are:
- min_samples_split
- min_samples_leaf

See details of the meaning of the hyperparametes in [sklearn decision tree documentation](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)

In [11]:
param_grid = [{'clf__min_samples_split': [2, 10, 30],
               'clf__min_samples_leaf': [1, 10, 30]}]

param_grids['dt'] = param_grid

### 4.3.4. The parameter grid for random forest
The hyperparameters we want to tune are:
- n_estimators
- min_samples_split
- min_samples_leaf

See details of the meaning of the hyperparametes in [sklearn random forest documentation](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

In [12]:
param_grid = [{'clf__n_estimators': [2, 10, 30],
               'clf__min_samples_split': [2, 10, 30],
               'clf__min_samples_leaf': [1, 10, 30]}]

param_grids['rf'] = param_grid

### 4.3.5. The parameter grid for support vector machine
The hyperparameters we want to tune are:
- C
- gamma
- kernel

See details of the meaning of the hyperparametes in [sklearn support vector machine documentation](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

In [13]:
param_grid = [{'clf__C': [0.01, 0.1, 1, 10, 100],
               'clf__gamma': [0.01, 0.1, 1, 10, 100],
               'clf__kernel': ['linear', 'poly', 'rbf', 'sigmoid']}]

param_grids['svc'] = param_grid

## 4.4. Hyperparameter tuning
Here we use two functions for hyperparameter tuning:
- [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html): Exhaustive search over specified parameter values for an estimator
- [StratifiedKFold](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html): Stratified K-Folds cross-validator

In [14]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold

# The list of [best_score_, best_params_, best_estimator_]
best_score_param_estimators = []

# For each classifier
for name in pipe_clfs.keys():
    for n_component in n_components:   
        # GridSearchCV
        gs = GridSearchCV(estimator=pipe_clfs[name][n_component],
                          param_grid=param_grids[name],
                          scoring='accuracy',
                          n_jobs=-1,
                          cv=StratifiedKFold(n_splits=10,
                                             shuffle=True,
                                             random_state=0))
        # Fit the pipeline
        gs = gs.fit(X, y)

        # Update best_score_param_estimators
        best_score_param_estimators.append([gs.best_score_, gs.best_params_, gs.best_estimator_])

## 4.5. Model selection

In [15]:
# Sort best_score_param_estimators in descending order of the best_score_
best_score_param_estimators = sorted(best_score_param_estimators, key=lambda x : x[0], reverse=True)

# Print out best_estimator
print(best_score_param_estimators[0][2])

Pipeline(memory=None,
     steps=[('StandardScaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('clf', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=30,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best'))])


## 5. What is next?
After selecting the best model, we can use it to predict the class label of new samples. These samples could be the ones without class labels in [kaggle competitions](https://www.kaggle.com/competitions).

To illustrate how this can be done, here we simply use the input data to mimic the new data.

In [16]:
y_pred = best_score_param_estimators[0][2].predict(X)
y_pred

array([0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1,
       0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1,
       1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
       0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0,
       1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0])