### Problem Statement

In this project, we're going to build up a supervised learning model based on “mushroom classification dataset” from UCI Machine Learning repository. Then make evaluation on this model to see is it effectively classifies whether the mushrooms are **“edible”** or **“poisonous”**, that means this would be a **binary classification** problem.

---

### Outline

We break the notebook into separate steps.  Outline and nevigation as below.

* [Step 0](#step0): Set Enviroment 
* [Step 1](#step1): Data Exploration
* [Step 2](#step2): Data Pre-processing
* [Step 3](#step3): Feature Importance Analysis
* [Step 4](#step4): Models
* [Step 5](#step5): Optimization
* [Step 6](#step6): Conclusion


<a id='step0'></a>
## Step 0: Set Enviroment

- Python
- Numpy
- Pandas
- Matplotlib
- Seaborn
- SciKit Learn

- Math 
- OS
- Time
- Collections

In [None]:
import math
import numpy as np
import pandas as pd
import os
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from time import time
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression as LGR
from sklearn.neural_network import MLPClassifier as MLP
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from collections import OrderedDict


---
<a id='step1'></a>
## Step 1: Data Exploration

### Import Mushroom Dataset
List below are the descriptions for each columns in the data. The goal is to demonstrate some different supervised learning models and find a best one to classify mushrooms effectively. Also, we want to realize whats the critical features to indicate mushroom edibility.


Attribute Information: (classes: edible=e, poisonous=p)

- **cap-shape**: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
- **cap-surface**: fibrous=f,grooves=g,scaly=y,smooth=s
- **cap-color**: brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y
- **bruises**: bruises=t,no=f
- **odor**: almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s
- **gill-attachment**: attached=a,descending=d,free=f,notched=n
- **gill-spacing**: close=c,crowded=w,distant=d
- **gill-size**: broad=b,narrow=n
- **gill-color**: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y
- **stalk-shape**: enlarging=e,tapering=t
- **stalk-root**: bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?
- **stalk-surface-above-ring**: fibrous=f,scaly=y,silky=k,smooth=s
- **stalk-surface-below-ring**: fibrous=f,scaly=y,silky=k,smooth=s
- **stalk-color-above-ring**: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
- **stalk-color-below-ring**: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
- **veil-type**: partial=p,universal=u
- **veil-color**: brown=n,orange=o,white=w,yellow=y
- **ring-number**: none=n,one=o,two=t
- **ring-type**: cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z
- **spore-print-color**: black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y
- **population**: abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y
- **habitat**: grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d

In [None]:
data = pd.read_csv("../input/mushrooms.csv")
data.head(5)

### Dataset Information
8124 rows x 23 columns with no any null values.<p>
All 22 features and 1 column for specifying class.

In [None]:
# check null values
data.info()

### Dataset Check - Balance
It shows quite good balance to the class by checking the 'poisonous-to-edible' ratio as below:

In [None]:
### Check input data balance
# "poisonous"-to-"edible" ratio
edible_cnt = data[data["class"] == "e"]["class"].count()
poisonus_cnt = data[data["class"] == "p"]["class"].count() 
p_e_ratio = poisonus_cnt/float(edible_cnt)
print("\n'poisonous'-to-'edible' ratio: {}\npoisonus_cnt: {}, edible_cnt: {}"
      .format(p_e_ratio.round(2),poisonus_cnt,edible_cnt))


# visulalize "e" vs "p" balance
sns.set(style="ticks", color_codes=True)
plt.title("Balance Checking for input class in Mushroom Dataset",fontsize=14)
sns.countplot(x = data["class"], data = data)

### Dataset Check - Unique Values
1. 'veil-type' has only one unique value 'p', which means it doesn't relate to the classification result and can be dropped.
2. 'stalk-root' has 2480 '?' values. (~30% data), should be careful about this when doing learning.

In [None]:
# unique value for each feature
columns = data.columns.values
for column in columns:
    print("{0}: {1}".format(column, data[column].unique()))
    
# check number of "?" values for "stalk-root", which means N/A for the feature.
print("\n There are {} '?' values in the feature 'stalk-root'.".format(data[data["stalk-root"] == "?"]["stalk-root"].count()))

---
<a id='step2'></a>
## Step 2: Data Pre-processing

### Drop Feature 'veil-type'
As just analysis, 'veil-type' is trivial to classification result since it's single type, so let's drop it now:

In [None]:
# drop "veil-type" since it has only one unique value, numbers of features reduced to 21 columns now.
data = data.drop("veil-type", axis = 1)
data.shape

### Data Encoding
Since the expected inputs for the learning algorithms should be numerical, we now transform the string type dataset into numerical type by one-hot encoder.

In [None]:
# Use one-hot encoder to encode the data
data_onehot = pd.get_dummies(data)

# Since we are dealing with a binary classification problem, 
#we can simply drop "class_e" and only use "class_p" for the indicator 
data_onehot = data_onehot.drop(['class_e'], axis=1)

# print the name and number of features after one-hot encoding
encoded = list(data_onehot.columns[1:])
print ("{} total features after one-hot encoding.".format(len(encoded)))
print (encoded)

data_onehot.head(5)

### Split Features and Labels

In [None]:
y_onehot = data_onehot['class_p']
X_onehot = data_onehot.drop(['class_p'], axis=1)
X_onehot.head()

In [None]:
y_onehot.head()

---
<a id='step3'></a>
## Step 3: Feature Importance Analysis

### Visualization for Feature Distribution
Check class distribution plot by grouping for the features:

In [None]:
feature_columns = data.columns[1:]
sns.set(style="darkgrid", color_codes=True)
fig, axes = plt.subplots(nrows= 3, ncols=7,figsize=(35, 15))
#fig, axes = plt.subplots(nrows= 7, ncols=3,figsize=(15, 35))
data['id'] = np.arange(1, data.shape[0] + 1)

for f, ax in zip(feature_columns, axes.ravel()):
    data.groupby(['class', f])['id'].count().unstack(f).plot(kind='bar', ax=ax, legend=True, grid=True,fontsize = 16)
    #ax.set_ylabel('Actual label', style='italic')
    ax.set_title(f, style='oblique', size=24)
    ax.set_xlabel(' \n', style='italic', size=18)
    ax.legend(fontsize=14)   #for 3*7 change to 14
data = data.drop("id",axis = 1) 

### Feature - Class Correlation
Here we perform correlation between features and classes. Together with the distribution plot above, we find few things about feature importance:
<p>  1. In poisonous mushrooms, about 96.9% of them do not belong to "odor type n".
<p>  2. About 55.2% of the poisonous mushrooms belong to "odor type f". From 1,2 and distribution plot, we can see that odor type is a significant  feature on mushroom classification.
<p>  3. About 55.8% of poisonous mushrooms have gill-color_b, and from distribution plot we can also see there is few edible mushrooms with gill-color_b.
<p>  4. About 55.2% of poisonous mushrooms belong to "stalk-surface-below-ring_k type", and from distribution plot we can also see there is few edible mushrooms belong to "stalk-surface-below-ring_k type".


In [None]:
# Calculate correlation by pd.corr()
corr = data_onehot.corr().loc[:,'class_p']
top_10_corr =corr.abs().sort_values(ascending=False).head(n=11).iloc[1:]
print ('Top-10 features to class_p correlation:','\n\n',top_10_corr)

top_10_corr_ratio = pd.DataFrame(index=range(2))
for feature in top_10_corr.index:
    feature_grouped = data_onehot[['class_p',feature]].groupby([feature])
    top_10_corr_ratio.loc[:,feature] = 100*feature_grouped.sum()/(poisonus_cnt)

print ('\n\nTop-10 features-class_p poisonus mushrooms ratio:')
top_10_corr_ratio

In [None]:
print ("Visualize Top-10 features to class_p correlation: ")
top_10_corr.plot(kind='bar', grid=True,fontsize = 12)

### PCA (Primary Components Analysis)
Here we perform PCA to find the components which explanin the variance of data most. Explained Variance & Accumulated Ratio by components are drawn as below:



In [None]:
# Setup PCA to one-hot encoded features dataset
pca = PCA()
pca.fit(X_onehot)

# Calculate explained_variance and explained_variance_ratio
explained_variance = pca.explained_variance_.round(4)
explained_variance_ratio_ = pca.explained_variance_ratio_.round(4)

# Calculate accumulated explained_variance_ratio
ratio_accm_num=0
ratio_accm=[]
for ratio in explained_variance_ratio_:
    ratio_accm_num += ratio
    ratio_accm.append(ratio_accm_num)

# Print values of explained_variance and accumulated explained_variance_ratio
print ("explained_variance:\n",explained_variance)
print ("\n\naccumulated explained_variance_ratio:\n",np.array(ratio_accm),"\n\n")


# Make plot of explained_variance and accumulated explained_variance_ratio
avl_style_list = plt.style.available
style = avl_style_list[15]
with plt.style.context(style):
    
## plot individual explained variance by components
    plt.figure(figsize=(12, 4))
    plt.bar(range(len(X_onehot.columns)),explained_variance, alpha=0.5, align='center', label='individual explained variance')

## plot accumulated explained variance ratio by components
    plt.bar(range(len(X_onehot.columns)),ratio_accm, alpha=0.5, align='center', label='accumulated explained variance ratio')
    plt.ylabel('Values')
    plt.xlabel('Principal components')
    plt.legend(loc='best')
    plt.title("Explained Variance & Accumulated Ratio by Components",fontsize=14)


### Reduction of Dimension by PCA
We make a new features dataset **"X_onehot_pca"** by choosing the minimum number of principal components such that **60% of the variance is retained**, and the corresponding number of components **n=8**.

In [None]:
def X_PCA(data,n):
    pca = PCA(n)
    pca.fit(X_onehot)
    X_pca_ = pca.transform(data)

    column_head=[]
    for i in range(X_pca_.shape[1]):
        column_head.append("dimension_"+str(i+1))

    X_pca_ = pd.DataFrame(X_pca_, columns=column_head)
    return X_pca_

X_onehot_pca = X_PCA(X_onehot,n=.6)
X_onehot_pca.head()

---
<a id='step4'></a>
## Step 4: Models

### Split Dataset
Now it's time to build up models, let's split the dataset into training set(70%) & testing set(30%), random seed set to 42.<p>
And we're going to setup 2 kinds of learning data. One is X_onehot, which is original 116 columns of features, and anothor one is X_onehot_pca, which is transformed by PCA with only 8 dimensions.(60% coverage of explained-variance)

In [None]:
### Split X_onehot (Without PCA transformation) ###
X_train, X_test, y_train, y_test = train_test_split(X_onehot,y_onehot,test_size = 0.3,random_state = 42)

print ('X_train Shape:', X_train.shape)
print ('X_test Shape:', X_test.shape)
print ('y_train Shape:', y_train.shape)
print ('y_test Shape:', y_test.shape)

In [None]:
### Split X_onehot (With PCA transformation s.t. 60% of the variance is retained, n=8) ###
X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(X_onehot_pca,y_onehot,test_size = 0.3,random_state = 42)

print ('X_train_pca Shape:', X_train_pca.shape)
print ('X_test_pca Shape:', X_test_pca.shape)
print ('y_train_pca Shape:', y_train_pca.shape)
print ('y_test_pca Shape:', y_test_pca.shape)

### Define Helper Sub-Functions
We define some sub-functions here for the convinience to compare different algorithms by run time, accuracy, F1 score, precision, recall.

In [None]:
###### define helper functions for training and prediction

### Train model and calculate t_train
def train_classifier(clf, X_train, y_train, printer):
    '''Fits a classifier to the training data'''
    # Check train time
    start = time()
    clf.fit(X_train, y_train)
    end = time()
    t_train = end - start
    if printer:
        print ("Trained model in {:.4f} seconds!".format(t_train))
    return t_train
    
### Calculate cv scores and calculate t_cv
def train_cv_(clf, X, y, cv_fold, printer):
    '''n-fold CV to the choosen algorithm & training data'''
    # Check cv time
    start = time()
    y_cv = cross_val_predict(clf, X, y, cv=cv_fold)
    end = time()
    t_cv = end - start
    if printer:
        print ("Made {}-fold CV in {:.4f} seconds!".format(cv_fold, t_cv))
    
    # Calculate cv scores
    f1_ = f1_score(y, y_cv)
    precision_ = precision_score(y, y_cv)
    recall_ = recall_score(y, y_cv)
    accuracy_ = accuracy_score(y, y_cv)
    
    # Summary CV score
    if printer:
        print ("\n-------------------CV Score Summary (WITH {}-fold CV)--------------------".format(cv_fold))
        print ("F1 score for cv set: {:.2f}% , Accuracy score for cv set: {:.2f}%.".format(
        f1_*100 , accuracy_*100))

        print("Precision score for cv set: {:.2f}% , Recall score for cv set: {:.2f}%.".format(
        precision_*100 , recall_*100))
        print ("-------------------------------------------------------------------------".format(cv_fold))
    return t_cv, f1_, precision_, recall_, accuracy_

### Calculate test scores and confusion matrix calculate t_test   
def predict_labels(clf, X, y, printer):
    '''Make predictions using a fit classifier'''
    # Check predict time
    start = time()
    y_predicted = clf.predict(X)
    end = time()
    t_predict = end - start
    
    if printer:
        print ("Made predictions in {:.4f} seconds!".format(t_predict))
    
    # Return predict score / confusion matrix
    f1_ = f1_score(y, y_predicted)
    precision_ = precision_score(y, y_predicted)
    recall_ = recall_score(y, y_predicted)
    accuracy_ = accuracy_score(y, y_predicted)
    conf_matrix = confusion_matrix(y, y_predicted)
    
    return t_predict, f1_, precision_, recall_, accuracy_, conf_matrix  
    

### Core to call train/cv/test for clfs    
def train_predict(clf, X_train, X_test, y_train, y_test, printer, cv=True, cv_fold=10, clf_remark=None):
    ''' Train and predict using a classifer based on F1 score/Accuracy score without cv'''
    
    # Indicate the classifier and the training set size
    clf_name = clf.__class__.__name__
    if printer:
        print (">>> Training/predcting a {} model now... <<<\n".format(clf_name))
    
    # Train the classifier
    t_train = train_classifier(clf, X_train, y_train, printer=printer)
    
    # Make CV on training set
    if cv:
        t_cv, f1_cv, precision_cv, recall_cv, accuracy_cv = train_cv_(clf, X_train, y_train, cv_fold, printer=printer)
    
       
    # Calculate score
    #print "\nPredicting on training set..."
    #f1_score_train, precision_score_train, recall_score_train, accuracy_score_train = predict_labels(clf, X_train, y_train)
    
    t_predict, f1_test, precision_test, recall_test, accuracy_test, conf_matrix = predict_labels(clf, X_test, y_test, printer=printer)
    
    if printer:
        print ("\nPredicting on testing set...")
    t_predict, f1_test, precision_test, recall_test, accuracy_test
    
    # Summary score
    if printer:
        print ("\n---------------------- Testing Score Summary -------------------------------------")  
        print ("F1 score for test set: {:.2f}% , Accuracy score for test set: {:.2f}%.".format(
        f1_test*100 , accuracy_test*100))

        print ("Precision score for test set: {:.2f}% , Recall score for test set: {:.2f}%.".format(
        precision_test*100 , recall_test*100))
        print ("\n===============================End of this clf ===============================\n")
        print ("\n------------------------------------------------------------------------------\n")
    
    ### out put values ###
        
    if not cv:        
        return clf_name, t_train, t_predict, f1_test, precision_test, recall_test, accuracy_test, conf_matrix 
    
    return clf_name, t_train, t_cv, t_predict, f1_cv, precision_cv, recall_cv, accuracy_cv, f1_test, precision_test, recall_test, accuracy_test, conf_matrix


### Summary table for the result of train_predict()
def clf_summary(clfs, X_train, X_test, y_train, y_test, printer=True, cv=True, cv_fold=10, clfs_rename=None):
    '''Return a summary table about train/test/cv time & scores for given data and a list of classifiers. '''
    result_non_cv = {'Clf_Name':[], 'Time_Train':[], 'Time_Test':[], 'Test_F1':[],
                 'Test_Precision':[], 'Test_Recall':[], 'Test_Accuracy':[], 'Confusion_Matrix':[]}

    result_cv = {'Clf_Name':[], 'Time_Train':[], 'Time_CV':[], 'Time_Test':[], 'CV_F1':[],
             'CV_Precision':[], 'CV_Recall':[], 'CV_Accuracy':[], 'Test_F1':[], 
             'Test_Precision':[], 'Test_Recall':[], 'Test_Accuracy':[], 'Confusion_Matrix':[]}

    if clfs_rename and len(clfs_rename) != len(clfs):
        raise Exception("rename list length error!")
    for ii in range(len(clfs)):
        result_time = []
        result_score = [] 
        result = train_predict(clfs[ii], X_train, X_test, y_train, y_test, printer=printer, cv=cv, cv_fold=cv_fold)

        if cv:
            for i in range(1,4):
                result_time.append(str(round(result[i],2))+'s')   
            for i in range(4,12):
                result_score.append(str(round(100*result[i],2))+'%')

            if clfs_rename:
                result_cv['Clf_Name'].append(clfs_rename[ii])
            else: result_cv['Clf_Name'].append(result[0])
            result_cv['Time_Train'].append(result_time[0])
            result_cv['Time_CV'].append(result_time[1])
            result_cv['Time_Test'].append(result_time[2])
            result_cv['CV_F1'].append(result_score[0])
            result_cv['CV_Precision'].append(result_score[1])
            result_cv['CV_Recall'].append(result_score[2])
            result_cv['CV_Accuracy'].append(result_score[3])
            result_cv['Test_F1'].append(result_score[4])
            result_cv['Test_Precision'].append(result_score[5])
            result_cv['Test_Recall'].append(result_score[6])
            result_cv['Test_Accuracy'].append(result_score[7])
            result_cv['Confusion_Matrix'].append(result[12])


        else:
            for i in range(1,3):
                result_time.append(str(round(result[i],2))+'s')    
            for i in range(3,7):
                result_score.append(str(round(100*result[i],2))+'%')

            if clfs_rename:
                result_non_cv['Clf_Name'].append(clfs_rename[ii])
            else: result_non_cv['Clf_Name'].append(result[0])
            result_non_cv['Time_Train'].append(result_time[0])
            result_non_cv['Time_Test'].append(result_time[1])
            result_non_cv['Test_F1'].append(result_score[0])
            result_non_cv['Test_Precision'].append(result_score[1])
            result_non_cv['Test_Recall'].append(result_score[2])
            result_non_cv['Test_Accuracy'].append(result_score[3])
            result_non_cv['Confusion_Matrix'].append(result[7])
    
    if cv:
        #print result_cv
        clf0_summary_ = pd.DataFrame(result_cv)
        clf0_summary = clf0_summary_.set_index('Clf_Name')
    else:
        clf0_summary_ = pd.DataFrame(result_non_cv)
        clf0_summary = clf0_summary_.set_index('Clf_Name')
    
    return clf0_summary


### Confusion matrix subplot function
'''Plot Confusion Matrix for the result of clf_summary by input clf_summary_['Confusion_Matrix']'''
def confusion_matrix_plot(clf_conf_list):
    n_clf = len(clf_conf_list)

    name_clf = []
    Confusion_Matrix_list = []
    for matrix, index_ in zip(clf_conf_list,clf_conf_list.index):
        name_clf.append(index_)
        Confusion_Matrix_list.append(matrix)

    sns.set(style="white", color_codes=True)
    rows_ = int(math.ceil(n_clf/5.0))
    cols_ = 5
    fig_x = 18
    if n_clf < 6:
        cols_ = n_clf
        fig_x = n_clf*(18.0/5.0)
    
    fig, axn = plt.subplots(nrows= rows_, ncols=cols_, figsize=(fig_x, 6*math.ceil(n_clf/5.0)))
    for i, ax in zip(range(n_clf), axn.ravel()):    
        sns.heatmap(Confusion_Matrix_list[i], ax=ax, annot=True,
                    fmt=".0f", linewidths=.5, square = True, cmap = 'Greens')
        ax.set_title('{}\nConfusion Matrix:'.format(name_clf[i]), size = 12, weight='semibold')
        ax.set_ylabel('Actual label', style='italic')
        ax.set_xlabel('Predicted label', style='italic')
        
        
        
########################

## Show detail of GS by different scoring
def Grid_reporter(clf,X_train,y_train,scores,params_,cv=5):
    #g_clfs_dic={}
    g_clfs_dic = OrderedDict()
    i=0
    for score in scores:
        i += 1
        print("# Tuning hyper-parameters for %s" % score)
        print ("")

        g_clf = GridSearchCV(clf, params_, cv=cv,
                           scoring=score)
        g_clf.fit(X_train, y_train)

        print ("Best parameters set found on development set:")
        print (g_clf.best_params_)
        print ("")
        print ("Grid scores on development set:")
        means = g_clf.cv_results_['mean_test_score']
        stds = g_clf.cv_results_['std_test_score']
        for mean, std, params in zip(means, stds, g_clf.cv_results_['params']):
            print("%0.3f (+/-%0.03f) for %r"
                  % (mean, std * 2, params))
        g_clfs_dic[str(i)+"_"+score] = g_clf.best_estimator_
    return g_clfs_dic


## Summary of GS by different scoring
def clf_summary_gs(clfs_grid_score,X_train, X_test, y_train, y_test,gs_str):
    clfs = []
    clfs_rename = []
    for key in clfs_grid_score:
        clfs_rename.append(key+gs_str)
        clfs.append(clfs_grid_score[key])
    clf_summary_ = clf_summary(clfs, X_train, X_test, y_train, y_test, 
                               cv=True, cv_fold=10, printer=False, clfs_rename=clfs_rename)
    return clf_summary_.drop('Confusion_Matrix', axis=1).transpose(),clf_summary_['Confusion_Matrix']

### Basic Model Performance
Here we choose 4 famous and classic supervised learning model(LogisticRegression, SVC, DecisionTreeClassifier, BernoulliNB), and a MLP model (all models with default parameters) to make train/cv/test by the corresponding splitting sets(after PCA) to get a baseline of the basic model.<p>
The comparison result of train/cv/test time, scores and confusion matrix between the algorithms are shown as table below. As the result, the MLP seems doing little better than others, but if consider to the train/cv time, so in overall speaking, DecisionTreeClassifier is more faster in train/cv/test, also has good scores: 

In [None]:
clfs = [LGR(random_state=42,solver='lbfgs'),SVC(random_state=42,gamma='auto'),
        tree.DecisionTreeClassifier(random_state=42),GaussianNB(), MLP(random_state=42,max_iter=1000)]
clf_summary_ = clf_summary(clfs, X_train_pca, X_test_pca, y_train_pca, y_test_pca, cv=True, cv_fold=10, printer=False)
clf_summary_.drop('Confusion_Matrix', axis=1).transpose()

In [None]:
confusion_matrix_plot(clf_summary_['Confusion_Matrix'])

---
<a id='step5'></a>
## Step 5: Optimization


### Grid Search

Let's try grid search for parameter optimization on the classifiers. We will focus on each algorithm by optimizing different kind of scoring functions.

### Logistic Regression

For LGR, we choose to tune type of solve,class_weight, war_start ; and value of C, max_iter. But the result shows no any improvement on accuracy.

In [None]:
parameters = {'C': [0.1,1.0,10.0,100.0] ,'solver': ['newton-cg','lbfgs','liblinear','sag'],
             'max_iter': [100,300],'class_weight':['balanced', None],'warm_start':[True,False]}
clf = LGR(random_state=42)
scores_list = ['accuracy', 'f1', 'precision', 'recall']

# report a dict of best clf with different kind of cv score(default fold number of cv=5)
clfs_grid_score = Grid_reporter(clf,X_train_pca,y_train_pca,scores_list,parameters)

In [None]:
clfs_grid_score['5_default'] = LGR(random_state=42,solver='lbfgs')
clfs_grid_score

In [None]:
clf_summary_ , confusion_matrix_ = clf_summary_gs(clfs_grid_score, X_train_pca, X_test_pca, y_train_pca, y_test_pca,
                                                  gs_str="_LGR_GS")

clf_summary_

In [None]:
confusion_matrix_plot(confusion_matrix_)

### SVC
For SVC, we choose to tune type of kernel ; and value of C, gamma. Tuned accuracy score achieve to 100%.

In [None]:
parameters = [{'kernel': ['rbf'],
               'gamma': [1e-4, 1e-3, 0.01, 0.1, 0.2, 0.5],
                'C': [1, 10, 100, 1000]},
              {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]
clf = SVC(random_state=42)
scores_list = ['accuracy', 'f1', 'precision', 'recall']

# report a dict of best clf with different kind of cv score(default fold number of cv=5)
clfs_grid_score = Grid_reporter(clf,X_train_pca,y_train_pca,scores_list,parameters)

In [None]:
clfs_grid_score['5_default'] = SVC(random_state=42,gamma='auto')
clfs_grid_score

In [None]:
clf_summary_ , confusion_matrix_ = clf_summary_gs(clfs_grid_score, X_train_pca, X_test_pca, y_train_pca, y_test_pca,
                                                  gs_str="_SVC_GS")
clf_summary_

In [None]:
confusion_matrix_plot(confusion_matrix_)

### Decision Tree
For decision tree, we choose to tune type of criterion and value of max_depth, min_sample_leaf, max_features. It shows slightly improvement on accuracy score. 

In [None]:
parameters = {'max_depth': [None,6,7,10,11,12,13,14], 'min_samples_leaf': [1,2], 
              'max_features':[None,3,4,5,6,7,8],'criterion':['gini','entropy']}
clf = tree.DecisionTreeClassifier(random_state=42)
scores_list = ['accuracy', 'f1', 'precision', 'recall']

# report a dict of best clf with different kind of cv score(default fold number of cv=5)
clfs_grid_score = Grid_reporter(clf,X_train_pca,y_train_pca,scores_list,parameters)

In [None]:
clfs_grid_score['5_default'] = tree.DecisionTreeClassifier(random_state=42)
clfs_grid_score

In [None]:
clf_summary_ , confusion_matrix_ = clf_summary_gs(clfs_grid_score, X_train_pca, X_test_pca, y_train_pca, y_test_pca,
                                                  gs_str="_DeciTree_GS")
clf_summary_

In [None]:
confusion_matrix_plot(confusion_matrix_)

### Gaussian NB
For GNB there's only 1 parameter one can tune: priors. As the result, it's not really helping on the scores, but only change the tend of classifier to guess which label is more likely to be. 

In [None]:
parameters = {'priors': [None,[0.5,0.5],[0.4,0.6],[0.6,0.4],[0.3,0.7],[0.7,0.3],
                         [0.2,0.8],[0.8,0.2],[0.9,0.1],[0.1,0.9],[0.95,0.05],[0.05,0.95],[0.999,0.001],[0.001,0.999]]}
clf = GaussianNB()
scores_list = ['accuracy', 'f1', 'precision', 'recall']

# report a dict of best clf with different kind of cv score(default fold number of cv=5)
clfs_grid_score = Grid_reporter(clf,X_train_pca,y_train_pca,scores_list,parameters)

In [None]:
clfs_grid_score['5_default'] = GaussianNB()
clfs_grid_score

In [None]:
clf_summary_ , confusion_matrix_ = clf_summary_gs(clfs_grid_score, X_train_pca, X_test_pca, y_train_pca, y_test_pca,
                                                  gs_str="_GNB_GS")
clf_summary_

In [None]:
confusion_matrix_plot(confusion_matrix_)

### MLP
For MLP, we choose to tune value of alpha and size/number of hidden layers. Actually the default setting is already excellent, only 1 FN. With larger hidden layer size and numbers (200,4), the accuracy achieves 100%.

In [None]:
alpha_range = 10.0 ** -np.arange(2, 5)
parameters = {'alpha': alpha_range,
              'hidden_layer_sizes': [(8,),(7,7,),(42,3,),(81,4,),(150,5,),(200,4)]}
clf = MLP(random_state=42,max_iter=1000)
scores_list = ['accuracy', 'f1', 'precision', 'recall']
#scores_list = ['accuracy']

# report a dict of best clf with different kind of cv score(default fold number of cv=5)
clfs_grid_score = Grid_reporter(clf,X_train_pca,y_train_pca,scores_list,parameters)

In [None]:
clfs_grid_score['5_default'] = MLP(random_state=42,max_iter=1000)
clfs_grid_score

In [None]:
clf_summary_ , confusion_matrix_ = clf_summary_gs(clfs_grid_score, X_train_pca, X_test_pca, y_train_pca, y_test_pca,
                                                  gs_str="_MLP_GS")
clf_summary_

In [None]:
confusion_matrix_plot(confusion_matrix_)

---
<a id='step6'></a>
## Step 6: Conclusion

### Overall Result
In Step.5 Optimization we have performed 5 classifiers by plenty combinations of parameters, with 60% explanined variance ratio PCA dimension reduction(n=8). Finally, 2 of them are tuned to achieve 100% accuracy(MLP and SVC). Let's summary 5 classifiers with best accuracy below:

In [None]:
clfs_best = [LGR(C=1.0, class_weight='balanced', dual=False,
                        fit_intercept=True, intercept_scaling=1, max_iter=100,
                        multi_class='ovr', n_jobs=1, penalty='l2', random_state=42,
                        solver='newton-cg', tol=0.0001, verbose=0, warm_start=True),
        SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
                decision_function_shape='ovr', degree=3, gamma=0.5, kernel='rbf',
                max_iter=-1, probability=False, random_state=42, shrinking=True,
                tol=0.001, verbose=False),
        tree.DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, presort=False, random_state=42,
                          splitter='best'),
        GaussianNB(priors=[0.1, 0.9]),
        MLP(activation='relu', alpha=0.001, batch_size='auto', beta_1=0.9,
                     beta_2=0.999, early_stopping=False, epsilon=1e-08,
                     hidden_layer_sizes=(200, 4), learning_rate='constant',
                     learning_rate_init=0.001, max_iter=1000, momentum=0.9,
                     nesterovs_momentum=True, power_t=0.5, random_state=42, shuffle=True,
                     solver='adam', tol=0.0001, validation_fraction=0.1, verbose=False,
                     warm_start=False)]


clfs_rename_best = ['LGR_Best','SVC_Best','DT_Best','GNB_Best','MLP_Best']
clf_summary_ = clf_summary(clfs_best, X_train_pca, X_test_pca, y_train_pca, y_test_pca, 
                           cv=True, cv_fold=10, printer=False, clfs_rename=clfs_rename_best)

clf_summary_best = clf_summary_.drop('Confusion_Matrix', axis=1).transpose()

# ---Over all tuned parameters list---
# LGR_parameters = {'C': [0.1,1.0,10.0,100.0] ,'solver': ['newton-cg','lbfgs','liblinear','sag'],
#             'max_iter': [100,300],'class_weight':['balanced', None],'warm_start':[True,False]}
#
# SVC_parameters = [{'kernel': ['rbf'],
#               'gamma': [1e-4, 1e-3, 0.01, 0.1, 0.2, 0.5],
#                'C': [1, 10, 100, 1000]},
#              {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]
#
# DT_parameters = {'max_depth': [None,6,7,10,11,12,13,14], 'min_samples_leaf': [1,2], 
#              'max_features':[None,3,4,5,6,7,8],'criterion':['gini','entropy']}
#
# parameters = {'priors': [None,[0.5,0.5],[0.4,0.6],[0.6,0.4],[0.3,0.7],[0.7,0.3],
#                         [0.2,0.8],[0.8,0.2],[0.9,0.1],[0.1,0.9],[0.95,0.05],[0.05,0.95],[0.999,0.001],[0.001,0.999]]}
#
# parameters = {'alpha': 10.0 ** -np.arange(2, 5),
#              'hidden_layer_sizes': [(8,),(7,7,),(42,3,),(81,4,),(150,5,),(200,4)]}

In [None]:
clf_summary_best

In [None]:
confusion_matrix_plot(clf_summary_['Confusion_Matrix'])

### Learning Curve
To check the models are overfitting or not, here we perform learning curve plot for the clfs with best parameters.

As the result, 

In [None]:
### Built the helper function for plotting learning curve

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate a simple plot of the test and training learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : int, cross-validation generator or an iterable, optional
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:
          - None, to use the default 3-fold cross-validation,
          - integer, to specify the number of folds.
          - An object to be used as a cross-validation generator.
          - An iterable yielding train/test splits.

        For integer/None inputs, if ``y`` is binary or multiclass,
        :class:`StratifiedKFold` used. If the estimator is not a classifier
        or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.

        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validators that can be used here.

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).
    """
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt


In [None]:
# title = "Learning Curves"
# Cross validation with 100 iterations to get smoother mean test and train
# score curves, each time with 20% data randomly selected as a validation set.


cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=42)
for i in range(5):
    plot_learning_curve(estimator=clfs_best[i], title=clfs_rename_best[i]+'_Learning Curve', X=X_onehot_pca,
                        y=y_onehot, ylim=(0.7, 1.01), cv=cv, n_jobs=4, train_sizes=np.linspace(.002, 0.2, 30))


### Result without PCA dimension reduction
Finally, we also perform training/test on the original data(with onehot encoder and without PCA) by the 5 classifiers with default parameters. As the result shown below, the accuracy scores are almost perfect, 3 of them achieve 100% accuracy.<p>So for this problem, the additional dimensionality reduction with PCA is actually not necessary. LGR and Decision Tree with default parameters are already do their great job and the speeds are also very quick. <p>Anyway, in the opinion of practicing, this whole procedure is really helped for understanding about machine learning.

In [None]:
clfs_basic = [LGR(random_state=42),SVC(random_state=42),tree.DecisionTreeClassifier(random_state=42),
              GaussianNB(), MLP(random_state=42)]
clf_summary_ori = clf_summary(clfs_basic, X_train, X_test, y_train, y_test, 
                           cv=True, cv_fold=10, printer=False)
clf_summary_original = clf_summary_ori.drop('Confusion_Matrix', axis=1).transpose()

In [None]:
clf_summary_original

In [None]:
confusion_matrix_plot(clf_summary_ori['Confusion_Matrix'])