# Tutorial 4 : SVMs, Neural Nets, Ensembles

In this tutorial you'll implement SVMs, Neural Nets, and Ensembling methods to classify patients as either having or not having diabetic retinopathy. For this task we'll be using the same Diabetic Retinopathy data set which was used in the previous assignments. You can find additional details about the dataset [here](http://archive.ics.uci.edu/ml/datasets/Diabetic+Retinopathy+Debrecen+Data+Set). You'll explore how to train SVMs, NNs, and Ensembles using the `scikit-learn` library. The scikit-learn documentation can be found [here](http://scikit-learn.org/stable/documentation.html).

In [1]:
#You may add additional imports
import warnings
warnings.simplefilter("ignore")
import pandas as pd
import numpy as np
import sklearn as sk
import matplotlib.pyplot as plt
import time
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.ensemble import AdaBoostClassifier

  from numpy.core.umath_tests import inner1d


In [3]:
%matplotlib inline

In [4]:
# Read the data from csv file
col_names = []
for i in range(20):
    if i == 0:
        col_names.append('quality')
    if i == 1:
        col_names.append('prescreen')
    if i >= 2 and i <= 7:
        col_names.append('ma' + str(i))
    if i >= 8 and i <= 15:
        col_names.append('exudate' + str(i))
    if i == 16:
        col_names.append('euDist')
    if i == 17:
        col_names.append('diameter')
    if i == 18:
        col_names.append('amfm_class')
    if i == 19:
        col_names.append('label')

data = pd.read_csv("messidor_features.txt", names = col_names)
print(data.shape)
data.head(10)

(1151, 20)


Unnamed: 0,quality,prescreen,ma2,ma3,ma4,ma5,ma6,ma7,exudate8,exudate9,exudate10,exudate11,exudate12,exudate13,exudate14,exudate15,euDist,diameter,amfm_class,label
0,1,1,22,22,22,19,18,14,49.895756,17.775994,5.27092,0.771761,0.018632,0.006864,0.003923,0.003923,0.486903,0.100025,1,0
1,1,1,24,24,22,18,16,13,57.709936,23.799994,3.325423,0.234185,0.003903,0.003903,0.003903,0.003903,0.520908,0.144414,0,0
2,1,1,62,60,59,54,47,33,55.831441,27.993933,12.687485,4.852282,1.393889,0.373252,0.041817,0.007744,0.530904,0.128548,0,1
3,1,1,55,53,53,50,43,31,40.467228,18.445954,9.118901,3.079428,0.840261,0.272434,0.007653,0.001531,0.483284,0.11479,0,0
4,1,1,44,44,44,41,39,27,18.026254,8.570709,0.410381,0.0,0.0,0.0,0.0,0.0,0.475935,0.123572,0,1
5,1,1,44,43,41,41,37,29,28.3564,6.935636,2.305771,0.323724,0.0,0.0,0.0,0.0,0.502831,0.126741,0,1
6,1,0,29,29,29,27,25,16,15.448398,9.113819,1.633493,0.0,0.0,0.0,0.0,0.0,0.541743,0.139575,0,1
7,1,1,6,6,6,6,2,1,20.679649,9.497786,1.22366,0.150382,0.0,0.0,0.0,0.0,0.576318,0.071071,1,0
8,1,1,22,21,18,15,13,10,66.691933,23.545543,6.151117,0.496372,0.0,0.0,0.0,0.0,0.500073,0.116793,0,1
9,1,1,79,75,73,71,64,47,22.141784,10.054384,0.874633,0.09978,0.023386,0.0,0.0,0.0,0.560959,0.109134,0,1


### 1. Data prep

Q1. Separate the feature columns from the class label column. You should end up with two separate data frames - one that contains all of the feature values and one that contains the class labels. Print the shape of the features DataFrame, the shape of the labels DataFrame, and the head of the features DataFrame.

In [5]:
# create new dataframe with the feature values columns
feature_data = data.drop(['label'],axis=1)
# create new dataframe with the class label values columns
label_data = data[['label']].copy()
# print the shapes of the new dataframes
print(feature_data.shape)
print(label_data.shape)
# print head of the features DataFrame
feature_data.head(10)

(1151, 19)
(1151, 1)


Unnamed: 0,quality,prescreen,ma2,ma3,ma4,ma5,ma6,ma7,exudate8,exudate9,exudate10,exudate11,exudate12,exudate13,exudate14,exudate15,euDist,diameter,amfm_class
0,1,1,22,22,22,19,18,14,49.895756,17.775994,5.27092,0.771761,0.018632,0.006864,0.003923,0.003923,0.486903,0.100025,1
1,1,1,24,24,22,18,16,13,57.709936,23.799994,3.325423,0.234185,0.003903,0.003903,0.003903,0.003903,0.520908,0.144414,0
2,1,1,62,60,59,54,47,33,55.831441,27.993933,12.687485,4.852282,1.393889,0.373252,0.041817,0.007744,0.530904,0.128548,0
3,1,1,55,53,53,50,43,31,40.467228,18.445954,9.118901,3.079428,0.840261,0.272434,0.007653,0.001531,0.483284,0.11479,0
4,1,1,44,44,44,41,39,27,18.026254,8.570709,0.410381,0.0,0.0,0.0,0.0,0.0,0.475935,0.123572,0
5,1,1,44,43,41,41,37,29,28.3564,6.935636,2.305771,0.323724,0.0,0.0,0.0,0.0,0.502831,0.126741,0
6,1,0,29,29,29,27,25,16,15.448398,9.113819,1.633493,0.0,0.0,0.0,0.0,0.0,0.541743,0.139575,0
7,1,1,6,6,6,6,2,1,20.679649,9.497786,1.22366,0.150382,0.0,0.0,0.0,0.0,0.576318,0.071071,1
8,1,1,22,21,18,15,13,10,66.691933,23.545543,6.151117,0.496372,0.0,0.0,0.0,0.0,0.500073,0.116793,0
9,1,1,79,75,73,71,64,47,22.141784,10.054384,0.874633,0.09978,0.023386,0.0,0.0,0.0,0.560959,0.109134,0


### 2. Support Vector Machines (SVM) and Pipelines

Q2. For some classification algorithms, like KNN, SVMs, and Neural Nets, scaling of the data is critical for the algorithm to operate correctly. For other classification algorithms, like Naive Bayes, and Decision Trees, data scaling is not necessary (take a minute to think about why that is the case). 

We discussed in class how the data scaling should happen on the _training set only_, which means that it should happen _inside_ of the cross validation loop. In other words, in each fold of the cross validation, the data will be separated in to training and test sets. The scaling (calculating mean and std, for instance) should happen based on the values in the _traning set only_. Then the test set can be scaled using the values found on the training set. (Refer to the concept of [data leakage](https://machinelearningmastery.com/data-leakage-machine-learning/).)

In order to do this with scikit-learn, you must create what's called a `Pipeline` and pass that in to the cross validation. This is a very important concept for Data Mining and Machine Learning, so let's practice it here.

Do the following:
* Create a `sklearn.preprocessing.StandardScaler` object to standardize the dataset’s features (mean = 0 and variance = 1). Do not call `fit` on it yet. Just create the `StandardScaler` object.
* Create a `sklearn.svm.SVC` classifier with a `linear` kernel. Do not call `fit` on it yet. Just create the `SVC` object.
* Create a `sklearn.pipeline.Pipeline` and set the `steps` to the scaler and the SVC objects that you just created. 
* Pass the `pipeline` in to a `cross_val_score` as the estimator, along with the features and the labels, and use a 5-fold-CV. 

In each fold of the cross validation, the training phase will use _only_ the training data for scaling and training the model. Then the testing phase will scale the test data into the scaled space (found on the training data) and run the test data through the trained classifier, to return an accuracy measurement for each fold. Print the average accuracy across all 5 folds. 

In [6]:
# create standard scaler object
scaler = StandardScaler()
# create svc object
svc = SVC(kernel='linear')
# setup pipeline
pipeline = Pipeline(steps=[('scaler', scaler), ('svc', svc)])
# run pipeline with cvs and get accuracy
piped_cvs = cross_val_score(pipeline, feature_data, label_data, cv=5)
print("Accuracy:", piped_cvs.mean()*100)

Accuracy: 72.28646715603239


Q3. The `svm.SVC` defaults to using an rbf (radial basis function) kernel. This kernel may or may not be the best choice for our dataset. We can use nested cross validation to find the best kernel for this dataset.

Set up the inner CV loop:
* Starter code is provided to create the "parameter grid" to search. You will need to change this code! Where I have "svm__kernel", this indicates that I want to tune the "kernel" parameter in the "svm" part of the pipeline. When you created your pipeline above, you named the SVM part of the pipeline with a string. You should replace "svm" in the param_grid below with whatever you named your SVM part of the pipeline: **<replace_this>__kernel.** 
* Create a `sklearn.model_selection.GridSearchCV` that takes in the pipeline you created above (as the estimator), the parameter grid, and uses a 5-fold-CV. Call `fit` on the `GridSearchCV` to find the best kernel. 
* Print out the best kernel (`best_params_`) for this dataset. 

In [7]:
# for the 'svm' part of the pipeline, tune the 'kernel' hyperparameter
param_grid = {'svc__kernel': ['linear', 'rbf', 'poly', 'sigmoid']}

# use GridSearchCV on the pipeline
piped_grid = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
# call fit() on the GridSearchCV and pass in the data
piped_grid.fit(feature_data, label_data)
# print out best_params_ from the GridSearchCV
print(piped_grid.best_params_)

{'svc__kernel': 'linear'}


Q4. Now put what you did in Q3 in to an outer CV loop to evaluate the accuracy of using that best-found kernel on unseen test data. 
* Pass the `GridSearchCV` in to a `cross_val_score` with 5-fold-CV. Print out the accuracy.

Note that the accuracy increases with a better choice of kernel.

In [8]:
# pass GrisSearchCV from q3 in to a cross val
outer_cv_grid = cross_val_score(piped_grid, feature_data, label_data, cv=5)
print("Accuracy:", outer_cv_grid.mean()*100)

Accuracy: 72.28646715603239


Q5. Let's see if we can get the accuracy even higher by tuning additional hyperparameters. SVMs have a parameter called 'C' that is the cost for a misclassification. (More info [here](https://medium.com/@pushkarmandot/what-is-the-significance-of-c-value-in-support-vector-machine-28224e852c5a)).
* Create a parameter grid that includes the kernel (as you have above) and the C value as well. Try values of C from 50 to 100 by increments of 10. (You can use the range function to help you with this.)
* Create a `GridSearchCV` with the pipeline from above, this new parameter grid, and a 5-fold-CV.
* Pass the `GridSearchCV` into a `cross_val_score` with a 5-fold-CV and print out the accuracy.

Be patient as this can take some time to run. Note that the accurcay has increased even further because the best value of C was found and used on the test data.

In [9]:
# create paramter grid with kernel and C vals
param_grid2 = {'svc__kernel': ['linear'],'svc__C': [50.0, 60.0, 70.0, 80.0, 90.0, 100.0]}
# create gridsearch with pipeline and params
piped_grid2 = GridSearchCV(pipeline, param_grid2, cv=5, scoring='accuracy')
piped_grid2.fit(feature_data, label_data)
# pass the gridsearch into a cvs and print accuracy
outer_cv_grid2 = cross_val_score(piped_grid2, feature_data, label_data, cv=5)
print("Accuracy:", outer_cv_grid2.mean()*100)

Accuracy: 74.54357236965933


### 3. Neural Networks (NN)

Q6. Train a multi-layer perceptron with a single hidden layer using `sklearn.neural_network.MLPClassifier`. 
* Scaling is critical to neural networks. Create a pipeline that includes scaling and an `MLPClassifier`.
* Use `GridSearchCV` with 5 folds to find the best hidden layer size and the best activation function. 
* Try values of `hidden_layer_sizes` ranging from `(10,)` to `(60,)` with gaps of 10.
* Try activation functions `logistic`, `tanh`, `relu`.

Wrap your `GridSearchCV` in a 5-fold `cross_val_score` and report the accuracy of your neural net.

Be patient, as this can take a few minutes to run. 

In [147]:
# make a pipeline with scaling and MLPClassifier
mlp = MLPClassifier()
nn_pipeline = Pipeline(steps=[('scaler', scaler), ('mlpclassifier', mlp)])
# use gridsearch to find best Hidden Layer Size and Activation Function
nn_grid = {'mlpclassifier__hidden_layer_sizes': [(10,), (20,), (30,), (40,),(50,),(60,)], 'mlpclassifier__activation': ["logistic", "relu", "tanh"]}
nn_piped_grid = GridSearchCV(nn_pipeline, nn_grid, cv=5, scoring='accuracy')
nn_piped_grid.fit(feature_data, label_data)
# wrap gridsearch into a 5-fold cvs
nn_cvs = cross_val_score(nn_piped_grid, feature_data, label_data, cv=5)
print("Accuracy:", nn_cvs.mean()*100)

Accuracy: 73.32881611142481


### 4. Ensemble Classifiers

Ensemble classifiers combine the predictions of multiple base estimators to improve the accuracy of the predictions. One of the key assumptions that ensemble classifiers make is that the base estimators are built independently (so they are diverse).

**A. Random Forests**

Q7. Use `sklearn.ensemble.RandomForestClassifier` to classify the data. Use a `GridSearchCV` to tune the hyperparameters to get the best results. 
* Try `max_depth` ranging from 35-55 (you can use the range function to help you with this)
* Try `min_samples_leaf` of 8, 10, 12
* Try `max_features` of `"sqrt"` and `"log2"`

Wrap your GridSearchCV in a cross_val_score with 5-fold CV to report the accuracy of the model.

Be patient, this can take a few minutes to run.

In [10]:
# startup random forest
forest = RandomForestClassifier()
# setup gridsearchcv
forest_params = {'max_depth': [*range(1,11)], 'min_samples_leaf': [8,10,12], 'max_features':['sqrt','log2']}
forest_grid = GridSearchCV(forest, forest_params)
forest_grid.fit(feature_data, label_data)
# wrap gridsearch into a 5-flod cvs
forest_cvs = cross_val_score(forest_grid, feature_data, label_data, cv=5)
print("Accuracy:", forest_cvs.mean()*100)

Accuracy: 66.72576698663654


**B. AdaBoost**

Random Forests are a kind of ensemble classifier where many estimators are built independently in parallel. In contrast, there is another method of creating an ensemble classifier called *boosting*. Here the classifiers are trained one-by-one in sequence and each time the sampling of the training set depends on the performance of previously generated models.

Q8. Evaluate a `sklearn.ensemble.AdaBoostClassifier` classifier on the data. By default, `AdaBoostClassifier` uses decision trees as the base classifiers (but this can be changed). 
* Use a GridSearchCV to find the best number of trees in the ensemble (`n_estimators`). Try values from 50-250 with increments of 25. (you can use the range function to help you with this.)
* Wrap your GridSearchCV in a cross_val_score with 5-fold CV to report the accuracy of the model.

Be patient, this can take a few minutes to run.

Use 150 base decision tree classifiers to make an `AdaBoostClassifier` and evaluate it's accuracy with a 5-fold-CV.

In [11]:
# setup AdaBoost 
boost = AdaBoostClassifier()
# do gridsearch
boost_params = {'n_estimators': [*range(50,251,25)]}
boost_grid = GridSearchCV(boost, boost_params)
boost_grid.fit(feature_data, label_data)
# do cvs with gridsearch and print accuracy
boost_cvs = cross_val_score(boost_grid, feature_data, label_data, cv=5)
print("Accuracy:", boost_cvs.mean()*100)

Accuracy: 70.9828722002635


### 5. Deploying a final model

Over the course of three programming assignments, you have tested all kinds of classifiers on this data. Some have performed better than others. 

We could continue trying to improve the accuracy of our models by tweaking their parameters more and/or we could do some feature engineering on our dataset.

Q9. Let's say we got to the point where we had a model with high accuracy and we want to deploy that model and use it for real-world predictions.

* Let's say we're going to deploy our neural net classifier.
* We need to make one final version of this model, where we use ALL of our available data for training (we do not hold out a test set this time, so no outer cross-validation loop). 
* We need to tune the parameters of the model on the FULL dataset, so copy the code you entered for Q6, but remove the outer cross validation loop (remove `cross_val_score`). Just run the `GridSearchCV` by calling `fit` on it and passing in the full dataset. This results in the final trained model with the best parameters for the full dataset. You can print out `best_params_` to see what they are.
* The accuracy of this model is what you assessed and reported in Q6.


* Use the `pickle` package to save your model. We have provided the lines of code for you, just make sure your final model gets passed in to `pickle.dump()`. This will save your model to a file called finalized_model.sav in your current working directory. 

In [12]:
import pickle

#replace this final_model with your final model
mlp = MLPClassifier()
nn_pipeline = Pipeline(steps=[('scaler', scaler), ('mlpclassifier', mlp)])
# use gridsearch to find best Hidden Layer Size and Activation Function
nn_grid = {'mlpclassifier__hidden_layer_sizes': [(10,), (20,), (30,), (40,),(50,),(60,)], 'mlpclassifier__activation': ["logistic", "relu", "tanh"]}
nn_piped_grid = GridSearchCV(nn_pipeline, nn_grid, cv=5, scoring='accuracy')
final_model = nn_piped_grid.fit(feature_data, label_data)

# your code goes here
print(final_model.best_params_)
print(final_model.best_score_)

filename = 'finalized_model.sav'
pickle.dump(final_model, open(filename, 'wb'))

{'mlpclassifier__activation': 'relu', 'mlpclassifier__hidden_layer_sizes': (50,)}
0.735881841876629


Q10. Now if someone wants to use your trained, saved classifier to classify a new record, they can load the saved model and just call predict on it. 
* Given this new record, classify it with your saved model and print out either "Negative for disease" or "Positive for disease."

In [14]:
# some time later...

# use this as the new record to classify
record = [ 0.05905386, 0.2982129, 0.68613149, 0.75078865, 0.87119216, 0.88615694,
  0.93600623, 0.98369184, -0.47426472, -0.57642756, -0.53115361, -0.42789774,
 -0.21907738, -0.20090532, -0.21496782, -0.2080998, 0.06692373, -2.81681183,
 -0.7117194 ]

 
# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))


# run the model
predic = loaded_model.predict([record])

if predic == 1:
    print('Positive for disease')
else:
    print('Negative for disease')

Positive for disease
