# Assignment 3 : SVMs, Neural Nets, Ensembles

In this assignment you'll implement SVMs, Neural Nets, and Ensembling methods to classify patients as either having or not having diabetic retinopathy. For this task we'll be using the same Diabetic Retinopathy data set which was used in the previous assignments. You can find additional details about the dataset [here](http://archive.ics.uci.edu/ml/datasets/Diabetic+Retinopathy+Debrecen+Data+Set). You'll explore how to train SVMs, NNs, and Ensembles using the `scikit-learn` library. The scikit-learn documentation can be found [here](http://scikit-learn.org/stable/documentation.html).

In [1]:
Tyree Pearson
Naren Makkapati

#You may add additional imports
import warnings
warnings.simplefilter("ignore")
import pandas as pd
import numpy as np
import sklearn as sk
import matplotlib.pyplot as plt
import time
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

In [2]:
%matplotlib inline

In [3]:
# Read the data from csv file
col_names = []
for i in range(20):
    if i == 0:
        col_names.append('quality')
    if i == 1:
        col_names.append('prescreen')
    if i >= 2 and i <= 7:
        col_names.append('ma' + str(i))
    if i >= 8 and i <= 15:
        col_names.append('exudate' + str(i))
    if i == 16:
        col_names.append('euDist')
    if i == 17:
        col_names.append('diameter')
    if i == 18:
        col_names.append('amfm_class')
    if i == 19:
        col_names.append('label')

data = pd.read_csv("messidor_features.txt", names = col_names)
print(data.shape)
data.head(10)

(1151, 20)


Unnamed: 0,quality,prescreen,ma2,ma3,ma4,ma5,ma6,ma7,exudate8,exudate9,exudate10,exudate11,exudate12,exudate13,exudate14,exudate15,euDist,diameter,amfm_class,label
0,1,1,22,22,22,19,18,14,49.895756,17.775994,5.27092,0.771761,0.018632,0.006864,0.003923,0.003923,0.486903,0.100025,1,0
1,1,1,24,24,22,18,16,13,57.709936,23.799994,3.325423,0.234185,0.003903,0.003903,0.003903,0.003903,0.520908,0.144414,0,0
2,1,1,62,60,59,54,47,33,55.831441,27.993933,12.687485,4.852282,1.393889,0.373252,0.041817,0.007744,0.530904,0.128548,0,1
3,1,1,55,53,53,50,43,31,40.467228,18.445954,9.118901,3.079428,0.840261,0.272434,0.007653,0.001531,0.483284,0.11479,0,0
4,1,1,44,44,44,41,39,27,18.026254,8.570709,0.410381,0.0,0.0,0.0,0.0,0.0,0.475935,0.123572,0,1
5,1,1,44,43,41,41,37,29,28.3564,6.935636,2.305771,0.323724,0.0,0.0,0.0,0.0,0.502831,0.126741,0,1
6,1,0,29,29,29,27,25,16,15.448398,9.113819,1.633493,0.0,0.0,0.0,0.0,0.0,0.541743,0.139575,0,1
7,1,1,6,6,6,6,2,1,20.679649,9.497786,1.22366,0.150382,0.0,0.0,0.0,0.0,0.576318,0.071071,1,0
8,1,1,22,21,18,15,13,10,66.691933,23.545543,6.151117,0.496372,0.0,0.0,0.0,0.0,0.500073,0.116793,0,1
9,1,1,79,75,73,71,64,47,22.141784,10.054384,0.874633,0.09978,0.023386,0.0,0.0,0.0,0.560959,0.109134,0,1


### 1. Data prep

Q1. Separate the feature columns from the class label column. You should end up with two separate data frames - one that contains all of the feature values and one that contains the class labels.

For some classification algorithms, like SVMs and Neural Nets, scaling of the data is critical for the algorithm to operate correctly. For other classification algorithms, data scaling is not necessary (like Naive Bayes and Decision Trees). But using scaled data with an algorithm that doesn't explicitly need it to be scaled does not hurt the results of that algorithm. So we will go ahead and scale the data and *use the scaled data going forward*. 

Use `sklearn.preprocessing.StandardScaler` to standardize the dataset’s features (mean = 0 and variance = 1). Only standardize the the features, not the class labels! Note that StandardScaler returns a numpy array.

In [None]:
# your code goes here
labels = data.loc[:,"label"]
features = data.drop("label",axis=1)

print(labels.head(5))
print("\n")
print(features.head(5))


In [7]:
from sklearn.preprocessing import StandardScaler
import pprint
scaler = StandardScaler()
std_features = scaler.fit_transform(features)


### 2. Support Vector Machines (SVM)

Q2. Create an `sklearn.svm.SVC` (Support Vector Classifier) to classify the data. Use `sklearn.model_selection.GridSearchCV` to find the best kernel for this dataset. Try the kernels: `linear`, `rbf` (radial basis kernel), `poly` (polynomial), and `sigmoid`. Use a 5-fold cross validation and print out the best kernel (`best_params_`) and best accuracy achieved with this kernel (`best_score_`).

In [8]:
# your code goes here
from sklearn.model_selection import GridSearchCV 
from sklearn.svm import SVC 
svc = SVC(gamma = "auto")
svc.fit(features,labels)
params = {'kernel':('linear','rbf','poly','sigmoid')}
grid = GridSearchCV(svc,params,cv=5)
print(grid.fit(std_features,labels))
print(grid.best_params_)
print(grid.best_score_)

GridSearchCV(cv=5, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'kernel': ('linear', 'rbf', 'poly', 'sigmoid')},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)
{'kernel': 'linear'}
0.7228496959165943


Q3. Create a new `sklearn.svm.SVC` using the best kernel that was found in Q2. Use `sklearn.model_selection.GridSearchCV` to find the best value of C for this SVM. Try values from 1-250 by increments of 10 (you can use the `range` funtion to do this). Use a 5-fold cross validation and print out the best value of C (`best_params_`) and best accuracy achieved with this value of C (`best_score_`).

Be patient, as this can take a few minutes to run.

In [9]:
# your code goes here
n = list(range(1,250,10))
print(n)
svc2 = SVC(kernel = "linear")
param = {'C':n}
grid2 = GridSearchCV(svc2,param,cv=5)
grid2.fit(std_features,labels)
print(grid2.best_params_)
print(grid2.best_score_)


[1, 11, 21, 31, 41, 51, 61, 71, 81, 91, 101, 111, 121, 131, 141, 151, 161, 171, 181, 191, 201, 211, 221, 231, 241]
{'C': 221}
0.7463075586446568


### 3. Neural Networks (NN)

Q4. Train a multi-layer perceptron with a single hidden layer using `sklearn.neural_network.MLPClassifier`. 
* Use `GridSearchCV` with 5 fold cross validation to find the best hidden layer size and the best activation function. 
* Try values of `hidden_layer_sizes` ranging from `(10,)` to `(60,)` with gaps of 10.
* Try activation functions `logistic`, `tanh`, `relu`.

Wrap your `GridSearchCV` in a 5-fold `cross_val_score` and report the accuracy of your neural net.

Be patient, as this can take a few minutes to run. 

In [10]:
# your code goes here
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import classification_report

mlp = MLPClassifier()
params2 = {'hidden_layer_sizes':(10,20,30,40,50,60),'activation':('logistic','tanh','relu')} 
grid3 = GridSearchCV(mlp,params2,cv=5)
grid3.fit(std_features,labels)
print(grid3.best_params_)
cross_val_score(grid3,std_features,labels,cv=5)
labels_predict = cross_val_predict(grid3,std_features,labels,cv=5)
print(classification_report(labels,labels_predict))

{'activation': 'tanh', 'hidden_layer_sizes': 40}
             precision    recall  f1-score   support

          0       0.68      0.77      0.72       540
          1       0.77      0.68      0.72       611

avg / total       0.73      0.72      0.72      1151



### 4. Ensemble Classifiers

Ensemble classifiers combine the predictions of multiple base estimators to improve the accuracy of the predictions. One of the key assumptions that ensemble classifiers make is that the base estimators are built independently (so they are diverse).

**A. Random Forests**

Q5. Use `sklearn.ensemble.RandomForestClassifier` to classify the data. Use a `GridSearchCV` to tune the hyperparameters to get the best results. 
* Try `max_depth` ranging from 35-55
* Try `min_samples_leaf` of 8, 10, 12
* Try `max_features` of `"sqrt"` and `"log2"`

Wrap your GridSearchCV in a cross_val_score with 5-fold CV to report the accuracy of the model.

Be patient, this can take a few minutes to run.

In [11]:
# your code goes here
from sklearn.ensemble import RandomForestClassifier
random = RandomForestClassifier()
num = list(range(35,55))
params3 = {"max_depth":num,"min_samples_leaf":[8,10,12],"max_features":("sqrt","log2")}
grid4 = GridSearchCV(random, params3, cv=5)
grid4.fit(std_features,labels)
print(grid4.best_params_)
cross_val_score(grid4,std_features,labels)
labels_predict2 = cross_val_predict(grid4,std_features,labels,cv=5)
print(classification_report(labels,labels_predict2))

{'max_depth': 37, 'max_features': 'log2', 'min_samples_leaf': 12}
             precision    recall  f1-score   support

          0       0.63      0.65      0.64       540
          1       0.68      0.67      0.67       611

avg / total       0.66      0.66      0.66      1151



In [120]:
#random2 = RandomForestClassifier()
#params4 = {"max_depth":43,"min_samples_leaf":12,"max_features":"log2"}
#grid5 = GridSearchCV(random2,params4,cv=5)
#grid5.fit(std_features,labels)
#print(best_score)
#labels_predict3 = cross_val_predict(grid5,std_features,labels,cv=5)
#print(classification_report(labels,labels_predict3))


**B. AdaBoost**

Random Forests are a kind of averaging ensemble classifier, where the driving principle is to build several estimators independently and then to average their predictions (by taking a vote). In contrast, there is another class of training ensemble classifiers called *boosting* methods. Here the classifiers are trained one-by-one and each time the sampling of the training set depends on the performance of previously generated models.

Q6. Evaluate a `sklearn.ensemble.AdaBoostClassifier` classifier on the data. By default, `AdaBoostClassifier` uses decision trees as the base classifiers (but this can be changed). Use 150 base classifiers to make an `AdaBoostClassifier` and evaluate it's accuracy with a 5-fold-CV.

In [13]:
# your code goes here
from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier(algorithm = 'SAMME')
ada.fit(std_features, labels)
n = list(range(1,150))
param2 = {"n_estimators":n}
grid5 = GridSearchCV(ada,param2,cv=5)
grid5.fit(std_features,labels)
print(grid5.best_params_)
print(grid5.best_score_)
cross_val_score(grid5,std_features,labels,cv=5)
labels_predict3 = cross_val_predict(grid5,std_features,labels, cv=5)
print(classification_report(labels,labels_predict3))

{'n_estimators': 116}
0.6672458731537794
             precision    recall  f1-score   support

          0       0.63      0.67      0.65       540
          1       0.69      0.64      0.67       611

avg / total       0.66      0.66      0.66      1151



### 5. Deploying a final model

Over the course of three programming assignments, you have tested all kinds of classifiers on this data. Some have performed better than others. 

We probably wouldn't want to deploy any of these models in the real world to actually diagnose patients because the accuracies are not very high. We could try to improve the accuracy of our models by tweaking their parameters more (testing more hyperparameters using GridSearchCV) and/or we could do some feature engineering on our dataset.

Q7. Let's say we *did* get to the point where we had a model with very high accuracy and we want to deploy that model and use it for real-world predictions.

* Let's say we're going to deploy our neural net classifier.
* We need to make one final version of this model, where we use ALL of our available data for training (we do not hold out a test set this time, so no outer cross-validation loop). 
* We need to tune the parameters of the model on the FULL dataset, so copy the code you entered for Q4, but remove the outer cross validation loop (remove `cross_val_score`). Just run the `GridSearchCV` by calling `fit` on it and passing in the full dataset. This results in the final trained model with the best parameters for the full dataset. You can print out `best_params_` to see what they are.
* The accuracy of this model is what you assessed and reported in Q4.


* Use the `pickle` package to save your model. We have provided the lines of code for you, just make sure your final model gets passed in to `pickle.dump()`. This will save your model to a file called finalized_model.sav in your current working directory. 

In [20]:
import pickle

# your code goes here
ada2 = AdaBoostClassifier(algorithm = 'SAMME')
ada2.fit(std_features, labels)
n = list(range(1,150))
param3 = {"n_estimators":n}
grid6 = GridSearchCV(ada2,param3,cv=5)
grid6.fit(std_features,labels)
#replace this final_model with your final model
final_model = ada2

filename = 'finalized_model.sav'
x = pickle.dump(final_model, open(filename, 'wb'))
print(x)

None


Q8. Now if someone wants to use your trained, saved classifier to classify a new record, they can load the saved model and just call predict on it. 
* Given this new record, classify it with your saved model and print out either "Negative for disease" or "Positive for disease."

In [25]:
# some time later...

# use this as the new record to classify
record = [[ 0.05905386, 0.2982129, 0.68613149, 0.75078865, 0.87119216, 0.88615694,
  0.93600623, 0.98369184, -0.47426472, -0.57642756, -0.53115361, -0.42789774,
 -0.21907738, -0.20090532, -0.21496782, -0.2080998, 0.06692373, -2.81681183,
 -0.7117194 ]]

 
# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))
a =loaded_model.predict(record)

# your code goes here
if a == 1:    
    print("Positive for disease")
else:
    print("Negative for diease")

Positive for disease
