# Classifier

In this notebook we will try out different classifiers. We will be using different versions of feature analysis to show which one is the best. Lastly we will compare the scores and the classifier with the highest scores will be used for the widget. 

In [None]:
from SimpleCV import *
import numpy as np
import sklearn 
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn import linear_model
from sklearn.decomposition import PCA
from sklearn.model_selection import RandomizedSearchCV
from sklearn import metrics

#### Cross validation
We are using cross validation to lower the change of overfitting. We are using it in two ways. 

The datasets that the classifiers use has been split up. This is done below. We make a training set which has 80% of the data in it. We also make a test set with 20% of the data in it. A classifier will train with the training set, and to test if it trained wel the test set can be used. 

Another way that we are using cross validation is with k-fold. This splits your data randomly into an x amount of folds. One fold will serve as a validation set and the others as a training set. This method is used in the parameter tuning. GridSearch uses this by default.

In [None]:
df = pd.read_csv("../dataset-numpy/dataset_analysis_normalized_v3.csv")
df.head()

# split df in data (X) and labels (y)
X, y = df.iloc[:,:-1], df.iloc[:,-1]

# create train and test data and labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.20, random_state=67)
df.describe()

# PCA
PCA stands for pricipal components analysis. With PCA the dimensions of data is reduced, it is summarized. In our case we took 0.95 procent of the components. When testing our classifiers with the PSA dataset some classifiers decreased in score and some increased. We made the choice to only use the PCA datasets on the classifiers where the score increased.

In [None]:
pca = PCA(0.95)
pca.fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

# Naive Bayes
We chose two naive bays models. Gaussian which can be used if a dataset has a normal distribution. Bernoulli which works better if the features are with zeros and ones. The outputs of these classifiers where rather low it would not reach the level of other classifiers.

#### Gaussian : 0.94270833333333337
Gaussian worked better with the PCA dataset. For this classifier the most confusion was around the number 9. The number was seen for many other numbers like 1,2,7 and 8. The only numbers that were not confused are 0 and 6.

In [None]:
#Naive Bayes Gaussian
gnb = GaussianNB()
gnb.fit(X_train_pca, y_train)
#print "Gaussian :",gnb.score(X_test_pca, y_test)
prediction=gnb.predict(X_test_pca)
accuracy_score_gnb = metrics.accuracy_score(prediction,y_test)

#evaluation(Accuracy)
print "Accuracy:",accuracy_score_gnb 
#evaluation(Confusion Metrix)
print metrics.confusion_matrix(prediction,y_test)
print metrics.classification_report(prediction, y_test)

#### Bernoulli: 0.84895833333333337
We also tried to use the Bernoulli classifier. This classifier scored rather low. This is because our dataset does not only consist out of zeros and ones but also out of other numbers. When looking at the confusion matrix every number has been confused for other numbers except 0. 

In [None]:
#Naive Bayes Bernoulli
bnb = BernoulliNB()
bnb.fit(X_train, y_train)
#print "Bernoulli:",bnb.score(X_test, y_test)
prediction=bnb.predict(X_test)
accuracy_score_bnb = metrics.accuracy_score(prediction,y_test)
#evaluation(Accuracy)
print "Accuracy:",accuracy_score_bnb
#evaluation(Confusion Metrix)
print metrics.confusion_matrix(prediction,y_test)
print metrics.classification_report(prediction, y_test)

# Random forest 

In [None]:
clfRF = RandomForestClassifier()
clfRF.fit(X_train, y_train)
prediction=clfRF.predict(X_test)

accuracy_score_rf = metrics.accuracy_score(prediction,y_test)
#evaluation(Accuracy)
print "Accuracy:",accuracy_score_rf 

#### Randomize search
When running random forest with default settings the score was about 0.945. The descision was made to tune the hyper parameters. To get the best possible parameters we used the randomized search in combination with the grid search. This tries all the combinations between a certain range. When using the randomized search we used k-fold cross validation, this helps with overfitting. It this case the k-fold is 5 (cv=5). This means it will randomly split the dataset into 5 folds. 

In [None]:
#number of trees 
n_estimators=[int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)] 

#number of features considered for splitting at leaf node
max_features =['sqrt', 'auto']

#method for sampling data points (with or without replacement)
bootstrap=[True, False]

#min number of data points allowed in a leaf node
min_samples_leaf=[1,2,4]

#min number of data points placed in a node before the node is split
min_samples_split=[2,5,10]

#max number of levels in each decision tree
max_depth=[int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)

randomgrid = {'n_estimators':n_estimators,
             'max_features': max_features,
             'bootstrap': bootstrap,
             'min_samples_leaf': min_samples_leaf,
             'min_samples_split': min_samples_split,
             'max_depth': max_depth}

clfRF = RandomForestClassifier()

rf_random = RandomizedSearchCV(estimator = clfRF, param_distributions = randomgrid, n_iter = 100, cv = 5, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train, y_train)
rf_random.best_score_

###### The best params that came out of the randomize search:
  bootstrap: False
  <br>
  max_depth: 20
  <br>
  max_features: 'sqrt'
  <br>
  min_samples_leaf: 1
  <br>
  min_samples_split: 2
  <br>
  n_estimators: 1800

#### Grid search
To get even better hyper parameters, we are using the result of the randomize search to narrow down the range for the grid search. Grid search also uses k fold cross validation, in this case it is set to 3 folds. There is a change that this score could be better, but it could also descrease.

In [None]:
# Create the parameter grid based on the results of random search 
parameters = {
    'bootstrap': [False],
    'max_depth': [10,20,30,40,50],
    'max_features': ['sqrt'],
    'min_samples_leaf': [1],
    'min_samples_split': [2],
    'n_estimators': [1400,1500,1600,1800,1900,2000]
}

# Create a based model
clfRF = RandomForestClassifier()
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = clfRF, param_grid = parameters, 
                          cv = 3, n_jobs = -1, verbose = 2)

# Fit the grid search to the data
grid_search.fit(X_train, y_train)
grid_search.best_score_

###### The best params that came out of the grid search
bootstrap: False
<br>
max_depth: 40
<br>
max_features: 'sqrt'
<br>
min_samples_leaf: 1
<br>
min_samples_split: 2
<br>
n_estimators: 1600

With the default setting of the random forest the acurracy score was about 0.945. Running the random forest with the tuned parameters the accurracy score came to 0.9765625. To be sure we also ran the params that came out of the randomize search. This was an accurracy score of 0.979166666667, which is higher than the grid search. 
This is an increase of about 3%. 

When looking at the confusion matrix number 8 catches our attention. Number 8 was mismatched to a 1 and 6. Also number 9 gets confused with 1 and 4. 

#### Random forest with parameters from random and grid search

In [None]:
clfRF = RandomForestClassifier(bootstrap= False,  max_depth= 20, max_features= 'sqrt', min_samples_leaf = 1, min_samples_split = 2, n_estimators = 1800)
clfRF.fit(X_train, y_train)
prediction=clfRF.predict(X_test)

accuracy_score_rf = metrics.accuracy_score(prediction,y_test)
#evaluation(Accuracy)
print "Accuracy :",accuracy_score_rf 
#evaluation(Confusion Metrix)
print metrics.confusion_matrix(prediction,y_test) 
print metrics.classification_report(prediction, y_test)

# Decision tree
The decision tree scored with default settings only  0.86979166666666663. Tuning parameters will not get this classifier to the accurracy scores of other classifiers. When looking at the confusion matrix 6 is the only number that does not get confused. 

In [None]:
clfDT = DecisionTreeClassifier()
clfDT.fit(X_train, y_train) 
prediction=clfDT.predict(X_test)

accuracy_score_dt = metrics.accuracy_score(prediction,y_test)
#evaluation(Accuracy)
print("Accuracy:",accuracy_score_dt)
#evaluation(Confusion Metrix)
print(metrics.confusion_matrix(prediction,y_test))
print metrics.classification_report(prediction, y_test)

# KNNeighbors

In [None]:
clfKNN = KNeighborsClassifier()
# Train the classifier with the training data and training labels
clfKNN.fit(X_train, y_train)
    # Score the classifier
    # (calculates the mean accuracy of the given test data) 
prediction=clfKNN.predict(X_test)
print "Accuracy: ", metrics.accuracy_score(prediction,y_test)


When running the KNN we first let it run with default settings. The accurracy score was already near the 0.97135416666666663 . So we decided to try to make this higher with tuning the hyper parameters. We first ran the classifier with an iteration. Where it would try the neighbors 1-9 in the params.

In [None]:
for i in range (1,10):
    # Create a classifier with k 
    clfKNN = KNeighborsClassifier(n_neighbors=i)
    # Train the classifier with the training data and training labels
    clfKNN.fit(X_train, y_train)
    # Score the classifier
    # (calculates the mean accuracy of the given test data) 
    prediction=clfKNN.predict(X_test)
    print "Accuracy: ", i ,metrics.accuracy_score(prediction,y_test)

##### KNN with neighbor iteration
Accuracy: 0.96354166666666663 Neighbors: 1
<br>
Accuracy: 0.96354166666666663 Neighbors: 2
<br>
Accuracy: 0.97135416666666663 Neighbors: 3
<br>
Accuracy: 0.96875........................ Neighbors: 4
<br>
Accuracy: 0.97135416666666663 Neighbors: 5
<br>
Accuracy: 0.96614583333333337 Neighbors: 6
<br>
Accuracy: 0.96354166666666663 Neighbors: 7
<br>
Accuracy: 0.96614583333333337 Neighbors: 8
<br>
Accuracy: 0.96614583333333337 Neighbors: 9
<br>
<br>
The highest accurracy score of running the classifier with an iterator is the same as when running it default 0.97135416666666663. As the iteration only focuced on the neighbors we decided to use grid search. We now tuned more than one parameter.
<br>
#### Grid search

In [None]:
#making the instance
clfKNN = KNeighborsClassifier(n_jobs=-1)
#Hyper Parameters Set
params = {'n_neighbors':[1,2,3,4,5,6,7,8,9,10],
          'leaf_size':[1,2,3,5],
          'weights':['uniform', 'distance'],
          'algorithm':['auto', 'ball_tree','kd_tree','brute'],
          'n_jobs':[-1]}
#Making models with hyper parameters sets
clfKNN1 = GridSearchCV(clfKNN, param_grid=params, n_jobs=-1, verbose=1)
#Learning
clfKNN1.fit(X_train,y_train)

##### Best params for KNN with grid search
algorithm': 'auto'
<br>
leaf_size': 1
<br>
n_jobs': -1
<br>
n_neighbors': 6
<br>
weights': 'distance'
<br>
<br>
Running the classifier with the params from the gridsearch improved the accurracy score. It is now 0.97395833333333337. It is an improvement of about 0.20%. When looking at the confusion matrix numbers 4 and 9 stand out. Number 4 gets confused with 1 and 8. Number 9 get confused with 1,4 and 7.

In [172]:
clfKNN = KNeighborsClassifier(n_neighbors = 6, n_jobs = -1, weights= 'distance', leaf_size= 1, algorithm= 'auto')
clfKNN.fit(X_train, y_train)
prediction=clfKNN.predict(X_test)

accuracy_score_knn = metrics.accuracy_score(prediction,y_test)
#evaluation(Accuracy)
print "Accuracy:",accuracy_score_knn 
#evaluation(Confusion Metrix)
print metrics.confusion_matrix(prediction,y_test)
print metrics.classification_report(prediction, y_test)

Accuracy: 0.973958333333
[[42  0  0  0  0  0  0  0  0  0]
 [ 0 38  1  1  1  0  0  0  1  1]
 [ 0  0 33  0  0  0  0  0  0  0]
 [ 0  0  0 39  0  1  0  0  0  0]
 [ 0  1  0  0 27  0  0  0  0  1]
 [ 0  0  0  0  0 28  0  0  0  0]
 [ 0  0  0  0  0  0 32  0  0  0]
 [ 0  0  0  0  0  0  0 51  0  1]
 [ 0  0  0  0  1  0  0  0 34  0]
 [ 0  0  0  0  0  0  0  0  0 50]]
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        42
          1       0.97      0.88      0.93        43
          2       0.97      1.00      0.99        33
          3       0.97      0.97      0.97        40
          4       0.93      0.93      0.93        29
          5       0.97      1.00      0.98        28
          6       1.00      1.00      1.00        32
          7       1.00      0.98      0.99        52
          8       0.97      0.97      0.97        35
          9       0.94      1.00      0.97        50

avg / total       0.97      0.97      0.97       384



# SGD
The accurracy score when running SGD with default settings is 0.9375. With changing the params manually we came to a score of 0.958333333333. When looking at the confusion matrix number 1 and 8 stand out. Number 1 gets confused for a 2 and 4. Number 8 gets confused for number 1 and number 5.  

In [173]:
clfSGD = linear_model.SGDClassifier(loss = 'log', n_iter =1000)
clfSGD.fit(X_train, y_train)
prediction=clfSGD.predict(X_test)

accuracy_score_sgd = metrics.accuracy_score(prediction,y_test)
#evaluation(Accuracy)
print "Accuracy:",accuracy_score_sgd 
#evaluation(Confusion Metrix)
print metrics.confusion_matrix(prediction,y_test) 
print metrics.classification_report(prediction, y_test)



Accuracy: 0.958333333333
[[42  0  0  0  0  0  0  0  0  0]
 [ 0 35  1  0  1  0  0  0  3  2]
 [ 0  2 33  0  0  1  0  1  0  0]
 [ 0  0  0 40  0  0  0  0  0  0]
 [ 0  2  0  0 28  0  0  0  0  2]
 [ 0  0  0  0  0 28  0  0  1  0]
 [ 0  0  0  0  0  0 32  0  0  0]
 [ 0  0  0  0  0  0  0 50  0  0]
 [ 0  0  0  0  0  0  0  0 31  0]
 [ 0  0  0  0  0  0  0  0  0 49]]
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        42
          1       0.90      0.83      0.86        42
          2       0.97      0.89      0.93        37
          3       1.00      1.00      1.00        40
          4       0.97      0.88      0.92        32
          5       0.97      0.97      0.97        29
          6       1.00      1.00      1.00        32
          7       0.98      1.00      0.99        50
          8       0.89      1.00      0.94        31
          9       0.92      1.00      0.96        49

avg / total       0.96      0.96      0.96       384



# Support Vector Machine

#### Default SVM with Support Vector Classification

In [None]:
clf = svm.SVC()
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print score

#### SVM with Nu-Support Vector Classification

In [None]:
from sklearn.svm import NuSVC
clf = NuSVC()
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print score

#### SVM with Linear Support Vector Classifiction

In [None]:
from sklearn.svm import LinearSVC

clf = LinearSVC()m
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print score

From these three types of support vector machines, the __SVC__ scores the highest. We will use this one and tweak the parameters.

#### SVM with Gridsearch

Using grid search we can find which parameters are the best to get the highest score. 

In [None]:
parameters = {'kernel' : ('linear', 'rbf'), 'C': np.arange(1, 2, 0.1), 
              'gamma': np.arange(.1, .5, 0.05)}

clf = GridSearchCV(svm.SVC(), parameters, verbose=1, n_jobs=-1, cv = 5)
clf.fit(X_train, y_train)

clf.score(X_test, y_test)

print clf.best_params_
print clf.best_score_

In [None]:
scores = np.zeros((100))
for i in range(100):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.20)
    clf = svm.SVC(kernel ="rbf", C=1.2, gamma=0.3)
    clf.fit(X_train, y_train)
    scores[i] = clf.score(X_test, y_test)

print 'mean', scores.mean()

print 'min', scores.min()
print 'max', scores.max()

##### Best params for SVM with grid search
kernel': 'rbf'
<br>
C': 1.2
<br>
gamma': 0.171

This scores 0.979 on average when we run it with these parameters a 100 times

#### SVM with PCA

In [None]:
scores = np.zeros((100))
for i in range(100):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.20)
    pca = PCA(0.95)
    pca.fit(X_train)
    X_train_pca = pca.transform(X_train)
    X_test_pca = pca.transform(X_test)
    
    clf = svm.SVC(kernel ="rbf", C=1.2, gamma=0.3)
    clf.fit(X_train_pca, y_train)
    scores[i] = clf.score(X_test_pca, y_test)

print 'mean', scores.mean()

print 'min', scores.min()
print 'max', scores.max()

This scores 0.979 on average when we run it with these parameters a 100 times. It scores the same as without PCA, so we decided to leave it out.

### Best Classifier SVM

In [None]:
clf = svm.SVC(kernel ="rbf", C=1.2, gamma=0.3)
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)

In [None]:
predictions = clf.predict(X_test)
labels = y_test

print confusion_matrix (labels, predictions)
print classification_report(labels, predictions)
print 'accuracy: ', metrics.accuracy_score(predictions, labels)

We tested svm on different datasets to see which would score the best

- __dataset_analysis_normalized_v3.csv - the latest normalized version__ <br>
0.979505208333
- dataset_analysis_v3.csv - the latest version not normalized <br>
0.090546875
- dataset_analysis_normalized_with_contours.csv - normalized version without the column 'blob_amount_contours'  <br>
0.979479166667
- dataset_analysis_with_contours.csv - not normalized version without the column 'blob_amount_contours'  <br>
0.0909635416667

The not normalized versions scored the worst. These are not useful at all. The version without 'blob_amount_contours' lowers the score a little bit.

In [None]:
predictions = clf.predict(X_test)
labels = y_test

print confusion_matrix (labels, predictions)
print classification_report(labels, predictions)
print metrics.accuracy_score(predictions, labels)

# Logistic Regression
When running LG with default settings a score around 0.9583 comes out. We decided to tune the parameters. 

In [174]:
clfLG = LogisticRegression()
clfLG.fit(X_train, y_train)
prediction=clfLG.predict(X_test)
accuracy_score_lg = metrics.accuracy_score(prediction,y_test)
#evaluation(Accuracy)
print "Accuracy:",accuracy_score_lg 
#evaluation(Confusion Metrix)
print metrics.confusion_matrix(prediction,y_test)
print metrics.classification_report(prediction, y_test)

Accuracy: 0.958333333333
[[42  0  0  0  0  0  0  0  0  0]
 [ 0 36  1  0  1  0  0  0  3  1]
 [ 0  2 33  0  0  1  0  1  0  0]
 [ 0  0  0 40  0  0  0  0  0  0]
 [ 0  1  0  0 27  0  0  0  0  2]
 [ 0  0  0  0  0 27  0  0  1  0]
 [ 0  0  0  0  0  0 32  0  0  0]
 [ 0  0  0  0  0  0  0 50  0  0]
 [ 0  0  0  0  1  0  0  0 31  0]
 [ 0  0  0  0  0  1  0  0  0 50]]
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        42
          1       0.92      0.86      0.89        42
          2       0.97      0.89      0.93        37
          3       1.00      1.00      1.00        40
          4       0.93      0.90      0.92        30
          5       0.93      0.96      0.95        28
          6       1.00      1.00      1.00        32
          7       0.98      1.00      0.99        50
          8       0.89      0.97      0.93        32
          9       0.94      0.98      0.96        51

avg / total       0.96      0.96      0.96       384



In [None]:
pipeline = Pipeline([('logreg',LogisticRegression(multi_class="multinomial",solver="lbfgs"))])
parameters = {'logreg__C':np.arange(0.01,100,10)}
clfLG = GridSearchCV(pipeline,parameters,cv=5, n_jobs=-1, verbose =2)
clfLG.fit(X_train, y_train)
print("Best parameters:\n",clfLG.best_params_)

###### Best params gridsearch
logreg__C: 10.01
<br>
<br>
When testing the LG classifier with this param the results went down to about 0.9557. So the best option for LG is to use the default settings. When looking at the matrix of the default settings number 1 stands out. It is confused for a 2, 4, 8 and 9. Number 4 stands out as well and is confused with a 1 and 9. 

###### Best params gridsearch
logreg__C: 10.01
<br>
<br>
When testing the LG classifier with this param the results went down to about 0.9557. So the best option for LG is to use the default settings. When looking at the matrix of the default settings number 1 stands out. It is confused for a 2, 4, 8 and 9. Number 4 stands out as well and is confused with a 1 and 9. 

In [None]:
clfLG = LogisticRegression(random_state=0, solver='lbfgs',  multi_class='multinomial', C =10.01 )
clfLG.fit(X_train, y_train)
prediction=clfLG.predict(X_test)
#evaluation(Accuracy)
print "Accuracy:",metrics.accuracy_score(prediction,y_test)
#evaluation(Confusion Metrix)
print metrics.confusion_matrix(prediction,y_test)

In [None]:
accuracy_score_svm = scores.mean()
dfResults = pd.Series([accuracy_score_gnb*100, accuracy_score_bnb*100,accuracy_score_rf*100, accuracy_score_dt*100,
                       accuracy_score_knn*100, accuracy_score_sgd*100, accuracy_score_svm*100, accuracy_score_lg*100],
                      index=['Gaussian', 'Bernoulli', 'Random Forest', 'Decision Tree',
                             'KNN', 'SGD','SVM', 'Logistic Regression'])
dfResults.head(8)
