# Classifier

In this notebook we will try out different classifiers. We will be using different versions of feature analysis to show which one is the best. Lastly we will compare the scores and the classifier with the highest scores will be used for the widget. The best classifier turned out to be the SVM. 

In [30]:
from SimpleCV import *
import numpy as np
import sklearn 
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn import linear_model
from sklearn.decomposition import PCA
from sklearn.model_selection import RandomizedSearchCV
from sklearn import metrics
from sklearn.svm import NuSVC
from sklearn.svm import LinearSVC

## Classifiers by order of score

All the classifiers we have implemented with the corresponding accuracy score are shown below. We show the highest score per classifier. The SVM with tweaked parameters scores the best. We will use this one for the test application.
 
GaussianNB: 0.942708333333 <br>
Bernoulli: 0.869791666667 <br>
Random Forest: 0.9765625 <br>
Desicion Tree: 0.877604166667 <br>
KNN: 0.966145833333 <br>
SGD: 0.958333333333 <br>
Logistic Regression: 0.9583 <br>
__SVM: 0.980729166667__ <br>

#### Cross validation
We are using cross validation to lower the change of overfitting. We are using it in two ways. 

The datasets that the classifiers use has been split up. This is done below. We make a training set which has 80% of the data in it. We also make a test set with 20% of the data in it. A classifier will train with the training set, and to test if it trained wel the test set can be used. 

Another way that we are using cross validation is with k-fold. This splits your data randomly into an x amount of folds. One fold will serve as a validation set and the others as a training set. This method is used in the parameter tuning. GridSearch uses this by default.

In [74]:
df = pd.read_csv("../dataset-numpy/dataset_analysis_normalized_v4.csv")

df.head()

# split df in data (X) and labels (y)
X, y = df.iloc[:,:-1], df.iloc[:,-1]

# create train and test data and labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.20, random_state=67)

# PCA
PCA stands for pricipal components analysis. With PCA the dimensions of data is reduced, it is summarized. In our case we took 0.95 procent of the components. When testing our classifiers with the PSA dataset some classifiers decreased in score and some increased. We made the choice to only use the PCA datasets on the classifiers where the score increased.

In [55]:
pca = PCA(0.95)
pca.fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

# Naive Bayes
We chose two naive bays models. Gaussian which can be used if a dataset has a normal distribution. Bernoulli which works better if the features are with zeros and ones. The outputs of these classifiers where rather low it would not reach the level of other classifiers.

#### Gaussian : 0.942708333333
Gaussian worked better with the PCA dataset. For this classifier the most confusion was around the number 9. The number was seen for many other numbers like 1,3 and 8. The only numbers that were not confused are 0 and 1.

In [9]:
#Naive Bayes Gaussian
gnb = GaussianNB()
gnb.fit(X_train_pca, y_train)

prediction=gnb.predict(X_test_pca)
accuracy_score_gnb = metrics.accuracy_score(prediction,y_test)

#evaluation
print "Accuracy:",accuracy_score_gnb 

print metrics.confusion_matrix(prediction,y_test)

Accuracy: 0.942708333333
[[42  0  0  0  0  0  0  0  0  0]
 [ 0 39  0  0  0  0  1  1  2  4]
 [ 0  0 33  1  0  0  0  0  0  0]
 [ 0  0  0 39  0  1  0  0  0  1]
 [ 0  0  0  0 28  0  1  2  0  0]
 [ 0  0  0  0  0 24  0  0  1  0]
 [ 0  0  0  0  0  0 30  0  0  0]
 [ 0  0  0  0  0  0  0 48  0  0]
 [ 0  0  0  0  0  0  0  0 32  1]
 [ 0  0  1  0  1  4  0  0  0 47]]


#### Bernoulli: 0.869791666667
We also tried to use the Bernoulli classifier. This classifier scored rather low. This is because our dataset does not only consist out of booleans but out of multiple numbers. When looking at the confusion matrix every number has been confused for other numbers except 0. 

In [11]:
#Naive Bayes Bernoulli
bnb = BernoulliNB()
bnb.fit(X_train, y_train)

prediction=bnb.predict(X_test)
accuracy_score_bnb = metrics.accuracy_score(prediction,y_test)

#evaluation
print "Accuracy:",accuracy_score_bnb
print metrics.confusion_matrix(prediction,y_test)

Accuracy: 0.869791666667
[[42  0  0  0  0  1  0  0  0  0]
 [ 0 30  0  1  6  0  1  0  0  2]
 [ 0  4 30  2  0  0  0  0  0  0]
 [ 0  0  0 32  0  0  0  0  1  1]
 [ 0  1  0  0 22  0  0  2  0  4]
 [ 0  1  1  2  0 23  0  0  1  0]
 [ 0  0  1  0  0  0 31  0  0  0]
 [ 0  0  0  0  1  0  0 49  1  1]
 [ 0  3  1  1  0  1  0  0 32  2]
 [ 0  0  1  2  0  4  0  0  0 43]]


# Random forest 

#### Random Forest with default parameters: 0.950520833333

In [14]:
# default parameters
clfRF = RandomForestClassifier()
clfRF.fit(X_train, y_train)
prediction=clfRF.predict(X_test)

accuracy_score_rf = metrics.accuracy_score(prediction,y_test)

#evaluation
print "Accuracy:",accuracy_score_rf 

Accuracy: 0.950520833333


#### Randomize search
When running random forest with default settings the score was about 0.95. The descision was made to tune the hyper parameters. To get the best possible parameters we used the randomized search in combination with the grid search. This tries all the combinations between a certain range. When using the randomized search we used k-fold cross validation, this helps with overfitting. It this case the k-fold is 5 (cv=5). This means it will randomly split the dataset into 5 folds. 

In [None]:
#number of trees 
n_estimators=[int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)] 

#number of features considered for splitting at leaf node
max_features =['sqrt', 'auto']

#method for sampling data points (with or without replacement)
bootstrap=[True, False]

#min number of data points allowed in a leaf node
min_samples_leaf=[1,2,4]

#min number of data points placed in a node before the node is split
min_samples_split=[2,5,10]

#max number of levels in each decision tree
max_depth=[int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)

randomgrid = {'n_estimators':n_estimators,
             'max_features': max_features,
             'bootstrap': bootstrap,
             'min_samples_leaf': min_samples_leaf,
             'min_samples_split': min_samples_split,
             'max_depth': max_depth}

clfRF = RandomForestClassifier()

rf_random = RandomizedSearchCV(estimator = clfRF, param_distributions = randomgrid, n_iter = 100, cv = 5, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train, y_train)
rf_random.best_score_

###### The best params that came out of the randomize search:
  bootstrap: False
  <br>
  max_depth: 20
  <br>
  max_features: 'sqrt'
  <br>
  min_samples_leaf: 1
  <br>
  min_samples_split: 2
  <br>
  n_estimators: 1800

#### Grid search
To get even better hyper parameters, we are using the result of the randomize search to narrow down the range for the grid search. Grid search also uses k fold cross validation, in this case it is set to 3 folds. There is a chance that this score could be better, but it could also descrease.

In [None]:
# Create the parameter grid based on the results of random search 
parameters = {
    'bootstrap': [False],
    'max_depth': [10,20,30,40,50],
    'max_features': ['sqrt'],
    'min_samples_leaf': [1],
    'min_samples_split': [2],
    'n_estimators': [1400,1500,1600,1800,1900,2000]
}

# Create a based model
clfRF = RandomForestClassifier()
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = clfRF, param_grid = parameters, 
                          cv = 3, n_jobs = -1, verbose = 2)

# Fit the grid search to the data
grid_search.fit(X_train, y_train)
grid_search.best_score_

###### The best params that came out of the grid search
bootstrap: False
<br>
max_depth: 40
<br>
max_features: 'sqrt'
<br>
min_samples_leaf: 1
<br>
min_samples_split: 2
<br>
n_estimators: 1600

#### Random forest with parameters from random and grid search:  0.9765625

With the default setting of the random forest the acurracy score was about 0.95. The gridSearch and randomized search both gave the same parameters. These parameters increased the score with almost 3%.

When looking at the confusion matrix number 5 catches our attention. Number 5 was mismatched to a 2 and a 9. Also number 9 gets confused with 4 and 8. The 0, 2, 3, 4, 7 and 8 are alle predicted correctly.

In [16]:
clfRF = RandomForestClassifier(bootstrap= False,  max_depth= 20, max_features= 'sqrt', min_samples_leaf = 1, min_samples_split = 2, n_estimators = 1800)
clfRF.fit(X_train, y_train)
prediction=clfRF.predict(X_test)

accuracy_score_rf = metrics.accuracy_score(prediction,y_test)

#evaluation
print "Accuracy :",accuracy_score_rf 
print metrics.confusion_matrix(prediction,y_test)

Accuracy : 0.9765625
[[42  0  0  0  0  0  0  0  0  0]
 [ 0 38  0  0  0  0  1  0  0  0]
 [ 0  0 34  0  0  2  0  0  0  0]
 [ 0  1  0 40  0  0  0  0  0  0]
 [ 0  0  0  0 29  0  0  0  0  2]
 [ 0  0  0  0  0 26  1  0  0  0]
 [ 0  0  0  0  0  0 30  0  0  0]
 [ 0  0  0  0  0  0  0 51  0  0]
 [ 0  0  0  0  0  0  0  0 35  1]
 [ 0  0  0  0  0  1  0  0  0 50]]


# Decision tree
The decision tree scored with default settings only 0.877604166667. Tuning parameters will not get this classifier to the accurracy scores of other classifiers. When looking at the confusion matrix the 0 is the only number that does not get confused for other numbers.

In [17]:
clfDT = DecisionTreeClassifier()
clfDT.fit(X_train, y_train) 
prediction=clfDT.predict(X_test)

accuracy_score_dt = metrics.accuracy_score(prediction,y_test)

#evaluation
print "Accuracy:",accuracy_score_dt
print metrics.confusion_matrix(prediction,y_test)

Accuracy: 0.877604166667
[[42  0  1  0  0  0  0  0  0  0]
 [ 0 32  1  0  0  0  1  0  0  2]
 [ 0  1 29  1  0  3  1  0  2  0]
 [ 0  2  0 37  1  0  0  0  0  5]
 [ 0  0  0  0 28  1  0  0  0  2]
 [ 0  1  0  0  0 24  1  0  2  4]
 [ 0  1  0  0  0  0 29  0  0  0]
 [ 0  1  1  0  0  0  0 48  2  1]
 [ 0  1  0  0  0  0  0  1 29  0]
 [ 0  0  2  2  0  1  0  2  0 39]]


# KNNeighbors

#### KNN with default parameters: 0.96875

In [71]:
clfKNN = KNeighborsClassifier()
# Train the classifier with the training data and training labels
clfKNN.fit(X_train, y_train)
prediction=clfKNN.predict(X_test)

#evaluation
print "Accuracy: ", metrics.accuracy_score(prediction,y_test)
print metrics.confusion_matrix(prediction,y_test)

Accuracy:  0.958333333333
[[34  0  0  0  0  0  0  0  0  0]
 [ 0 36  0  0  1  0  1  0  3  3]
 [ 0  0 36  0  0  0  0  0  0  0]
 [ 0  0  0 38  0  1  0  0  1  1]
 [ 0  0  0  0 39  0  0  0  0  0]
 [ 0  0  0  0  0 37  0  0  0  0]
 [ 0  0  0  0  1  0 40  0  0  0]
 [ 0  0  0  0  1  0  0 36  0  3]
 [ 0  0  0  0  0  0  0  0 32  0]
 [ 0  0  0  0  0  0  0  0  0 40]]


When running the KNN we first let it run with default settings. The accurracy score was already 0.96875. So we decided to try to make this higher with tuning the hyper parameters. We decided to run a GridSearch to see what parameters would be the best to use.
#### Grid search

In [None]:
#making the instance
clfKNN = KNeighborsClassifier(n_jobs=-1)
#Hyper Parameters Set
params = {'n_neighbors':[1,2,3,4,5,6,7,8,9,10],
          'leaf_size':[1,2,3,5],
          'weights':['uniform', 'distance'],
          'algorithm':['auto', 'ball_tree','kd_tree','brute'],
         'n_jobs': [-1]}

#Making models with hyper parameters sets
clfKNN1 = GridSearchCV(clfKNN, param_grid=params, n_jobs=-1, verbose=1)
#Learning
clfKNN1.fit(X_train,y_train)

##### Best params for KNN with grid search
algorithm': 'auto'
<br>
leaf_size': 1
<br>
n_jobs': -1
<br>
n_neighbors': 6
<br>
weights': 'distance'
<br>
<br>
Running the classifier with the params from the gridsearch did improve the accurracy score. It is now 0.966145833333. When looking at the confusion matrix numbers 4, 5, 6, 8 and 9 get confused. The 0, 1, 2 ,3 and 7 don't get confused for other numbers.

In [72]:
clfKNN = KNeighborsClassifier(n_neighbors = 6, n_jobs = -1, weights= 'distance', leaf_size= 1, algorithm= 'auto')
clfKNN.fit(X_train, y_train)
prediction=clfKNN.predict(X_test)

accuracy_score_knn = metrics.accuracy_score(prediction,y_test)

#evaluation
print "Accuracy:",accuracy_score_knn 
print metrics.confusion_matrix(prediction,y_test)

Accuracy: 0.966145833333
[[34  0  0  0  0  0  0  0  0  0]
 [ 0 36  0  0  1  0  1  0  3  2]
 [ 0  0 36  0  0  0  0  0  0  0]
 [ 0  0  0 38  0  1  0  0  1  0]
 [ 0  0  0  0 39  0  0  0  0  0]
 [ 0  0  0  0  0 37  0  0  0  0]
 [ 0  0  0  0  1  0 40  0  0  0]
 [ 0  0  0  0  1  0  0 36  0  2]
 [ 0  0  0  0  0  0  0  0 32  0]
 [ 0  0  0  0  0  0  0  0  0 43]]


# SGD 
The accurracy score when running SGD with default settings is 0.958333333333. When looking at the confusion matrix number 0, 1, 5, 6, 8, and 9 get confused. The 8 and 9 get confused the most often.  

In [73]:
clfSGD = linear_model.SGDClassifier(loss = 'log', n_iter =1000)
clfSGD.fit(X_train, y_train)
prediction=clfSGD.predict(X_test)

accuracy_score_sgd = metrics.accuracy_score(prediction,y_test)

#evaluation
print "Accuracy:",accuracy_score_sgd 
print metrics.confusion_matrix(prediction,y_test) 



Accuracy: 0.958333333333
[[33  0  0  0  0  0  0  0  0  0]
 [ 0 36  0  0  2  0  1  0  2  2]
 [ 0  0 35  0  0  1  0  0  1  0]
 [ 0  0  1 38  0  1  0  0  1  0]
 [ 0  0  0  0 39  0  0  0  0  0]
 [ 0  0  0  0  0 36  0  0  0  0]
 [ 0  0  0  0  1  0 40  0  0  0]
 [ 0  0  0  0  0  0  0 36  0  1]
 [ 1  0  0  0  0  0  0  0 32  1]
 [ 0  0  0  0  0  0  0  0  0 43]]


# Logistic Regression
When running LG with default the score is 0.9583. We decided to tune the parameters with a pipeline and GridSearch. 

In [25]:
clfLG = LogisticRegression()
clfLG.fit(X_train, y_train)
prediction=clfLG.predict(X_test)
accuracy_score_lg = metrics.accuracy_score(prediction,y_test)

#evaluation
print "Accuracy:",accuracy_score_lg 
print metrics.confusion_matrix(prediction,y_test)

Accuracy: 0.958333333333
[[41  0  0  0  0  0  0  0  0  0]
 [ 0 38  0  0  0  0  1  0  0  1]
 [ 0  0 34  0  0  2  0  0  0  1]
 [ 0  0  0 40  0  0  0  0  0  1]
 [ 0  0  0  0 28  0  0  0  0  2]
 [ 1  0  0  0  0 25  1  0  1  0]
 [ 0  0  0  0  0  0 30  0  0  0]
 [ 0  1  0  0  0  0  0 51  0  0]
 [ 0  0  0  0  1  0  0  0 34  1]
 [ 0  0  0  0  0  2  0  0  0 47]]


In [None]:
pipeline = Pipeline([('logreg',LogisticRegression(multi_class="multinomial",solver="lbfgs"))])
parameters = {'logreg__C':np.arange(0.01,100,10)}
clfLG = GridSearchCV(pipeline,parameters,cv=5, n_jobs=-1, verbose =2)
clfLG.fit(X_train, y_train)
print("Best parameters:\n",clfLG.best_params_)

###### Best params gridsearch
logreg__C: 10.01
<br>
<br>
When testing the LG classifier with this param the results went down to about 0.966145833333. So the best option for LG is to use these settings settings. When looking at the matrix the 3 is the only number to not get confused for others. The 5 and 9 get confused the most often.

In [28]:
clfLG = LogisticRegression(random_state=0, solver='lbfgs',  multi_class='multinomial', C =10.01 )
clfLG.fit(X_train, y_train)
prediction=clfLG.predict(X_test)
#evaluation(Accuracy)
print "Accuracy:",metrics.accuracy_score(prediction,y_test)
print metrics.confusion_matrix(prediction,y_test)

Accuracy: 0.966145833333
[[41  0  1  0  0  0  1  0  0  0]
 [ 0 38  0  0  0  0  1  0  0  0]
 [ 0  0 33  0  0  2  0  1  0  0]
 [ 0  0  0 40  0  0  0  0  0  1]
 [ 0  1  0  0 29  0  0  0  0  2]
 [ 1  0  0  0  0 26  0  0  1  0]
 [ 0  0  0  0  0  0 30  0  0  0]
 [ 0  0  0  0  0  0  0 50  0  0]
 [ 0  0  0  0  0  0  0  0 34  0]
 [ 0  0  0  0  0  1  0  0  0 50]]


# Support Vector Machine

#### Default SVM with Support Vector Classification

In [29]:
clf = svm.SVC()
clf.fit(X_train, y_train)
prediction=clf.predict(X_test)

print "Accuracy:",metrics.accuracy_score(prediction,y_test)

Accuracy: 0.971354166667


#### SVM with Nu-Support Vector Classification

In [31]:
clf = NuSVC()
clf.fit(X_train, y_train)
prediction=clf.predict(X_test)

print "Accuracy:",metrics.accuracy_score(prediction,y_test)

Accuracy: 0.958333333333


#### SVM with Linear Support Vector Classifiction

In [33]:
clf = LinearSVC()
clf.fit(X_train, y_train)
prediction=clf.predict(X_test)

print "Accuracy:",metrics.accuracy_score(prediction,y_test)

Accuracy: 0.966145833333


From these three types of support vector machines, the __SVC__ scores the highest. We will use this one and tweak the parameters.

#### SVM with Gridsearch

Using grid search we can find which parameters are the best to get the highest score. 

In [None]:
parameters = {'kernel' : ('linear', 'rbf'), 'C': np.arange(1, 2, 0.1), 
              'gamma': np.arange(.1, .5, 0.05)}

clf = GridSearchCV(svm.SVC(), parameters, verbose=1, n_jobs=-1, cv = 5)
clf.fit(X_train, y_train)

clf.score(X_test, y_test)

print clf.best_params_
print clf.best_score_

##### Best params for SVM with grid search
kernel': 'rbf'
<br>
C': 1.1
<br>
gamma': 0.2

In [53]:
clf = svm.SVC(kernel ="rbf", C=1.1, gamma=0.2)
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)

predictions = clf.predict(X_test)

print 'accuracy: ', metrics.accuracy_score(predictions, y_test)
print metrics.confusion_matrix (y_test, predictions)
print metrics.classification_report(y_test, predictions)

accuracy:  0.979166666667
[[36  0  0  0  0  0  0  0  0  0]
 [ 0 39  0  0  0  0  0  0  0  0]
 [ 0  0 39  0  0  0  0  0  0  0]
 [ 0  0  0 41  0  1  0  1  0  0]
 [ 0  0  0  0 33  0  0  1  0  0]
 [ 0  0  0  0  0 37  0  0  0  1]
 [ 0  1  0  0  0  1 35  0  0  0]
 [ 0  0  0  0  0  0  0 40  0  0]
 [ 1  0  0  0  0  0  0  0 31  0]
 [ 0  1  0  0  0  0  0  0  0 45]]
             precision    recall  f1-score   support

          0       0.97      1.00      0.99        36
          1       0.95      1.00      0.97        39
          2       1.00      1.00      1.00        39
          3       1.00      0.95      0.98        43
          4       1.00      0.97      0.99        34
          5       0.95      0.97      0.96        38
          6       1.00      0.95      0.97        37
          7       0.95      1.00      0.98        40
          8       1.00      0.97      0.98        32
          9       0.98      0.98      0.98        46

avg / total       0.98      0.98      0.98       384



The SVM scores fairly high with a score of 0.979166666667. The numbers 0,1 5, 7, and 9 get confused for other numbers. The most often a number get's confused is two times. This is the best compared to other classifiers. The precision shows that all numbers get predicted right above 95% of the time. 
We run the SVM with the parameters found above a 100 times with different train and test set. Then we get the mean of all the scores. This way we don't have to worry about outliers. This scores __0.980729166667__ on average.

In [51]:
scores = np.zeros((100))
for i in range(100):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.20)
    clf = svm.SVC(kernel ="rbf", C=1.1, gamma=0.2)
    clf.fit(X_train, y_train)
    scores[i] = clf.score(X_test, y_test)

print 'mean accuracy: ', scores.mean()

mean accuracy:  0.980729166667


#### SVM with PCA  : 0.978072916667
The SVM scores a little better without PCA.

In [60]:
scores = np.zeros((100))
for i in range(100):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.20)
    pca = PCA(0.95)
    pca.fit(X_train)
    X_train_pca = pca.transform(X_train)
    X_test_pca = pca.transform(X_test)
    
    clf = svm.SVC(kernel ="rbf", C=1.1, gamma=0.2)
    clf.fit(X_train_pca, y_train)
    scores[i] = clf.score(X_test_pca, y_test)

print 'mean accuracy: ', scores.mean()

mean accuracy:  0.978072916667
