## SVM on Leaf Classification Data Set

In [122]:
import numpy as np
import pandas as pd
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold

leaf = pd.read_csv('train.csv')

### Data Preprocessing

Class labels need to be number encoded for fitting with the model.  We additionally need to remove species and id from the dataset so that the model does not fit to those values. We use stratification for splitting the data into training and test sets because of the large number of classes (100) in a relatively small dataset (990 observations)

In [88]:
leaf = pd.read_csv('train.csv')

le = LabelEncoder().fit(leaf.species) 
labels = le.transform(leaf.species) 
leaf = leaf.drop(['species', 'id'], axis=1)  

X_train, X_test, y_train, y_test = train_test_split(leaf, labels, test_size=0.3, stratify=labels)

X_train.head()

Unnamed: 0,margin1,margin2,margin3,margin4,margin5,margin6,margin7,margin8,margin9,margin10,...,texture55,texture56,texture57,texture58,texture59,texture60,texture61,texture62,texture63,texture64
524,0.019531,0.029297,0.041016,0.001953,0.005859,0.021484,0.007812,0.0,0.007812,0.013672,...,0.0,0.003906,0.008789,0.019531,0.012695,0.003906,0.0,0.009766,0.0,0.000977
469,0.009766,0.023438,0.029297,0.056641,0.037109,0.013672,0.011719,0.0,0.003906,0.013672,...,0.0,0.0,0.007812,0.001953,0.089844,0.0,0.0,0.001953,0.020508,0.0
458,0.021484,0.056641,0.005859,0.048828,0.0,0.12891,0.0,0.0,0.007812,0.005859,...,0.057617,0.0,0.02832,0.0,0.014648,0.0,0.0,0.006836,0.0,0.041992
954,0.0,0.007812,0.023438,0.046875,0.011719,0.0,0.005859,0.0,0.005859,0.013672,...,0.013672,0.0,0.022461,0.0,0.017578,0.0,0.035156,0.0,0.006836,0.00293
229,0.003906,0.009766,0.066406,0.033203,0.025391,0.0,0.019531,0.005859,0.005859,0.013672,...,0.0,0.001953,0.013672,0.003906,0.001953,0.001953,0.0,0.00293,0.0,0.011719


### Parameter Tuning

First we'll use a grid search to determine the ideal SVM model based on the training data. The tuned parameters are:
* C: Penalty parameter C of the error term. This is used to regularlize the model and has a tradeoff between a smooth fit and exactly fitting the training data.
* kernel: kernel type for the algorithm, must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’
* gamma: Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. A higher gamma value makes the model fit more to the training data
* degree: Only used for polynomial SVM. Determines the degree of the polynomial when making the hyperplane

In [126]:
Cs = [0.001, 0.01, 0.1, 1, 10, 25, 50, 100, 1000]
kernels = ["linear", "rbf", "sigmoid", "poly"]
gammas = [0.001, 0.01, 0.1, 1, 10]
degrees = [2,3,4,5,6] # degree = 1 is identical to linear kernel
#different param dict for each kernel to remove redundancies
param_grid = [{'kernel' : ["linear"] ,'C': Cs},
             {'kernel': ["rbf", "sigmoid"], 'C': Cs, 'gamma': gammas},
             {'kernel' : ["poly"], 'C': Cs, 'gamma': gammas, 'degree': degrees}]
grid_search = GridSearchCV(SVC(), param_grid, cv=5)
grid_search.fit(X_train, y_train)



GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid=[{'kernel': ['linear'], 'C': [0.001, 0.01, 0.1, 1, 10, 25, 50, 100, 1000]}, {'kernel': ['rbf', 'sigmoid'], 'C': [0.001, 0.01, 0.1, 1, 10, 25, 50, 100, 1000], 'gamma': [0.001, 0.01, 0.1, 1, 10]}, {'kernel': ['poly'], 'C': [0.001, 0.01, 0.1, 1, 10, 25, 50, 100, 1000], 'gamma': [0.001, 0.01, 0.1, 1, 10], 'degree': [2, 3, 4, 5, 6]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [127]:
print("Best parameter set from grid search: ", grid_search.best_params_)
print("\nBest Accuracy: ", grid_search.best_score_, '\n')

for mean, params in zip(grid_search.cv_results_['mean_test_score'], grid_search.cv_results_['params']):
    print("%0.3f for %r" % (mean, params))

Best parameter set from grid search:  {'C': 50, 'kernel': 'linear'}

Best Accuracy:  0.924963924964 

0.788 for {'C': 0.001, 'kernel': 'linear'}
0.788 for {'C': 0.01, 'kernel': 'linear'}
0.788 for {'C': 0.1, 'kernel': 'linear'}
0.788 for {'C': 1, 'kernel': 'linear'}
0.851 for {'C': 10, 'kernel': 'linear'}
0.902 for {'C': 25, 'kernel': 'linear'}
0.925 for {'C': 50, 'kernel': 'linear'}
0.925 for {'C': 100, 'kernel': 'linear'}
0.924 for {'C': 1000, 'kernel': 'linear'}
0.788 for {'C': 0.001, 'gamma': 0.001, 'kernel': 'rbf'}
0.788 for {'C': 0.001, 'gamma': 0.001, 'kernel': 'sigmoid'}
0.788 for {'C': 0.001, 'gamma': 0.01, 'kernel': 'rbf'}
0.788 for {'C': 0.001, 'gamma': 0.01, 'kernel': 'sigmoid'}
0.788 for {'C': 0.001, 'gamma': 0.1, 'kernel': 'rbf'}
0.788 for {'C': 0.001, 'gamma': 0.1, 'kernel': 'sigmoid'}
0.797 for {'C': 0.001, 'gamma': 1, 'kernel': 'rbf'}
0.788 for {'C': 0.001, 'gamma': 1, 'kernel': 'sigmoid'}
0.830 for {'C': 0.001, 'gamma': 10, 'kernel': 'rbf'}
0.726 for {'C': 0.001, 'gam

In [128]:
grid_search.best_params_

{'C': 50, 'kernel': 'linear'}

In [129]:
svm = SVC(C=50, kernel="linear", probability=True)
svm.fit(X_train, y_train)
predictions = svm.predict(X_test)

accuracy_score(predictions, y_test)

0.94612794612794615

### Conclusion
Linear SVM with high cost parameters were shown to be the best, topping >90% accuracy whenever C>=25.  C=50 had the best cross-validation acurracy, and a test accuracy of 94.6%.  Likewise, rbf and sigmoid kernels with high cost parameters also faired exceedlingly well on the training data. Polynomial kernels faired poorly outside of degree = 2 with high values for C and gamma.