## SVM on Leaf Classification Data Set

In [31]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC, NuSVC
from sklearn.metrics import accuracy_score
import time
np.random.seed(1200)


leaf = pd.read_csv('train.csv')

### Data Preprocessing

Class labels need to be number encoded for fitting with the model.  We additionally need to remove species and id from the dataset so that the model does not fit to those values. We use stratification for splitting the data into training and test sets because of the large number of classes (100) in a relatively small dataset (990 observations)

In [32]:
le = LabelEncoder().fit(leaf.species) 
labels = le.transform(leaf.species) 
leaf = leaf.drop(['species', 'id'], axis=1)  

X_train, X_test, y_train, y_test = train_test_split(leaf, labels, test_size=0.3, stratify=labels)

X_train.head()

Unnamed: 0,margin1,margin2,margin3,margin4,margin5,margin6,margin7,margin8,margin9,margin10,...,texture55,texture56,texture57,texture58,texture59,texture60,texture61,texture62,texture63,texture64
951,0.0,0.0,0.013672,0.015625,0.035156,0.0,0.023438,0.0,0.005859,0.017578,...,0.09668,0.0,0.008789,0.0,0.011719,0.0,0.0,0.0,0.021484,0.000977
127,0.011719,0.03125,0.044922,0.009766,0.009766,0.025391,0.035156,0.0,0.0,0.011719,...,0.0,0.0,0.051758,0.0,0.005859,0.0,0.0,0.0,0.008789,0.1416
825,0.005859,0.019531,0.080078,0.0,0.001953,0.082031,0.001953,0.0,0.003906,0.001953,...,0.27051,0.0,0.023438,0.0,0.001953,0.0,0.0,0.0,0.0,0.024414
775,0.025391,0.050781,0.029297,0.015625,0.003906,0.041016,0.03125,0.005859,0.0,0.048828,...,0.0,0.0,0.018555,0.0,0.012695,0.0,0.0,0.0,0.0,0.056641
755,0.0,0.0,0.001953,0.007812,0.074219,0.0,0.0,0.0,0.003906,0.0,...,0.0,0.0,0.000977,0.021484,0.014648,0.037109,0.0,0.10059,0.001953,0.004883


### SVM Parameter Tuning

First we'll use a grid search to determine the ideal SVM model based on the training data. The tuned parameters are:
* C: Penalty parameter C of the error term. This is used to regularlize the model and has a tradeoff between a smooth fit and exactly fitting the training data.
* kernel: kernel type for the algorithm, must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’
* gamma: Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. A higher gamma value makes the model fit more to the training data
* degree: Only used for polynomial SVM. Determines the degree of the polynomial when making the hyperplane

In [33]:
Cs = [0.001, 0.01, 0.1, 1, 10, 25, 50, 100, 1000]
kernels = ["linear", "rbf", "sigmoid", "poly"]
gammas = [0.001, 0.01, 0.1, 1, 10]
degrees = [2,3,4,5,6] # degree = 1 is identical to linear kernel
#different param dict for each kernel to remove redundancies
param_grid = [{'kernel' : ["linear"] ,'C': Cs},
             {'kernel': ["rbf", "sigmoid"], 'C': Cs, 'gamma': gammas},
             {'kernel' : ["poly"], 'C': Cs, 'gamma': gammas, 'degree': degrees}]
grid_search = GridSearchCV(SVC(), param_grid, cv=5)

grid_search.fit(X_train, y_train)



GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid=[{'kernel': ['linear'], 'C': [0.001, 0.01, 0.1, 1, 10, 25, 50, 100, 1000]}, {'kernel': ['rbf', 'sigmoid'], 'C': [0.001, 0.01, 0.1, 1, 10, 25, 50, 100, 1000], 'gamma': [0.001, 0.01, 0.1, 1, 10]}, {'kernel': ['poly'], 'C': [0.001, 0.01, 0.1, 1, 10, 25, 50, 100, 1000], 'gamma': [0.001, 0.01, 0.1, 1, 10], 'degree': [2, 3, 4, 5, 6]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [34]:
print("Best parameter set from grid search: ", grid_search.best_params_)
print("Best Accuracy: %0.2f%%" % (grid_search.best_score_*100),"\n")

means = grid_search.cv_results_['mean_test_score']
params = grid_search.cv_results_['params']

results = sorted(zip(means, params), key=lambda x: x[0], reverse=True)

for mean, params in results:
    print("%0.2f%% : %r" % (mean*100, params))

Best parameter set from grid search:  {'C': 1000, 'kernel': 'linear'}
Best Accuracy: 92.78% 

92.78% : {'C': 1000, 'kernel': 'linear'}
92.78% : {'C': 1000, 'gamma': 0.1, 'kernel': 'rbf'}
92.64% : {'C': 100, 'kernel': 'linear'}
92.64% : {'C': 100, 'gamma': 1, 'kernel': 'rbf'}
92.64% : {'C': 1000, 'gamma': 0.1, 'kernel': 'sigmoid'}
92.64% : {'C': 1000, 'gamma': 1, 'kernel': 'rbf'}
92.64% : {'C': 1000, 'gamma': 1, 'kernel': 'sigmoid'}
92.50% : {'C': 50, 'kernel': 'linear'}
92.50% : {'C': 100, 'gamma': 1, 'kernel': 'sigmoid'}
92.35% : {'C': 25, 'gamma': 1, 'kernel': 'rbf'}
92.35% : {'C': 50, 'gamma': 1, 'kernel': 'rbf'}
92.35% : {'C': 50, 'gamma': 1, 'kernel': 'sigmoid'}
92.21% : {'C': 10, 'gamma': 10, 'kernel': 'rbf'}
92.21% : {'C': 25, 'gamma': 10, 'kernel': 'rbf'}
92.21% : {'C': 50, 'gamma': 10, 'kernel': 'rbf'}
92.21% : {'C': 100, 'gamma': 10, 'kernel': 'rbf'}
92.21% : {'C': 1000, 'gamma': 10, 'kernel': 'rbf'}
91.63% : {'C': 10, 'degree': 2, 'gamma': 10, 'kernel': 'poly'}
91.63% : {'C'

In [35]:
grid_search.best_params_

{'C': 1000, 'kernel': 'linear'}

In [40]:
svm = SVC(C=1000, kernel="linear", probability=True)
svm.fit(X_train, y_train)
predictions = svm.predict(X_test)

accuracy_score(predictions, y_test)

0.92592592592592593

### NuSVC

Additionally, I am trying NuSVC, which instead of using the penalty parameter C, nu sets a lower bound on the fraction of observations that can be support vectors and an upper bound on the fraction of training errors (fraction misclassified).

In [42]:
%%time
kernels = ["linear", "rbf", "sigmoid", "poly"]
gammas = [0.001, 0.01, 0.1, 1, 10]
nus = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]
#different param dict for each kernel to remove redundancies
nu_param_grid = [{'kernel' : ["linear"], 'nu' : nus},
             {'kernel': ["rbf", "sigmoid"], 'gamma': gammas, 'nu' : nus}]
nu_grid_search = GridSearchCV(NuSVC(), nu_param_grid, cv=5)
nu_grid_search.fit(X_train, y_train)

CPU times: user 2min 23s, sys: 805 ms, total: 2min 23s
Wall time: 2min 26s




In [43]:
nu_grid_search.best_params_

{'kernel': 'linear', 'nu': 0.1}

In [44]:
print("Best parameter set from grid search: ", grid_search.best_params_)
print("Best Accuracy: %0.2f%%" % (grid_search.best_score_*100),"\n")

means = nu_grid_search.cv_results_['mean_test_score']
params = nu_grid_search.cv_results_['params']

results = sorted(zip(means, params), key=lambda x: x[0], reverse=True)

for mean, params in results:
    print("%0.2f%% : %r" % (mean*100, params))

Best parameter set from grid search:  {'C': 1000, 'kernel': 'linear'}
Best Accuracy: 92.78% 

92.93% : {'kernel': 'linear', 'nu': 0.1}
92.78% : {'gamma': 1, 'kernel': 'sigmoid', 'nu': 0.1}
92.64% : {'kernel': 'linear', 'nu': 0.2}
92.64% : {'gamma': 1, 'kernel': 'sigmoid', 'nu': 0.2}
92.50% : {'gamma': 1, 'kernel': 'rbf', 'nu': 0.2}
92.35% : {'gamma': 0.1, 'kernel': 'rbf', 'nu': 0.2}
92.35% : {'gamma': 1, 'kernel': 'rbf', 'nu': 0.1}
92.21% : {'gamma': 10, 'kernel': 'rbf', 'nu': 0.1}
92.21% : {'gamma': 10, 'kernel': 'rbf', 'nu': 0.2}
91.63% : {'kernel': 'linear', 'nu': 0.3}
91.63% : {'gamma': 0.1, 'kernel': 'rbf', 'nu': 0.3}
91.49% : {'gamma': 1, 'kernel': 'rbf', 'nu': 0.3}
91.34% : {'gamma': 0.1, 'kernel': 'rbf', 'nu': 0.1}
91.34% : {'gamma': 1, 'kernel': 'sigmoid', 'nu': 0.3}
91.34% : {'gamma': 10, 'kernel': 'rbf', 'nu': 0.3}
91.34% : {'gamma': 10, 'kernel': 'rbf', 'nu': 0.4}
91.05% : {'gamma': 0.1, 'kernel': 'sigmoid', 'nu': 0.3}
90.48% : {'kernel': 'linear', 'nu': 0.4}
90.48% : {'gam

In [47]:
nu_svc = NuSVC(nu=0.1, kernel="linear", probability=True)
nu_svc.fit(X_train, y_train)
predictions = nu_svc.predict(X_test)

accuracy_score(predictions, y_test)

0.92592592592592593

### SVM Conclusion

Linear SVM with high cost parameters were shown to be the best, topping >90% accuracy whenever C>=25.  C=50 had the best cross-validation acurracy, and a test accuracy of ~94%.  Likewise, rbf and sigmoid kernels with high cost parameters also faired exceedlingly well on the training data. Polynomial kernels faired poorly outside of degree = 2 with high values for C and gamma.

NuSVC showed that small values of nu were more effective, which is unsurprising since it has an inverse relationship with C.  While linear wasn't the best in this case, it still had a similarly high accuracy, and I'd think 