## Leaf Classification and Hyperparameter Tuning of SVM

> Dataset Description

The dataset consists approximately 1,584 (594 + 990) images of leaf specimens (16 samples each of 99 species) which have been converted to binary black leaves against white backgrounds. Three sets of features are also provided per image: a *shape contiguous descriptor*, an *interior texture histogram*, and a *ﬁne-scale margin histogram*. For each feature, a 64-attribute vector is given per leaf sample (3 * 64 = 192 features).

Note that of the original 100 species, we have eliminated one on account of incomplete associated data in the original dataset.

In [1]:
import numpy as np
import pandas as pd

def warn(*args, **kwargs): pass
import warnings
warnings.warn = warn

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, log_loss

from sklearn.svm import SVC
from sklearn import decomposition

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedShuffleSplit

train = pd.read_csv('./leafClf/train.csv') # (990, 192)
test = pd.read_csv('./leafClf/test.csv') # (594, 192)

## Data Preparation

In [2]:
# Swiss army knife function to organize the data

def encode(train, test):
    le = LabelEncoder().fit(train.species) 
    labels = le.transform(train.species)           # encode species strings
    classes = list(le.classes_)                    # save column names for submission
    test_ids = test.id                             # save test ids for submission
    
    train = train.drop(['species', 'id'], axis=1)  
    test = test.drop(['id'], axis=1)
    
    return train, labels, test, test_ids, classes

train, labels, test, test_ids, classes = encode(train, test)
train.head(1)

Unnamed: 0,margin1,margin2,margin3,margin4,margin5,margin6,margin7,margin8,margin9,margin10,...,texture55,texture56,texture57,texture58,texture59,texture60,texture61,texture62,texture63,texture64
0,0.007812,0.023438,0.023438,0.003906,0.011719,0.009766,0.027344,0.0,0.001953,0.033203,...,0.007812,0.0,0.00293,0.00293,0.035156,0.0,0.0,0.004883,0.0,0.025391


## Stratified Train/Test Split

Stratification is necessary for this dataset because there is a relatively large number of classes (100 classes for 990 samples). This will ensure we have all classes represented in both the train and test indices. 

In [3]:
#sss = StratifiedShuffleSplit(labels, 10, test_size=0.2, random_state=23)
#
#for train_index, test_index in sss:
#    X_train, X_test = train.values[train_index], train.values[test_index]
#    y_train, y_test = labels[train_index], labels[test_index]

In [4]:
pca = decomposition.PCA()
pca.fit(train)
train_t = pca.transform(train)

In [5]:
# Set the parameters by cross-validation
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-2, 1e-3, 1e-4, 1e-5],
                     'C': [0.001, 0.10, 0.1, 10, 25, 50, 100, 1000]},
                    {'kernel': ['sigmoid'], 'gamma': [1e-2, 1e-3, 1e-4, 1e-5],
                     'C': [0.001, 0.10, 0.1, 10, 25, 50, 100, 1000]},
                    {'kernel': ['linear'], 'C': [0.001, 0.10, 0.1, 10, 25, 50, 100, 1000]}
                   ]

scores = ['precision', 'recall']

for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    print()

    clf = GridSearchCV(SVC(C=1), tuned_parameters, cv=5,
                       scoring='%s_macro' % score)
    clf.fit(train_t, labels)

    print("Best parameters set found on development set:")
    print()
    print(clf.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
    print()

#    print("Detailed classification report:")
#    print()
#    print("The model is trained on the full development set.")
#    print("The scores are computed on the full evaluation set.")
#    print()
#    y_true, y_pred = y_test, clf.predict(X_test)
#    print(classification_report(y_true, y_pred))
#    print()

# Tuning hyper-parameters for precision

Best parameters set found on development set:

{'C': 1000, 'kernel': 'linear'}

Grid scores on development set:

0.797 (+/-0.080) for {'C': 0.001, 'gamma': 0.01, 'kernel': 'rbf'}
0.797 (+/-0.080) for {'C': 0.001, 'gamma': 0.001, 'kernel': 'rbf'}
0.796 (+/-0.080) for {'C': 0.001, 'gamma': 0.0001, 'kernel': 'rbf'}
0.797 (+/-0.084) for {'C': 0.001, 'gamma': 1e-05, 'kernel': 'rbf'}
0.797 (+/-0.080) for {'C': 0.1, 'gamma': 0.01, 'kernel': 'rbf'}
0.797 (+/-0.080) for {'C': 0.1, 'gamma': 0.001, 'kernel': 'rbf'}
0.796 (+/-0.080) for {'C': 0.1, 'gamma': 0.0001, 'kernel': 'rbf'}
0.797 (+/-0.084) for {'C': 0.1, 'gamma': 1e-05, 'kernel': 'rbf'}
0.797 (+/-0.080) for {'C': 0.1, 'gamma': 0.01, 'kernel': 'rbf'}
0.797 (+/-0.080) for {'C': 0.1, 'gamma': 0.001, 'kernel': 'rbf'}
0.796 (+/-0.080) for {'C': 0.1, 'gamma': 0.0001, 'kernel': 'rbf'}
0.797 (+/-0.084) for {'C': 0.1, 'gamma': 1e-05, 'kernel': 'rbf'}
0.797 (+/-0.080) for {'C': 10, 'gamma': 0.01, 'kernel': 'r

In [6]:
clf.best_params_

{'C': 1000, 'kernel': 'linear'}

In [7]:
my_svm = SVC(C=1000, kernel="linear", probability=True)
my_svm.fit(train_t, labels)

#print('****Results****')
#train_predictions = my_svm.predict(X_test)
#acc = accuracy_score(y_test, train_predictions)
#print("Accuracy: {:.4%}".format(acc))
#
#train_predictions = my_svm.predict_proba(X_test)
#ll = log_loss(y_test, train_predictions)
#print("Log Loss: {}".format(ll))
#
#
#print("="*30)

In [8]:
#train_t = pca.transform(train)
#my_svm.fit(train_t, labels)

test_t = pca.transform(test)
test_predictions = my_svm.predict_proba(test_t)

# Format DataFrame
submission = pd.DataFrame(test_predictions, columns=classes)
submission.insert(0, 'id', test_ids)
submission.reset_index()

# Export Submission
submission.to_csv('./leafClf/submission.csv', index = False)

In [9]:
subm = pd.read_csv('./leafClf/submission.csv')

In [10]:
subm.iloc[:, 1:].sum(axis='columns')

0      1.0
1      1.0
2      1.0
3      1.0
4      1.0
      ... 
589    1.0
590    1.0
591    1.0
592    1.0
593    1.0
Length: 594, dtype: float64

#### Reference:

- https://www.kaggle.com/code/udaysa/svm-with-scikit-learn-svm-with-parameter-tuning