Assignment Content:

1. Implement an SVM classifier using Jupyter Notebook and Python as well as the sklearn package using any dataset of your choosing 
2. Implement cross-validation and train your classifier on the training data, then test the results on the test data. 
3. Tweak the kernel functions and regularization parameters as to perform well both on the training and test data. 
4. Finally, tweak the kernel functions and regularization parameters as to perform the best on the test data (meaning find the optimal test accuracy, not caring too much about the training accuracy). Report on your findings, printing them out to the console in Jupyter Notebook.


In [13]:
# Standard libary and settings
import os
import sys
import warnings; warnings.simplefilter('ignore')

# Data extensions and settings
import numpy as np
np.set_printoptions(threshold = np.inf, suppress = True)
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.options.display.float_format = '{:,.6f}'.format


# Modeling extensions
import sklearn.metrics as metrics
import sklearn.model_selection as model_selection
import sklearn.pipeline as pipeline
import sklearn.preprocessing as preprocessing
import sklearn.svm as svm


# Load data

In [14]:
# load and inspect data
df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/'
                        'machine-learning-databases/wine/wine.data',
                        header = None)

df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
                 'Alcalinity of ash','Magnesium', 'Total phenols',
                 'Flavanoids', 'Nonflavanoid phenols','Proanthocyanins',
                 'Color intensity', 'Hue','OD280/OD315 of diluted wines',
                 'Proline']
df_wine[:5]


Unnamed: 0,Class label,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [15]:
# return counts for each class label
np.unique(df_wine['Class label'].values, return_counts = True)


(array([1, 2, 3]), array([59, 71, 48]))

In [16]:
# drop class 1 so that we have a binary classification problem
df_wine = df_wine[df_wine['Class label'] != 1]


In [17]:
# split labels and features
y = df_wine['Class label'].values
X = df_wine.iloc[:,1:].values

# encode labels
le = preprocessing.LabelEncoder()
y = le.fit_transform(y)

# split into train/test set
XTrain, XTest, yTrain, yTest =\
                model_selection.train_test_split(X
                                                ,y
                                                ,test_size = 0.5
                                                ,random_state = 1
                                                ,stratify = y)                                                


# Train default model

In [18]:
# cross validation on training data
pipe = pipeline.make_pipeline(preprocessing.StandardScaler()
                             ,svm.SVC(random_state = 10))
scores = model_selection.cross_val_score(pipe
                                        ,XTrain
                                        ,yTrain
                                        ,scoring = 'accuracy'
                                        ,cv = 10
                                       )
print('CV accuracy on training data: {:.3f} +/- {:.3f}'.format(np.mean(scores), np.std(scores)))


CV accuracy on training data: 0.980 +/- 0.060


In [19]:
# fit best model, create predictions and review accuracy
model = pipe.fit(XTrain, yTrain)
yPredsTest = model.predict(XTest)
print('Accuracy: {:.3f}'.format(metrics.accuracy_score(y_true = yTest, y_pred = yPredsTest)))


Accuracy: 0.983


> Remarks - The default SVC algorithm uses a C of 1.0 and an RBF kernel.

# Optimize model parameters

In [21]:
# use GridSearchCV to perform CV over several different combination of parameters
pipe = pipeline.make_pipeline(preprocessing.StandardScaler()
                             ,svm.SVC(random_state = 1))
param_range = [0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]
param_grid = [{'svc__C': param_range, 'svc__kernel': ['linear']}
                ,{'svc__C': param_range,'svc__gamma': param_range,'svc__kernel': ['rbf']}
             ]
gs = model_selection.GridSearchCV(estimator = pipe
                                    ,param_grid = param_grid
                                    ,scoring = 'accuracy'
                                    ,cv = 10
                                    ,n_jobs = -1
                                 )
gs = gs.fit(XTrain, yTrain)
print(gs.best_score_)
print(gs.best_params_)


0.9830508474576272
{'svc__C': 0.01, 'svc__kernel': 'linear'}


In [22]:
# fit best model, create predictions and review accuracy
model = gs.fit(XTrain, yTrain)
yPredsTest = model.predict(XTest)
print('Accuracy: {:.3f}'.format(metrics.accuracy_score(y_true = yTest, y_pred = yPredsTest)))


Accuracy: 0.983


> Remarks - GridSearchCV performs slightly better on the training data and similarly on the test data compared to the default SVC algorithm but instead chooses a C of 0.01 and a linear kernel. Between the two models, this would be the better choice as it is the simpler model of the two.

# Optimize model parameters on test set only

In [24]:
# use GridSearchCV to perform CV over several different combination of parameters
pipe = pipeline.make_pipeline(preprocessing.StandardScaler()
                             ,svm.SVC(random_state = 1))
param_range = [0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]
param_grid = [{'svc__C': param_range, 'svc__kernel': ['linear']}
                ,{'svc__C': param_range,'svc__gamma': param_range,'svc__kernel': ['rbf']}
             ]
gs = model_selection.GridSearchCV(estimator = pipe
                                    ,param_grid = param_grid
                                    ,scoring = 'accuracy'
                                    ,cv = 10
                                    ,n_jobs = -1
                                 )
gs = gs.fit(XTest, yTest)
print(gs.best_score_)
print(gs.best_params_)


1.0
{'svc__C': 10.0, 'svc__gamma': 0.01, 'svc__kernel': 'rbf'}


> Remarks - On the test set, GridSearchCV chose a very different model than it did for the training set. In this case, the ideal parameters were C = 10.0, gamma = 0.01 and an RBF kernel.