# Basic imports & Exploration

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as py 

We import our data, which is from the MNIST DB (Modified National Institute of Standards and Technology database). 

Our dataset consists of handwritten one-digit numbers, and our model will attempt to classify them. 

It can be downloaded from Kaggle at https://www.kaggle.com/c/3004/download/train.csv


In [5]:
data = pd.read_csv("/Users/yotroz/Ironhackers Dropbox/Octavio Ramirez/Work/MDBI_IE/Term_2/AI_ML_STATISTICAL_LEARNING_PREDICTION/SVM-project/train.csv")

We explore our data with some basic pandas functions

In [6]:
data.shape

(42000, 785)

In [8]:
data.head()

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
data.tail()

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
41995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
41996,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
41997,7,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
41998,6,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
41999,9,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
data.keys()

Index(['label', 'pixel0', 'pixel1', 'pixel2', 'pixel3', 'pixel4', 'pixel5',
       'pixel6', 'pixel7', 'pixel8',
       ...
       'pixel774', 'pixel775', 'pixel776', 'pixel777', 'pixel778', 'pixel779',
       'pixel780', 'pixel781', 'pixel782', 'pixel783'],
      dtype='object', length=785)

We see that our dataset consists of a bit over 40k samples and, as for the features, we have our needed label and roughly 800 features representing a "pixel intensity", which is a value between 0 and 255. 

In [11]:
from sklearn.model_selection import train_test_split

features = data.columns[1:]
X = data[features]
Y = data['label']

X_train, X_test, Y_train, y_test = train_test_split(X/255, Y, test_size=0.1, random_state=0)

Now we split 10% of the dataset for testing and the rest for training.  

Our data preparation consists of dividing our data point by 255 in order to have a value that goes from 0 to 1 to measure the pixel intensity (instead of 0 to 255). 

We import now our Support Vector Machine algorithm from the sci-kit learn library and an accuracy score from the metrics module

In [12]:
from sklearn.svm import LinearSVC
clf_svm = LinearSVC(dual=False, tol=1e-5)
clf_svm.fit(X_train, Y_train)

LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=1e-05,
     verbose=0)

In [13]:
from sklearn.metrics import accuracy_score
y_pred_svm = clf_svm.predict(X_test)
score = accuracy_score(y_test, y_pred_svm)
print('Model score: ', score)

Model score:  0.9102380952380953


By inputing a linear SVM model we already a 91% accuracy score on the classifier.

The next step involves finding the optimal combination of hyperparameters. In our specific case, what is the right tolerance?

In [19]:
from sklearn.model_selection import GridSearchCV
tolerances = [1e-3, 1e-4, 1e-5]
param_grid = {'tol': tolerances}

grid_search = GridSearchCV(LinearSVC(dual=False), param_grid, cv=3)
grid_search.fit(X_train, Y_train)
grid_search.best_params_

{'tol': 0.001}

The upper code uses cross-validation to avoid overfitting (three-fold cross validation, in this case). And uses a grid search to find the right parameter to improve the model. 

We preserve the default penalty parameter(12). The penalty parameter tries to determine which is the best negative score when an observation is on an incorrect side of the classification hyperplane. 



In [26]:
clf_svm = LinearSVC(dual=False, tol=1e-3)
clf_svm.fit(X_train, Y_train)

LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.001,
     verbose=0)

Now, we reinstantiate the SVM class with our adjusted parameters and we evaluate once again. 

In [27]:
from sklearn.metrics import accuracy_score
y_pred_svm = clf_svm.predict(X_test)
score = accuracy_score(y_test, y_pred_svm)
print('Adjusted Model score: ', score)

Adjusted Model score:  0.910952380952381


Our model only improved marginally. But we could carry on tweaking the parameters until an acceptable accuracy is met.         
