# Support Vector Machine - Image Classification


In this notebook we will train SVM classifiers to classify images. 

For a comparative understanding, we will compare the performance of the SVM with the Logistic Regression Softmax classifier.

We will use dimensionality reduction technique (Principle Component Analysis) to project the features into a smaller dimension to expedite the training time.


Generally **images are linearly non-separable**. Based on this we formulate the following hypotheses:
- The kernelized SVM models will perform significantly better than the linear SVM model.
- The RBF Kernel based SVM will perform better than Softmax regression classifier.
- Dimensionaly reduction (by retaining maximum variance) should improve the performance.

We will investige these hypotheses by conducting the following experiments.


## Experiments

- Experiment 1: Support Vector Machine (LinearSVC) + PCA
- Experiment 2: Support Vector Machine (SVC with RBF Kernel) + PCA
- Experiment 3: Support Vector Machine (SVC with RBF Kernel) 
- Experiment 4: Logistic Regression (Softmax Regression) + PCA

## Dataset: MNIST


We will use the MNIST dataset, which is a set of 70,000 small images of digits handwritten by high school students and employees of the US Census Bureau. Each image is labeled with the digit it represents.


There are 70,000 images. Each image is 28x28 pixels, and each feature simply represents one pixel’s intensity, from 0 (white) to 255 (black).

Thus, each image has 784 features. 

In [23]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt


from sklearn.datasets import fetch_mldata
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC, SVC

from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, classification_report
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

## Load Data and Create Data Matrix (X) and the Label Vector (y)

In [2]:
mnist = fetch_mldata('MNIST original')

X, y = mnist["data"], mnist["target"] 

print(X.shape)
print(y.shape)

(70000, 784)
(70000,)




## Split Data Into Training and Test Sets

The MNIST dataset is already split into a training set (the first 60,000 images) and a test set (the last 10,000 images).

We will shuffle the training set to ensure that all cross-validation folds will be similar. 

In [3]:
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

# Optimization Using Dimensionaly Reduction

We can optimize the running-time of the Logistic Regression algorithm by reducing the number of features. Our assumption is that the essence or core content of the data does not span along all dimensions. The technique for reducing the dimension of data is known as dimensionality reduction.

For a gentle introduction to various dimensionality reduction technique, see the notebook "Dimensionality Reduction" in the Github repository.

We will use the Principle Component Analysis (PCA) dimensionality reduction technique to project the MNIST dataset (784 features) to a lower dimensional space by retaining maximum variance. 

The goal is to see the improvement in training time due to this dimensionality reduction.

Before we apply the PCA, we need to standardize the data.

## Standardize the Data

PCA is influenced by scale of the data. Thus we need to scale the features of the data before applying PCA. 

For understanding the negative effect of not scaling the data, see the following post:

https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto-examples-preprocessing-plot-scaling-importance-py

Note that we fit the scaler on the training set and transform on the training and test set. 

In [4]:
scaler = StandardScaler()

# Fit on training set only.
scaler.fit(X_train)

# Apply transform to both the training set and the test set.
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)



## Apply PCA

While applying PCA we can set the number of principle components by the "n_components" attribute. But more importantly, we can use this attribute to determine the % of variance we want to retain in the extracted features.

For example, if we set it to 0.95, sklearn will choose the **minimum number of principal components** such that 95% of the variance is retained.

In [5]:
%%time
pca = PCA(n_components=0.95)

pca.fit(X_train)

CPU times: user 22.2 s, sys: 1.69 s, total: 23.9 s
Wall time: 7.77 s


## Number of Principle Components

We can find how many components PCA chose after fitting the model by using the following attribute: n_components_

We will see that 95% of the variance amounts to **315 principal components**.

In [21]:
print("Numberof Principle Components: ", pca.n_components_)  

Numberof Principle Components:  331


## Apply the Mapping (Transform) to both the Training Set and the Test Set

In [6]:
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

## Experiments

We will conduct the following experiments.

- Experiment 1: Support Vector Machine (LinearSVC) + PCA
- Experiment 2: Support Vector Machine (SVC with RBF Kernel) + PCA
- Experiment 3: Support Vector Machine (SVC with RBF Kernel) 
- Experiment 4: Logistic Regression (Softmax Regression) + PCA

## Support Vector Machine: Model Selection via Hyperparameter Tuning

Note that we are not performing grid search (which we should have). 

We are simply using the best values for the two hyperparameters ($\gamma$ and $C$) for the SVC from prior grid search. However, it is advised that one should perform grid search to fine tune the hyperparameters.

## Experiment 1: LinearSVC + PCA

In [24]:
%%time
linear_svc_pca = LinearSVC(loss='hinge', C=1, random_state=42)
linear_svc_pca.fit(X_train_pca, y_train)

CPU times: user 3min 32s, sys: 580 ms, total: 3min 33s
Wall time: 3min 33s




## Experiment 1: Evaluate LinearSVC + PCA on Test Data

In [27]:
%%time

y_test_predicted = linear_svc_pca.predict(X_test_pca)

accuracy_score_test = np.mean(y_test_predicted == y_test)
print("\nTest Accuracy: ", accuracy_score_test)

print("\nTest Confusion Matrix:")
print(confusion_matrix(y_test, y_test_predicted))

print("\nClassification Report:")
print(classification_report(y_test, y_test_predicted))


Test Accuracy:  0.918

Test Confusion Matrix:
[[ 957    0    1    2    0    4   10    3    2    1]
 [   0 1112    3    1    0    2    4    1   12    0]
 [  10    5  912   19    9    7   11   11   44    4]
 [   4    0   17  928    1   19    3   10   18   10]
 [   1    2    6    1  905    1   10    5    9   42]
 [   7    3    0   39   12  768   21    9   27    6]
 [  10    3    6    3    7   10  914    1    4    0]
 [   1   11   23   10   11    2    1  944    2   23]
 [  10    8    8   24   12   27    8   13  856    8]
 [  11    8    3   15   40   12    1   24   11  884]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.95      0.98      0.96       980
         1.0       0.97      0.98      0.97      1135
         2.0       0.93      0.88      0.91      1032
         3.0       0.89      0.92      0.90      1010
         4.0       0.91      0.92      0.91       982
         5.0       0.90      0.86      0.88       892
         6.0       

## Experiment 2: SVC (RBF Kernel) + PCA

In [20]:
%%time
svm_clf_pca = SVC(C=1, gamma=0.001)
svm_clf_pca.fit(X_train_pca, y_train)

CPU times: user 3min 34s, sys: 396 ms, total: 3min 34s
Wall time: 3min 34s


## Experiment 2: Evaluate SVC (RBF Kernel) + PCA on Test Data

In [21]:
%%time

y_test_predicted = svm_clf_pca.predict(X_test_pca)

accuracy_score_test = np.mean(y_test_predicted == y_test)
print("\nTest Accuracy: ", accuracy_score_test)

print("\nTest Confusion Matrix:")
print(confusion_matrix(y_test, y_test_predicted))

print("\nClassification Report:")
print(classification_report(y_test, y_test_predicted))


Test Accuracy:  0.9659

Test Confusion Matrix:
[[ 968    0    2    1    0    3    3    1    2    0]
 [   0 1126    3    0    0    1    3    0    2    0]
 [   6    2  993    3    2    0    1   14   10    1]
 [   0    0    2  984    1    7    0   10    6    0]
 [   1    0    8    0  945    2    4    7    2   13]
 [   2    0    1   12    3  854    7    5    7    1]
 [   6    2    1    0    4    9  930    2    4    0]
 [   0    7   17    3    1    1    0  986    0   13]
 [   3    0    4    9    6   12    3    9  926    2]
 [   4    6    4   12   17    2    0   14    3  947]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.98      0.99      0.98       980
         1.0       0.99      0.99      0.99      1135
         2.0       0.96      0.96      0.96      1032
         3.0       0.96      0.97      0.97      1010
         4.0       0.97      0.96      0.96       982
         5.0       0.96      0.96      0.96       892
         6.0      

## Experiment 3: SVC (RBF Kernel) 

We experiment with the SVC (RBF Kernel) without applying dimensionaly reducion on the data.

In [17]:
%%time
svm_clf = SVC(gamma=0.001)
svm_clf.fit(X_train, y_train)

CPU times: user 8min 17s, sys: 614 ms, total: 8min 17s
Wall time: 8min 17s


## Experiment 3: Evaluate SVC (RBF Kernel) on Test Data

In [18]:
%%time

y_test_predicted = svm_clf.predict(X_test)

accuracy_score_test = np.mean(y_test_predicted == y_test)
print("\nAccuracy: ", accuracy_score_test)


print("\nTest Confusion Matrix:")
print(confusion_matrix(y_test, y_test_predicted))


print("\nClassification Report:")
print(classification_report(y_test, y_test_predicted))


Accuracy:  0.9657

Test Confusion Matrix:
[[ 968    0    2    1    0    3    3    1    2    0]
 [   0 1125    3    0    0    1    3    1    2    0]
 [   6    1  992    3    2    0    1   15   11    1]
 [   0    0    2  982    1    8    0   11    6    0]
 [   1    0    8    0  945    2    4    8    3   11]
 [   2    0    1   13    2  853    7    5    8    1]
 [   6    2    1    0    4    9  931    2    3    0]
 [   1    5   15    3    3    0    0  988    0   13]
 [   3    1    5    8    6   11    3   10  924    3]
 [   4    6    4   12   15    1    0   15    3  949]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.98      0.99      0.98       980
         1.0       0.99      0.99      0.99      1135
         2.0       0.96      0.96      0.96      1032
         3.0       0.96      0.97      0.97      1010
         4.0       0.97      0.96      0.96       982
         5.0       0.96      0.96      0.96       892
         6.0       0.98

## Experiment 4: Logistic Regression (Softmax Regression) + PCA

We use the best performing solver (i.e., lbfgs) from previous notebook to train the logistic regression model on the PCA transformed data.

In [25]:
%%time
softmax_reg_pca = LogisticRegression(solver='lbfgs', multi_class='multinomial')

softmax_reg_pca.fit(X_train_pca, y_train)

CPU times: user 37.3 s, sys: 3.36 s, total: 40.7 s
Wall time: 10.5 s




## Experiment 4: Evaluate Softmax Regression + PCA on Test Data

In [26]:
print("No. of Iterations:", softmax_reg_pca.n_iter_ )


y_test_predicted = softmax_reg_pca.predict(X_test_pca)
#print(y_test_predict)

accuracy_score_test = np.mean(y_test_predicted == y_test)
print("\nTest Accuracy: ", accuracy_score_test)


print("\nTest Confusion Matrix:")
print(confusion_matrix(y_test, y_test_predicted))


print("\nClassification Report:")
print(classification_report(y_test, y_test_predicted))

No. of Iterations: [100]

Test Accuracy:  0.9265

Test Confusion Matrix:
[[ 957    0    1    2    1    6    8    3    2    0]
 [   0 1114    3    2    0    1    3    2   10    0]
 [   7    5  931   17   12    3    9   11   34    3]
 [   3    3   18  919    1   22    3   11   23    7]
 [   1    2    8    2  917    0   10    4    9   29]
 [   7    5    3   33    8  778   13    6   35    4]
 [  12    3    8    2    6   12  912    1    2    0]
 [   0    9   29    5    6    1    0  948    0   30]
 [   6    6    6   22    9   24    7   11  874    9]
 [   9    7    2   10   25    7    0   26    8  915]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.96      0.98      0.97       980
         1.0       0.97      0.98      0.97      1135
         2.0       0.92      0.90      0.91      1032
         3.0       0.91      0.91      0.91      1010
         4.0       0.93      0.93      0.93       982
         5.0       0.91      0.87      0.89    

# Summary of Results from 4 Experiments

In [28]:
data = [["LinearSVC + PCA", 0.918, "3min 33s"], 
        ["SVM(RBF) + PCA", 0.9659, "3min 34s"],
        ["SVM(RBF)", 0.9657, "8min 17s"],
        ["Softmax + PCA", 0.9265, "10.5 s"]]

pd.DataFrame(data, columns=["Classifier", "Accuracy", "Running-Time"])


Unnamed: 0,Classifier,Accuracy,Running-Time
0,LinearSVC + PCA,0.918,3min 33s
1,SVM(RBF) + PCA,0.9659,3min 34s
2,SVM(RBF),0.9657,8min 17s
3,Softmax + PCA,0.9265,10.5 s


## Comparative Understanding

We have done 4 experiments using SVM and Logistic Regression classifiers.

The first 3 experiements are done using 2 SVM algorithms, with the effect of PCA.

The experimental results confirm our hypotheses:
- The kernelized SVM models will perform significantly better than the linear SVM model.
- The RBF Kernel based SVM will perform better than Softmax regression classifier.
- Dimensionaly reduction (by retaining maximum variance) should improve the performance.

We make following observations.
- The SVM classifiers perform **significantly** better than the Softmax classifier.
- The SVM classifier training and prediction time is **longer**.
- The RBF kernel based SVM classifier performs better than the linear SVM classifier. It indicates that for this non-linear image classsification problem the kernelized SVM is the most suitable algorithm.
- Dimensionality reduction improves the performance slightly on the RBF kernel based SVM.

### Thus, for image classification problems RBF kernel based SVM model should be used with dimensionality reduction.