<a href="https://colab.research.google.com/github/saadhassan99/SVM-on-MNIST/blob/main/SVM_classifier_on_the_MNIST.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

In [None]:
mnist = fetch_openml(name='mnist_784', version=1, cache=True)

X = mnist["data"]
y = mnist["target"].astype(np.uint8)

In [None]:
X_train = X[:60000]
y_train = y[:60000]
X_test = X[60000:]
y_test = y[60000:]

Many training algorithms are sensitive to the order of the training instances, so it's generally good practice to shuffle them first. However, the dataset is already shuffled, so we do not need to do it.

Let's start simple, with a linear SVM classifier. It will automatically use the One-vs-All (also called One-vs-the-Rest, OvR) strategy, so there's nothing special we need to do. Easy!

Do not forget to scale the features. SVMs are extremely sensitive to feature scaling. Different scales will ruin the results

In [None]:
lin_svc_pipeline = Pipeline([
                             ("scaler", StandardScaler()),
                             ("linear_SVM", LinearSVC(random_state=2021))
])

In [None]:
lin_svc_pipeline.fit(X_train, y_train)



Pipeline(memory=None,
         steps=[('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('linear_SVM',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
                           loss='squared_hinge', max_iter=1000,
                           multi_class='ovr', penalty='l2', random_state=2021,
                           tol=0.0001, verbose=0))],
         verbose=False)

In [None]:
y_pred = lin_svc_pipeline.predict(X_train)
accuracy_score(y_pred, y_train)

0.9205666666666666


not a great accuracy for MNIST. If we want to use an SVM, we will have to use a kernel. Let's try an SVC with an RBF kernel

In [None]:
rbf_kernel_svm_clf = Pipeline([
                               ("scaler", StandardScaler()),
                               ("svm_clf", SVC(kernel='rbf'))
])

In [None]:
rbf_kernel_svm_clf.fit(X_train[:10000], y_train[:10000])

Pipeline(memory=None,
         steps=[('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('svm_clf',
                 SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None,
                     coef0=0.0, decision_function_shape='ovr', degree=3,
                     gamma='scale', kernel='rbf', max_iter=-1,
                     probability=False, random_state=None, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [None]:
y_pred = rbf_kernel_svm_clf.predict(X_train)
accuracy_score(y_pred, y_train)

0.94405

That's promising, we get better performance even though we trained the model on 6 times less data. Let's tune the hyperparameters by doing a randomized search with cross validation. We will do this on a small dataset just to speed up the process:

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import reciprocal, uniform

# Parameters of pipelines can be set using ‘__’ separated parameter names:
param_distributions = {"svm_clf__gamma": reciprocal(0.001, 0.1), "svm_clf__C": uniform(1, 10)}
rnd_search_cv = RandomizedSearchCV(rbf_kernel_svm_clf, param_distributions, n_iter=10, verbose=2, cv=3)
rnd_search_cv.fit(X_train[:1000], y_train[:1000])

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] svm_clf__C=10.613968931268783, svm_clf__gamma=0.007053058879001415 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  svm_clf__C=10.613968931268783, svm_clf__gamma=0.007053058879001415, total=   1.2s
[CV] svm_clf__C=10.613968931268783, svm_clf__gamma=0.007053058879001415 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.2s remaining:    0.0s


[CV]  svm_clf__C=10.613968931268783, svm_clf__gamma=0.007053058879001415, total=   1.2s
[CV] svm_clf__C=10.613968931268783, svm_clf__gamma=0.007053058879001415 
[CV]  svm_clf__C=10.613968931268783, svm_clf__gamma=0.007053058879001415, total=   1.2s
[CV] svm_clf__C=10.911179249108393, svm_clf__gamma=0.005590781875127783 
[CV]  svm_clf__C=10.911179249108393, svm_clf__gamma=0.005590781875127783, total=   1.2s
[CV] svm_clf__C=10.911179249108393, svm_clf__gamma=0.005590781875127783 
[CV]  svm_clf__C=10.911179249108393, svm_clf__gamma=0.005590781875127783, total=   1.2s
[CV] svm_clf__C=10.911179249108393, svm_clf__gamma=0.005590781875127783 
[CV]  svm_clf__C=10.911179249108393, svm_clf__gamma=0.005590781875127783, total=   1.2s
[CV] svm_clf__C=7.74601983446654, svm_clf__gamma=0.0029762976640808304 
[CV]  svm_clf__C=7.74601983446654, svm_clf__gamma=0.0029762976640808304, total=   1.1s
[CV] svm_clf__C=7.74601983446654, svm_clf__gamma=0.0029762976640808304 
[CV]  svm_clf__C=7.74601983446654, sv

[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:   34.4s finished


RandomizedSearchCV(cv=3, error_score=nan,
                   estimator=Pipeline(memory=None,
                                      steps=[('scaler',
                                              StandardScaler(copy=True,
                                                             with_mean=True,
                                                             with_std=True)),
                                             ('svm_clf',
                                              SVC(C=1.0, break_ties=False,
                                                  cache_size=200,
                                                  class_weight=None, coef0=0.0,
                                                  decision_function_shape='ovr',
                                                  degree=3, gamma='scale',
                                                  kernel='rbf', max_iter=-1,
                                                  probability=False,
                                            

In [None]:
rnd_search_cv.best_estimator_

Pipeline(memory=None,
         steps=[('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('svm_clf',
                 SVC(C=5.962064738868197, break_ties=False, cache_size=200,
                     class_weight=None, coef0=0.0,
                     decision_function_shape='ovr', degree=3,
                     gamma=0.0011699064117295366, kernel='rbf', max_iter=-1,
                     probability=False, random_state=None, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [None]:
rnd_search_cv.best_score_

0.8379757002511493

In [None]:
# train the entire dataset on the best set of parameters

rnd_search_cv.best_estimator_.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('svm_clf',
                 SVC(C=5.962064738868197, break_ties=False, cache_size=200,
                     class_weight=None, coef0=0.0,
                     decision_function_shape='ovr', degree=3,
                     gamma=0.0011699064117295366, kernel='rbf', max_iter=-1,
                     probability=False, random_state=None, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [None]:
# dont forget to save
from sklearn.externals import joblib
joblib.dump(rnd_search_cv.best_estimator_, 'final_svm_model_with_rbf_kernel.pkl')



['final_svm_model_with_rbf_kernel.pkl']

In [None]:
y_pred = rnd_search_cv.best_estimator_.predict(X_train)
accuracy_score(y_train, y_pred)

0.9973333333333333

Ah, this looks good! Let's select this model. Now we can test it on the test set:

In [None]:
y_pred = rnd_search_cv.best_estimator_.predict(X_test)
accuracy_score(y_test, y_pred)

0.9728


Not too bad, but apparently the model is overfitting slightly. It's tempting to tweak the hyperparameters a bit more (e.g. decreasing C and/or gamma), but we would run the risk of overfitting the test set. Other people have found that the hyperparameters C=5 and gamma=0.005 yield even better performance (over 98% accuracy). By running the randomized search for longer and on a larger part of the training set, you may be able to find this as well.