<a href="https://colab.research.google.com/github/raj-vijay/ml/blob/master/02.Linear%20Classifiers/16_GridSearchCV_and_SVM_Digits_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Here, we fit the two types of multi-class logistic regression, one-vs-rest and softmax/multinomial, on the handwritten digits data set and compare the results.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
import matplotlib.pyplot as plt
import numpy as np

The handwritten digits dataset is already loaded into the variables X and y. The show_digit function takes in an integer index and plots the corresponding image, with some extra information displayed above the image.

The radial basis function kernel, or RBF kernel, is a popular kernel function used in various kernelized learning algorithms.

Increasing the RBF kernel hyperparameter gamma increases training accuracy. 

Here we search for the gamma that maximizes cross-validation accuracy using scikit-learn's GridSearchCV. A binary version of the handwritten digits dataset, in which we predict whether or not an image is a "2", is loaded into the variables X and y.

In [None]:
from sklearn import datasets
digits = datasets.load_digits()
X, X_test, y, y_test = train_test_split(digits.data, digits.target)

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

In [None]:
# Instantiate an RBF SVM
svm = SVC()

# Instantiate the GridSearchCV object and run the search
parameters = {'gamma':[0.00001, 0.0001, 0.001, 0.01, 0.1]}
searcher = GridSearchCV(svm, parameters)
searcher.fit(X, y)

# Report the best parameters
print("Best CV params", searcher.best_params_)

Best CV params {'gamma': 0.001}


In [None]:
X_train = X
y_train = y

From above, it can be seen that the best value of gamma was 0.001 using the default value of C, which is 1. 

Now we search for the best combination of C and gamma using GridSearchCV.

The 2-vs-not-2 digits dataset is already loaded, but this time it's split into the variables X_train, y_train, X_test, and y_test. 

Even though cross-validation already splits the training set into parts, it's makes more sense to hold out a separate test set to make sure the cross-validation results are sensible.

In [None]:
# Instantiate an RBF SVM
svm = SVC()

# Instantiate the GridSearchCV object and run the search
parameters = {'C':[0.1, 1, 10], 'gamma':[0.00001, 0.0001, 0.001, 0.01, 0.1]}
searcher = GridSearchCV(svm, parameters)
searcher.fit(X_train, y_train)

# Report the best parameters and the corresponding score
print("Best CV params", searcher.best_params_)
print("Best CV accuracy", searcher.best_score_)

# Report the test accuracy using these best parameters
print("Test accuracy of best grid search hypers:", searcher.score(X_test, y_test))

Best CV params {'C': 10, 'gamma': 0.001}
Best CV accuracy 0.9918298223874432
Test accuracy of best grid search hypers: 0.9888888888888889
