<a href="https://colab.research.google.com/github/todnewman/coe_training/blob/master/Cross_Validation_Simple.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple Cross-Validation Example
Code Developed by: W. Tod Newman

### Overview
Here we show how to use grid search to identify the best hyperparameters for a Support Vector Machine.  We evaluate linear vs. non-linear SVM, kernel coefficient (gamma) values, and loss penalty (C) values.  The dataset we're evaluating is the MNIST handwritten digits dataset.

### Note:
* This is a very simple example for the reason that grid search takes a very long time.  In class, none of us have this kind of patience.
* However, grid search is still valuable and DOESN'T have to be run every time you train a model.  Often times running it early in a prototyping effort is all you need.

### Learning Objectives
Show how grid search can be an effective way to evaluate hyperparameters of small models.

Opportunities to extend this learning include evaluating the contribution of the penalty factors and potential approaches to force overfitting.  Applying this tool to different datasets that are less predictable will also reveal larger differences between the hyperparameter sets.


In [8]:
from __future__ import print_function

from sklearn import datasets
from keras.datasets import fashion_mnist
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC
from keras import backend as K
import warnings
warnings.filterwarnings('ignore')


# Loading the Digits dataset
digits = datasets.load_digits()

# To apply an classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target


# Split the dataset in two equal parts
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)
    
"""
Create an Array of Dictionaries to use for our Cross-Validation

PARAMETERS:
*  C is the regularization parameter for the SVC classifier 
*  kernel specifies the kernel type.  There are a few options but we'll experiment
      with the radial basis function kernel (rbf) and the linear kernel.
*  gamma is a specific kernel coefficient for rbf that represents the radius of 
      influence for any datapoint.
"""

tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4, 1e-2, 2e-3],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

scores = ['precision', 'recall']

for score in scores:
    print("***********************************************************")
    print("******** Tuning hyper-parameters for %s ************" % score) 
    print("***********************************************************")
    print()
    
    #
    # The grid search provided by GridSearchCV exhaustively generates 
    # candidates from a grid of parameter values specified with the 
    # param_grid parameter (in this case, our Array of Dictionaries, tuned_parameters). 
    #
    clf = GridSearchCV(SVC(), tuned_parameters, cv=5,
                       scoring='%s_macro' % score)
    
    print(X_train.shape, y_train.shape)
    clf.fit(X_train, y_train) # FIT our grid_search model

    print("***********************************************************")
    print("Best parameters set found on development set:")
    print()
    print(clf.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
    print()
    print("***********************************************************")
    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    y_true, y_pred = y_test, clf.predict(X_test)
    print(classification_report(y_true, y_pred))
    print()



***********************************************************
******** Tuning hyper-parameters for precision ************
***********************************************************

(1257, 64) (1257,)
***********************************************************
Best parameters set found on development set:

{'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}

Grid scores on development set:

0.989 (+/-0.006) for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}
0.970 (+/-0.013) for {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}
0.933 (+/-0.009) for {'C': 1, 'gamma': 0.01, 'kernel': 'rbf'}
0.987 (+/-0.010) for {'C': 1, 'gamma': 0.002, 'kernel': 'rbf'}
0.991 (+/-0.006) for {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
0.987 (+/-0.013) for {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}
0.936 (+/-0.011) for {'C': 10, 'gamma': 0.01, 'kernel': 'rbf'}
0.988 (+/-0.011) for {'C': 10, 'gamma': 0.002, 'kernel': 'rbf'}
0.991 (+/-0.006) for {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}
0.988 (+/-0.013) for {'C': 100, 'gamma': 0.