<a href="https://colab.research.google.com/github/remijul/tutorial/blob/master/CrossValidation_%26_GridSearch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cross Validation & Grid Search for Machine Learning
---

Learning the parameters of a prediction function and testing it on the same data is a **methodological mistake**: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data.  
This situation is called **overfitting**.  
To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test.  
Note that the word “experiment” is not intended to denote academic use only, because even in commercial settings machine learning usually starts out experimentally. 

Please visit this great ressource : [Module CrossValidation from scikit-learn.org](https://scikit-learn.org/stable/modules/cross_validation.html).

## Objectives
- Understand the usefulness of **cross validation** and **grid search**
- Be able to apply them to easily setup a advanced Machine Learning model.
- Practice on a dedicated exercice


See the great ressource [machinelearnia](https://www.youtube.com/watch?v=VoyMOVfCSfc) for more explanations.


---


## Cross Validation
The simplest way to use cross-validation is to call the cross_val_score helper function on the estimator and the dataset.
    
Check the documentation for further explanation : [Crossvalidation](https://scikit-learn.org/stable/modules/cross_validation.html).


### Prepare libraries and data

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

X, y = datasets.load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)


clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)

0.9666666666666667

### Use `cross_val_score`
The following example demonstrates how to estimate the accuracy of a linear kernel support vector machine on the iris dataset by splitting the data, fitting a model and computing the score 5 consecutive times (with different splits each time):

In [None]:
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1, random_state=42)
scores = cross_val_score(clf, X, y, cv=5)
scores

array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ])

The mean score and the standard deviation are hence given by:

In [None]:
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

0.98 accuracy with a standard deviation of 0.02


### Use `cross_validate`
The cross_validate function differs from cross_val_score in two ways:

- It allows specifying multiple metrics for evaluation.

- It returns a dict containing fit-times, score-times (and optionally training scores as well as fitted estimators) in addition to the test score.

In [None]:
from sklearn.model_selection import cross_validate
from sklearn.metrics import recall_score
scoring = ['precision_macro', 'recall_macro']
clf = svm.SVC(kernel='linear', C=1, random_state=0)
scores = cross_validate(clf, X, y, scoring=scoring)
scores

{'fit_time': array([0.00242615, 0.00127101, 0.00112796, 0.00110555, 0.00103307]),
 'score_time': array([0.00284362, 0.00178933, 0.00161648, 0.00158119, 0.0016005 ]),
 'test_precision_macro': array([0.96969697, 1.        , 0.96969697, 0.96969697, 1.        ]),
 'test_recall_macro': array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ])}


---


## Grid Search
Hyper-parameters are parameters that are not directly learnt within estimators. In scikit-learn they are passed as arguments to the constructor of the estimator classes.

It is possible and recommended to search the hyper-parameter space for the best cross validation score.      
Check the documentation for further explanation : [Grid Search](https://scikit-learn.org/stable/modules/grid_search.html).


### Prepare libraries and data

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm
from sklearn.model_selection import GridSearchCV

X, y = datasets.load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

### Declare model and parameter for Grid Search

In [None]:
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters)

GridSearchCV(estimator=SVC(),
             param_grid={'C': [1, 10], 'kernel': ('linear', 'rbf')})

### Fit the model

In [None]:
clf.fit(X_train, y_train)

### Access to results from Cross Validation
List of available results

In [None]:
sorted(clf.cv_results_.keys())

['mean_fit_time',
 'mean_score_time',
 'mean_test_score',
 'param_C',
 'param_kernel',
 'params',
 'rank_test_score',
 'split0_test_score',
 'split1_test_score',
 'split2_test_score',
 'split3_test_score',
 'split4_test_score',
 'std_fit_time',
 'std_score_time',
 'std_test_score']

Access to the best score

In [None]:
clf.best_score_

0.9888888888888889

Acces to the parameters providing the best score

In [None]:
clf.best_params_

{'C': 1, 'kernel': 'linear'}


---


## Combine GridSearchCV & Pipeline
 
Check the documentation for further explanation : [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline).


### Prepare libraries and data

In [None]:
# data library
import numpy as np
from sklearn import datasets

# Preprocessing
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, RobustScaler, MinMaxScaler

# Pipeline and model
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn import svm
from sklearn.model_selection import GridSearchCV

In [None]:
# data import
X, y = datasets.load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

### Declare model and parameter for Grid Search
Check the documentation for further explanation : [SVC()](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).

In [None]:
parameters = {'model__kernel':('linear', 'rbf'), 'model__C':[1, 10]}
svc = svm.SVC()

### Declare the pipeline

In [None]:
pipe = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('model', svc)]
    )

### Declare the Grid Search method

In [None]:
grid = GridSearchCV(pipe, parameters, cv = 5, n_jobs =-1, verbose = 1)

### Fit the model

In [None]:
grid.fit(X_train, y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('model', SVC())]),
             n_jobs=-1,
             param_grid={'model__C': [1, 10],
                         'model__kernel': ('linear', 'rbf')},
             verbose=1)

### Access to results from Cross Validation
List of available params.

In [None]:
grid.get_params().keys()

dict_keys(['cv', 'error_score', 'estimator__memory', 'estimator__steps', 'estimator__verbose', 'estimator__scaler', 'estimator__model', 'estimator__scaler__copy', 'estimator__scaler__with_mean', 'estimator__scaler__with_std', 'estimator__model__C', 'estimator__model__break_ties', 'estimator__model__cache_size', 'estimator__model__class_weight', 'estimator__model__coef0', 'estimator__model__decision_function_shape', 'estimator__model__degree', 'estimator__model__gamma', 'estimator__model__kernel', 'estimator__model__max_iter', 'estimator__model__probability', 'estimator__model__random_state', 'estimator__model__shrinking', 'estimator__model__tol', 'estimator__model__verbose', 'estimator', 'n_jobs', 'param_grid', 'pre_dispatch', 'refit', 'return_train_score', 'scoring', 'verbose'])

List of available results.

In [None]:
sorted(grid.cv_results_.keys())

['mean_fit_time',
 'mean_score_time',
 'mean_test_score',
 'param_model__C',
 'param_model__kernel',
 'params',
 'rank_test_score',
 'split0_test_score',
 'split1_test_score',
 'split2_test_score',
 'split3_test_score',
 'split4_test_score',
 'std_fit_time',
 'std_score_time',
 'std_test_score']

Access to the best score

In [None]:
grid.best_score_

0.9777777777777779

Acces to the parameters providing the best score

In [None]:
grid.best_params_

{'model__C': 1, 'model__kernel': 'linear'}

### Make predictions

In [None]:
y_pred = grid.predict(X_test)

### Evaluate model performance

In [None]:
from sklearn.metrics import r2_score
print("R2:", r2_score(y_test, y_pred))

R2: 0.8906605922551253


## Wrap-up
Let's bring together all pieces of code for a complete set.

In [None]:
# data library
import numpy as np
from sklearn import datasets

# Preprocessing
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, RobustScaler, MinMaxScaler

# Pipeline and model
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn import svm
from sklearn.model_selection import GridSearchCV

# Metrics
from sklearn.metrics import r2_score

# data import
X, y = datasets.load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

# Declare model and parameter for Grid Search
parameters = {'model__kernel':('linear', 'rbf'), 'model__C':[1, 10]}
svc = svm.SVC()

# Declare the pipeline
pipe = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('model', svc)]
    )

# Declare the Grid Search method
grid = GridSearchCV(pipe, parameters, scoring='r2', cv = 5, n_jobs =-1, verbose = 1)

# Fit the model
grid.fit(X_train, y_train)

# Evaluate cross validation performance 
print("CV score - R2:", grid.best_score_)

# Make predictions
y_pred = grid.predict(X_test)

# Evaluate model performance
print("Test score - R2:", r2_score(y_test, y_pred))

Fitting 5 folds for each of 4 candidates, totalling 20 fits
CV score - R2: 0.9690987124463518
Test score - R2: 0.8906605922551253


## Conclusions
The design and implementation of a machine learning pipeline is at the core of enterprise AI software applications and fundamentally determines the performance and effectiveness. In addition to the software design, additional factors must be considered, including choice of machine learning libraries and runtime environments (processor requirements, memory, and storage).

Many real-world machine learning use cases involve complex, multi-step pipelines. Each step may require different libraries and runtimes and may need to execute on specialized hardware profiles. It is therefore critical to factor in management of libraries, runtimes, and hardware profiles during algorithm development and ongoing maintenance activities. Design choices can have a significant impact on both costs and algorithm performance ([https://c3.ai/](https://c3.ai/glossary/machine-learning/machine-learning-pipeline/#:~:text=A%20machine%20learning%20pipeline%20is,model%20parameters%2C%20and%20prediction%20outputs)).