# Short Course: Machine Learning for Exploration Geophysics

Hamburg, 10. - 12. March 2020

#### Computer Class 1.3: Cross validation and Hyperparameter search

Table of Content:
- [Packages](#Packages)
- [Import and preprocess the dataset](#Import-and-preprocess-the-dataset)
- [k-Fold Cross-Validation](#k-Fold-Cross-Validation)
- [Evaluate a score by cross-validation](#Evaluate-a-score-by-cross-validation)
- [Tuning the hyper-parameters of an estimator](#Tuning-the-hyper-parameters-of-an-estimator)
- [Combine all together](#Combine-all-together)

## Packages

Let's first import all the packages: 
- [numpy](https://www.numpy.org/) is the fundamental package for scientific computing with Python.
- [pandas](https://pandas.pydata.org/) is a software library for data manipulation and analysis.
- [matplotlib](https://matplotlib.org/) is a comprehensive library for creating static, animated, and interactive visualizations in Python.
- [seaborn](https://seaborn.pydata.org/) is a statistical data visualization library based on matplotlib.
- [scikit-learn](https://scikit-learn.org/stable/) is a simple and efficient tools for predictive data analysis.

In [1]:
#!pip install pandas
#!pip install -U scikit-learn

#!python -m pip install -U pip
#!python -m pip install -U matplotlib

#!pip install pandas

#!pip install seaborn

In [2]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import RidgeCV

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

import warnings
warnings.filterwarnings('ignore')

print('Pandas:  ' + pd.__version__)
print('Numpy:   ' + np.__version__)
print('Sklearn: ' + sklearn.__version__)

Pandas:  0.24.2
Numpy:   1.17.4
Sklearn: 0.22.2.post1


## Import and preprocess the dataset

In [3]:
df = pd.read_csv('data/berlin_flat_price.csv')
df['Price'] = df['Price']/1000;  # in 1000s euro

X = df[['Size']].values
y = df['Price'].values    

scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

#X_scaled.mean()
#X_scaled.std()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, shuffle=False)

## k-Fold Cross-Validation

See [a Gentle Introduction to k-fold Cross-Validation](https://machinelearningmastery.com/k-fold-cross-validation/), but in short the general procedure is as follows:

- Shuffle the dataset randomly.
- Split the dataset into k groups
- For each unique group:
    - Take the group as a hold out or test data set
    - Take the remaining groups as a training data set
    - Fit a model on the training set and evaluate it on the test set
    - Retain the evaluation score and discard the model
- Summarize the skill of the model using the sample of model evaluation scores

[KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default).

In [4]:
kfold = KFold(n_splits=5, shuffle=True, random_state=11)

data = np.linspace(0,9,10)

for ind_train, ind_test in kfold.split(data):
    print('train: %s, test: %s' % (data[ind_train], data[ind_test]))

train: [0. 1. 2. 3. 4. 5. 6. 9.], test: [7. 8.]
train: [0. 1. 3. 4. 5. 7. 8. 9.], test: [2. 6.]
train: [0. 1. 2. 3. 6. 7. 8. 9.], test: [4. 5.]
train: [0. 2. 4. 5. 6. 7. 8. 9.], test: [1. 3.]
train: [1. 2. 3. 4. 5. 6. 7. 8.], test: [0. 9.]


## Evaluate a score by cross-validation

Function [cross_val_score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) evaluates 'estimator' by k-fold cross-validation

In [5]:
reg = ElasticNet()

kfold = KFold(n_splits=5, shuffle=True)
scores = cross_val_score(estimator = reg, X = X_train, y = y_train, cv = kfold, scoring='r2')
print("K-fold CV average score: %.2f" % scores.mean())
print(scores)

K-fold CV average score: 0.81
[0.85407685 0.84879132 0.78849311 0.86646394 0.66781065]


## Tuning the hyper-parameters of an estimator

The grid search provided by [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) exhaustively generates candidates from a grid of parameter values specified with the 'param_grid' parameter.

While using a grid of parameter settings is currently the most widely used method for parameter optimization, other search methods have more favourable properties. [RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV) implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values.

In [6]:
param_grid = [{'alpha': [1, 10, 100], 'l1_ratio': [0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6]}]

kfold = KFold(n_splits=5, shuffle=True)
grid = GridSearchCV(estimator = reg, 
                        param_grid = param_grid, 
                        scoring='r2', 
                        cv = kfold, 
                        n_jobs = -1)

grid = grid.fit(X_train, y_train)

In [7]:
print(grid.best_score_)

0.7969018159993049


In [8]:
print(grid.best_params_)

{'alpha': 100, 'l1_ratio': 0.8}


## Combine all together

In [9]:
def grid_search(reg, param_grid, X_train, y_train):
    
    kfold = KFold(n_splits=5, shuffle=True)
    
    grid = GridSearchCV(estimator = reg, 
                        param_grid = param_grid, 
                        scoring = 'r2', 
                        cv = kfold)
    grid.fit(X_train, y_train)
    print('R2 score: %.2f' %grid.best_score_)
    
    return grid.best_estimator_

In [10]:
reg = ElasticNet()
#reg_best = grid_search(reg, param_grid, X_train, y_train)
reg_best = grid_search(reg, {}, X_train, y_train)
y_pred = reg_best.predict(X_test)
print('RMSE = %.2f' %mean_squared_error(y_test,y_pred,squared=False))

R2 score: 0.80
RMSE = 96.48
