# Chapter 6: Learning best practices for model evaluation and hyperparameter tuning

## Streamlining workflows with pipelines

Piplines allow for fitting a model usign an arbitrary number of transformations steps

### reading in the data:

1. get the data

In [1]:
import pandas as pd
df = pd.read_csv('wdbc.data')

2. use label encoder to transform the data into numerics

(M = malignant tumors)

In [2]:
from sklearn.preprocessing import LabelEncoder 
X = df.iloc[:, 2:].values
y = df.iloc[:, 1].values
le = LabelEncoder()
y = le.fit_transform(y)
le.classes_

array(['B', 'M'], dtype=object)

In [3]:
le.transform(['M', 'B'])

array([1, 0])

3. split the data

In [4]:
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y, random_state=1)

### Now apply the transforms with pipeline

Data will need to be standardized.  And, to reduce the 30 features, PCA will be used.

In [5]:
from sklearn.preprocessing import StandardScaler 
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression 
from sklearn.pipeline import make_pipeline #this will make the pipeline

Make a pipeline using the scaler, PCA dimensionality reductiona, and then apply the model. `make_pipeline` will take an arbitrary number of transformer objects which are followed by an estimator. The pipeline object then acts like a "meta-estimator".

In [6]:
pipe_lr = make_pipeline(StandardScaler(), PCA(n_components=2), LogisticRegression())

`pipe_lr` will now act like a model object for the data:

In [7]:
pipe_lr.fit(X_train, y_train)
y_pred = pipe_lr.predict(X_test)
test_acc = pipe_lr.score(X_test, y_test)
print(f'Test accuracy: {test_acc:.3f}')

Test accuracy: 0.930


The `pipeline.fit()` will be used with training data and the `pipeline.predict()` is used with test data.  The training data passes through the `fit` and `transform` methods of the transformers/estimators, whereas the test data only passes through the `transform` methods.

## Using k-fold cross validation to assess model performance

### Holdout method

*Holdout cross-validation* is when a test set is held out to test a model on unseen data. However, this can be extended to holding out a validation set on which to tune hyperparameters for a model, and to help with model selection. Then the final model and hyperparameters are selected and used to predict on the test data set.  However, models might be sensitive to how the train/validate/test are partitioned.

### K-fold cross validation

*K-fold cross validation* is a more robust way of cross validation. The training set is subset into $k$ folds without replacement, and $k - 1$ are training folds with the last fold as the test fold which is used for performance evaluation. This is repeated $k$ times for $k$ models.

Each model is then hyperparameter tuned to a single fold of the $k-1$ train folds and tested on the test fold. Once hyperparameters are tuned, the model is trained on the entire train set and evaluated on the test set.  Then all of the $k$ models (having been hyperparameter tuned) can be evaluated against the test set.

Estimated performance can be computed for the $k$ folds by:
$$
    E = \sum^k_{i=0} \frac{E_i}{k}
$$
Where $E_i$ is the evaluation metric for each of the models.

K-fold validation uses all the datapoints, unlike the holdout method.  However, an efficient choice of $k$ should be chosen; a good standard is 10.

(Note: each model gets a single fold to train on.  Only the final model is trained on the entire train fold.  Before the final model, each model is trained on $\frac{1}{k}$ of the data).

Smaller datasets can benefit from a larger $k$, so that more of the data is used in each iteration.  This increases runtime and variance. Larger datasets  benefit from a smaller $k$ because in decreases runtime.

*Leave one out cross-validation* : set $k=n$, and leave a single record out to predict.  Every iteration, a different record is left out to predict.  This is useful for small datasets.

*Stratified k-fold cross-validation* : target labels are stratified in the training and test sets. Use scikit-learn's `StratifiedKFold`. 

In [14]:
import numpy as np
from sklearn.model_selection import StratifiedKFold
kfold = StratifiedKFold(n_splits=10).split(X_train, y_train) 
scores = []
for k, (train, test) in enumerate(kfold):
    pipe_lr.fit(X_train[train], y_train[train])
    score = pipe_lr.score(X_train[test], y_train[test]) 
    scores.append(score)
    print(f"Fold: {k+1:02d}, Class distr.: {np.bincount(y_train[train])}, Acc.: {score:.3f}")

Fold: 01, Class distr.: [256 152], Acc.: 0.913
Fold: 02, Class distr.: [256 152], Acc.: 0.935
Fold: 03, Class distr.: [256 152], Acc.: 0.957
Fold: 04, Class distr.: [256 152], Acc.: 0.891
Fold: 05, Class distr.: [257 152], Acc.: 0.978
Fold: 06, Class distr.: [257 152], Acc.: 0.978
Fold: 07, Class distr.: [257 152], Acc.: 0.978
Fold: 08, Class distr.: [257 152], Acc.: 0.911
Fold: 09, Class distr.: [257 152], Acc.: 0.933
Fold: 10, Class distr.: [256 153], Acc.: 0.978


- `n_splits` = $k$
- `.split()` is where the target variable is defined
- Then the splits are looped over for training and scoring.

In [15]:
mean_acc = np.mean(scores)
std_acc = np.std(scores)
print(f'\nCV accuracy: {mean_acc:.3f} +/- {std_acc:.3f}')


CV accuracy: 0.945 +/- 0.031


The average metrics are calculated. `cross_val_score` can also be used, where `cv` is the number of splits.  Note that `n_jobs` will distribute the computation across multiple CPUs.

In [17]:
from sklearn.model_selection import cross_val_score 
scores = cross_val_score(estimator=pipe_lr, X=X_train, y=y_train, cv=10, n_jobs=1)
print(f'CV accuracy scores: {scores}')
print(f'CV accuracy: {np.mean(scores):.3f} +/- {np.std(scores):.3f}')

CV accuracy scores: [0.91304348 0.93478261 0.95652174 0.89130435 0.97777778 0.97777778
 0.97777778 0.91111111 0.93333333 0.97777778]
CV accuracy: 0.945 +/- 0.031
