# Learning Best Practice for Model Evaluation and Hyperparameter Tuning

In this chapter we will learn about followings:-

- Obtain unbiased estimates of a model's performance
- Diagnose the common problems of machine learning algorithms
- Fine-tuning machine learning models
- Evaluate predictive models using different performance metrics

In [1]:
import warnings
warnings.simplefilter('ignore')

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Streamlining workflows with pipeline

In [3]:
df = pd.read_csv('wdbc.data', header = None)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
0     569 non-null int64
1     569 non-null object
2     569 non-null float64
3     569 non-null float64
4     569 non-null float64
5     569 non-null float64
6     569 non-null float64
7     569 non-null float64
8     569 non-null float64
9     569 non-null float64
10    569 non-null float64
11    569 non-null float64
12    569 non-null float64
13    569 non-null float64
14    569 non-null float64
15    569 non-null float64
16    569 non-null float64
17    569 non-null float64
18    569 non-null float64
19    569 non-null float64
20    569 non-null float64
21    569 non-null float64
22    569 non-null float64
23    569 non-null float64
24    569 non-null float64
25    569 non-null float64
26    569 non-null float64
27    569 non-null float64
28    569 non-null float64
29    569 non-null float64
30    569 non-null float64
31    569 non-null float64
dtypes: float64(30), int64(1), obj

In [5]:
X = df.iloc[:,2:].values
y = df.iloc[:,1].values

In [6]:
np.unique(y)

array(['B', 'M'], dtype=object)

In [7]:
#let's do label encoding of our classes
from sklearn.preprocessing import LabelEncoder

In [8]:
le = LabelEncoder()

y = le.fit_transform(y)

In [9]:
np.bincount(y) #classes ratio B and M

array([357, 212], dtype=int64)

In [10]:
#let's split our dataset
from sklearn.model_selection import train_test_split

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1, stratify = y)

## Combining transformers and estimators in a pipeline

Remember that we have to standarize our data for logistic regression model since it uses gradient descent optimization, standarized of data helps in reaching minimum cost function early.

Here we will going to use pipeline in sklearn library it helps to chain a lot of process of machine learning on both our training set and test set.

The make_pipeline function takes an arbitrary number of scikit-learn transformers(objects that support fit and transform method as inputs). Also make_pipeline function constructs a scikit-learn Pipeline object.<br>
There is no limit to the number of intermediate steps in a pipeline; however the last pipeline element has to be an estimator(model).

In [12]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

In [13]:
pipe_lr = make_pipeline(
    StandardScaler(), PCA(n_components=2), LogisticRegression(random_state=1)
)

In [14]:
pipe_lr.fit(X_train, y_train)


Pipeline(memory=None,
         steps=[('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('pca',
                 PCA(copy=True, iterated_power='auto', n_components=2,
                     random_state=None, svd_solver='auto', tol=0.0,
                     whiten=False)),
                ('logisticregression',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='warn', n_jobs=None,
                                    penalty='l2', random_state=1, solver='warn',
                                    tol=0.0001, verbose=0, warm_start=False))],
         verbose=False)

In [15]:
y_pred = pipe_lr.predict(X_test)

In [16]:
print('Test Accuracy: {:.3}'.format(pipe_lr.score(X_test,y_test)))

Test Accuracy: 0.956


## Using k-fold cross validation to assess model performance

To find an acceptable bias-variance trade-off, we need to evaluate our model carefully. **Cross-validation** help us obtain reliable estimates of the model's generalization performance i.e how well the model perform on unseen data.

We are going to discuss two cross validation technique here:-

- Holdout cross validation
- K-fold cross validation

Using the holdout method, we split out initial dataset into training dataset and test dataset - the training dataset is used for training our model, and the test dataset is used to estimate model generalization performance.

But we also have to do *model selection*, which refers to a given classification problem for which we want to select the optimal values of tuning parameters (also called hyperparameters).

The problem is if we reuse the same test dataset over and over again during **model selection**, it will become part of our training data and thus the model will be more likely to overfit. Thus it is not fare to use test dataset for model selection and testing the model.

A better way of using the holdout method for model selection is to separate the data into three parts:

- A training set
- A validation set
- A test set

The training set is use to fit the the different models.<br>
The performance on the validation set is then used for the model selection<br>
Now our test data is not yet exposed to our model, thus it is completely unseen by our model, hence it will be provide less biased estimate of model ability to generalize to a new data.

A *Disadvantage* of the holdout method is that the performance estimate may be very sensitive to how we partition the training set and validation sets

## K-Fold Cross validation

In K-Fold cross-validation we randomly split the training dataset into k folds without replacement, where k-1 folds are used for the model training, and one fold is used for performance evaluation. This procedure is repeated k times so that we obtain k models and performance estimates.

We then calculate the average performance of the models based on the different, independent folds to obtain a performance estimates that is less sensitive to the sub-partitioning of the training data compared to the holdout method.

Typically we use k-fold cross validation for **model tuning**, i.e finding the optimal hyperparameter values that yields a satisfying generalization performance.

Once we have found satisfactory hyperparmeter values, we can retrain the model on the complete training set and obtain a final performance estimate using the independent test set. We are doing training again after learning hyperparameter because it results in a more accurate and robust model.

**Note:- A good standard value for *k* in K-fold cross validation is 10, as it has been suggested that it offers the best tradeoff between the bias and variance**

A special case of k-fold cross validation is the **Leave-one-out cross validation (LOOCV)** method. In LOOCV, we set the number of folds equal to training samples (k=n) so that only one training sample is used for testing during each iteration, which is a recommended approach for working with very small datasets.

An improvement upon a K-fold cross validation is **Stratified K-fold cross validation**, which can yield better bias and variance estimates, especially in case of unequal class proportions. 

In stratified k-fold cross validation, <u>the class proportions are preserved in each fold</u> to ensure that each fold is representative of the class proportions in the training dataset.

In [17]:
from sklearn.model_selection import StratifiedKFold

In [18]:
kfold = StratifiedKFold(n_splits=10, random_state=1).split(X_train, y_train)

In [19]:
scores = []

for k, (train, test) in enumerate(kfold): #spl
    pipe_lr.fit(X_train[train], y_train[train])
    score = pipe_lr.score(X_train[test], y_train[test])
    scores.append(score)
    
    print('Fold: {}, Class dist:. {}, Acc: {:.3}'.format(k+1, np.bincount(y_train[train]), score))

Fold: 1, Class dist:. [256 153], Acc: 0.935
Fold: 2, Class dist:. [256 153], Acc: 0.935
Fold: 3, Class dist:. [256 153], Acc: 0.957
Fold: 4, Class dist:. [256 153], Acc: 0.957
Fold: 5, Class dist:. [256 153], Acc: 0.935
Fold: 6, Class dist:. [257 153], Acc: 0.956
Fold: 7, Class dist:. [257 153], Acc: 0.978
Fold: 8, Class dist:. [257 153], Acc: 0.933
Fold: 9, Class dist:. [257 153], Acc: 0.956
Fold: 10, Class dist:. [257 153], Acc: 0.956


In [20]:
print('CV accuracy {:.3} +/- {:.3}'.format(np.mean(scores), np.std(scores)))

CV accuracy 0.95 +/- 0.0139


we can also we use cross_val_score provided by scikit-learn to do above process, one benefit of using cross_val_score is the parameter n_jobs that helps us distributing works among processor, which do works parallely and thus execution time is less.

In [21]:
from sklearn.model_selection import cross_val_score

In [22]:
scores = cross_val_score(estimator=pipe_lr, X=X_train, y=y_train, cv = 10, n_jobs=-1)

In [24]:
print('CV accuracy score', scores)

CV accuracy score [0.93478261 0.93478261 0.95652174 0.95652174 0.93478261 0.95555556
 0.97777778 0.93333333 0.95555556 0.95555556]


In [25]:
print('CV average accuracy: {:.3} +/- {:.3}'.format(np.mean(scores), np.std(scores)))

CV average accuracy: 0.95 +/- 0.0139


## Debugging algorithms with learning and validation curve