# Iterative Model Development<br/>*The ML Model Building Process*

## Goals
1. Describe Iterative Model Development.
2. Describe how to Compare Models.

## Introduction to Iterative Model Development Series

These notebooks are designed to be used with Jupyter Lab and the [Table of Contents](https://github.com/jupyterlab/jupyterlab-toc) plugin.  This plugin allows an outline of the notebook to be viewed in the left-hand panel with the ability to jump to any section. 

This is the first in a series of notebooks about building Machine Learning models with Scikit Learn.
* Only Supervised Learning is considered.
* Prediction accuracy is emphasized over model interpretability.
* Last notebook in series will present a simple, interpretable, and well performing model.

The content is intended to cover common questions a beginner may have, after having made an initial attempt to learn Python, Pandas, Scikit Learn and Machine Learning, elsewhere.

### Machine Learning Models

A model is a simplified representation of a system or process.

A Machine Learning (ML) Model is created by a machine.  The machine being the computer running a learning algorithm.

A ML Model is often a representation of a data generation process.  The Model can be used to classify or predict new data, or understand existing data.

Creating a ML Model is a semi-automated data driven process.  A reasonable model can often be created without knowing much about the domain of the data.  A better model can often be created by extracting features and applying data transformations specific to the domain of the data.

A learning algorithm "learns" or "fits" the data it is presented with to create a model.  A model represents data in a manner specific to the learning algorithm that created it.  For example a Linear Regression model represents data as a line.  A Decision Tree Classifier, represents data as a hierarchical set of if-then-else rules.

Models with simple internal data representations, such as Linear Regression and Decision Trees, can have their internal representation examined by an end-user to provide insight into the user's data.  Models which have complex internal representations are used to make the best predictions possible.

The fitting process is performed by an algorithm which optimizes the model's internal representation of the data.  In Scikit Learn terminology, fitting the data to create a model is called "minimizing the objective function". The objective function is a measure of the "distance" between the model's representation of the data and the actual data. When this distance has been minimized, a model has been created which has "learned" or "fit" the data.

In general, the measure of distance used by a learning algorithm to best fit a model to the data, is not the same as the measure of distance used by an end-user to evaluate the usefulness of that model for their particular application.  In Scikit Learn terminology, the "objective function" used by the learning algorithm is usually different than the "scoring" function selected by the end-user.

The scoring function should be applied to data that was not used to build the model. The practical value of a predictive model depends upon how well it can predict on new data.

In Scikit Learn terminology:
1. An **estimator** is a Python object which:
   * has an algorithm to fit data
   * has internal data structures to hold the results of having fit the data
2. **estimator.fit()** fits the data (it creates the model)
3. **estimator.predict()** uses the model to make new predictions

## The Model Building Process
This series of notebooks is about how to build ML models in general.  For concreteness, the Titanic data set from Kaggle will be used.

Some of the model building techniques presented in the later notebooks of this series are overkill for a data set as simple as the Titanic. However they are likely to be useful for larger and more complex data sets. 

Model Building Process:  
1. Identify and quantify goal
2. Perform Exploratory Data Analysis (EDA)
3. Quickly build first model and evaluate its usefulness
4. Create a better model using better:
    * Exploratory Data Analysis 
    * Preprocessing (data transformations)
    * Feature Extraction
    * Tuning of hyperparameters
    * Other estimators (other model building algorithms)
    * Creating a Stack of models
    * and more ...
4. Compare new model to previous model 
5. Repeat steps 4 and 5 (a reasonable number of times)

### 1. Identify and Quantify Goal
This example is with respect to the Titanic Data Set.

**Goal**  
Predict whether a passenger would have survived or not, from information about the passenger.  This as a binary classification problem.

**Quantify**  
Evaluate the "goodness" of the model as the percent of predictions that were correctly predicted.  This can be scored with Scikit Learn metric: accuracy_score.  

### 2. Exploratory Data Analysis
Use EDA to:
* get an overview of the data
* choose a few simple predictors to start with
* document ideas for additional predictors, data transformations, and extracted features to try later

This requires a data visualization tool such as Seaborn.  An excellent online course for Seaborn is: [Data Visualization with Seaborn](https://www.datacamp.com/courses/data-visualization-with-seaborn).

### 3. Create the Initial Model

The key point is to get started.  Once something is up an running, it is easy to refine it.

### 4. Improve the Model
The goal of each iteration is to create a better model based on what was learned from the previous iteration and what was left untried.

### 5. Compare Model Performance
In order to understand if progress is being made, it is necessary to score the model and compare that score to previous models.

This topic is more subtle than it may seem.  As such, an entire section is provided below.

For this series of notebooks:
1. Model Evaluation
   * an absolute score used to estimate performance on unseen data
   * use 10-Repeated 5 or 10 Fold CV
2. Model Selection
   * a relative score used to rank models by performance
   * use 10-Repeated 2-Fold CV
3. Score Significance
   * CV scores are "noisy".  Decide if one score is better than another, above the noise level, by comparing their confidence intervals.

The above is described in more detail in the section on Model Selection.

### 6. Iterate
Repeat steps 4 and 5 above until the desired performance is reached or no additional performance can be obtained with the available resources.

## Model Selection<br/>Ranking Models by Score
This section presents a deeper look into how to evaluate and compare model performance.

This section is more advanced and can be skipped.

The primary references used for this section are:
1. [Model Evaluation, Model Selection, and Algorithm
Selection in Machine Learning by Sebastian Raschka](https://sebastianraschka.com/pdf/manuscripts/model-eval.pdf)
2. [Model Selection by David Schonleber](https://towardsdatascience.com/a-short-introduction-to-model-selection-bb1bb9c73376)
3. [Cross Validation for Selecting a Model Selection Procedure by Yongli Zhang and Yuhong Yang](http://users.stat.umn.edu/~yangx374/papers/ACV_v30.pdf), in particular, Section 7: "Misconceptions on the use of CV".
4. [On Over-fiting in Model Selection and Subsequent Selection Bias in Performance Evaluation by Cawley and Talbot](http://www.jmlr.org/papers/volume11/cawley10a/cawley10a.pdf)

The discussion in this notebook is simpler than the above and oriented towards small data sets.

**Hold Out Set**  
If there is sufficient data, it is best to keep a final hold-out set (aka test set) to score the model on after all experimentation is complete.

With small amounts of data, this is not possible.  Setting aside a hold-out set would leave either too little data for training the model well, or too little data for scoring the model well.

**Bias and Variance**  
The terms bias and variance are used here with respect to the estimate of the model's performance (i.e. its score).

A model may have a score that is biased low, because it was trained on too little training data.

A model may have a score that has a large variance, because it was scored on too little validation data.

### Three Reasons for Model Evaluation
1. Estimate the performance of the model on unseen data
2. Select the best performing hyperparameters for a given algorithm
   * out of the set of hyperparameters tried
   * AND estimate performance on unseen data
3. Select the best performing algorithm
   * out of the set of algorithms tried
   * out of the set of hyperparameters tried for each algorithm  
   * AND estimate performance on unseen data

### 1. Estimate Performance on Unseen Data

As suggested above, with a small amount of data, Cross Validation can provide a better score than using a train/test split.

This discussion applies to using an out-of-the box SciKit Learn estimator with default hyperparameter values.

10 Fold Cross Validation is often recommended.

Repeating the 10-Fold Cross Validation is often recommended.  Repeating reduces the variance of the score, but not its bias.

Repeating perhaps 10 or 20 times is usually sufficient to reduce the variance about as much as is possible.

### 2. Select the Best Hyperparameters AND Evaluate Performance on Unseen Data

There are various methods for exploring the hyperparameter space:
* GridSearchCV
* RandomSearchCV
* hand-coded search
* Bayesian Hyperparameter tuning using a third-party library

However the hyperparameter values are chosen, the resulting models are scored with Cross Validation.

There are two parts to the problem:
1. rank each of the models to find the best hyperparameter values
2. estimate the performance of this model on unseen data

**Ranking Models**  
The question is, what is the best value of K to produce the best ranking of the models?

Above it was mentioned that K=10 is a good rule of thumb for estimating a model's performance on unseen data.  It might seem that K=10 would also work best with respect to ranking models, but this is not the case.

An analogy may help to make this clear.  If you had a stopwatch that was precisely 10% too slow, you could correctly rank how fast you ran a mile each day.  On the other hand, if you had a stopwatch that was on average neither slow nor fast, but did not run consistently at the same speed, you could not correctly rank how fast you ran each day.  The ability to properly rank how well you ran depends upon both the variance and the bias of the score.

When a model is trained on too little data, it has a score that is biased downwards.  If two models are trained on too little data, they both have scores that are biased downward.  To compare the models, the difference in their scores is taken.  If both have a downward bias, the biases largely cancel out.

The amount of data used for hyperparameter optimization is usually limited either by computational resources or by simply having too little data.  When using K-Fold Cross Validation, a decision has to be made as to how much data to train on, and how much data to validate on.

K=2 provides 50% of the data for training and 50% for score validation.  
K=10 provides 90% of the data for training, and 10% of the data for score validation.

At K=2, the model scores may be biased downward, but part of this downward bias will cancel out when the scores are subtracted from one another.

A K=10, the model scores will have more variance than at K=2 and this variance may obscure which model is best.

Of course, if there is much too little data for training, the model is not viable at all, and comparing scores becomes meaningless.

A [Learning Curve](https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html) can be used to determine how close to capacity a model is with a given number of records for training.

To use a concrete example taken from later notebooks:
* The Titanic data set can be well trained on about 600 records using LogisticRegression
* About 900 records are available
* K=2 implies 450 records for training and 450 for scoring
* K=10 implies 810 records for training and 90 records for scoring

In the above, K=2 will likely produce a better ranking of models than K=10.

Although the scores using 450 records will be biased downward as the model capacity has not yet been reached, the bias will largely cancel out when the difference in scores is taken.  The reduction in variance between scoring on 450 records vs scoring on 90 records will allow K=2 to produce a better ranking than K=10.

**Estimating Performance of Best Model**  
Built into the model building process for hyperparameter optimization, is the repeated use of the Cross Validation validation folds.  The use of this validation data as part of the model building process, means the score is optimistically biased.

If there was sufficient data to have a final hold-out test set, then the best hyperparameter that were found could be used to train a model on all the training data and score that on the test data.

However if no final hold-out set is possible due to too little data, then [Nested Cross Validation](https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html) can be used.

In Nested Cross Validation, the process of selecting the best hyperparameters is performed within the Cross Validation loop.  In the inner loop, model comparison is being perform, so low K (usually K=2) is used.  In the outer loop, the performance on the model on unseen data is being computed, so a higher K (usually K=5) is used.  I suspect K=5 is used in this situation because it reduces the computation time relative to K=10 and yet produces very similar results.

Nested Cross Validation is very computationally intensive, but as it is only needed when there is too little data for a final hold-out test set, that may not be a problem.

### 3. Select the Best Hyperparameter Optimized Algorithm AND Evaluate Performance on Unseen Data

The key point about having a final hold-out test set for evaluation is that it can only be used once (or very few times).

If several different learning algorithms are tried, and each is evaluated on the same test set, then the test set has been used multiple times and it no longer provides an unbiased estimate of model performance.

If there was a very large amount of data, than there could be many final mutually exclusive hold-out test sets, one for each algorithm, and this would be fine.  However this requires a very large amount of data.

### Confidence Intervals  
Is 3 close to 7?
* 3 ${\pm}$ 1 is not close to 7 ${\pm}$ 1
* 3 ${\pm}$ 5 is close to 7 ${\pm}$ 5

In order to determine if two values are distinguishable from one another, a confidence interval or equivalent is needed.

In the second example above, 3 ${\pm}$ 5 represents the confidence interval: \[-2, 8\].  As 7 is within this confidence interval, the two measurements could be considered to be the same.  That is, random chance alone may explain the observed difference in values.

For more detail, see an introductory book on probability and statistics.  A good book for Data Sciencties is: [Practical Statistics for Data Scientists](https://www.amazon.com/Practical-Statistics-Data-Scientists-Essential-ebook/dp/B071NVDFD6/)

For Cross Validation, there is no statistically precise definition of variance, standard deviation, margin of error, confidence interval and the like.  Nevertheless, these values can be computed and are useful.  For an advanced discussion, see: [No Unbaised Estimator of Variance of K-Fold CV](http://www.jmlr.org/papers/volume5/grandvalet04a/grandvalet04a.pdf)

Some ways to compute a confidence interval include: 
* K-Fold CV: consider each of the K performance estimates to be IID.  The confidence interval is:
```[cv_score_mean - cv_score_sd, cv_score_mean + cv_score_sd]```
* M-Repeated K-Fold CV: consider the M\*K performance estimates to be IID.   The confidence interval is:
```[cv_score_mean - cv_score_sd, cv_score_mean + cv_score_sd]```
* Alternative: if M\*K is sufficiently large, use the median as the point estimate and the Interquartile Range as the confidence interval

If the distribution of the model scores is normal, then +/-1 standard deviation covers about 68% of the scores.  Interquartile Range (IQR), as is presented visually on a boxplot, covers 50% of the scores, so the box in a boxplot is similar to a confidence interval defined as the mean +/- 1 standard deviation.

If the score of one model is within the confidence interval of another, they have effectively the same performance.  This means some other criteria can be used to distinguish between them, such as complexity, cost to deploy, etc.  Note that the latest version of GridSearchCV allows the user to specify a function which determines which model is best using any user-defined criteria.

Again, the terms confidence interval, standard deviation, variance and the like are being used loosely here.  These terms are only well defined for IID data and resampled data is not IID.