# TOC

+ Bias vs Variance
+ Validation
+ Training Curves
+ Hyperparameter Tuning


# I. Validation

Using a validation set will limit the risk of overfitting our model on the training set. Once we are satisfied with a given model, we can retrain it on the entire dataset with the same hyperparameters.

Of particular interest will be how the sample data was extracted from the database: was it random, or were some classes over-sampled to produce a more balanced dataset. This will be crucial to set up a proper validation scheme: when splitting the training set to create the validation set, we want the resulting split to be as close to the split between training and test set as possible. 

The scikit-learn documentation has a section dedicated to [validation](https://scikit-learn.org/stable/modules/cross_validation.html).

## I.1. Validation Methods

How to split / what ratios to use:
+ holdout (`sklearn.model_selection.ShuffleSplit`): split the dataset in two and train only on one part. This is a good choice when the dataset is large or when the model score is likely to be consistent across splits.
+ K-fold (`sklearn.model_selection.Kfold`): split dataset into K subsets; train K times on K-1 subsets. This method ensures that a given sample is used for validation only once. The end score is the average over these K-folds. This is a good choice when the dataset is of medium size  
+ Leave-one-out (`sklearn.model_selection.LeaveOneOut`): K-fold where K is equal to the number of samples in the training set. This is a good choice when the dataset is small and the models are relatively fast to train. 

_Note: it is important to ensure that there is no overlap between training and validation sets, ehich could occur if there are duplicate samples. In this case, the accuracy of the validation set risks being incorrectly high._

_Note: with K-fold and LOO, you can also estimate mean and variance of the loss. This can be used to measure if improvements are statistically signiffication**

For classification datasets that are eith small, or with a large amount of categories, it is always a good practice to use stratification: this method ensures that the distribution of classes will be similar over different training folds. 

## I.2. Inconsistent Scores 

There are two main reasons for observing vastly different scores for our different folds:
+ the data has clear patterns but little data. The model will not be able to generalize them well. Each fold train on slightly different patterns, which can lead to vastly different scores.
+ the data is inconsistent. In this case where the variance is high, the model will struggle to generalize.

Running several K-folds with different seeds can help get a better estimate of the model's performance. It can also be helpful to adjust hyperparameters with one seed and estimate performance with another one.


## I.3. Competition-specific Steps
A few extra steps can be taken during competitions to comparevValidation scores vs leaderboard scoes: 
+ Ideally, improving your validation score leads to improving your leaderboard score.

+ 
Sometimes, the validation score is very different from the leaderboard one. It usually comes from using a validation set that is not representative of the test set. If the test set distribution is different from the validation test, doing some learderboard probing to get mean values can help improve the score significantly 
+ 
A good practie foro final submissions is to submit one model that performs well in the validation set (to cover cases where test set et validation set have similar distributions) and one that performs well on the public leaderboard (to cover cases where the distribution of the test set if very different from the one of the validation et)s

Is also helps to create a validation set that mimicks the test set as closely as possible. A few examples:
+ if the test set asks you to predict three months in the future, your validation set should also predict three months after the end of your training set.
+ if the test set has different customers than the training set, then your validation set should not share any customers with your training set.
.

# II. Training Curves

TODO.


# III. Hyperparameters Tuning

Understanding the effects of each hyperparameter allow us to select the ones to tune in order to address either situations of under- and overfitting. A good approach consists of overfitting the model first, then tuning the model to find optimal parameters.

A few important things to note:
+ adding meaningful features and insights will bring far more value than a finely tuned model built on default features. It is best not to spend too much time on hyperparameters tuning too early.
+ it can take a few thousand rounds for models to fit.
+ it is good to average predictions from different seeds an/or small variations from optimal hyperparameters values.
+ tuning too many times on small datasets can lead to the [multiple comparisons fallacy](https://en.wikipedia.org/wiki/Multiple_comparisons_problem).


## III.1. GBDT

There are three GBDT algorithms that will benefit from these technics: XGBoost, LightBGM and CatBoost. Scikit-Learn has RandomForest and ExtraTrees.

| XGboost                                | LightGBM                                   | RandomForest        | Description                                                                                                          | Impact | Good starting value |
|----------------------------------------|--------------------------------------------|---------------------|----------------------------------------------------------------------------------------------------------------------|--------|---------------------|
| max_depth                              | + max_depth + num_leaves                   | max_depth           | Max depth of a tree and leaves per level. The optimal value is typically higher for Random Forests than for GBDT.    | +      | 7                   |
| subsample                              | bagging_fraction                           |                     | Fraction of objects to use when fitting a tree.                                                                      | +      |                     |
| + colsample_bytree + colsample_bylevel | feature_fraction                           | max_features        | Fraction of features to use when fitting a tree.                                                                     | +      |                     |
| + min_child_weight + lambda + alpha    | + min_data_in_leaf + lambda_l1 + lambda_l2 | min_samples_leaf    | Regularization parameters. min_child_weight has the biggest impact and is one of the most important hyperparameters. | -      | 1 - 300             |
| + eta + num_rounds                     | + learning_rate + num_iterations           | + NA + N_estimators | Learning rate (similar to gradient descent) and number of learning steps (i.e. trees) to build.                      |        |                     |
| seed                                   | *_seed                                     | random_state        |                                                                                                                      |        |                     |
|                                        |                                            | criterion           | For Random Forest Classifiers only. Either Gini or Entropy.                                                          |        |                     |
|                                        |                                            | n_jobs              | Indicate the number of CPU cores to use for training (default is 1).                                                 |        |                     |

**Max Depth**

Increasing the max depth will lead to longer training times, so it's better to do it only when necessary.

If increasing the depth of trees does not lead to overfitting, it means that there is a lot of important information to extract from the data; In this case, it might be usefulto stop tuning and try to generate new features.

**Subsmaple**

Reducing the fraction of objects to use for each tree might reduce overfitting. It's akin to a regularization parameter.


**Learning Rate**

Large learning rates will lead the model to fit faster but is prone to overfitting. And a learning rate that is to large will not converge so the model won't fit. Smaller learning rates lead to less overfitting but a learning rate that is too small will learn nothing even after many rounds.

We can start by using a relatively small learning rate, say 0.1 or 0.01, and use early stopping to find the number of iterations it takes for the model to overfit. We can then divide the learning rate and multiply the number of rounds by the same amount to improve the model's performance.


Contrary to GBDT that build trees one after the other, Random Forests build trees independently. It means that having many trees does not lead to overfitting. So the first step is to identify the number of trees (N_estimators) that is sufficient for the problem at hand. 

Additional resources:
+ [Tuning the hyper-parameters of an estimator (sklearn)](http://scikit-learn.org/stable/modules/grid_search.html).
+ [Optimizing hyperparameters with hyperopt](http://fastml.com/optimizing-hyperparams-with-hyperopt).
+ [Complete Guide to Parameter Tuning in Gradient Boosting (GBM) in Python](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/).


## III.2. Neural Nets

**Layers & Neurons**

Increasing the number of neurons per layer increases allows the network to learn more complex decision boundaries; it risks overfitting. The same will happen when increasing the number of layers, with the additional risk of having the learning fail to converge. 


**Optimization Method**

Selecting an optimization method is often crucial:

+ Stochastic Gradient Descent (SGD) with momentum. Converges slower but less risks of overfitting.
+ Adaptive methods: Adam, Adagrad, Adadelta... Converges faster but more risks of overfitting. 


**Batch Size**

Large batch sizes typically lead to more overfitting. A good rule of thumb is to start with either 32 or 64, then reduce the batch size if the model still overfits. 

Note that:
+ too small batch sizes lead to the gradient being too noisy.
+ for the same number of epochs, a model with a smaller batch size is updated more often, which makes it take longer to train.

**Learning Rate**

If the learning rate too high, the network will never converge and if it's too small, it will take forever to learn. A good approach is to start with a very high learning rate like 0.1, then slowly decrease it until the network starts to converge.

Batch size and learning rates are correlated: a good rule of thumb when increasing the batch size by a factor of $\alpha$ is to also increase the learning rate by the same factor.

**Regularization**

L1 and L2 methods were commonly applied to the neural networks weights, but the most common regularization method today is the dropout method. Note that applying a dropout layer just after the data layer is not recommended, as the network will lose some information completely.



## III.3. Linear Models

Support Vector Machine models require almost no tuning; the parameter C is inversely proportional to regularization weights. A good rule of thumg is to start with very small values (10e-6) then increase by a factor of 10 at each iteration (more details [here](https://stats.stackexchange.com/questions/31066/what-is-the-influence-of-c-in-svms-with-linear-kernel)). Note that training time typically increases at the value of C increases.

Linear models mostly use L1 and L2 regularization methods. A good rule of thumg is also to start with very small values. Note that L1 can be used for feature selection due to its pattern of weights scarcity.
