# TOC

+ Bias vs Variance
+ Validation
+ Training Curves
+ Hyperparameter Tuning


# I. Validation

Using a validation set will limit the risk of overfitting our model on the training set. Once we are satisfied with a given model, we can retrain it on the entire dataset with the same hyperparameters.

Of particular interest will be how the sample data was extracted from the database: was it random, or were some classes over-sampled to produce a more balanced dataset. This will be crucial to set up a proper validation scheme: when splitting the training set to create the validation set, we want the resulting split to be as close to the split between training and test set as possible. 

The scikit-learn documentation has a section dedicated to [validation](https://scikit-learn.org/stable/modules/cross_validation.html).

## I.1. Validation Methods

How to split / what ratios to use:
+ holdout (`sklearn.model_selection.ShuffleSplit`): split the dataset in two and train only on one part. This is a good choice when the dataset is large or when the model score is likely to be consistent across splits.
+ K-fold (`sklearn.model_selection.Kfold`): split dataset into K subsets; train K times on K-1 subsets. This method ensures that a given sample is used for validation only once. The end score is the average over these K-folds. This is a good choice when the dataset is of medium size  
+ Leave-one-out (`sklearn.model_selection.LeaveOneOut`): K-fold where K is equal to the number of samples in the training set. This is a good choice when the dataset is small and the models are relatively fast to train. 

_Note: it is important to ensure that there is no overlap between training and validation sets, ehich could occur if there are duplicate samples. In this case, the accuracy of the validation set risks being incorrectly high._

_Note: with K-fold and LOO, you can also estimate mean and variance of the loss. This can be used to measure if improvements are statistically signiffication**

For classification datasets that are eith small, or with a large amount of categories, it is always a good practice to use stratification: this method ensures that the distribution of classes will be similar over different training folds. 

## I.2. Inconsistent Scores 

There are two main reasons for observing vastly different scores for our different folds:
+ the data has clear patterns but little data. The model will not be able to generalize them well. Each fold train on slightly different patterns, which can lead to vastly different scores.
+ the data is inconsistent. In this case where the variance is high, the model will struggle to generalize.

Running several K-folds with different seeds can help get a better estimate of the model's performance. It can also be helpful to adjust hyperparameters with one seed and estimate performance with another one.


## I.3. Competition-specific Steps
A few extra steps can be taken during competitions to comparevValidation scores vs leaderboard scoes: 
+ Ideally, improving your validation score leads to improving your leaderboard score.

+ 
Sometimes, the validation score is very different from the leaderboard one. It usually comes from using a validation set that is not representative of the test set. If the test set distribution is different from the validation test, doing some learderboard probing to get mean values can help improve the score significantly 
+ 
A good practie foro final submissions is to submit one model that performs well in the validation set (to cover cases where test set et validation set have similar distributions) and one that performs well on the public leaderboard (to cover cases where the distribution of the test set if very different from the one of the validation et)s

Is also helps to create a validation set that mimicks the test set as closely as possible. A few examples:
+ if the test set asks you to predict three months in the future, your validation set should also predict three months after the end of your training set.
+ if the test set has different customers than the training set, then your validation set should not share any customers with your training set.
.

# II. Training Curves

TODO.


# III. Hyperparameters Tuning

Understanding the effects of each hyperparameter allow us to select the ones to tune in order to address either situations of under- and overfitting. A good approach consists of overfitting the model first, then tuning the model to find optimal parameters.

A few important things to note:
+ adding meaningful features and insights will bring far more value than a finely tuned model built on default features. It is best not to spend too much time on hyperparameters tuning too early.
+ it can take a few thousand rounds for models to fit.
+ it is good to average predictions from different seeds an/or small variations from optimal hyperparameters values.
+ tuning too many times on small datasets can lead to the [multiple comparisons fallacy](https://en.wikipedia.org/wiki/Multiple_comparisons_problem).


## III.1. GBDT

There are three GBDT algorithms that will benefit from these technics: XGBoost, LightBGM and CatBoost. Scikit-Learn has RandomForest and ExtraTrees.

| XGboost                                | LightGBM                                   | RandomForest        | Description                                                                                                          | Impact | Good starting value |
|----------------------------------------|--------------------------------------------|---------------------|----------------------------------------------------------------------------------------------------------------------|--------|---------------------|
| max_depth                              | + max_depth + num_leaves                   | max_depth           | Max depth of a tree and leaves per level. The optimal value is typically higher for Random Forests than for GBDT.    | +      | 7                   |
| subsample                              | bagging_fraction                           |                     | Fraction of objects to use when fitting a tree.                                                                      | +      |                     |
| + colsample_bytree + colsample_bylevel | feature_fraction                           | max_features        | Fraction of features to use when fitting a tree.                                                                     | +      |                     |
| + min_child_weight + lambda + alpha    | + min_data_in_leaf + lambda_l1 + lambda_l2 | min_samples_leaf    | Regularization parameters. min_child_weight has the biggest impact and is one of the most important hyperparameters. | -      | 1 - 300             |
| + eta + num_rounds                     | + learning_rate + num_iterations           | + NA + N_estimators | Learning rate (similar to gradient descent) and number of learning steps (i.e. trees) to build.                      |        |                     |
| seed                                   | *_seed                                     | random_state        |                                                                                                                      |        |                     |
|                                        |                                            | criterion           | For Random Forest Classifiers only. Either Gini or Entropy.                                                          |        |                     |
|                                        |                                            | n_jobs              | Indicate the number of CPU cores to use for training (default is 1).                                                 |        |                     |

**Max Depth**

Increasing the max depth will lead to longer training times, so it's better to do it only when necessary.

If increasing the depth of trees does not lead to overfitting, it means that there is a lot of important information to extract from the data; In this case, it might be usefulto stop tuning and try to generate new features.

**Subsmaple**

Reducing the fraction of objects to use for each tree might reduce overfitting. It's akin to a regularization parameter.


**Learning Rate**

Large learning rates will lead the model to fit faster but is prone to overfitting. And a learning rate that is to large will not converge so the model won't fit. Smaller learning rates lead to less overfitting but a learning rate that is too small will learn nothing even after many rounds.

We can start by using a relatively small learning rate, say 0.1 or 0.01, and use early stopping to find the number of iterations it takes for the model to overfit. We can then divide the learning rate and multiply the number of rounds by the same amount to improve the model's performance.


Contrary to GBDT that build trees one after the other, Random Forests build trees independently. It means that having many trees does not lead to overfitting. So the first step is to identify the number of trees (N_estimators) that is sufficient for the problem at hand. 

Additional resources:
+ [Tuning the hyper-parameters of an estimator (sklearn)](http://scikit-learn.org/stable/modules/grid_search.html).
+ [Optimizing hyperparameters with hyperopt](http://fastml.com/optimizing-hyperparams-with-hyperopt).
+ [Complete Guide to Parameter Tuning in Gradient Boosting (GBM) in Python](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/).


## III.2. Neural Nets

**Layers & Neurons**

Increasing the number of neurons per layer increases allows the network to learn more complex decision boundaries; it risks overfitting. The same will happen when increasing the number of layers, with the additional risk of having the learning fail to converge. 


**Optimization Method**

Selecting an optimization method is often crucial:

+ Stochastic Gradient Descent (SGD) with momentum. Converges slower but less risks of overfitting.
+ Adaptive methods: Adam, Adagrad, Adadelta... Converges faster but more risks of overfitting. 


**Batch Size**

Large batch sizes typically lead to more overfitting. A good rule of thumb is to start with either 32 or 64, then reduce the batch size if the model still overfits. 

Note that:
+ too small batch sizes lead to the gradient being too noisy.
+ for the same number of epochs, a model with a smaller batch size is updated more often, which makes it take longer to train.

**Learning Rate**

If the learning rate too high, the network will never converge and if it's too small, it will take forever to learn. A good approach is to start with a very high learning rate like 0.1, then slowly decrease it until the network starts to converge.

Batch size and learning rates are correlated: a good rule of thumb when increasing the batch size by a factor of $\alpha$ is to also increase the learning rate by the same factor.

**Regularization**

L1 and L2 methods were commonly applied to the neural networks weights, but the most common regularization method today is the dropout method. Note that applying a dropout layer just after the data layer is not recommended, as the network will lose some information completely.



## III.3. Linear Models

Support Vector Machine models require almost no tuning; the parameter C is inversely proportional to regularization weights. A good rule of thumg is to start with very small values (10e-6) then increase by a factor of 10 at each iteration (more details [here](https://stats.stackexchange.com/questions/31066/what-is-the-influence-of-c-in-svms-with-linear-kernel)). Note that training time typically increases at the value of C increases.

Linear models mostly use L1 and L2 regularization methods. A good rule of thumg is also to start with very small values. Note that L1 can be used for feature selection due to its pattern of weights scarcity.


# IV. Ensembling

Ensembling is the practice of combining the predictions of several models to improve their performance. The simplest method is taking the average predictions (either simple, weighted or conditional weighted). More advanced methods include bagging, boosting and stacking.

## III.1. Bagging

Bagging means averaging slightly different versions of the same model in order to reduce variance; A common example of bagging is random forests. In bagging, each model is called a bag.

Here are common bagging methods:
+ using different seeds.
+ rows subsampling / bootstrapping (random sampling with replacement).
+ columns subsampling.
+ tweaking hyperparameters.


In [None]:
# example code
model = RandomForestRegressor()
bags = 10
seed = 1

# create array to hold bagged predictions
bagged_predictions = np.zeros(test.shape[0])

# loop over each iteration
for n in range(bags):
    model.set_params(random_state=seed+n)
    model.fit(train, y)
    preds = model.predict(test)
    bagged_predictions += preds

# final predictions (average of all predictions)
bagged_predictions /= bags


## III.2. Boosting

Boosting means training models sequentially, one model focusing its efforts on areas where the previous one didn't perform so well. There are two main methods:

+ weight-based: assigning weights to put more emphasis on rows where the error was high (requires models that take weights as inputs - see AdaBoost).
+ residual-based: training each model on the residuals of the previous one; This is the way GDBT work.

At each step, the prediction become: $prediction_N = pred_{0} + pred_{1} * eta + ... + pred_{N} * eta$

Here are common boosting methods:
+ tweaking the learning rate.
+ rows subsampling / bootstrapping (random sampling with replacement).
+ columns subsampling.

Here are a few common implementations of the residual-based boosting:
+ XGBoost
+ LightGBM
+ H20 GBM
+ CatBoost
+ Scikit-learn GBM


## III.3. Stacking

**Concept**

Stacking starts with several models fit to a training dataset. Their predictions on a validation set are used as features ("predictions stacking") for a final linear meta-model to predict the validation set; the resulting weights will allow us to combine the base learners on any unseen dataset.

The meta-model doesn't have to be very sophisticated; a linear model will usually give good results.

_Note: there are several tools to help with several layers of ensembling: StackNet, Xcessiv, etc._


**Tips**

The models used for stacking must be diverse and bring different inputs:
+ different algorithms. For instance:
    + 2-3 GDBT from different implementations (LightGBM, XGBoost, CatBoost, H20) with different depths & hyperparameter values.
    + 2-3 neural nets (pytorch, keras) with different hidden layers (1/2/3).
    + 1-2 ExtraTrees/RandomForest (scikit-learn).
    + 1-2 linear models like ridge or linear SVM (scikit-learn).
    + 1-2 KNN (scikit-learn).
    + 1 factorization machine (libFM).
      
+ different sets of features and transformations of the raw data. For instance:
    + categorical data: one-hot/label/target encoding, frequency, etc.
    + numerical: outliers, binning, derivatives, percentiles, etc.
    + interactions between features: col1+col2/col1*col2, grouping & averaging by categorical value, etc.
    + unsupervised learning: K-means, PCA, etc.


For any layer after the first one, you need to only use shallow models:
+ GDBT with depth of 2/3.
+ linear models with high regularization.
+ neural nets with only one hidden layer.
+ KNN with BrayCurtis distance.

_Note: for time series, the timeline must be preserved for stacking to work well. This means training before validation before test._


In [1]:
# train is the training data
train = [
    [0.624,0.583],[0.321,0.016],[0.095,0.285],[0.235,0.573],[0.027,0.146],[0.514,0.06],[0.612,0.083],[0.054,0.597],
    [0.344,0.108],[0.671,0.089],[0.486,0.583],[0.829,0.929],[0.451,0.018],[0.448,0.286],[0.26,0.914],[0.893,0.706],
    [0.951,0.487],[0.811,0.075],[0.65,0.505],[0.902,0.24],[0.312,0.617],[0.907,0.844],[0.629,0.194],[0.333,0.559],
    [0.98,0.983],[0.87,0.706],[0.611,0.623],[0.463,0.097],[0.957,0.507],[0.341,0.792],[0.384,0.482],[0.584,0.655],
    [0.446,0.454],[0.314,0.396],[0.061,0.712],[0.951,0.691],[0.71,0.444],[0.238,0.809],[0.943,0.874],[0.325,0.619],
    [0.438,0.146],[0.131,0.055],[0.884,0.083],[0.306,0.641],[0.071,0.32],[0.765,0.402],[0.321,0.584],[0.714,0.444],
    [0.533,0.811],[0.644,0.293],[0.403,0.579],[0.278,0.577],[0.888,0.902],[0.99,0.182],[0.212,0.072],[0.692,0.386],
    [0.919,0.318],[0.082,0.234],[0.99,0.597],[0.867,0.371],[0.158,0.154],[0.304,0.826],[0.088,0.638],[0.382,0.87],
    [0.491,0.75],[0.155,0.731],[0.291,0.494],[0.76,0.304],[0.602,0.904],[0.512,0.713],[0.28,0.626],[0.99,0.566],
    [0.26,0.613],[0.312,0.561],[0.84,0.695],[0.112,0.245],[0.701,0.479],[0.974,0.103],[0.507,0.188],[0.583,0.586],
    [0.965,0.96],[0.112,0.007],[0.018,0.752],[0.063,0.967],[0.456,0.024],[0.214,0.107],[0.086,0.352],[0.892,0.356],
    [0.533,0.533],[0.276,0.241],[0.514,0.363],[0.241,0.765],[0.829,0.821],[0.73,0.54],[0.136,0.635],[0.431,0.248],
    [0.288,0.259],[0.008,0.663],[0.856,0.954],[0.579,0.972]
]

# y is the target variable for the train data
y = [
    19.444,4.992,2.11,13.708,0.805,8.619,10.21,11.443,5.624,11.092,17.783,19.544,7.135,7.411,10.633,19.133,23.157,
    12.794,18.729,14.939,9.564,19.852,10.028,14.571,22.387,18.742,14.089,8.017,23.241,11.613,14.473,13.893,15.127,
    12.062,5.968,19.842,18.636,9.556,21.748,9.319,7.09,2.423,14.594,9.14,6.539,18.883,15.257,18.729,14.114,10.36,
    16.359,14.56,19.991,15.411,3.763,17.696,19.243,1.937,26.017,19.858,2.971,10.657,5.808,11.943,13.442,7.663,
    13.304,17.375,15.677,13.589,8.728,25.399,8.824,14.796,18.179,2.405,19.424,15.835,8.254,19.661,21.516,1.764,
    5.845,8.68,7.262,3.478,7.544,19.942,17.491,4.847,14.444,9.097,18.873,20.894,7.049,7.103,5.038,5.026,20.896,16.037
]

# test is the test data
test = [
    [0.463,0.496],[0.45,0.365],[0.131,0.283],[0.015,0.827],[0.076,0.302],[0.092,0.356],[0.765,0.039],[0.94,0.767],[0.413,0.343],
    [0.484,0.155],[0.464,0.695],[0.574,0.767],[0.81,0.848],[0.888,0.317],[0.802,0.776],[0.197,0.417],[0.076,0.9],[0.071,0.248],
    [0.377,0.356],[0.523,0.538],[0.282,0.151],[0.299,0.342],[0.171,0.879],[0.125,0.123],[0.38,0.554],[0.138,0.919],[0.984,0.361],
    [0.07,0.95],[0.674,0.511],[0.514,0.808],[0.808,0.83],[0.573,0.622],[0.719,0.961],[0.479,0.144],[0.158,0.708],[0.365,0.306],
    [0.704,0.963],[0.959,0.614],[0.36,0.8],[0.937,0.178],[0.412,0.69],[0.145,0.122],[0.386,0.832],[0.419,0.622],[0.908,0.44],
    [0.139,0.227],[0.57,0.852],[0.322,0.763],[0.407,0.94],[0.972,0.735],[0.027,0.671],[0.875,0.533],[0.117,0.829],[0.837,0.725],
    [0.963,0.674],[0.065,0.641],[0.271,0.693],[0.845,0.423],[0.332,0.341],[0.548,0.883],[0.979,0.094],[0.806,0.249],[0.924,0.513],
    [0.564,0.971],[0.768,0.098],[0.258,0.096],[0.365,0.811],[0.241,0.83],[0.636,0.481],[0.583,0.037],[0.408,0.535],[0.147,0.737],
    [0.027,0.452],[0.871,0.599],[0.774,0.614],[0.563,0.268],[0.573,0.424],[0.902,0.863],[0.274,0.253],[0.312,0.135],[0.435,0.416],
    [0.973,0.094],[0.541,0.022],[0.501,0.773],[0.18,0.936],[0.253,0.042],[0.354,0.242],[0.268,0.671],[0.253,0.382],[0.488,0.956],
    [0.081,0.715],[0.786,0.647],[0.813,0.999],[0.967,0.846],[0.3,0.26],[0.06,0.658],[0.366,0.988],[0.397,0.978],[0.535,0.935]
]

# ytest is the target variable for the test data
ytest = [
    15.828,13.165,2.805,6.464,6.338,7.868,11.866,19.99,12.631,7.707,12.534,14.495,19.236,19.504,17.825,10.311,7.744,1.643,12.255,
    17.601,4.874,10.789,9.428,2.327,15.502,8.762,21.087,8.585,19.251,13.568,19.096,13.478,18.991,7.625,7.578,10.835,18.129,19.543,
    11.518,15.077,11.418,2.585,12.003,11.192,22.111,2.727,15.859,10.681,13.344,20.361,5.576,23.465,8.48,18.288,20.473,5.871,9.61,
    19.963,10.884,15.108,15.787,13.071,23.744,16.452,12.334,4.794,11.67,9.674,18.095,9.58,15.796,7.597,8.211,24.461,16.518,9.306,
    16.248,20.532,4.874,5.217,14.104,15.786,8.479,13.659,9.66,4.149,6.156,9.4,10.784,14.966,6.84,16.894,19.311,21.661,5.455,
    6.063,13.329,13.911,15.998
]


In [8]:
import numpy as np
from sklearn.metrics import mean_squared_error        # the metric to test 
from sklearn.ensemble import RandomForestRegressor    # import model
from sklearn.linear_model import LinearRegression     # import model
from sklearn.model_selection import train_test_split  # split the training data

# train is the training data
# y is the target variable for the train data
# test is the test data
# ytest is the target variable for the test data

#split train data in 2 parts, training and  valdiation.
training, valid, \
ytraining, yvalid = train_test_split(
    train,y,
    test_size=0.5,
    random_state=2
)

# predictions
valid_predictions = []
test_predictions = []

# models
base_learners = [
    RandomForestRegressor(random_state=2),
    LinearRegression()
]

# fit & predict for each model
for model in base_learners:
    # fit model
    model.fit(training, ytraining)
    # predict validation set
    preds = model.predict(valid)
    valid_predictions.append(preds)
    # predict test set
    test_preds = model.predict(test)
    test_predictions.append(test_preds)
    # print perf on test set
    print('MSE of base learner:  {: .3f}'.format(mean_squared_error(ytest, test_preds)))

# create new dataset by stacking predictions
stacked_predictions=np.column_stack(valid_predictions)
stacked_test_predictions=np.column_stack(test_predictions)

# fit meta model on stacked validation set
meta_model=LinearRegression()
meta_model.fit(stacked_predictions, yvalid)

# apply meta model to stacked test set
final_predictions = meta_model.predict(stacked_test_predictions)
print('MSE of stacked model: {: .3f}'.format(mean_squared_error(ytest, final_predictions)))


MSE of base learner:   3.237
MSE of base learner:   4.511
MSE of stacked model:  2.650


## III.4. Second-level models

_Note: all predictions are used as meta-features._

**Simple holdout scheme**

+ Holdout Scheme: split train data into three parts
  + partA (first-level training).
  + partB (second-level training).
  + partC (second-level validation).

+ Holdout Meta-Features
  + Fit N diverse models on partA.
  + Predict for partB and partC: \[partB_meta, partC_meta\].

+ Meta Holdout
  + Fit a metamodel to \[partB_meta\].
  + Validate its hyperparameters on \[partC_meta\].

+ Test Meta-Features & Predictions
  + Fit N diverse models on partA.
  + Predict for test data: \[test_meta\].
  + Fit metamodel to \[partB_meta, partC_meta\].
  + Predict for \[test_meta\].


**Meta holdout scheme with OOF meta-features**

+ OOF Meta-Features
  + Split train data into K-folds. 
  + Fit N diverse models on K-1 folds (loop).
  + Predict for Kth fold.
  + Aggregate OOF predictions to \[train_meta\].

+ Meta Holdout
  + Split \[train_meta\] in two parts: \[train_metaA\] and \[train_metaB\].
  + Fit a metamodel to \[train_metaA\].
  + Validate its hyperparameters on \[train_metaB\].

+ Test Meta-Features & Predictions
  + Fit N diverse models on the entire training data.
  + Predict for test data: \[test_meta\].
  + Fit metamodel to \[train_meta\].
  + Predict for \[test_meta\].


**Meta KFold scheme with OOF meta-features**

_Same OOF Meta-Features as above_

+ Meta KFold:
  + Use KFold on \[train_meta\] to validate hyperparameters.
  + A common practice is to use the same seed for first- and second-level KFolds.


**Holdout scheme with OOF meta-features**

_Holdout Scheme for second-level validation_

+ Holdout Scheme: Split train data into two parts
  + partA (first- and second-level training).
  + partB (second-level validation).

+ OOF Meta-Features
  + Split partA into K-folds. 
  + Fit N diverse models on K-1 folds (loop).
  + Predict for Kth fold.
  + Aggregate OOF predictions to \[partA_meta\].

+ Holdout Meta-Features
  + Fit N diverse models on partA.
  + Predict for partB: \[partB_meta\].

+ Meta Holdout
  + Fit a metamodel to \[partA_meta\].
  + Validate its hyperparameters on \[partB_meta\].

_OOF Meta-Features for Final Predictions_

+ OOF Meta-Features
  + Split train data into K-folds. 
  + Fit N diverse models on K-1 folds (loop).
  + Predict for Kth fold.
  + Aggregate OOF predictions to \[train_meta\].

+ Test Meta-Features & Predictions
  + Fit N diverse models on train data.
  + Predict for test data: \[test_meta\].
  + Fit metamodel to \[train_meta\].
  + Predict for \[test_meta\].


**KFold scheme with OOF meta-features**

This method is the same as the one above, but we divide the train data into parts partA and partB M times using KFold strategy with M folds. The final predictions are built the same way.


**KFold scheme in time series**

In time-series task we usually have a fixed period of time we are asked to predict. Like day, week, month or arbitrary period with duration of T.

+ OOF Meta-Features
  + Split the train data into M+K chunks of duration T.
  + Fit N diverse models on first M chunks (loop).
  + Predict for the chunk M+1: \[metaM+1\].

+ Meta KFold
  + Fit a metamodel to \[metaM+1, ..., metaM+K\].
  + Validate its hyperparameters on \[metaM+K+1\].

+ Test Meta-Features & Predictions
  + Fit N diverse models on train data.
  + Predict for the test data: \[test_meta\].
  + Fit metamodel to \[train_meta\].
  + Predict for \[test_meta\].


**KFold scheme in time series with limited amount of data**

The method described above is unapplicable when the amount of data is limited. In these cases, we can relax timeline constraints and treats chunks as regular folds.  
