## XGBoost

https://xgboost.readthedocs.io/en/latest/parameter.html
    
https://blog.cambridgespark.com/hyperparameter-tuning-in-xgboost-4ff9100a3b2f
    
https://stats.stackexchange.com/questions/317073/explanation-of-min-child-weight-in-xgboost-algorithm

https://medium.com/data-design/xgboost-hi-im-gamma-what-can-i-do-for-you-and-the-tuning-of-regularization-a42ea17e6ab6

### Parameters

_**eta**_ (name comes from greek character η) - learning rate. 

Makes every successive tree having less impact (thus eta should be less than 1) on the overall prediction which makes the boosting process more conservative (the predictions of the ensemble after next boosting will be more aligned with the predictions after the last boosting step (will not introduce anything "significantly new" to the overall ensemble predictions)).

Range: [0, 1]

_**gamma**_ - Minimum loss reduction required to make a further partition on a leaf node of the tree. 

The larger gamma is (meaning we want each split to significantly reduce the loss), the more conservative the algorithm will be (thus reducing the variance). 

Range: [0, ∞)

_**max_depth**_ - Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit (the variance will be larger!). 

Range: [0, ∞)

_**min_child_weight**_ - Minimum sum of instance weight (hessian) (or number of samples if all samples have a weight of 1) needed in a child.

If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning.

In linear regression task (and similar logic applies to other algorithms and tasks), this simply corresponds to minimum number of instances needed to be in each node. Because the loss for each node is 1/2*(y-y_hat)^2, the hessian (for on node it is the second derivative w.r.t. y_hat) is +1. Thus, summing these hessians over the nodes will give us their count.

The larger min_child_weight is, the more conservative the algorithm will be (thus reducing the variance).

A smaller min_child_weight allows the algorithm to create children that correspond to fewer samples, thus allowing for more complex trees, but again, more likely to overfit (the variance will be larger!).

Range: [0, ∞)

_**subsample**_ - Subsample ratio of the training instances.

Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. and this will prevent overfitting. Subsampling will occur once in every boosting iteration.

Range: (0, 1]

_**colsample_bytree**_ - the subsample ratio of columns when constructing each tree. 

Subsampling occurs once for every tree constructed.

Range: (0, 1]

_**colsample_bylevel**_ - the subsample ratio of columns for each level. 

Subsampling occurs once for every new depth level reached in a tree. 

Columns are subsampled from the set of columns chosen for the current tree.

Range: (0, 1]

_**colsample_bynode**_ - the subsample ratio of columns for each node (split). 

Subsampling occurs once every time a new split is evaluated. 

Columns are subsampled from the set of columns chosen for the current level.

Range: (0, 1]

_**lambda**_ - L2 regularization term on weights. 

Increasing this value will make model more conservative (thus reducing the variance).

_**alpha**_ - L1 regularization term on weights. 

Increasing this value will make model more conservative (thus reducing the variance).

_**scale_pos_weight**_ - Control the balance of positive and negative weights, useful for unbalanced classes. 

A typical value to consider: sum(negative instances) / sum(positive instances).

**colsample_by* parameters work cumulatively. For instance, the combination {'colsample_bytree':0.5, 'colsample_bylevel':0.5, 'colsample_bynode':0.5} with 64 features will leave 8 features to choose from at each split.**

### gamma vs min_child_weight
Always start with 0, use xgb.cv, and look how the train/test are faring. 

If you train CV skyrocketing over test CV at a blazing speed, this is where Gamma is useful instead of min_child_weight (because you need to control the complexity issued from the loss, not the loss derivative from the hessian weight in min_child_weight).

### gamma tuning 
The higher the Gamma, the lower the difference between train/test CV will happen. 

If you have no idea of the value to use, put 10 and look what happens.

If your train/test CV are always lying too close, it means you controlled way too much the complexity of xgboost, and the model can’t grow trees without pruning them (due to the loss threshold not reached thanks to Gamma). Lower Gamma (good relative value to reduce if you don’t know: cut 20% of Gamma away until you test CV grows without having the train CV frozen).

If your train/test CV are differing too much, it means you did not control enough the complexity of xgboost, and the model grows too many trees without pruning them (due to the loss threshold not reached because of Gamma). Put a higher Gamma (good absolute value to use if you don’t know: +2, until your test CV can follow faster your train CV which goes slower, your test CV should be able to peak).

If your train CV is stuck (not increasing, or increasing way too slowly), decrease Gamma: that value was too high and xgboost keeps pruning trees until it can find something appropriate (or it may end in an endless loop of testing + adding nodes but pruning them straight away…).