[Tabular Playground Series - Feb 2021](https://www.kaggle.com/c/tabular-playground-series-feb-2021/overview)

# Learn Boosting Techniques through TPS Feb 2021 ([Link](https://www.kaggle.com/c/tabular-playground-series-feb-2021/submit))

In this notebook, we will learn everything about the popular Gradient Boosting Technique known as XGBoost. We will see how far up the leaderboard can just an XGBoost model take us. We will focus on understanding the hyperparameters  how to tune them. 

# How is XGBoost different from vanilla Gradient Boosting?

WIP

# Import Libraires & Read Data

In [None]:
# Import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb
import lightgbm as lgb
import altair as alt
from sklearn import tree
from sklearn import ensemble
from sklearn import metrics
from sklearn import model_selection
from sklearn import preprocessing
from sklearn import compose
from sklearn import decomposition
from sklearn import pipeline
from functools import partial
from skopt import space
from skopt import gp_minimize

# Settings
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 1000)

# Read Data
PATH = '/kaggle/input/tabular-playground-series-feb-2021'
train = pd.read_csv(f'{PATH}/train.csv')
test = pd.read_csv(f'{PATH}/test.csv')
ss = pd.read_csv(f'{PATH}/sample_submission.csv')

# Utility Code

In [None]:
# Function to fit and evaluate the model on validation set.
def mfe(model, X_train, X_val, y_train, y_val, oob_score=False):
    model.fit(X_train, y_train) 
    
    y_train_pred = model.predict(X_train)
    train_error = np.sqrt(metrics.mean_squared_error(y_train, np.where(y_train_pred < 0, 0, y_train_pred))) 
    
    y_val_pred = model.predict(X_val) 
    val_error = np.sqrt(metrics.mean_squared_error(y_val, np.where(y_val_pred < 0, 0, y_val_pred)))
    
    if oob_score:
        result = {'Train Error': train_error, 'Validation. Error': val_error, 'oob_score': model.oob_score_} 
    else:
        result = {'Train Error': train_error, 'Validation. Error': val_error} 
    print(result)

In [None]:
# Function to plot feature importances
def fi(model, X):
    fi = pd.DataFrame({
        'feature':X.columns.tolist(), 
        'importance':(model.feature_importances_ * 100).tolist()}).round(4).sort_values(by='importance', ascending=False)
    bars = alt.Chart(fi).mark_bar().encode(
      x='importance',
      y=alt.Y('feature', sort='-x'),
      tooltip=['feature','importance']
    )

    text = bars.mark_text(
      align='left',
      baseline='middle',
      dx=3  # Nudges text to right so it doesn't appear on top of the bar
    ).encode(
      text='importance:Q'
    )
    return (bars + text).properties(width=400).configure_axis(labelFontSize=13, titleFontSize=16)

# Data Preprocessing

In [None]:
# Define features & target
num_feats = train.drop(columns=['id','target']).select_dtypes(include='number').columns.tolist()
cat_feats = train.select_dtypes(include='object').columns.tolist()
feats = num_feats+cat_feats
target = 'target'

# Create X & y datasets.
X = train.loc[:, feats]
y = train[target].ravel()
X_test = test[feats].copy()

# Slit the dataset.
X_train, X_val, y_train, y_val = model_selection.train_test_split(X, y, test_size=0.20, random_state=42)

# To avoid setting with copy warnings we copy the data after splitting
X_train = X_train.copy()
X_val = X_val.copy()

# Define the transformation steps needed for the data.
ct = compose.ColumnTransformer(
transformers = [
    ('cat', preprocessing.OrdinalEncoder(categories='auto', dtype=np.int), cat_feats)
], remainder='passthrough'
)

# Transform the data.
X_train.loc[:, cat_feats] = ct.fit_transform(X_train.loc[:, cat_feats])
X_val.loc[:, cat_feats] = ct.transform(X_val.loc[:, cat_feats])
X_test.loc[:, cat_feats] = ct.transform(X_test.loc[:, cat_feats])

In [None]:
X_train.head()

# Random Forest Baseline

In [None]:
%%time
rf = ensemble.RandomForestRegressor(n_estimators=100, n_jobs=-1)
mfe(rf, X_train, X_val, y_train, y_val)

In [None]:
fi(rf, X_train)

In [None]:
preds = rf.predict(X_test)
dsub = pd.DataFrame({'id': test.loc[:, 'id'], 'target': preds})
dsub.to_csv('dsub2.csv', index=False)

In [None]:
preds = clf.predict(X_test)
dsub = pd.DataFrame({'id': test.loc[:, 'id'], 'target': preds})
dsub.to_csv('dsub4.csv', index=False)

# [XGBoost: Intro to python API](https://xgboost.readthedocs.io/en/latest/python/python_intro.html#)

* [XGB Hyperparameters](https://xgboost.readthedocs.io/en/latest/parameter.html)
* [XGB specific notes](https://xgboost.readthedocs.io/en/latest/tutorials/param_tuning.html)

### General Parameters
**booster**
* Default Values is 'gbtree'; Can be gbtree, gblinear or dart; 
* gbtree and dart use tree based models while gblinear uses linear functions.

**Verbosity**
* Verbosity of printing messages. 
* Valid values are 0 (silent), 1 (warning), 2 (info), 3 (debug)

### Parameters for Tree Booster
**eta**
* Default value is 0.3. Range is [0,1], aka learning_rate.
* Used to scale the output from each tree. Shrinkage parameter. Speed of learning. 
* Higher values may lead to overfitting.

**gamma**
* Default value is 0. Range [0,∞]. aka `min_split_loss`.
* Is meant to encourage pruning in the tree, thought xgboost can prune even if gamma is zero.
* Minimum loss reduction required to make a further partition on a leaf node. 
* Increasing `gamma` regularizes the model and reduces the overfitting. 

**max_depth**
* Default value: 6; range: [0,∞]
* Maximum depth of a tree. How many interactions do you want to allow in your model?  
* Larger values increase complexity and lead to overfitting.
* If the validation performance keeps increasing as the max_depth increases, it means there are a lot of interactions that can be extracted from the data. So it's better to stop tuning in that case any try to generate some features.
* 7 is a good value to start.

**min_child_weight**:1
* Similar to `min_samples_leaf` in Random Forest.
* Increasing the value regularizes the model and reduces the overfitting.
* One of the most important parameters to tune in XGB & LGB

**max_delta_step**:0
* Default value:  0; range: [0,∞]
* Maximum delta step we allow each leaf output to be.
* Usually not needed, but it might help in logistic regression when class is extremely imbalanced. Suggested range: [1-10].

**subsample**
* Default value: 1, range: (0,1]
* Subsample ratio of the training instances.
* Lower values regularize the model and prevent overfitting.

**colsample_bytree**
* Default value: 1, range: (0,1]
* What fraction of features will be used for building each tree. Subsampling occurs once for every tree constructed.
* Lower values regularize the model and prevent overfitting.

**colsample_bylevel**: 1 
* Default value: 1, range: (0,1]
* Subsample ratio of columns for each level. 
* Subsampling occurs once for every new depth level reached in a tree. Columns are subsampled from the set of columns chosen for the current tree.
* Lower values regularize the model and prevent overfitting.

**colsample_bynode**: 1 
* Default value: 1, range: (0,1]
* Subsample ratio of columns for each node (split). 
* Subsampling occurs once every time a new split is evaluated. Columns are subsampled from the set of columns chosen for the current level.
* Lower values regularize the model and prevent overfitting.

**lambda**
* Default value: 1, range: (0,1]; aka reg_lambda
* Used to regularize the output values from leaves. As lambda increases the output value from a leaf shifts closer to zero.
* L2 regularization term on weights. 
* Higher values reduce overfitting.

**alpha**
* Default value: 0, range: (0,1]; aka reg_alpga
* L1 regularization term on weights.
* Can be used in case of very high dimensionality so that the algorithm runs faster when implemented
* Higher values reduce overfitting.

**tree_method**
* Specifies which tree construction algorithm to use in in XGBoost.
* auto, exact, approx, hist, gpu_hist are different options.
* For large datasets prefer approx, hist, gpu_hist. For smaller datasets used exact.

**scale_pos_weight**
* Control the balance of positive and negative weights, useful for unbalanced classes. 
* A typical value to consider: sum(negative instances) / sum(positive instances)

**max_leaves**
* Maximum number of nodes to be added. Only relevant when grow_policy=lossguide is set.

### Learning Task Parameters
**objective**: 
* reg:squarederror Optimization objective for regression.

**base_score**
* 0.5, The initial prediction score of all instances, global bias

**eval_metric**:
* Evaluation metrics for validation data, a default metric will be assigned according to objective

**seed**: 
* 42 # Reproducibility

### How to control overfitting?
<br/>

When you observe high training accuracy, but low test accuracy, it is likely that you encountered overfitting problem.There are in general two ways that you can control overfitting in XGBoost 
* The first way is to directly control model complexity through `max_depth`, `min_child_weight` and `gamma`.
* The second way is to add randomness to make training robust to noise by using `subsample` and `colsample_bytree`.
* You can also reduce stepsize `eta`. Remember to increase num_round when you do so.

### General Tips
* In general, a small learning rate (and large number of estimators) will yield more accurate XGBoost models, though it will also take the model longer to train since it does more iterations through the cycle.
* We can try a very small learning rate and a large number of boosting iterations to see how long it takes for the model to overfit. When the validation loss stops decreasing, we can exit the training. To get even more accuracy, we can multiply the `num_rounds` by k & divide `eta` by k. 
* We can change the `seed` parameter to see how different values affect out results.

## Training XGBoost Model ([Link](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train))

In [None]:
# Define the train, validation and test Dmatrix objects.
dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_val, label=y_val)
dtest = xgb.DMatrix(X_test)

In [None]:
# Define the model params.
params = {
    'eta': 0.15,
    'gamma': 1,
    'max_depth':6 ,
    'min_child_weight': 8,
    'subsample': 1,
    'colsample_bytree': 1,
    'colsample_bylevel': 1,
    'colsample_bynode': 1, 
    'lambda': 1,
    'alpha': 1,
    'tree_method': 'exact',
    'objective': 'reg:squarederror',
    'eval_metric':'rmse',
    #'tree_method':'gpu_hist',
    'seed': 42
} 

The argument `early_stopping_rounds` offers a way to automatically find the ideal value. Early stopping causes the model to stop iterating when the validation score stops improving, even if we aren't at the hard stop for n_estimators. It's smart to set a high value for n_estimators and then use early_stopping_rounds to find the optimal time to stop iterating.Since random chance sometimes causes a single round where validation scores don't improve, you need to specify a number for how many rounds of straight deterioration to allow before stopping. early_stopping_rounds = 10 is a reasonable value. Thus we stop after 10 straight rounds of deteriorating validation scores.

In [None]:
%%time
model = xgb.train(
    params, 
    dtrain, 
    num_boost_round = 100, 
    evals = [(dtrain, 'train'), (dval, 'val')], # List of validation sets for which metrics will evaluated during training
    early_stopping_rounds = 10,
    verbose_eval = 20
)

In [None]:
xgb.plot_importance(model);

In [None]:
model.attributes()

In [None]:
# Evaluate the model on mat/dataset.
model.eval(dtrain, name='eval', iteration=0)

In [None]:
model.get_fscore()

Feature Importance types in XGBoost

* `weight`: the number of times a feature is used to split the data across all trees.
* `gain`: the average gain across all splits the feature is used in.
* `cover`: the average coverage across all splits the feature is used in.
* `total_gain`: the total gain across all splits the feature is used in.
* `total_cover`: the total coverage across all splits the feature is used in.

In [None]:
model.get_score(importance_type='weight')

In [None]:
model.get_split_value_histogram('cat5', bins=None, as_pandas=True)

In [None]:
model.num_boosted_rounds()

In [None]:
model.num_features()

## XGBoost CV ([Link](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.cv))

In [None]:
xgb.cv(
    params, #  Booster params.
    dtrain, # Data to be trained.
    num_boost_round=50, # Number of boosting iterations.
    nfold=3, # Number of folds in CV
    stratified=False, # Perform stratified sampling.
    metrics='rmse', # Evaluation metrics to be watched in CV
    early_stopping_rounds=10, 
    verbose_eval=10,  
    seed=0)

# Using the [Scikit-Learn API](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn) for XGBoost

Here is the Scikit-Learn implementation of XGBoost. 

In [None]:
clf = xgb.XGBRegressor(
    n_estimators = 100,
    max_depth = 6,
    learning_rate = 0.15,
    #verbosity = 0,
    objective = 'reg:squarederror',
    n_jobs = -1,
    gamma = 0,
    min_child_weight = 7,
    max_delta_step = 0,
    subsample = 1,
    colsample_bytree = 1,
    colsample_bylevel = 1,
    colsample_bynode = 1,
    reg_alpha = 1,
    reg_lambda = 1,
    scale_pos_weight = 1,
    random_state = 42)

reg_alpha 0 to 1 decreases validation rmse from .84584 to .84576.

In [None]:
clf.fit(
    X_train, y_train,
    eval_set = [(X_train, y_train), (X_val, y_val)],
    eval_metric = 'rmse',
    early_stopping_rounds = 10,
    verbose = 10);

## Hyperparameter Tuning

Some Resources for Learning Model Tuning.

* [Analytics Vidhya](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/)
* [Grid and Random Search](https://www.kaggle.com/willkoehrsen/intro-to-model-tuning-grid-and-random-search)

#### Advice from top Kagglers

1. Select the most influential parameters. There are tons of parameters and we can't tune them all.
2. For the selected set of parameters, understand how they influence the training.
3. Tune the selected parameters. In pracice it's faster to tune manually. 

Different values of parameters result in 3 different fitting behaviors. 
* Underfitting
* Good fit and generalization
* Overfitting

We will split the parameters into two groups. 
* Brakers: Parameters  whose value upon increasing regularize the model. If we increase their value the model will change its behaviour from  overfitting to underfitting. 
* Speeders: Parameters whose value upon increasing can lead to overfitting. Increase their value if model underfits & decrease if it underfits.


In [None]:
# Hyperparameters that matter and their XGBooot & LightGBM names.
pd.DataFrame({
'Type': ['Speeder', 'Speeder', 'Speeder', 'Braker', 'Braker', 'Braker', 'Speeder', 'Breaker'],
'XGBoost': ['max_depth', 'subsample', 'colsample_by_tree/level', 'min_child_weight', 'lambda', 'alpha', 'eta', 'num_round'],
'LightGBM': ['max_depth/num_leaves', 'bagging_fraction', 'feature_fraction', 'min_data_in_leaf', 'lambda_l1', 'lambda_l2', 'learning_rate', 'num_iterations']
})

#### Step 1: Fix learning rate and number of estimators for tuning tree-based parameters

Since we have set n_estimators = 100 & learning_rate = 0.15 from above, let's tune `max_depth` & `min_child_weight`

#### Step 2: Tune max_depth and min_child_weight

In [None]:
param_grid_1 = {
    "max_depth": [6, 10, 15],
    "min_child_weight": [1, 3, 5, 7]
}

clf_1 = xgb.XGBRegressor(n_estimators=100, learning_rate=0.15, objective = 'reg:squarederror', reg_alpha=1, n_jobs=-1)

gs_1 = model_selection.GridSearchCV(
    estimator = clf_1,
    param_grid = param_grid_1,
    scoring = 'neg_root_mean_squared_error',
    verbose = 10,
    n_jobs = 1,
    cv = 5
)

# Fit the model and extract best score
gs_1.fit(X_train, y_train)
print(f"Best score: {gs_1.best_score_}")

print("Best parameters set:")
best_parameters = gs_1.best_estimator_.get_params()
for param_name in sorted(param_grid_1.keys()):
    print(f"\t{param_name}: {best_parameters[param_name]}")

print(f'Overall Results are : {gs_1.cv_results_}')    

Notes from above run.

Best score: -0.8467124937346888
Best parameters set: max_depth: 6; min_child_weight: 7

In [None]:
param_grid_2 = {
    "min_child_weight": [5, 7, 9]
}

clf_2 = xgb.XGBRegressor(max_depth=6, n_estimators=100, learning_rate=0.15, objective = 'reg:squarederror', reg_alpha=1, n_jobs=-1)

gs_2 = model_selection.GridSearchCV(
    estimator = clf_1,
    param_grid = param_grid_2,
    scoring = 'neg_root_mean_squared_error',
    verbose = 10,
    n_jobs = 1,
    cv = 5
)

# Fit the model and extract best score
gs_2.fit(X_train, y_train)
print(f"Best score: {gs_2.best_score_}")

print("Best parameters set:")
best_parameters = gs_2.best_estimator_.get_params()
for param_name in sorted(param_grid_2.keys()):
    print(f"\t{param_name}: {best_parameters[param_name]}")

#### Step 3: Tune gamma

#### Step 4: Tune subsample and colsample_bytree

### Step 5: Tuning Regularization Parameters

**Predictions on test data**

In [None]:
ypred = model.predict(dtest)
dsub = pd.DataFrame({'id': test.loc[:, 'id'], 'target': ypred})
dsub.to_csv('dsub.csv', index=False)