## XGBoost, LightGBM, CatBoost

- theory behind all 3 models are based on Gradient Boosting with some extensions
- many useful programming functionalities were added that:
    - reduce model variance further (better cross-validation and test performance)
    - make the runtime faster (more efficient implementation)

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from xgboost import XGBRegressor, XGBClassifier

In [3]:
trainf = pd.read_csv("/Users/vaibhavrangan/Downloads/Stat_303-3/Datasets/Car_features_train.csv")
trainp = pd.read_csv("/Users/vaibhavrangan/Downloads/Stat_303-3/Datasets/Car_prices_train.csv")
train = pd.merge(trainf, trainp)

testf = pd.read_csv("/Users/vaibhavrangan/Downloads/Stat_303-3/Datasets/Car_prices_test.csv")
testp = pd.read_csv("/Users/vaibhavrangan/Downloads/Stat_303-3/Datasets/Car_features_test.csv")
test = pd.merge(testf, testp)

predictors = ['mpg', 'engineSize', 'year', 'mileage']
target = 'price'
X_train = train[predictors]
y_train = train[target]
X_test = test[predictors]
y_test = test[target]

- XGBoost is faster than Gradient Boosting because:
    - parallel processing of the NODES OF EACH TREE
    - "histogram features": while finding the decision rule at each node:
        - take a predictor, create a histogram of its values
        - find a threshold using these categorized values
        - this method returns very comparable performance but much faster

- XGBoost performs better than Gradient Boosting because of the extra hyperparameters:
    - will be introduced in the model inputs
    - strongly force the model to avoid overfitting
    - make the model ignore outliers as much as possible

- Does XGBoost miss anything? Yes, Huber loss:
    - Gradient Boosting implements Huber loss very well
    - XGBoost has "pseudo Huber loss," not implemented well
    - just use MAE in XGBoost

# Model Inputs

In [None]:
model = XGBRegressor(
    random_state = 12,
    max_depth = 6, # identical to AdaBoost and Gradient Boosting
    learning_rate = 0.1, # same as AdaBoost and Gradient Boosting
    subsample = 0.8, # same as AdaBoost and Gradient Boosting

    # the extra hyperparameters
    reg_lambda = 1, # the factor multiplied with the Ridge penalty added to the cost function of each tree
    # the ridge penalty consists of "leaf weights" -- measuring the important of a leaf in the tree
        # proportional to the MSE/Gini it reduces, inversely proportional to the complexity it adds
    
    gamma = 0.1,# "tree pruning" hyperparameter, cancels/prunes teh relatively unnecessary leaves by checking their leaf weights and eliminating the leaves with weights below gamma

    colsample_bytree = 0.5, # a feature/predictor subset that each tree sees, tries to break any excess correlation between trees

)
# a group of three hyperparams:
# colsample_bytree: the predictor subset size seen by each tree
# colsample_byLevel: the predictor subset size seen by each level of the tree
# colsample_bynode: the predictor subset size seen by each node of the tree