# Very simple XGBoost regression

[XGBoost](https://xgboost.readthedocs.io/en/latest/) was the first of ***The Big Three*** [gradient boosting](https://en.wikipedia.org/wiki/Gradient_boosting) frameworks, released in 2014. The other two are [LightGBM](https://www.microsoft.com/en-us/research/project/lightgbm/) by Microsoft and launched in 2016, and [CatBoost](https://catboost.ai/) by Yandex, launched in 2017. Each of these frameworks are magnificent tools to tackling tabular data problems, using either regression or classification.

#### What is '*boosting*'?
First there was a **tree**. The underlying element of these technique is the decision tree. Decision trees were one of the first algorithms, dating back to the 1980s with examples such as CART, and ID3, C4.5 and C5.0 by Quinlan. Trees are wonderfully intuitive leading to easily interpretable results. See for example the notebook ["*Titanic: some sex, a bit of class, and a tree...*"](https://www.kaggle.com/carlmcbrideellis/titanic-some-sex-a-bit-of-class-and-a-tree). In view of this the most important hyperparameter for the [`DecisionTreeRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) is the `max_depth`, although being a tree perhaps this should have really been called the maximum height...
However, despite the appealing aspects such as few hyperparameters and interpretability, the drawback of decision trees is their high variance; a slight change in the input data can lead to a radically different tree structure. A similar thing can happen if there are [collinear variables](https://en.wikipedia.org/wiki/Multicollinearity) present. Sometimes individual decision trees are know as weak predictors, or weak learners.

Then we have a **forest**. As we all know, listening to many opinions and taking an average leads to a more balanced consensus. With this in mind, why not randomly plant a lot of trees and then ensemble them into one aggregate output. Each of the trees are slightly different in that they are grown from a subset of randomly selected features, which are taken from a "bootstrapped" copy of the dataset which is made up from samples taken from the original dataset.
We now have the random forest, which outperforms the individual decision tree. Random forests are great in reducing the [variance](https://en.wikipedia.org/wiki/Variance) with respect to a single decision tree, are particularly immune to overfitting, and are wonderful for obtaining a baseline score against which to compare more extravagant techniques. For more details see the [introduction by Breiman and Cutler](https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm) who invented the Random Forest in the early 2000's. With the [`RandomForestRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) the most important hyperparameters are now `max_depth` as before, as well as `n_estimators`; which is the number of trees in the forest.

**Gradient boosting**. This time, instead of simultaneously planting a load of independent trees all at once at random (bootstrapping and aggregating aka. [bagging](https://en.wikipedia.org/wiki/Bootstrap_aggregating)), each successive tree that we plant is weighted in such a way as to compensate for any weakness (residual errors) in the previous tree. This is known as [boosting](https://en.wikipedia.org/wiki/Gradient_boosting). We have the hyperparameters `max_depth`, `n_estimators` as before, and now we have a `learning_rate` hyperparameter which is between 0 and 1, and controls the amount of *shrinkage* when creating each successive new tree.


### Sample script:

This here is a minimalist script which applies XGBoost regression to the [House Prices](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) data set. The purpose of this script is to serve as a basic starting framework which one easily can adapt. 

Some suggestions for ways to improve the score are:

* Feature selection: for example see [recursive feature elimination script](https://www.kaggle.com/carlmcbrideellis/recursive-feature-elimination-hp-v1)
* Feature engineering: creating new features out of the existing features
* Outlier removal
* Imputation of missing values: XGBoost is resilient to missing values, however one may like to try using the [missingpy](https://github.com/epsilon-machine/missingpy) library

**The best hyperparameters**. XGBoost has a multitude of hyperparameters, but here we shall only be using three of them. The optimal choice of these parameters can lead to a significant improvement in ones final score, so choosing the best values for these hyperparameters is important. To do this we shall perform a [cross-validated](https://scikit-learn.org/stable/modules/cross_validation.html) grid-search using the scikit-learn [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) routine. Note that this can be quite time consuming, as it tries out each and every hyperparameter combination, exhaustively checking for the best result.

In [None]:
#===========================================================================
# load up the libraries
#===========================================================================
import pandas  as pd
import numpy   as np
import xgboost as xgb

#===========================================================================
# read in the data
#===========================================================================
train_data = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv',index_col=0)
test_data  = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv',index_col=0)

#===========================================================================
# here, for this simple demonstration we shall only use the numerical columns 
# and ingnore the categorical features
#===========================================================================
X_train = train_data.select_dtypes(include=['number']).copy()
X_train = X_train.drop(['SalePrice'], axis=1)
y_train = train_data["SalePrice"]
X_test  = test_data.select_dtypes(include=['number']).copy()

#===========================================================================
# XGBoost regression: 
# Parameters: 
# n_estimators  "Number of gradient boosted trees. Equivalent to number 
#                of boosting rounds."
# learning_rate "Boosting learning rate (also known as “eta”)"
# max_depth     "Maximum depth of a tree. Increasing this value will make 
#                the model more complex and more likely to overfit." 
#===========================================================================
regressor=xgb.XGBRegressor()

#===========================================================================
# exhaustively search for the optimal hyperparameters
#===========================================================================
from sklearn.model_selection import GridSearchCV
# set up our search grid
param_grid = {"max_depth": [2, 3, 4, 5],
              "n_estimators": [400, 500, 600, 700],
              "learning_rate": [0.015, 0.020, 0.025]}

# try out every combination of the above values
search = GridSearchCV(regressor, param_grid, cv=5).fit(X_train, y_train)

print("The best hyperparameters are ",search.best_params_)

we shall now use these values for our hyperparameters in our final calculation

In [None]:
regressor=xgb.XGBRegressor(learning_rate = search.best_params_["learning_rate"],
                           n_estimators  = search.best_params_["n_estimators"],
                           max_depth     = search.best_params_["max_depth"])

regressor.fit(X_train, y_train)

#===========================================================================
# To use early_stopping_rounds: 
# "Validation metric needs to improve at least once in every 
# early_stopping_rounds round(s) to continue training."
#===========================================================================
# first perform a test/train split 
#from sklearn.model_selection import train_test_split

#X_train,X_test,y_train,y_test = train_test_split(X_train,y_train, test_size = 0.2)
#regressor.fit(X_train, y_train, early_stopping_rounds=6, eval_set=[(X_test, y_test)], verbose=False)

#===========================================================================
# use the model to predict the prices for the test data
#===========================================================================
predictions = regressor.predict(X_test)

let us now calculate our score (for more details see my notebook ["*House Prices: How to work offline*"](https://www.kaggle.com/carlmcbrideellis/house-prices-how-to-work-offline))

In [None]:
# read in the ground truth file
solution   = pd.read_csv('../input/house-prices-advanced-regression-solution-file/solution.csv')
y_true     = solution["SalePrice"]

from sklearn.metrics import mean_squared_log_error
RMSLE = np.sqrt( mean_squared_log_error(y_true, predictions) )
print("The score is %.5f" % RMSLE )

and write out a `submission.csv` for the competition

In [None]:
#===========================================================================
# write out CSV submission file
#===========================================================================
output = pd.DataFrame({"Id":test_data.index, "SalePrice":predictions})
output.to_csv('submission.csv', index=False)

### Feature importance
Let us also take a very quick look at the feature importance too:

In [None]:
from xgboost import plot_importance
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 16})

fig, ax = plt.subplots(figsize=(12,6))
plot_importance(regressor, max_num_features=8, ax=ax)
plt.show();

Where here the `F score` is a measure "*...based on the number of times a variable is selected for splitting, weighted by the squared improvement to the model as a result of each split, and averaged over all trees*." [1]
### Links:
* XGBoost: [documentation](https://xgboost.readthedocs.io/en/latest/index.html), [GitHub](https://github.com/dmlc/xgboost).
* LightGBM: [documentation](https://lightgbm.readthedocs.io/en/latest/index.html), [GitHub](https://github.com/microsoft/LightGBM).
* CatBoost: [documentation](https://catboost.ai/docs/), [GitHub](http://https://github.com/catboost).

*See also*:

* [Automatic tuning of XGBoost with XGBTune](https://www.kaggle.com/carlmcbrideellis/automatic-tuning-of-xgboost-with-xgbtune)
* [GPU accelerated SHAP values with XGBoost](https://www.kaggle.com/carlmcbrideellis/gpu-accelerated-shap-values-jane-street-example)

[1] [J. Elith, J. R. Leathwick, and T. Hastie "*A working guide to boosted regression trees*", Journal of Animal Ecology **77** pp. 802-813 (2008)](https://doi.org/10.1111/j.1365-2656.2008.01390.x)