# Boosting

In this notebook, we will apply ensemble techniques to the regression problem in `California Housing Dataset`.

We have already applied different regressors on this dataset. In this notebook, we will make use of ---
- AdaBoost Regressor
- Gradient Boosting Regressor
- XGBoost Regressor

In [1]:
import numpy as np
import pandas as pd

np.random.seed(69)

from sklearn.datasets import fetch_california_housing

from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor

from sklearn.metrics import mean_absolute_error

from sklearn.model_selection import train_test_split, cross_validate, RandomizedSearchCV, ShuffleSplit

In [12]:
# To ignore all the warnings

import warnings
warnings.filterwarnings('ignore')

We use `ShuffleSplit` cross validation with $10$ splits and $20\%$ of the data set aside for model evaluation as a test data.

In [2]:
shufflesplit_cv = ShuffleSplit(
    n_splits=10,
    test_size=0.2,
    random_state=69
)

Let's download the data and split it into training and test sets.

In [3]:
# fetch dataset
features, labels = fetch_california_housing(as_frame=True, return_X_y=True)
labels *= 100

# train-test split
X_train, X_test, y_train, y_test = train_test_split(features, labels, random_state=69)

We train a helper function to help us train different regressors on the `California Housing Dataset`.

In [4]:
def train_regressor(estimator, X_train, y_train, cv, name):
    cv_results = cross_validate(
        estimator,
        X_train,
        y_train,
        cv=cv,
        scoring='neg_mean_absolute_error',
        return_train_score=True,
        return_estimator=True
    )

    cv_train_error = -1* cv_results['train_score']
    cv_test_error = -1*cv_results['test_score']

    print(f"On an average, {name} model makes an error of " f"{cv_train_error.mean():3f}k +/- {cv_train_error.std():3f}k on the training set.")
    print(f"On an average, {name} model makes an error of " f"{cv_test_error.mean():3f}k +/- {cv_test_error.std():3f}k on the test set.")

# AdaBoost

In [5]:
train_regressor(
    AdaBoostRegressor(random_state=69),
    X_train,
    y_train,
    cv=shufflesplit_cv,
    name="AdaBoost Regressor"
)

On an average, AdaBoost Regressor model makes an error of 69.595506k +/- 3.309228k on the training set.
On an average, AdaBoost Regressor model makes an error of 70.026886k +/- 3.182938k on the test set.


In [9]:
help(AdaBoostRegressor)

Help on class AdaBoostRegressor in module sklearn.ensemble._weight_boosting:

class AdaBoostRegressor(sklearn.base.RegressorMixin, BaseWeightBoosting)
 |  AdaBoostRegressor(base_estimator=None, *, n_estimators=50, learning_rate=1.0, loss='linear', random_state=None)
 |  
 |  An AdaBoost regressor.
 |  
 |  An AdaBoost [1] regressor is a meta-estimator that begins by fitting a
 |  regressor on the original dataset and then fits additional copies of the
 |  regressor on the same dataset but where the weights of instances are
 |  adjusted according to the error of the current prediction. As such,
 |  subsequent regressors focus more on difficult cases.
 |  
 |  This class implements the algorithm known as AdaBoost.R2 [2].
 |  
 |  Read more in the :ref:`User Guide <adaboost>`.
 |  
 |  .. versionadded:: 0.14
 |  
 |  Parameters
 |  ----------
 |  base_estimator : object, default=None
 |      The base estimator from which the boosted ensemble is built.
 |      If ``None``, then the base es

# GradientBoosting

In [13]:
help(GradientBoostingRegressor)

Help on class GradientBoostingRegressor in module sklearn.ensemble._gb:

class GradientBoostingRegressor(sklearn.base.RegressorMixin, BaseGradientBoosting)
 |  GradientBoostingRegressor(*, loss='squared_error', learning_rate=0.1, n_estimators=100, subsample=1.0, criterion='friedman_mse', min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, init=None, random_state=None, max_features=None, alpha=0.9, verbose=0, max_leaf_nodes=None, warm_start=False, validation_fraction=0.1, n_iter_no_change=None, tol=0.0001, ccp_alpha=0.0)
 |  
 |  Gradient Boosting for regression.
 |  
 |  GB builds an additive model in a forward stage-wise fashion;
 |  it allows for the optimization of arbitrary differentiable loss functions.
 |  In each stage a regression tree is fit on the negative gradient of the
 |  given loss function.
 |  
 |  Read more in the :ref:`User Guide <gradient_boosting>`.
 |  
 |  Parameters
 |  ----------
 |  loss : {'squared_er

In [6]:
train_regressor(
    GradientBoostingRegressor(random_state=69),
    X_train,
    y_train,
    cv=shufflesplit_cv,
    name="GradientBoosting Regressor"
)

On an average, GradientBoosting Regressor model makes an error of 35.244218k +/- 0.165517k on the training set.
On an average, GradientBoosting Regressor model makes an error of 36.903203k +/- 0.681527k on the test set.


# XGBoost

In [14]:
from xgboost import XGBRegressor

Extreme Gradient Boosting (XGBoost) is the latest boosting technique. It is more regularizd form of gradient boosting. With regularization, it is able to achieve better generalization performance w.r.t. Gradient Boosting.

In [15]:
help(XGBRegressor)

Help on class XGBRegressor in module xgboost.sklearn:

class XGBRegressor(XGBModel, sklearn.base.RegressorMixin)
 |  XGBRegressor(*, objective: Union[str, Callable[[numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray]], NoneType] = 'reg:squarederror', **kwargs: Any) -> None
 |  
 |  Implementation of the scikit-learn API for XGBoost regression.
 |  
 |  
 |  Parameters
 |  ----------
 |  
 |      n_estimators : int
 |          Number of gradient boosted trees.  Equivalent to number of boosting
 |          rounds.
 |  
 |      max_depth :  Optional[int]
 |          Maximum tree depth for base learners.
 |      learning_rate : Optional[float]
 |          Boosting learning rate (xgb's "eta")
 |      verbosity : Optional[int]
 |          The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
 |      objective : typing.Union[str, typing.Callable[[numpy.ndarray, numpy.ndarray], typing.Tuple[numpy.ndarray, numpy.ndarray]], NoneType]
 |          Specify the learning ta

In [16]:
train_regressor(
    XGBRegressor(random_state=69),
    X_train,
    y_train,
    cv=shufflesplit_cv,
    name="XGBoost Regressor"
)

On an average, XGBoost Regressor model makes an error of 18.111425k +/- 0.222967k on the training set.
On an average, XGBoost Regressor model makes an error of 31.811765k +/- 0.509883k on the test set.


In [None]:
%%time