# Bagging and Random Forest Classifier

In this notebook, we will apply ensemble techniques tto regression problem in california housing dataset.

We have already applied different regressors on california housing datasets. In this notebook, we will make use of  ---
- Decision Tree Regressor
- Bagging Regressor
- Random Forest Regressor

We will observe performance improvement when we use Random Forest over Decision Tree regressor and Bagging, which also implicity uses Decision Tree regressors.

# Imports

In [1]:
import numpy as np
import pandas as pd

from sklearn.datasets import fetch_california_housing

from sklearn.ensemble import BaggingRegressor, RandomForestRegressor

from sklearn.metrics import mean_absolute_error

from sklearn.model_selection import train_test_split, cross_validate, RandomizedSearchCV, ShuffleSplit

from sklearn.tree import DecisionTreeRegressor

In [2]:
np.random.seed(69)

We use `ShuffleSplit` cross validation with $10$ splits and $20\%$ of the data set aside for model evaluation as a test data.

In [3]:
shufflesplit_cv = ShuffleSplit(
    n_splits=10,
    test_size=0.2,
    random_state=69
)

Let's download the data and split it into training and test sets.

In [4]:
# fetch dataset
features, labels = fetch_california_housing(as_frame=True, return_X_y=True)
labels *= 100

# train-test split
X_train, X_test, y_train, y_test = train_test_split(features, labels, random_state=69)

We train a helper function to help us train different regressors on the `California Housing Dataset`.

In [5]:
def train_regressor(estimator, X_train, y_train, cv, name):
    cv_results = cross_validate(
        estimator,
        X_train,
        y_train,
        cv=cv,
        scoring='neg_mean_absolute_error',
        return_train_score=True,
        return_estimator=True
    )

    cv_train_error = -1* cv_results['train_score']
    cv_test_error = -1*cv_results['test_score']

    print(f"On an average, {name} model makes an error of " f"{cv_train_error.mean():3f}k +/- {cv_train_error.std():3f}k on the training set.")
    print(f"On an average, {name} model makes an error of " f"{cv_test_error.mean():3f}k +/- {cv_test_error.std():3f}k on the test set.")

# Decision Tree Regressor

In [6]:
train_regressor(
    DecisionTreeRegressor(random_state=69),
    X_train,
    y_train,
    cv=shufflesplit_cv,
    name="Decision Tree Regressor"
)

On an average, Decision Tree Regressor model makes an error of 0.000000k +/- 0.000000k on the training set.
On an average, Decision Tree Regressor model makes an error of 46.533976k +/- 1.283197k on the test set.


As we can see above, there is a strong case of overfitting taking place here.

# Bagging Regressor

In [7]:
train_regressor(
    BaggingRegressor(random_state=69),
    X_train,
    y_train,
    cv=shufflesplit_cv,
    name="Bagging Regressor"
)

On an average, Bagging Regressor model makes an error of 14.266920k +/- 0.123606k on the training set.
On an average, Bagging Regressor model makes an error of 35.320800k +/- 0.718662k on the test set.


Althought the error on the training set has increase, it has generalized better, as the error on the test set has reduced.

# Random Forest Regressor

In [8]:
train_regressor(
    RandomForestRegressor(random_state=69),
    X_train,
    y_train,
    cv=shufflesplit_cv,
    name="Random Forest Regressor"
)

On an average, Random Forest Regressor model makes an error of 12.558201k +/- 0.082668k on the training set.
On an average, Random Forest Regressor model makes an error of 33.435454k +/- 0.633468k on the test set.


# Hyperparameter Tuning for Random Forest Regressor

In [10]:
rfr_params = {
    'n_estimators':[1,2,5,10,20,50,100,200,500],
    'max_leaf_nodes':[2,5,10,20,50,100]
}

rfr_search_cv = RandomizedSearchCV(
    RandomForestRegressor(n_jobs=2, random_state=69),
    param_distributions=rfr_params,
    scoring='neg_mean_absolute_error',
    n_iter=10,
    random_state=69,
    n_jobs=2
)

In [11]:
rfr_search_cv.fit(X_train,y_train)

RandomizedSearchCV(estimator=RandomForestRegressor(n_jobs=2, random_state=69),
                   n_jobs=2,
                   param_distributions={'max_leaf_nodes': [2, 5, 10, 20, 50,
                                                           100],
                                        'n_estimators': [1, 2, 5, 10, 20, 50,
                                                         100, 200, 500]},
                   random_state=69, scoring='neg_mean_absolute_error')

Let's look at the results of this randomized parameter search.

In [12]:
columns = [f"param_{name}" for name in rfr_params.keys()]
columns += ['mean_test_error', 'std_test_error']
cv_results = pd.DataFrame(rfr_search_cv.cv_results_)
cv_results['mean_test_error'] = -cv_results['mean_test_score']
cv_results['std_test_error'] = -cv_results['std_test_score']
cv_results[columns].sort_values(by='mean_test_error')

Unnamed: 0,param_n_estimators,param_max_leaf_nodes,mean_test_error,std_test_error
7,5,100,41.875612,-0.525083
1,10,20,49.718961,-0.698376
9,1,50,49.978396,-0.780722
6,1,20,54.061509,-0.522468
0,200,10,54.704262,-0.603559
4,2,10,56.582734,-0.660194
5,20,2,72.419007,-1.085487
3,200,2,72.551403,-1.182701
2,500,2,72.601623,-1.161199
8,2,2,74.424927,-1.036156


In [13]:
error = -rfr_search_cv.score(X_test, y_test)
print(f"On average, our Random Forest Classifer makes an error of {error:.2f}k")

On average, our Random Forest Classifer makes an error of 42.73k
