## Bagging Regressor

In [1]:
import pandas as pd
import numpy as np
# Importing the California Housing dataset
from sklearn.datasets import fetch_california_housing


# Loading the dataset
housing_data = fetch_california_housing()
housing = pd.DataFrame(data = housing_data['data'], columns = housing_data['feature_names'])


In [2]:
housing['MedHouseValue'] = housing_data['target']

In [9]:
housing.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseValue
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [10]:
X,y = fetch_california_housing(return_X_y=True)

In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score

In [6]:
from sklearn.model_selection import train_test_split


In [11]:
X_train, X_test, Y_train, Y_test = train_test_split(X, y , train_size=0.80, test_size=0.20, random_state=123)
print('Train/Test Sets Sizes : ',X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)

Train/Test Sets Sizes :  (16512, 8) (4128, 8) (16512,) (4128,)


In [12]:
lr = LinearRegression()
dt = DecisionTreeRegressor()
knn = KNeighborsRegressor()

In [14]:
lr.fit(X_train,Y_train)

In [15]:
dt.fit(X_train,Y_train)


In [16]:
knn.fit(X_train,Y_train)

In [17]:
y_pred1 = lr.predict(X_test)
y_pred2 = dt.predict(X_test)
y_pred3 = knn.predict(X_test)

In [20]:
print("R^2 score for LR: {:.2f}".format(r2_score(Y_test,y_pred1)))
print("R^2 score for DT: {:.2f}".format(r2_score(Y_test,y_pred2)))
print("R^2 score for KNN: {:.2f}".format(r2_score(Y_test,y_pred3)))

R^2 score for LR: 0.61
R^2 score for DT: 0.62
R^2 score for KNN: 0.16


In [21]:
from sklearn.ensemble import BaggingRegressor

bag_regressor = BaggingRegressor(random_state=1) # we kept the default the vanilla setting of bagging
bag_regressor.fit(X_train, Y_train)

In [22]:
Y_preds = bag_regressor.predict(X_test)

print('Training Coefficient of R^2 : %.3f'%bag_regressor.score(X_train, Y_train))
print('Test Coefficient of R^2 : %.3f'%bag_regressor.score(X_test, Y_test))

Training Coefficient of R^2 : 0.963
Test Coefficient of R^2 : 0.792



- We can see without tuning any hyperparameters, we are able to achieve better results with Bagging Regressor compared to the other three algorithms.
- Instead of trying all other options like Pasting, Random Patches, etc., we will use GridSearchCV to find the best parameters.
- Here, we will try all four combinations.



In [26]:
%%time

n_samples = X.data.shape[0]
n_features = X.data.shape[1]

params = {'estimator': [None, LinearRegression(), KNeighborsRegressor()],
          'n_estimators': [20,50,100],
          'max_samples': [0.5,1.0],
          'max_features': [0.5,1.0],
          'bootstrap': [True, False],
          'bootstrap_features': [True, False]}

bagging_regressor_grid = GridSearchCV(BaggingRegressor(random_state=1,
                                                       n_jobs=-1),
                                      param_grid =params,
                                      cv=3,
                                      n_jobs=-1,
                                      verbose=1)

bagging_regressor_grid.fit(X_train, Y_train)

Fitting 3 folds for each of 144 candidates, totalling 432 fits
CPU times: total: 1.62 s
Wall time: 5min 52s


In [27]:
print('Train R^2 Score : %.3f'%bagging_regressor_grid.best_estimator_.score(X_train, Y_train))
print('Test R^2 Score : %.3f'%bagging_regressor_grid.best_estimator_.score(X_test, Y_test))
print('Best R^2 Score Through Grid Search : %.3f'%bagging_regressor_grid.best_score_)
print('Best Parameters : ',bagging_regressor_grid.best_params_)

Train R^2 Score : 0.973
Test R^2 Score : 0.816
Best R^2 Score Through Grid Search : 0.801
Best Parameters :  {'bootstrap': True, 'bootstrap_features': True, 'estimator': None, 'max_features': 1.0, 'max_samples': 1.0, 'n_estimators': 100}


# Summary

- We can see without tuning any hyperparameters, we are able to achieve better results with Bagging Regressor compared to the other three algorithms (Linear Regression, Decision Tree, KNN).
- The Bagging Regressor achieved an R² score of 0.792 on the test set without any hyperparameter tuning.
- By using GridSearchCV to find the best parameters, we were able to further improve the performance.
- The best R² score through GridSearchCV was 0.801.
- The best parameters found were:
  - `bootstrap`: True
  - `bootstrap_features`: True
  - `estimator`: None (default to decision trees)
  - `max_features`: 1.0
  - `max_samples`: 1.0
  - `n_estimators`: 100
- With these parameters, the Bagging Regressor achieved an R² score of 0.816 on the test set.
- This indicates that the optimal Bagging Regressor model, using the best parameters found through GridSearchCV, provides a significant improvement in performance, achieving around 80% accuracy.

