# Exercise - RFs for regression

1. Use the **fetch_california_housing** data (remember to split your data into a train, validation, and test data). Using your training and validation data, optimize the parameters of your RF. How well does your optimized model perform on the test data?
   
**Note**: This dataset is **much** larger than what we have otherwise been using. This means you cannot try a million different things without the code running very slowly!

**See slides for more details!**

In [2]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn import ensemble
import tqdm
import pandas as pd
import numpy as np

X, y = fetch_california_housing(return_X_y=True)

housing_data = fetch_california_housing()
print(housing_data.DESCR)

# Use `train_test_split` to split your data into a train and a test set.
X_train, X_test, y_train, y_test = X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=42)

# Use `train_test_split` to split your train data into a train and a validation  set.
X_train, X_val, y_train, y_val   = train_test_split(X_train,
                                                   y_train,
                                                   test_size=0.2,
                                                   random_state=42)

print(X_train.shape, X_val.shape, X_test.shape, y_train.shape, y_val.shape, y_test.shape)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

# Exercise 1

Use the **fetch_california_housing** data (remember to split your data into a train, validation, and test data). Using your training and validation data, optimize the parameters of your RF. How well does your optimized model perform on the test data?

Let us start by ensuring we can just run an RF without any optimization. Note how it is slower than a lot of what we have done so far!

In [3]:
rf_current = ensemble.RandomForestRegressor()
rf_current.fit(X_train, y_train)
y_val_hat = rf_current.predict(X_val)
mse = mean_squared_error(y_val, y_val_hat)

print(f'RF with default settings has validation MSE of {mse}.')

RF with default settings has validation MSE of 0.275505019417254.


In [4]:
# Remember you can try other stuff than these specific parameters.
# Just here to get you started!

n_estimators_list = [2,4,6]
min_samples_split_list = [5, 8, 10]
min_samples_leaf_list = [10,15,20]

results = []

for n_estimators in n_estimators_list:
    for min_samples_split in min_samples_split_list:
        for min_samples_leaf in min_samples_leaf_list:
            rf_current = ensemble.RandomForestRegressor(
                n_estimators=n_estimators,
                min_samples_split=min_samples_split,
                min_samples_leaf=min_samples_leaf,
                )
            rf_current.fit(X_train, y_train)
            y_val_hat = rf_current.predict(X_val)
            mse = mean_squared_error(y_val, y_val_hat)

            results.append([mse, n_estimators, min_samples_split, min_samples_leaf])

results = pd.DataFrame(results)
results.columns = ['MSE', 'n_estimators', 'min_samples_split', 'min_samples_leaf']
print(results)

         MSE  n_estimators  min_samples_split  min_samples_leaf
0   0.352073             2                  5                10
1   0.365053             2                  5                15
2   0.366284             2                  5                20
3   0.355221             2                  8                10
4   0.353240             2                  8                15
5   0.362745             2                  8                20
6   0.386546             2                 10                10
7   0.356868             2                 10                15
8   0.369001             2                 10                20
9   0.315994             4                  5                10
10  0.338576             4                  5                15
11  0.342273             4                  5                20
12  0.328047             4                  8                10
13  0.332032             4                  8                15
14  0.337263             4              

In [8]:
# Extract best parameters.
results[results['MSE'] == results['MSE'].max()]

Unnamed: 0,MSE,n_estimators,min_samples_split,min_samples_leaf
6,0.386546,2,10,10


In [9]:
# Initialize your final model

rf_current_best = ensemble.RandomForestRegressor(
                n_estimators=2,
                min_samples_split=10,
                min_samples_leaf=10,
                )

rf_current_best.fit(np.concatenate([X_train, X_val]), np.concatenate([y_train, y_val]))

y_val_hat_rf_current_best = rf_current_best.predict(X_test)

mse_best = mean_squared_error(y_val, y_val_hat)
print(f'Optimized RF Current {mse_best}% accuracy.')

# Use both training and validation data to fit it using np.concatenate (np.concatenate "stacks" the array like rbind in R)

# Predict on test data

# Obtain and check mse on test data


Optimized RF Current 0.3314576437988222% accuracy.
