# LAB 07 - Random Forest for Regression

In this lab we will be extending the previous lab about Decision trees and build a Regression model using Random Forest.

For simplicity, we will be using the same dataset as the previous lab (you can find it in ECLASS).

**IMPORTANT:** For this lab, if you haven't finished your code from last week's lab on Decision trees, you will have the option to use the sklearn implementation for a regression tree. However, this doesn't mean that you should skip the previous lab. This is just so that you don't get behind with the content and you don't spend all your time today working on the previous lab. 

In [1]:
import numpy as np
from sklearn.tree import DecisionTreeRegressor

As mentioned before, use the Boston Housing data and prepare your train/val/test split as usual.

In [2]:
# your code here
import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.read_csv('./BostonHousing.txt')
X = data.values[:,:-1]
y = data.values[:,-1].reshape(-1, 1)

In [3]:
# your code here
random_state=42
X_train, X_tv, y_train, y_tv = train_test_split(X, y, test_size=0.2, random_state=random_state)
X_test, X_val, y_test, y_val = train_test_split(X_tv, y_tv, test_size=0.5, random_state=random_state)

## Exercise 1 -- Bootstrap

Also known as [bagging](https://en.wikipedia.org/wiki/Bootstrap_aggregating), this technique consists of making several samples with replacement of the original data, using each of the samples to train an estimator, and then aggregating the predictions using the average (this is also a type of model ensemble).

In [4]:
def bootstrap(X:np.ndarray, num_bags:int=10) -> list:
    """
    Given a dataset and a number of bags,
    sample the dataset with replacement.
    
    This function does not return a copy
    of the datapoints, but a list of indices
    with compatible dimensionality
    
    Parameters
    ----------
    X : ndarray
        A dataset
    num_bags : int, default 10
        The number of bags to create
    
    Returns
    -------
    list of ndarray
        The list contains `num_bags` integer one-dimensional ndarrays.
        Each of these contains the indices corresponding to the 
        sampled datapoints in `X`
    
    Notes
    -----
    * The number of datapoints in each bach will
      match the number of datapoints in the given
      dataset.
    * The
    """
    rng = np.random.default_rng(0) # you can change the seed, or use 0 to replicate my results
    # Your code here
    samples = len(X)
    bags = list()

    for bag in range(num_bags):
        indices = rng.choice(samples, size=samples, replace=True)
        bags.append(indices)

    return bags

In [5]:
rng = np.random.default_rng(0)
X_small = rng.random(size=(100,2))
bags = bootstrap(X_small)
bags[0]

array([85, 63, 51, 26, 30,  4,  7,  1, 17, 81, 64, 91, 50, 60, 97, 72, 63,
       54, 55, 93, 27, 81, 67,  0, 39, 85, 55,  3, 76, 72, 84, 17,  8, 86,
        2, 54,  8, 29, 48, 42, 40,  2,  0, 12,  0, 67, 52, 64, 25, 61, 76,
       38, 46, 99, 80, 98, 37, 68, 95, 65, 84, 68, 70, 38, 87, 13, 57, 72,
       84, 52, 37, 31, 42, 48, 71, 88,  7, 93, 53, 35, 67, 57, 25, 32, 71,
       59, 50, 33, 76, 39, 32, 89, 26, 22, 71, 62,  4,  8, 37, 83],
      dtype=int64)

## Exercise 2 -- Aggregation

The second part of bagging.

In [6]:
def aggregate_regression(preds:list) -> np.ndarray:
    """
    Aggregate predictions by several estimators
    
    Parameters
    ----------
    preds : list of ndarray
        Predictions from multiple estimators.
        All ndarrays in this list should have the same
        dimensionality.
        
    Return
    ------
    ndarray
        The mean of the predictions
    """
    # Your code here
    sum_preds = np.zeros_like(preds[0])

    for pred in preds:
        sum_preds += pred
    
    return sum_preds/len(preds)

## Exercise 3 -- Random Forest for regression

Using the functions you implemented above, it is now time to put all of them together to train several decision trees and then ensemble them to output a single prediction. For the random forest, however, we need to select a subset of features at each split on the decision tree. 

For this part, you can use the sklearn implementation of Random forest for regression as your estimator for each set of features and bags. See below an example of how to do this, and always remember to check the necessary documentation when using an external function.

Some parameters you will have to set are: 
* num_features: number of features per estimator
* min_samples: min number of samples per leaf node
* max_depth: maximum depth of the decision tree (each estimator)
* num_estimators: number of decision trees you will create using each bag and random set of features

In [7]:
# example of sklearn Decision tree
estimator = DecisionTreeRegressor(max_depth=4)
estimator.fit(X_train, y_train)
estimator.predict(X_val)

array([16.43181818, 21.63205128, 27.49767442, 21.63205128, 44.15789474,
       21.22      , 21.63205128, 14.77142857, 16.43181818, 21.63205128,
       44.15789474, 21.63205128, 10.27567568, 27.49767442, 21.63205128,
       16.43181818, 16.43181818, 21.63205128, 10.27567568, 21.63205128,
       21.63205128, 27.49767442, 16.43181818, 27.49767442, 21.63205128,
       14.77142857, 32.63243243, 21.63205128, 21.63205128, 21.63205128,
       21.63205128, 16.43181818, 21.63205128, 21.63205128, 16.43181818,
       27.49767442, 21.63205128, 10.27567568, 16.43181818, 16.43181818,
       16.43181818, 10.27567568, 16.43181818, 16.43181818, 21.63205128,
       10.27567568, 14.77142857, 14.77142857, 10.27567568, 14.77142857,
       44.15789474])

In [8]:
def train_random_forest(X:np.ndarray, y:np.ndarray, num_features:int, min_samples:int, max_depth:int, num_estimators:int) -> list:
    forest = list()
    bags = bootstrap(X, num_estimators)

    for indices in bags:
        X_sample = X[indices]
        y_sample = y[indices]

        features =  np.random.choice(X.shape[1], num_features, replace=False)
        X_sample = X_sample[:, features]

        tree = DecisionTreeRegressor(max_depth=max_depth, min_samples_leaf=min_samples)
        tree.fit(X_sample, y_sample)

        forest.append([tree, features])

    return forest

def predict_random_forest(X:np.ndarray, forest:list) -> np.ndarray:
    predictions = []
    
    for tree, features in forest:
        pred = tree.predict(X[:, features])
        predictions.append(pred)
    
    return aggregate_regression(predictions)

def rmse(y_real:np.ndarray, y_pred:np.ndarray) -> float:
    return np.sqrt(np.mean(np.power(y_real - y_pred, 2)))

num_features = 3
min_samples = 25
max_depth = 10
num_estimators = 100

forest = train_random_forest(X_train, y_train, num_features, min_samples, max_depth, num_estimators)

y_val_pred = predict_random_forest(X_val, forest)
print(f"RMSE Val Set: {rmse(y_val, y_val_pred)}")

RMSE Val Set: 10.689876060537863
