# Random Forest for for Regression

In this lab, we will be using the solution from the last lab, and we will extend it to implement a random forest for regression. To simplify the process, we will be using the same dataset, and therefore code to load it and splitting it into training and testing is provided below.

As a reminder we are making a dataset where the features are

* `variance`
* `skewness`
* `curtosis`

and the target variable is

* `entropy`

For this lab, we simply ignore the `class` column

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/00267/data_banknote_authentication.txt', names=['variance', 'skewness', 'curtosis', 'entropy', 'class'])

X = data.loc[:, ["variance", "skewness", "curtosis"]].values
y = data["entropy"].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)


## DecisionTree Regressor 

In the following cell, I simply copy the solution from the previous lab, and I turn it into a scikit-learn regressor for convenience. I also copy the function to print the tree, as part of the regressor.

> This code SEEMS extremely long, but you can delete the documentation lines and see that it's actually not too bad <span style="font-size:20pt;">😃</span>

In [2]:
from sklearn.base import RegressorMixin
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted

class DTRegressor(RegressorMixin):

    """
    CART Decision Tree Regressor
    
    This is not an efficient implementation, but it should serve to learn how to put together 
    a simple decision tree for regression.
    
    All features are assumed to be numeric.
    
    Parameters
    ----------
    min_samples : int
        parameter for stopping criterion if a node has <= min_samples datapoints
    max_depth : int
        parameter for stopping criterion if a node belongs to this depth
    """
    def __init__(self,
                 min_samples,
                 max_depth):
        self.min_samples_ = min_samples
        self.max_depth_ = max_depth
        self.root_ = None # placeholder to hold the root node for predictions
    
    def print_tree(self, depth):
        if 'value' in node.keys():
            print('.  '*(depth-1), f"[{node['value']}]")
        else:
            print('.  '*depth, f'X_{node["feature_index"]} < {node["tau"]}')
            print_tree(node['left'], depth+1)
            print_tree(node['right'], depth+1)

    def recursive_growth(self, node, current_depth, X, y):
        """
        Recursively grows a decision tree.

        Parameters
        ----------
        node : dictionary
            If the node is terminal, it contains only the "value" key, which determines the value to be used as a prediction.
            If the node is not terminal, the dictionary has the structure defined by `get_split`
        depth : int
            current distance from the root
        X : array (n_samples, n_features)
            features (full dataset)
        y : array (n_samples, )
            labels (full dataset)

        Notes
        -----
        To create a terminal node, a dictionary is created with a single "value" key, with a value that
        is the mean of the target variable

        'left' and 'right' keys are added to non-terminal nodes, which contain (possibly terminal) nodes 
        from higher levels of the tree:
        'left' corresponds to the 'low_region' key, and 'right' to the 'high_region' key
        """
        if 'low_region' in node.keys(): # not a terminal node
            lo = node['low_region']
            hi = node['high_region']
            # process left
            if len(lo) <= self.min_samples_ or current_depth == self.max_depth_:
                node['left'] = {'value':y[lo].mean()}
            else:
                node['left'] = self.get_split(X[lo], y[lo])
                self.recursive_growth(node['left'], current_depth + 1, X, y)

            # process right
            if len(hi) <= self.min_samples_ or current_depth == self.max_depth_:
                node['right'] = {'value':y[lo].mean()}
            else:
                node['right'] = self.get_split(X[hi], y[hi])
                self.recursive_growth(node['right'], current_depth + 1, X, y)

    def get_split(self, X, y):
        """
        Given a dataset (full or partial), splits it on the feature of that minimizes the sum of squared error

        Parameters
        ----------
        X : array (n_samples, n_features)
            features 
        y : array (n_samples, )
            labels

        Returns
        -------
        decision : dictionary
            keys are:
            * 'feature_index' -> an integer that indicates the feature (column) of `X` on which the data is split
            * 'tau' -> the threshold used to make the split
            * 'low_region' -> array of indices where the `feature_index`th feature of X is lower than `tau`
            * 'high_region' -> indices not in `low_region`
        """
        best_criterion = float("inf") # unreasonably high Gini Index
        best_feature_index = None
        best_tau = None
        best_lo = None
        best_hi = None
        for feature_index in range(X.shape[1]):
            for tau in X[:, feature_index]:
                lo, hi = self.split_region(X, feature_index, tau)
                criterion = (self.regression_criterion(y[lo]) + 
                             self.regression_criterion(y[hi]))
                if criterion < best_criterion:
                    best_criterion = criterion
                    best_feature_index = feature_index
                    best_tau = tau
                    best_lo = lo
                    best_hi = hi
        return {
            'feature_index': best_feature_index,
            'tau': best_tau,
            'low_region' : best_lo,
            'high_region' : best_hi,
        }

    @staticmethod
    def split_region(region, feature_index, tau):
        """
        Given a region, splits it based on the feature indicated by
        `feature_index`, the region will be split in two, where
        one side will contain all points with the feature with values 
        lower than `tau`, and the other split will contain the 
        remaining datapoints.

        Parameters
        ----------
        region : array of size (n_samples, n_features)
            a partition of the dataset (or the full dataset) to be split
        feature_index : int
            the index of the feature (column of the region array) used to make this partition
        tau : float
            The threshold used to make this partition

        Return
        ------
        low_partition : array
            indices of the datapoints in `region` where feature < `tau`
        high_partition : array
            indices of the datapoints in `region` where feature >= `tau` 
        """
        return np.where(region[:, feature_index] < tau)[0], np.where(region[:, feature_index] >= tau)[0]

    @staticmethod
    def regression_criterion(region):
        """
        Implements the sum of squared error criterion in a region

        Parameters
        ----------
        region : ndarray
            Array of shape (N,) containing the values of the target values 
            for N datapoints in the training set.

        Returns
        -------
        float
            The sum of squared error

        Note
        ----
        The error for an empty region should be infinity
        This avoids creating empty regions
        """
        N = region.shape[0]
        if N > 0:
            return np.sum(region - region.mean())**2
        return float("inf")

    def fit(self, X, y):
        X, y = check_X_y(X, y)
        
        self.n_features_in_ = X.shape[1]
        
        self.root_ = self.get_split(X, y)
        self.recursive_growth(self.root_, 1, X, y)
        
        return self
    
    @staticmethod
    def predict_sample(node, sample):
        """
        Makes a prediction based on the decision tree defined by `node`

        Parameters
        ----------
        node : dictionary
            A node created one of the methods above
        sample : array of size (n_features,)
            a sample datapoint
        """
        if 'value' in node.keys():
            return node['value']
        if sample[node['feature_index']] < node['tau']:
            return DTRegressor.predict_sample(node['left'], sample)
        return DTRegressor.predict_sample(node['right'], sample)
    
    def predict(self, X):
        """
        Makes a prediction based on the decision tree defined by `node`

        Parameters
        ----------
        X : array of size (n_samples, n_features)
            n_samples predictions will be made
        """
        check_is_fitted(self)
        
        X = check_array(X)

        prediction = np.zeros(X.shape[0])
        
        for i, sample in enumerate(X):
            prediction[i] = self.predict_sample(self.root_, sample)
        
        return prediction

We can test this implementation also using the code from the last lab

In [3]:
from sklearn.metrics import mean_squared_error
dt = DTRegressor(min_samples=20, max_depth=6)
dt.fit(X_train, y_train)
test_mse = mean_squared_error(y_test, dt.predict(X_test))
train_mse = mean_squared_error(y_train, dt.predict(X_train))

print(f'Train MSE : {train_mse}')
print(f'Test MSE : {test_mse}')

Train MSE : 4.582477460507463
Test MSE : 4.704711300439497


## Exercise 1 -- Bootstrap

Also known as [bagging](https://en.wikipedia.org/wiki/Bootstrap_aggregating), this technique consists of making several samples with replacement of the original data, using each of the samples to train an estimator, and then aggregating the predictions using the average (this is also a type of model ensemble).

In [4]:
def bootstrap(X, num_bags=10):
    """
    Given a dataset and a number of bags,
    sample the dataset with replacement.
    
    This function does not return a copy
    of the datapoints, but a list of indices
    with compatible dimensionality
    
    Parameters
    ----------
    X : ndarray
        A dataset
    num_bags : int, default 10
        The number of bags to create
    
    Returns
    -------
    list of ndarray
        The list contains `num_bags` integer one-dimensional ndarrays.
        Each of these contains the indices corresponding to the 
        sampled datapoints in `X`
    
    Notes
    -----
    * The number of datapoints in each bach will
      match the number of datapoints in the given
      dataset.
    * The
    """
    rng = np.random.default_rng(0) # you can change the seed, or use 0 to replicate my results
    # Your code here

In [5]:
rng = np.random.default_rng(0)
X_small = rng.random(size=(100,2))
bags = bootstrap(X_small)
bags[0]

array([85, 63, 51, 26, 30,  4,  7,  1, 17, 81, 64, 91, 50, 60, 97, 72, 63,
       54, 55, 93, 27, 81, 67,  0, 39, 85, 55,  3, 76, 72, 84, 17,  8, 86,
        2, 54,  8, 29, 48, 42, 40,  2,  0, 12,  0, 67, 52, 64, 25, 61, 76,
       38, 46, 99, 80, 98, 37, 68, 95, 65, 84, 68, 70, 38, 87, 13, 57, 72,
       84, 52, 37, 31, 42, 48, 71, 88,  7, 93, 53, 35, 67, 57, 25, 32, 71,
       59, 50, 33, 76, 39, 32, 89, 26, 22, 71, 62,  4,  8, 37, 83])

## Exercise 2 -- Aggregation

The second part of bagging.

In [6]:
def aggregate_regression(preds):
    """
    Aggregate predictions by several estimators
    
    Parameters
    ----------
    preds : list of ndarray
        Predictions from multiple estimators.
        All ndarrays in this list should have the same
        dimensionality.
        
    Return
    ------
    ndarray
        The mean of the predictions
    """
    # Your code here

## Exercise 3 -- Random Forest for regression

Using the functions you implemented above, it is now time to put all of them together to train several decision trees and then ensemble them to output a single prediction. For the random forest, however, we need to select a subset of features at each split on the decision tree. 

A convenient way to implement this, is to specialize the existing implementation of the `DTRegressor` class that is provided in this lab, add a couple of parameters, and overrride a small set of functions (namely, `fit`, `predict`, and `get_split`). If you don't feel comfortable using object oriented inheritance, you can always copy the code above, and make the necessary modifications.


In [7]:
class RFRegressor(DTRegressor):
    """
    A CART-based random forest for regression
    
    Parameters
    ----------
    min_samples : int
        parameter for stopping criterion if a node has <= min_samples datapoints
    max_depth : int
        parameter for stopping criterion if a node belongs to this depth
    num_estimators : int, default 10
        The number of trees in the forest.
    num_features : int
        The number of features to consider at each split
        
    Notes
    -----
    This implementation uses bootstrap samples, each estimator is trained
    on a different sample of datapoints.
    """
    def __init__(self, min_samples, max_depth, num_estimators, num_features):
        # Your code here

    def get_split(self, X, y):
        """
        Given a dataset (full or partial), splits it on the feature of that minimizes the sum of squared error

        Parameters
        ----------
        X : array (n_samples, n_features)
            features 
        y : array (n_samples, )
            labels

        Returns
        -------
        decision : dictionary
            keys are:
            * 'feature_index' -> an integer that indicates the feature (column) of `X` on which the data is split
            * 'tau' -> the threshold used to make the split
            * 'low_region' -> array of indices where the `feature_index`th feature of X is lower than `tau`
            * 'high_region' -> indices not in `low_region`
        """
        # Your code here


    
    def fit(self, X, y):
        # Your code here

        
    def predict(self, X):
        # Your code here

        

The dataset we used for part 1 has very few features to be useful for random forest, for this, we can use scikit learn and create a random regression problem

In [8]:
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=400, n_features=15)
X_train, X_test, y_train, y_test = train_test_split(X, y)

num_features = X_train.shape[1] // 3

rf = RFRegressor(min_samples=20, max_depth=6, num_estimators=10, num_features=num_features)
rf.fit(X_train, y_train)
test_mse = mean_squared_error(y_test, rf.predict(X_test))
train_mse = mean_squared_error(y_train, rf.predict(X_train))

print(f'Train MSE : {train_mse}')
print(f'Test MSE : {test_mse}')

Train MSE : 36032.860933058844
Test MSE : 29970.33167009771


These numbers look absolutely massive compared to the numbers we obtained in the previous lab. But remember that they don't mean much in isolation, but only when compared to another model trained on the same training set, and tested on the same testing set. 

Let's see the performance of `DTRegressor` on this dataset

In [9]:
dt = DTRegressor(min_samples=20, max_depth=6)
dt.fit(X_train, y_train)
test_mse = mean_squared_error(y_test, dt.predict(X_test))
train_mse = mean_squared_error(y_train, dt.predict(X_train))

print(f'Train MSE : {train_mse}')
print(f'Test MSE : {test_mse}')

Train MSE : 34747.7357686727
Test MSE : 28375.69828304471
