# Random Forests

## Ensemble Theory

Ensemble methods are a powerful family of approaches in machine-learning. The idea is to combine multiple individual models together, an ensemble, such that they collectively produce better predictions with greater generalisability. The individual models are often reffered to as base models or "weak learners" and can be constructed using a single modelling algorithm, or several different algorithms. Variability between the base models is desired such that their individual errors are less likely to overlap, allowing the ensemble to correct for mistakes made by any single model. This diversity helps reduce overfitting, improves robustness, and leads to more stable and accurate predictions overall. A common approach to creating an ensemble is **bagging** (bootstrap aggregating) which uses a technique known as **bootstrapping**. In bootstrapping, a given base model is trained on a random sample of the full training dataset. This is performed in parallel with each model indepedent from oneanother. In this section, we will implement a **random forest classifier** which is probably the most well-known ensemble model. In a generic random forest, the base model is decision tree and the final class prediction is determined from the majorite vote of the trees. Our implementation of a decision tree, 

In [None]:
# Imports,
import numpy as np
import matplotlib.pyplot as plt

class DecisionTreeClassifier():

    def __init__(self, max_depth):
        """Constructor method for the DecisionTreeClassifier class. We simply create the class variables."""
        
        # Class variables for the data and nodes,
        self.X, self.y, self.nodes, self.leaves = None, None, None, None

        # Stopping criteria,
        self.max_depth = max_depth

        return None
    
    def fit(self, X, y):
        self.X, self.y = X, y

        # Creating the root node,
        root_node = Node(self.X, self.y, parent_node=None)
        self.nodes, self.leaves = [root_node], [root_node]

        # Growth algorithm,
        for i in range(self.max_depth):
            new_leaves = self.grow_tree()

            # Adding child nodes,
            self.nodes.extend(new_leaves)

        # Assigning class predictions to leaves (they become terminate nodes) based on majority vote,
        for leaf_node in self.leaves:
            leaf_node.class_prediction = majority_vote(leaf_node.y)

    def grow_tree(self):
    
        # Placeholder list,
        new_leaves = []

        # Looping through current leaves,
        for node in self.leaves:

            # Performing split,
            child_node_left, child_node_right, valid_split = self.split(node.X, node.y, parent_node=node)

            if valid_split:
                # Assigning child nodes,
                node.child_left, node.child_right, = child_node_left, child_node_right

                # Appending nodes to list,
                new_leaves.extend([child_node_left, child_node_right])
            else:
                # Node becomes a terminal node and is assigned a class prediction,
                node.class_prediction = majority_vote(node.y)

        # Updating leaves,
        self.leaves = new_leaves

        return new_leaves

    def split(self, X, y, parent_node):
        """Binary splits a parent node into two child nodes based on the decision that maximises information gain in accordance
        with Shannon entropy."""

        # Placeholder variables,
        max_gain = -1
        split_threshold_value = None
        found_valid_split = False
        X_best_left_split, X_best_right_split, y_best_left_split, y_best_right_split = None, None, None, None

        # Computing entropy before split,
        S_parent = compute_entropy(y)

        # Randomly selecting features,
        subset_size = np.random.randint(low=1, high=(X.shape[1]+1), size=1)[0]
        feature_idxs = np.arange(start=0, stop=X.shape[1], step=1)
        selected_feature_idxs = np.random.choice(feature_idxs, size=subset_size, replace=False)

        # Double loop, first for each feature, second for each threshold value,
        for feature_idx in selected_feature_idxs:

            # Extracting feature values and thresholds,
            X_feature = X[:, feature_idx]
            thresholds = np.unique(X_feature)

            for threshold_value in thresholds:

                # Splitting data in parent node into child nodes,
                left_split_idxs, right_split_idxs = np.where(X_feature <= threshold_value)[0], np.where(X_feature > threshold_value)[0]
                X_left_split, X_right_split = X[left_split_idxs], X[right_split_idxs]
                y_left_split, y_right_split = y[left_split_idxs], y[right_split_idxs]

                # Reject splits which result in empty child nodes,
                if len(left_split_idxs) == 0 or len(right_split_idxs) == 0:
                    continue
                else:
                    found_valid_split = True

                # Compute entropy after split,
                S_left, S_right = compute_entropy(y_left_split), compute_entropy(y_right_split)

                # Calculating information gain,
                w1, w2 = len(y_left_split)/len(y), len(y_right_split)/len(y)
                delta_S = S_parent - (w1*S_left + w2*S_right)
            
                # Tracking maximum information gain,
                if delta_S > max_gain:

                    # Updating nodes associated with the best split,
                    max_gain, split_threshold_value, split_feature = delta_S, threshold_value, feature_idx
                    X_best_left_split, X_best_right_split, y_best_left_split, y_best_right_split = X_left_split, X_right_split, y_left_split, y_right_split

        # Creating node objects for the child nodes,
        if found_valid_split:
            child_node_left, child_node_right = Node(X_best_left_split, y_best_left_split, parent_node), Node(X_best_right_split, y_best_right_split, parent_node)
            parent_node.child_left, parent_node.child_right = child_node_left, child_node_right
            parent_node.decision = (split_feature, split_threshold_value)
            return child_node_left, child_node_right, True
        else:
            return None, None, False
        
    def predict_sample(self, X_sample):
        current_node = self.nodes[0]

        while current_node.decision is not None:
            feature_idx, threshold_value = current_node.decision

            if X_sample[feature_idx] <= threshold_value:
                current_node = current_node.child_left
            else:
                current_node = current_node.child_right

        return current_node.class_prediction

    def score(self, X, y):

        correct = 0
        n_samples = X.shape[0]
        for i in range(n_samples):
            pred, target = self.predict_sample(X_sample=X[i]), y[i]

            if pred == target:
                correct += 1

        accuracy = correct/n_samples
        
        return accuracy
    
    def info(self):
        for node in self.nodes:
            node.info(verbose=True)

class Node():
    """The class for node objects. Essentially used as a container."""

    def __init__(self, X, y, parent_node):
        """Constructor method for the node. Class variables contain node information and encode its location in the tree
        required for predictions."""

        # Node information,
        self.X, self.y = X, y
        self.decision = None
        self.class_prediction = None

        # Encodes location in the tree,
        self.parent, self.child_left, self.child_right = parent_node, None, None

    def info(self, verbose=False):
        """Returns information about the node."""

        if verbose:
            print(f"Parent: {self.parent}, Decision: {self.decision}, Class prediction: {self.class_prediction}")

        return self.parent, self.decision, self.class_prediction

def compute_entropy(y):
    """Helper function which computes the Shannon entropy of a given node."""

    # Computing probabilities P_j,
    classes, classes_counts = np.unique(y, return_counts=True)
    classes_probs = classes_counts/len(y)

    # Computing the Shannon entropy of the node,
    entropy = -np.sum(classes_probs*np.log2(classes_probs))

    return entropy

def majority_vote(array):
    """Returns the most frequent class index."""
    return np.bincount(array).argmax()

## Basic Implementation

Typically, a random forest incorporates two design considerations which allows for variability between the trees. We have, 

<ul>
  <li>Bootstrapping: As previously mentioned, each tree will be trained on a random subset of the full training dataset. This introduces variance in the predictions because there is now an element of stochasticity in our model. </li>
  <li>Random Feature Selection: When splitting a node in a tree, we randomly select a subset of features to make the decision on. This is different from the way we previously coded our decision trees. In the 
  original code, the construction of a tree was purely deterministic because we considered all features whenever we made a split. The purpose of random feature selection is to discourage correlation between trees. That is, to 
  make it less likely that trees will have the same node structure from considering the same features. This means that the trees will be less likely to make the same mistakes. </li>
</ul>

Below, we have a basic implementation of a random forest classifier,

In [24]:
class RandomForestClassifier():
    """Class for the random forest ensemble model.

    PARAMETERS
    n_trees (int): The number of decision trees in the random forest.
    max_depth (int): The maximum number of layers a given tree may have.
    min_depth (int): The minimum number of layers a given tree may have.
    bootstrap_ratio (0 < r < 1): This is the ratio between the number of samples in a given random subset over the full dataset.
    randomise_depth (bool): Toggle whether or not to randomise the max depth of a given tree between min_depth and max_depth."""

    def __init__(self, n_trees, max_depth, min_depth=1, bootstrap_ratio=1, randomise_depth=False):
        """Constructor method for the random forest ensemble model. Placeholder and class variables are assigned using this method."""

        # Class variables,
        self.n_trees, self.max_depth, self.min_depth = n_trees, max_depth, min_depth
        self.bootstrap_ratio, self.randomise_depth = bootstrap_ratio, randomise_depth

        # Ensemble list,
        self.trees = []

    def fit(self, X, y):
        """This method populates the random forest using the data supplied."""

        # A new tree is created in each iteration,
        for i in range(self.n_trees):

            # Selecting a subset of the training data (bootstrapping)
            X_sub, y_sub = self.bootstrap(X, y)

            # Determining the max depth of the tree,
            if self.randomise_depth:
                tree_depth = np.random.randint(self.min_depth, self.max_depth, 1)[0]
            else:
                tree_depth=self.max_depth
            
            # Creating and fitting tree,
            tree = DecisionTreeClassifier(max_depth=tree_depth)
            tree.fit(X_sub, y_sub)

            # Adding the tree to the forest,
            self.trees.append(tree)

    def predict_sample(self, X_sample):
        """This function returns the class prediction of a single data sample."""

        # Placeholder,
        trees_pred = []

        # Looping over all trees,
        for tree in self.trees:

            # Individual tree prediction,
            tree_pred = tree.predict_sample(X_sample)
            trees_pred.append(tree_pred)

        # Finding majority-vote,
        trees_pred = np.array(trees_pred)
        model_pred = majority_vote(trees_pred)

        # Returning model prediction,
        return model_pred

    def score(self, X, y):
        """Returns the prediction accuracy of the model on a given dataset."""

        # Storing predictions as an array,
        preds = np.array([self.predict_sample(x) for x in X])

        # Calculating accuracy,
        correct, n_samples = len(np.where(preds == y)[0]), len(y)
        accuracy = correct/n_samples

        return accuracy

    def bootstrap(self, X, y):
        """Helper function used to created a random subset from a larger set of data points. Returns a tuple."""

        # Determining the number of samples in the full set and subset,
        n_samples = X.shape[0]
        n_subsamples = int(self.bootstrap_ratio*n_samples)

        # Computing the indices of the subset,
        subset_idxs = np.random.choice(n_samples, size=n_subsamples, replace=True)

        # Returning subset,
        return X[subset_idxs], y[subset_idxs]

Note that we also have the option to randomise the maximum depth of the trees in the forest. This is slightly atypical, but also discourages correlation between trees. We test our model on the Wine dataset,

In [25]:
# Importing,
from sklearn.model_selection import train_test_split
from sklearn import datasets

# Loading dataset,
wine_dataset = datasets.load_wine()

# Extracting features,
X, y = wine_dataset["data"], wine_dataset["target"]

# Creating training split,
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=31)

# Creating and fitting model,
clf = RandomForestClassifier(n_trees=100, max_depth=5, randomise_depth=True)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.8888888888888888