<a href="https://colab.research.google.com/github/shivanshu1303/Simple-ML-Algos-Implemented/blob/main/Random_Forest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Here, I will do the `Random Forest` algorithm.

In [29]:
import numpy as np
from collections import Counter
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

In [2]:
def node_initialization(feature_index=None,threshold_value=None,left_child_node=None,right_child_node=None,class_value=None):
  node={
      "feature":feature_index,
      "threshold":threshold_value,
      "left":left_child_node,
      "right":right_child_node,
      "value":class_value
  }
  return node

In [3]:
def is_leaf_node(node):
  return node["value"] is not None

In [4]:
min_samples_split=2
max_depth=10
n_features=None
root=None

In [38]:
def fit_decision_tree(X,y,_min_samples_split=2,_max_depth=100,_n_features=None):
  global min_samples_split,max_depth,n_features,root
  min_samples_split=_min_samples_split
  max_depth=_max_depth
  n_features=_n_features
  root=grow_tree(X,y,0)
  return root

In [6]:
def grow_tree(X,y,depth=0,min_samples_split=2,max_depth=100,n_features=None):
  n_samples, n_feats=X.shape

  n_labels=len(np.unique(y))

  # Check if we're already at the stopping limit using 3 conditions
  # * if the current depth is already equal to the max depth allowed
  # * if all the samples are of a single class(i.e. the samples are 100% pure)
  # * the number of samples are less than the minimum number of samples that we decided to stop
  # splitting on
  if (depth >= max_depth) or (n_labels == 1) or (n_samples <= min_samples_split):
    leaf_value=most_common_label(y)
    return create_node(value=leaf_value)

  feat_idxs = np.random.choice(n_feats,n_features if n_features else n_feats,replace=False)
  # Above, we choose what features to train our tree on. If we have been provided with 'n_features'
  # as an arguement to the function, we obviously use that but if we haven't then we use the n_feats
  # value

  # Now, we focus on finding the best split among all the splits possible
  # Also, for a particular feature, since it may not always be a binary one, we also have to
  # decide what threshold we choose i.e. where we draw a line for a particular feature's values
  best_feature, best_thresh = best_split(X,y,feat_idxs,min_samples_split)

  # Now that we know what the best feature to split is, we actually split the nodes
  left_idxs, right_idxs = split(X[:,best_feature],best_thresh)
  left=grow_tree(X[left_idxs,:],y[left_idxs],depth+1,min_samples_split,max_depth,n_features)
  right=grow_tree(X[right_idxs,:],y[right_idxs],depth+1,min_samples_split,max_depth,n_features)

  return create_node(feature=best_feature,threshold=best_thresh,left=left,right=right)

In [7]:
def create_node(feature=None, threshold=None, left=None, right=None, value=None):
  return {"feature":feature, "threshold":threshold, "left":left, "right":right, "value":value}

In [8]:
def best_split(X,y,feat_idxs,min_samples_split):
  best_gain=-1
  split_idx=None
  split_threshold=None

  for feat_idx in feat_idxs:
    X_column=X[:,feat_idx]
    thresholds=np.unique(X_column)
    # ie the 'threshold' we choose wont be derived by some mathematical calculation but only
    # from the values already existing in the set
    for threshold in thresholds:
      #gain=information_gain(y,X_column, threshold, min_samples_split)
      gain=information_gain(y,X_column, threshold)


      if gain>best_gain:
        best_gain=gain
        split_idx=feat_idx
        split_threshold=threshold

  return split_idx, split_threshold

In [9]:
def information_gain(y,X_column, threshold):
  # Calculate the entropy of the parent
  parent_entropy=entropy(y)

  # Now, create children using the threshold value
  left_idxs, right_idxs=split(X_column, threshold)

  if((len(left_idxs)==0) or (len(right_idxs)==0)):
    return 0

  n=len(y)

  n_l=len(left_idxs)
  e_l=entropy(y[left_idxs])

  n_r=len(right_idxs)
  e_r=entropy(y[right_idxs])

  child_entropy=(n_l/n)*(e_l)+(n_r/n)*(e_r)

  return (parent_entropy - child_entropy)

In [10]:
def split(X_column, split_thresh):
  left_idxs = np.argwhere(X_column<=split_thresh).flatten()
  right_idxs= np.argwhere(X_column >split_thresh).flatten()
  return left_idxs, right_idxs

In [11]:
def entropy(y):
  hist=np.bincount(y)
  ps=hist/len(y)
  # The above provides us with the occurence ratio for each index where we interpret the indices
  # as class
  return -np.sum([p*np.log2(p) for p in ps if p>0])

In [12]:
def most_common_label(y):
  counter=Counter(y)
  most_common=counter.most_common(1)[0][0]
  return most_common

In [13]:
def predict(X,root):
  return np.array([traverse_tree(x,root) for x in X])

In [14]:
def traverse_tree(x,node):
  if is_leaf_node(node):
    return node["value"]

  if x[node["feature"]]<=node["threshold"]:
    return traverse_tree(x,node["left"])

  else:
    return traverse_tree(x,node["right"])

In [15]:
def accuracy(y_test, y_pred):
    return np.sum(y_test == y_pred) / len(y_test)

In [39]:
data = datasets.load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)


# def fit_decision_tree(X,y,_min_samples_split=2,_max_depth=100,_n_features=None)
fit_decision_tree(X_train, y_train, _min_samples_split=2, _max_depth=10, _n_features=None)


predictions = predict(X_test, root)

print(f"Accuracy: {100*accuracy(y_test, predictions)} percent")

Accuracy: 92.10526315789474 percent


The above code is the same code that I used in the decision tree model.

This is because as the name suggests, a `Random Forest` is a forest i.e. in this context, a collection of trees. It is called random because the set of examples that we use to train a single tree in the forest is decided randomly.

How this works is that before using this for predictions, we train as many trees as we want to and then for the predictions:
* if it is a classification problem, we choose the class that occurs most frequently
* if it is a regression problem, we average out the predictions

to arrive at our solution.

Now, we begin actually coding up our random forest model building upon the decision tree code previously written.

In [18]:
n_trees=10
max_depth=10
min_samples_split=2
n_features=None
trees=[]

We just declared some global variables so that we can use them everywhere in all functions without having to pass them as arguments to every function where we want to use them.

Now, we write a simple function to initialize our forest.

In [19]:
def initialize_random_forest(_n_trees=10,_max_depth=10,_min_samples_split=2,_n_features=None):
  global n_trees,max_depth,min_samples_split,n_features,trees
  n_trees=_n_trees
  max_depth=_max_depth
  min_samples_split=_min_samples_split
  n_features=_n_features
  trees=[] # Do this to reset the list of trees and eliminate trees that may have been built
           # previously

Now, we write the `build_random_forest` function that will present us with the trained model.

In [43]:
def build_random_forest(X,y,_n_trees=10,_max_depth=10,_min_samples_split=2,_n_features=None):
  global trees

  for _ in range(_n_trees):
    X_sample, y_sample=bootstrap_samples(X,y)
    # The code wouldnt work because the original 'fit_decision_tree' function didnt return anything
    # It just updated the global variables that were there when we were using it for the decision
    # trees. I updated it to return the root of the tree, which is now
    # how we represent a tree
    # This is crucial because the decison tree's predict function also takes the tree's root
    # to traverse it. Hence, it makes sense to represent and pass around a tree by its root node.
    tree=fit_decision_tree(X_sample,y_sample,_min_samples_split=_min_samples_split,
                           _max_depth=_max_depth,_n_features=_n_features)
    trees.append(tree)

To prevent our code from getting too bulky, we write a sample `bootstrap_samples` function that creates samples to train our decision tree on.

This function creates the dataset for a particular tree by choosing data points from the dataset randomly but **with replacement**.

What this means is that when we choose a sample, after 'choosing' it, we put it back in our dataset. This allows it for the possibility for the same sample to be chosen again.

Hence, when we use this method to build a dataset, some samples from our original dataset might occur more than once in our bootstrapped dataset while some samples might not occur at all.

This randomness in the datasets allows for very different decision trees to be built with significantly different `node splits` at each level which in turn allow for a far more robust final decision tree model which doesn't react very strongly to minor changes in the dataset and is largely immune to small alterations in the data.

In [21]:
def bootstrap_samples(X,y):
  n_samples=X.shape[0]
  idxs=np.random.choice(n_samples,n_samples,replace=True)
  return X[idxs],y[idxs]

Now, we only need to write the `predict` function

In [41]:
def predict_random_forest(X):
  predictions=np.array([predict(X,tree) for tree in trees])
  predictions=np.swapaxes(predictions,0,1)
  #predictions=predictions.T
  final_predictions=np.array([most_common_label(pred) for pred in predictions])

  return final_predictions

The above code, while looks simple, actually does a lot.

First, obviously, for the test dataset or the dataset we're given to predict, we get the prediction for all samples from all the trees in it and store it in a numpy array.

Now, the `predictions` array actually is a bit inconvenient for us. In it, each row contains the predictions from a particular tree for all the rows in X.

i.e. the 1st row would be the model's 1st tree's predictions for all the rows in X, and so on. So now, the 1st column contains all the trees' predictions for the 1st row in X.

But, we need to predict a particular row's class using all the forests' trees predictions. For this, we would like for all the trees' predictions for a particular row to be in the same row.

This can be achieved easily with a simple transpose of the matrix i.e. here, a swap of the axes of the array.

Now, the 1st row has all of the models' trees predictions for it.

Then, going row by row, we simply choose the most commonly occuring label and make it the predicted class for that row.

In [46]:
# Loading the dataset
data = datasets.load_breast_cancer()
X, y = data.data, data.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)


# Initialzing the random forest with all its parameters and its values
initialize_random_forest(_n_trees=10, _max_depth=10, _min_samples_split=2, _n_features=None)

# Here we build the random forest model using the training data
build_random_forest(X_train, y_train, _n_trees=100, _max_depth=10, _min_samples_split=2, _n_features=None)

# Making predictions on the test set using the random forest model
predictions_rf = predict_random_forest(X_test)

# Calculate the accuracy
print(f"Random Forest Accuracy: {100*accuracy(y_test, predictions_rf)} percent")

# Additionally, we calculate and print the F1 score
f1 = f1_score(y_test, predictions_rf, average='binary')  # Adjust 'average' as necessary
print(f"Random Forest F1 Score: {f1}")


Random Forest Accuracy: 93.85964912280701 percent
Random Forest F1 Score: 0.9496402877697843
