<a href="https://www.kaggle.com/code/realshaktigupta/implementing-a-random-forest-via-decision-trees?scriptVersionId=131550919" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

**Building a Random Forest using Decision Trees**

Here we will construct a Random Forest using Decision Trees. We will be using the Moons dataset with a sample size of 10,000 for this purpose. 80% of which will be split into the train set and the rest placed in the test set.

Now,Our Implementation is going to be simple and will involve the following steps:-

1.Generate the dataset using make_moons() and Split it into train and test sets.

2.Split the train set into 1,000 subsets each containing 100 randomly selected instances. We will be using sklearn's ShuffleSplit class for this.

3.Train a decision tree on each subset.We will be setting max_leaf_nodes=15 and keeping all other hyperparameters as default. To know why this is the best hyperparameter combination for the decision trees here, refer to this small and simple notebook here at https://bit.ly/43r9W9Q ,which explains the process of determining the best hyperparameter combination using gridsearch.

4.For each test set instance,we generate the predictions of
all the 1,000 Decision Trees,and keep only the most frequent prediction (we use SciPy’s mode() function for this).This gives us the majority-vote predictions over the test set. Our Random Forest model becomes ready at this stage.

5.Now we evaluate these majority-vote predictions over the test set and check for accuracy.

Now, Let's get to work.

**Generating The Dataset**

First,we generate the dataset with a sample size of 10000 and some noise(0.4 to be precise).

In [2]:
from sklearn.datasets import make_moons
moons_data=make_moons(n_samples=10000,noise=0.4)

In [3]:
moons_data

(array([[ 1.04310617,  0.86590428],
        [ 0.481051  ,  0.31546363],
        [-0.23872684,  0.18247249],
        ...,
        [ 0.71203957,  0.65880818],
        [ 1.16347422, -0.1454104 ],
        [ 0.13703999, -0.54424976]]),
 array([0, 1, 1, ..., 0, 1, 1]))

In [4]:
X=moons_data[0]
y=moons_data[1]

In [5]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=42,test_size=0.1)

**Splitting Train Set into 1000 Subsets**

Now,we Split the train set into 1,000 subsets each containing 100 randomly selected instances using ShuffleSplit.

In [6]:
from sklearn.model_selection import ShuffleSplit
ss = ShuffleSplit(n_splits=1000,train_size=0.0112,random_state=42) 
#train size is 0.0112 because 100/9000=0.0112
train_splits=[]
i=0
for train_index,test_index in ss.split(X_train):
    x=[]
    y=[]
    for j in train_index:
        x.append(X_train[j])
        y.append(y_train[j])
    train_splits.append([x,y])

X_train_splits=[]
y_train_splits=[]

for i in range(1000):
    X_train_splits.append(train_splits[i][0])
    y_train_splits.append(train_splits[i][1])

**Training a decision tree on each subset**

Now,we perform Gridsearch and clone the best estimator obtained 1000 times to build our forest. Then, fit each tree on one subset each.This is step 4 that was mentioned at the start.

In [7]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
decision_tree_clf= DecisionTreeClassifier()
param_grid={'criterion':["gini","entropy","log_loss"],'splitter':["best","random"],
            'max_depth':[None,2,3,5,7,10,13,14],'min_samples_split':[2,3,4],
            'max_leaf_nodes':[5, 7, 10, 13, 15,16,17,18]}
grid_search = GridSearchCV(decision_tree_clf, param_grid,verbose=1, n_jobs=-1, cv=3)
grid_search.fit(X_train,y_train)
grid_search.best_estimator_

Fitting 3 folds for each of 1152 candidates, totalling 3456 fits


In [8]:
from sklearn.metrics import accuracy_score

In [9]:
from sklearn.base import clone
forest = [clone(grid_search.best_estimator_) for _ in range(1000)]
accuracy_scores = []
for tree, (X_mini_train, y_mini_train) in zip(forest, train_splits):
    tree.fit(X_mini_train, y_mini_train)
    y_pred = tree.predict(X_test)
    accuracy_scores.append(accuracy_score(y_test, y_pred))
np.mean(accuracy_scores)

0.808007

In [10]:
Y_pred = np.empty([1000, len(X_test)], dtype=np.uint8)
for tree_index, tree in enumerate(forest):
    Y_pred[tree_index] = tree.predict(X_test)

Now, we implement step 5.

In [11]:
from scipy.stats import mode
y_pred_majority_votes, n_votes = mode(Y_pred, axis=0,keepdims=True)

Our Random Forest Model is ready.

In [12]:
accuracy_score(y_test, y_pred_majority_votes.reshape([-1]))

0.874

We notice that the individual trees have an accuracy of 79.5% but the forest has an accuracy of 86.3%. This figure can vary upto +-2% depending on the random dataset generated. 

Now, Let's compare how it performs in comparison to the inbuilt RandomForestClassifier in sklearn.As we see, it performs almost equally well.

In [13]:
from sklearn.ensemble import RandomForestClassifier
rand_clf=RandomForestClassifier(n_estimators=1000,max_leaf_nodes=17,n_jobs=-1)
rand_clf.fit(X_train,y_train)

In [14]:
ypred2=rand_clf.predict(X_test)
accuracy_score(y_test,ypred2)

0.867

An easy and simple alternative implementation of this notebook is by using an ensemble learning trick called Bagging. What we did is also bagging but we can use the inbuilt BaggingClassifier in sklearn to save time.

In [15]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
bag_clf = BaggingClassifier(
 DecisionTreeClassifier(), n_estimators=1000,
 max_samples=100, bootstrap=True, n_jobs=-1)
bag_clf.fit(X_train, y_train)
y_pred3 = bag_clf.predict(X_test)
accuracy_score(y_pred3,y_test)

0.874