# Random Forest from Scratch

In this notebook, I will be entirely covering topics relating to the Random Forest Algorithm. This will start with a simple decision tree implementation, and go over various topics. The dataset that will be used is a NASA star type classfication dataset, where the goal is to have our tree predict the correct star classification given its tree of rules. 

## Importing and Prepping Data

Temperature -- K <br>
L -- L/Lo<br>
R -- R/Ro<br>
AM -- Mv<br>
Color -- General Color of Spectrum<br>
Spectral_Class -- O,B,A,F,G,K,M / SMASS - https://en.wikipedia.org/wiki/Asteroid_spectral_types<br>
Type -- Red Dwarf, Brown Dwarf, White Dwarf, Main Sequence , Super Giants, Hyper Giants<br>

In [3]:
import numpy as np
import pandas as pd

df = pd.read_csv("C:\\Users\\rogerree\\OneDrive - Merck Sharp & Dohme LLC\\Documents\\Personal Projects\\Data\\Stars.csv")
df.shape

ModuleNotFoundError: No module named 'numpy'

In [320]:
df.dtypes

Temperature         int64
L                 float64
R                 float64
A_M               float64
Color              object
Spectral_Class     object
Type                int64
dtype: object

In [321]:
df.head()

Unnamed: 0,Temperature,L,R,A_M,Color,Spectral_Class,Type
0,3068,0.0024,0.17,16.12,Red,M,0
1,3042,0.0005,0.1542,16.6,Red,M,0
2,2600,0.0003,0.102,18.7,Red,M,0
3,2800,0.0002,0.16,16.65,Red,M,0
4,1939,0.000138,0.103,20.06,Red,M,0


In [322]:
# a decision tree/ random forest can work with both categorical variables, such as color, as well
# numerical attributes, but we will need to alter our categorical attributes before passing data
# into a model. 

# for color, one of our categorical variables, we will use one hot encoding, as there is no
# inherent or ordinal relationship between the colors (one color is not greater than another). We
# will just do the same for spectral class as well.

df = pd.get_dummies(df, columns=['Color', 'Spectral_Class'], dtype=int)

# Move the "Type" column to the end
type_column = df.pop("Type")
df["Type"] = type_column

df.head()

Unnamed: 0,Temperature,L,R,A_M,Color_Blue,Color_Blue White,Color_Blue white,Color_Blue-White,Color_Blue-white,Color_Orange,...,Color_yellow-white,Color_yellowish,Spectral_Class_A,Spectral_Class_B,Spectral_Class_F,Spectral_Class_G,Spectral_Class_K,Spectral_Class_M,Spectral_Class_O,Type
0,3068,0.0024,0.17,16.12,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,3042,0.0005,0.1542,16.6,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,2600,0.0003,0.102,18.7,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,2800,0.0002,0.16,16.65,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,1939,0.000138,0.103,20.06,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [323]:
# scaling is not necessary in decision tree, because we will locate the splits that give the best
# reduction in entropy/ most information gain

## Decision Trees

Knowing and understanding the way a decision tree works, our decision tree from scratch will consist of the following parts:

- Node class: represents a node of the decision tree. Each node will have attributes like a split dimension (what attribute this node was split with), split point (what value of that split dimension was setteled on), and a label (which would give the class prediction if the node is a leaf), etc.

- Entropy Computation: compute the entropy of a set of labels

- Split Information: compute split information for a specific split point

- Fitting the Tree: builds the desision tree model recursively 

- Classify: uses built model to predict labels of data

### Decision Tree from Scratch

In [309]:
from typing import List, Tuple
import math
from collections import Counter
import pandas as pd


class Node:
    def __init__(self, split_dim=None, split_point=None, label=None):
        self.split_dim = split_dim
        self.split_point = split_point
        self.label = label
        self.left = None
        self.right = None
        
    def is_leaf(self):
        return self.label is not None
    
#_______________________________________________________________________________________________

class Solution:
    def __init__(self):
        self.root = None

    # computes the information (entropy) 
    def compute_info(self, labels: pd.Series) -> float:
        total_samples = len(labels)
        label_counts = labels.value_counts()
        info = 0.0

        # this is the key line: information is the negative sum of the probablity of each class times 
        # the log of that same probablity value for each class
        for count in label_counts.values:
            probability = count / total_samples
            info -= probability * math.log2(probability)

        return info

    # grabs two subsets and tells us how valuable a split point is based on entropy
    def split_info(self, data: pd.DataFrame, label: pd.Series, split_dim: int, split_point: float) -> float:
        data_left, label_left, data_right, label_right = self._split_data(data, label, split_dim, split_point)

        total_samples = len(data)
        total_left = len(data_left)
        total_right = len(data_right)

        p_left = total_left / total_samples
        p_right = total_right / total_samples

        info_left = self.compute_info(label_left)
        info_right = self.compute_info(label_right)
        # we then weigh the information we gather from each side's split, and return the sum of those 
        # as info_a
        info_a = (p_left * info_left) + (p_right * info_right)

        return info_a

#_______________________________________________________________________________________________

    # this is needed to actually split the data. We need to split in the recursive build, so we can define this behavior and also use it in split_info
    def _split_data(self, data: pd.DataFrame, labels: pd.Series, split_dim: int, split_point: float) -> Tuple[pd.DataFrame, pd.Series, pd.DataFrame, pd.Series]:

        # left and right just represent the two subsets that are created when we split the data. We can 
        # capture the data that is in each subset, as well as the labels
        data_left = data[data.iloc[:, split_dim] <= split_point]
        data_right = data[data.iloc[:, split_dim] > split_point]
        labels_left = labels[data.iloc[:, split_dim] <= split_point]
        labels_right = labels[data.iloc[:, split_dim] > split_point]

        return data_left, labels_left, data_right, labels_right
    
    # used to get information gain, which differs from split info
    def _calculate_info_gain(self, data: pd.DataFrame, labels: pd.Series, split_dim: int, split_point: float) -> float:
        total_samples = len(data)
        total_left, total_right = 0, 0
        count_left, count_right = {}, {}

        for i in range(total_samples):
            if data.iloc[i, split_dim] <= split_point:
                total_left += 1
                label = labels.iloc[i]
                if label in count_left:
                    count_left[label] += 1
                else:
                    count_left[label] = 1
            else:
                total_right += 1
                label = labels.iloc[i]
                if label in count_right:
                    count_right[label] += 1
                else:
                    count_right[label] = 1

        info = self.compute_info(labels)
        info_left = 0.0
        info_right = 0.0

        for count in count_left.values():
            probability = count / total_left
            info_left -= probability * math.log2(probability)

        for count in count_right.values():
            probability = count / total_right
            info_right -= probability * math.log2(probability)

        info_gain = info - ((total_left / total_samples) * info_left + (total_right / total_samples) * info_right)
        return info_gain
    
    # fit just calls recursive build
    def fit(self, train_data: pd.DataFrame, train_label: pd.Series) -> None:
        self.root = self._recursive_build(train_data, train_label)

    # builds the tree
    def _recursive_build(self, data: pd.DataFrame, labels: pd.Series, depth: int = 1) -> Node:
        label_counts = labels.value_counts()
        majority_label = label_counts.idxmax()
        
        # our stopping criteri; if the subset is all one label (leaf node) or the depth reaches more than 2, we return the node
        if len(label_counts) == 1 or depth > 2:
            return Node(label=majority_label, split_dim=-1, split_point=-1.0)

        num_features = data.shape[1]
        best_info_gain = float('-inf')
        best_split_dim = -1
        best_split_point = -1

        # find best information gain among split dimensions and split points
        for split_dim in range(num_features):
            split_points = self._calculate_split_points(data, split_dim)
            for split_point in split_points:
                info_gain = self._calculate_info_gain(data, labels, split_dim, split_point)
                if info_gain > best_info_gain:
                    best_info_gain = info_gain
                    best_split_dim = split_dim
                    best_split_point = split_point

        # set the node's properties to the split dimension, split point and label we found
        node = Node(split_dim=best_split_dim, split_point=best_split_point, label=majority_label)

        # now split again
        data_left, labels_left, data_right, labels_right = self._split_data(data, labels, best_split_dim, best_split_point)

        # if we are not at a leaf node, then call recursive build again
        if not data_left.empty and not data_right.empty:
            node.left = self._recursive_build(data_left, labels_left, depth=depth + 1)
            node.right = self._recursive_build(data_right, labels_right, depth=depth + 1)

        return node

    # needed in order to calculate the potential split points. Takes in the data and a split dimension
    def _calculate_split_points(self, data: pd.DataFrame, split_dim: int) -> List[float]:

        # gets unique values from selected dimension and sorts them in ascending order
        attribute_values = sorted(data.iloc[:, split_dim])

        # grabs all midpoints between each adjacent pair of values from our dimension
        split_points = [(attribute_values[i] + attribute_values[i + 1]) / 2 for i in range(len(attribute_values) - 1)]

        # returns the list of all split points
        return split_points
    
#_______________________________________________________________________________________________

    # here we look into our tree we built and select the node label
    def classify(self, train_data: pd.DataFrame, train_label: pd.Series, test_data: pd.DataFrame) -> List[int]:
        self.fit(train_data, train_label)
        predictions = []

        # for each data point in the test data, set the predicted label to the value derived by traversing the tree
        for _, data_point in test_data.iterrows():
            predicted_label = self._traverse_tree(data_point, self.root)
            predictions.append(predicted_label)

        return predictions

    # look into the tree
    def _traverse_tree(self, data_point: pd.Series, node: Node) -> int:
        # if we are at leaf node, return the label
        if node.left is None and node.right is None:
            return node.label
        # if the value of the given data point (and split dimension) is less, go to the left
        if data_point[node.split_dim] <= node.split_point:
            return self._traverse_tree(data_point, node.left)
        else:
            # else, go right and call traverse again
            return self._traverse_tree(data_point, node.right)

### Testing

In [216]:
# Testing for Entropy and Split Info

def parse_input_dataframe(df):
    split_dim = 2  # Set the split_dim (dimension to split on) to 0 as an example
    split_point = 13  # Set the split_point to 0.5 as an example

    data = df.iloc[:15]  # Take the first 20 records and convert them to a list of lists
    labels = df.iloc[:15]['Type']  # Extract the 'Type' column as the labels

    return data, labels, split_dim, split_point

data, labels, split_dim, split_point = parse_input_dataframe(df)

solution = Solution()

# Call the split_info method with the parsed data
result = solution.split_info(data, labels, split_dim, split_point)
print(result)

0.9182958340544896


In [217]:
# Testing for the tree creation and fit

dataset = df.iloc[:1000, :-1]
labels = df.iloc[:1000, -1]

# Create an instance of the Solution class
solution = Solution()

# Build the decision tree using the fit function
solution.fit(dataset, labels)

def preorder_traversal(node):
    if node is None:
        return ""
    result = "{" + f"split_dim: {node.split_dim}, split_point: {node.split_point}, label: {node.label}" + "}"
    if node.left or node.right:
        result += "{" + preorder_traversal(node.left) + "}"
        result += "{" + preorder_traversal(node.right) + "}"
    return result

def inorder_traversal(node):
    if node is None:
        return ""
    result = ""
    if node.left or node.right:
        result += "{" + inorder_traversal(node.left) + "}"
    result += "{" + f"split_dim: {node.split_dim}, split_point: {node.split_point}, label: {node.label}" + "}"
    if node.left or node.right:
        result += "{" + inorder_traversal(node.right) + "}"
    return result

# Perform preorder traversal on the decision tree
preorder_result = preorder_traversal(solution.root)
print(preorder_result)

print()

# Perform inorder traversal on the decision tree
inorder_result = inorder_traversal(solution.root)
print(inorder_result)

{split_dim: 1, split_point: 0.07050000000000001, label: 0}{{split_dim: 0, split_point: 5396.0, label: 0}{{split_dim: -1, split_point: -1.0, label: 0}}{{split_dim: -1, split_point: -1.0, label: 2}}}{{split_dim: 2, split_point: 11.3, label: 3}{{split_dim: -1, split_point: -1.0, label: 3}}{{split_dim: -1, split_point: -1.0, label: 4}}}

{{{split_dim: -1, split_point: -1.0, label: 0}}{split_dim: 0, split_point: 5396.0, label: 0}{{split_dim: -1, split_point: -1.0, label: 2}}}{split_dim: 1, split_point: 0.07050000000000001, label: 0}{{{split_dim: -1, split_point: -1.0, label: 3}}{split_dim: 2, split_point: 11.3, label: 3}{{split_dim: -1, split_point: -1.0, label: 4}}}


In [220]:
# another test for structure

import pandas as pd

data = [
    ['1', '1.0', '1.0'],
    ['1', '1.0', '2.0'],
    ['1', '2.0', '1.0'],
    ['3', '2.0', '2.0'],
    ['1', '3.0', '1.0'],
    ['3', '3.0', '2.0'],
    ['3', '3.0', '3.0'],
    ['3', '4.5', '3.0']
]

# Creating the DataFrame
#df = pd.DataFrame(data, columns=['label', 'attribute 0', 'attribute 1'])

# testing

classifier = Solution()

train_data = df[df['label'] != '-1'].drop('label', axis=1).astype(float)
train_label = df[df['label'] != '-1']['label'].astype(int)
test_data = df[df['label'] == '-1'].drop('label', axis=1).astype(float)

predictions = classifier.classify(train_data, train_label, test_data)

# Perform preorder traversal on the decision tree
preorder_result = preorder_traversal(classifier.root)
print(preorder_result)

{split_dim: 1, split_point: 1.0, label: 1}{{split_dim: -1, split_point: -1.0, label: 1}}{{split_dim: 0, split_point: 1.5, label: 3}{{split_dim: -1, split_point: -1.0, label: 1}}{{split_dim: -1, split_point: -1.0, label: 3}}}


In [221]:
# test the traversal of the tree and classification

import pandas as pd

data = [
    ['1', '1.0', '1.0'],
    ['1', '1.0', '2.0'],
    ['1', '2.0', '1.0'],
    ['3', '2.0', '2.0'],
    ['1', '3.0', '1.0'],
    ['3', '3.0', '2.0'],
    ['3', '3.0', '3.0'],
    ['3', '4.5', '3.0'],
    ['-1', '1.0', '2.2'],
    ['-1', '4.5', '1.0']
]

# Creating the DataFrame
#df = pd.DataFrame(data, columns=['label', 'attribute 0', 'attribute 1'])

# testing

classifier = Solution()

train_data = df[df['label'] != '-1'].drop('label', axis=1).astype(float)
train_label = df[df['label'] != '-1']['label'].astype(int)
test_data = df[df['label'] == '-1'].drop('label', axis=1).astype(float)

predictions = classifier.classify(train_data, train_label, test_data)

print(predictions)

preorder_result = preorder_traversal(classifier.root)
print(preorder_result)

print()

[1, 1]
{split_dim: 1, split_point: 1.0, label: 1}{{split_dim: -1, split_point: -1.0, label: 1}}{{split_dim: 0, split_point: 1.5, label: 3}{{split_dim: -1, split_point: -1.0, label: 1}}{{split_dim: -1, split_point: -1.0, label: 3}}}



In [222]:
import pandas as pd

# Read the data from the text file
with open(r"C:\Users\rogerree\OneDrive - Merck Sharp & Dohme LLC\Desktop\input01.txt", "r") as file:
    lines = file.read().splitlines()

# Convert the data into a list of lists
data = [line.split() for line in lines]

# Extract the label and remove the attribute prefixes from the data
data = [[row[0]] + [value.split(":")[1] for value in row[1:] if ":" in value] for row in data]

# Filter out rows with no values after removing prefixes
data = [row for row in data if len(row) > 1]

# Create a DataFrame from the data
#df = pd.DataFrame(data)

classifier = Solution()

# Rename the columns in the DataFrame
df.columns = ['label'] + [f'attr_{i}' for i in range(1, len(df.columns))]

# Split the data into train and test
train_data = df[df['label'] != '-1'].drop('label', axis=1).astype(float)
train_label = df[df['label'] != '-1']['label'].astype(int)
test_data = df[df['label'] == '-1'].drop('label', axis=1).astype(float)

# Classify using the test data
predictions = classifier.classify(train_data, train_label, test_data)
# Perform preorder traversal on the decision tree
preorder_result = preorder_traversal(classifier.root)
print(preorder_result)
print()
print(predictions)


{split_dim: 1, split_point: 2.076767332959929, label: 3}{{split_dim: 2, split_point: 2.0413842935532105, label: 2}{{split_dim: -1, split_point: -1.0, label: 2}}{{split_dim: -1, split_point: -1.0, label: 2}}}{{split_dim: 2, split_point: 3.041219385179334, label: 3}{{split_dim: -1, split_point: -1.0, label: 3}}{{split_dim: -1, split_point: -1.0, label: 2}}}

[2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 3, 3, 2, 3, 3, 2, 2, 3, 3, 2, 2, 3, 2, 3, 2, 2, 2, 2, 3, 3, 3, 2, 3, 2, 2, 2, 2, 3, 2, 2, 3, 2, 3, 3, 3, 2, 3, 3, 2, 2, 3, 2, 2, 3, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 3, 2, 2, 3, 2, 2, 3, 2, 2, 3, 2, 2, 2, 3, 2, 3, 3, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 2, 2, 2, 3, 2, 3, 3, 3, 3, 3, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 3, 2, 2, 2, 3, 2, 3, 3, 3, 2, 3, 2, 3, 3, 3, 3, 2, 3, 3, 2, 3, 3, 2, 2, 3, 2, 3, 2, 2, 2, 3, 2, 3, 2, 3, 2, 2, 2, 2, 3, 2, 2, 3, 2, 2, 2, 3, 2, 2, 3, 2, 3, 3, 2, 3, 2, 3, 2, 2, 2, 2, 3, 3, 3, 3, 2, 2, 3, 2, 3, 2, 2, 2, 2, 3

### Evaluation of Model and Cross-Validation

In [324]:
# first, with a basic 20/80 test train split 
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

from sklearn.model_selection import train_test_split

# Splitting the data into train and test DataFrames (80% train, 20% test)
train_df, test_df = train_test_split(df, test_size=0.2)

# setting training/ testing data and labels (as df's)
test_data = test_df.iloc[:test_df.shape[0], :-1]
test_labels = test_df.iloc[:test_df.shape[0], -1]

train_data = train_df.iloc[:train_df.shape[0], :-1]
train_labels = train_df.iloc[:train_df.shape[0], -1]

# Instantiating and training the decision tree model
model = Solution()
model.fit(train_data, train_labels)

# Generating predictions on the test set
predictions = model.classify(train_data, train_labels, test_data)

# Calculating accuracy metrics
accuracy = accuracy_score(test_labels, predictions)

# Printing the accuracy metrics
print(f"Accuracy: {accuracy:.4f}")

Accuracy: 0.5833


### 

### Pruning and Overfitting

## Random Forests

### Random Forest as an Ensemble of Decision Trees

### Random Subspace Method for Feature Selection

### Bootstrapping 

### Hyperparameter Tuning

## Evaluation of Models and Cross-validation

## Variations and Improvements

### Feature Importance Estimation with Random Forest

### Handling Missing Data and Outliers

### Extending Random Forest to Handle Imbalanced Datasets

### Multiclass Classification and Regression Problems

## Coursework for Random Forest

In [193]:
# Classification Test

def parse_classification_data(file_path):
    with open(file_path, 'r') as file:
        train_data = []
        train_label = []
        test_data = []
        
        for line in file:
            line = line.strip()
            parts = line.split()
            label = int(parts[0])
            attributes = [float(attr.split(':')[1]) for attr in parts[1:]]
            
            if label != -1:
                train_data.append(attributes)
                train_label.append(label)
            else:
                test_data.append(attributes)
    
    return train_data, train_label, test_data

input_file = r"C:\Users\rogerree\Downloads\01mSTLanT0igEUUS8kDmDQ_2834ea673e8949dea706a3da771f23f1_PA-DT-Handout\dt_handout\sample_test_cases\classification\input01.txt"
train_data, train_label, test_data = parse_classification_data(input_file)
result = solution.classify(train_data, train_label, test_data)
solution = Solution()


# Call the classify method with the extracted data
result = solution.classify(train_data, train_label, test_data)
print(result)

[2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 3, 3, 2, 3, 3, 2, 2, 3, 3, 2, 2, 3, 2, 3, 2, 2, 2, 2, 3, 3, 3, 2, 3, 2, 2, 2, 2, 3, 2, 2, 3, 2, 3, 3, 3, 2, 3, 3, 2, 2, 3, 2, 2, 3, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 3, 2, 2, 3, 2, 2, 3, 2, 2, 3, 2, 2, 2, 3, 2, 3, 3, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 2, 2, 2, 3, 2, 3, 3, 3, 3, 3, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 3, 2, 2, 2, 3, 2, 3, 3, 3, 2, 3, 2, 3, 3, 3, 3, 2, 3, 3, 2, 3, 3, 2, 2, 3, 2, 3, 2, 2, 2, 3, 2, 3, 2, 3, 2, 2, 2, 2, 3, 2, 2, 3, 2, 2, 2, 3, 2, 2, 3, 2, 3, 3, 2, 3, 2, 3, 2, 2, 2, 2, 3, 3, 3, 3, 2, 2, 3, 2, 3, 2, 2, 2, 2, 3, 2, 2, 2, 2, 3, 2, 2, 2, 2, 3, 3]
