# Random Forest Regressor from Scratch
***
## Table of Contents
1. [Introduction](#1-introduction)
***

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from typing import Tuple, List, Dict, Optional, Any
from numpy.typing import NDArray
from joblib import Parallel, delayed

## 1. Introduction
This notebook is an extension of [Decision Tree Regressor from Scratch](https://github.com/tsu76i/DS-playground/blob/main/2.%20Building%20ML%20Models%20From%20Scratch/2.3%20CART/decision_tree_regressor.ipynb).

Random forests are an ensemble learning technique that combines multiple decision trees, each trained on a random subset of the data (with replacement) and a random subset of features at each split. The final prediction is made by aggregating the results of all trees(**majority vote** for classification, **average** for regression). Compared to decision trees, this approach provides better accuracy, reduced overfitting and more stable predictions, though at the cost of increased computational complexity and reduced interpretability. This method introduces two key randomisation techniques during the training process:

1. **Bootstrap Sampling**: Each tree is trained on a bootstrapped dataset, which is a random sample of the original dataset created *with replacement*. This ensures diversity among the trees.
2. **Feature Randomisation**: At each split in a tree, a random subset of features is considered rather than evaluating all features. This prevents dominant features from appearing in every tree and further promotes diversity.

## 2. Loading Data
Retrieved from [GitHub - YBI Foundation](https://github.com/YBI-Foundation/Dataset/blob/main/Admission%20Chance.csv)

In [2]:
df = pd.read_csv(
    'https://raw.githubusercontent.com/YBI-Foundation/Dataset/refs/heads/main/Admission%20Chance.csv')
df.head()

Unnamed: 0,Serial No,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


In [3]:
X = df.iloc[:, :-1]
y = df.iloc[:, -1]
feature_names = df.columns[:-1].tolist()  # All columns except the last one

# Check the shape of the data
print(f'Features shape: {X.shape}')
print(f'Target shape: {y.shape}')
print(f'Features: \n{feature_names}')

Features shape: (400, 8)
Target shape: (400,)
Features: 
['Serial No', 'GRE Score', 'TOEFL Score', 'University Rating', ' SOP', 'LOR ', 'CGPA', 'Research']


## 3. Train Test Split
Train test split is a fundamental model validation technique in machine learning. It divides a dataset into two separate portions: a **training set** used to train a model, and a **testing set** used to evaluate how well the model can perform on unseen data. 

The typical split ratio is 80% for training and 20% for testing, though this can vary (70/30 or 90/10 are also common). The key principle is that the test set must remain completely separated during model training process, and should never be used to make decisions about the model or tune parameters. 

The split is usually done randomly to ensure both sets are representative of the overall dataset, and many libraries (such as scikit-learn) provide build-in functions that handle this process automatically while maintaining proper randomisation.


In [4]:
def train_test_split(X: pd.DataFrame, y: pd.Series, test_size: float = 0.2,
                     random_state: int = None) -> Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]:
    """
    Split arrays or matrices into random train and test subsets.

    Args:
        X: Input features, a 2D array with rows (samples) and columns (features).
        y: Target values/labels, a 1D array with rows (samples).
        test_size: Proportion of the dataset to include in the test split. Must be between 0.0 and 1.0. default = 0.2
        random_state: Seed for the random number generator to ensure reproducible results. default = None

    Returns:
        A tuple containing:
            - X_train: Training set features.
            - X_test: Testing set features.
            - y_train: Training set target values.
            - y_test: Testing set target values.
    """
    # Set a random seed if it exists
    if random_state:
        np.random.seed(random_state)

    # Create a list of numbers from 0 to len(X)
    indices = np.arange(len(X))

    # Shuffle the indices
    np.random.shuffle(indices)

    # Define the size of our test data from len(X)
    test_size = int(test_size * len(X))

    # Generate indices for test and train data
    test_indices: NDArray[np.int64] = indices[:test_size]
    train_indices: NDArray[np.int64] = indices[test_size:]

    # Return: X_train, X_test, y_train, y_test
    return X.iloc[train_indices], X.iloc[test_indices], y.iloc[train_indices], y.iloc[test_indices]

## 4. Loss Functions for Regression
### Variance

In [5]:
def variance(y: pd.Series) -> float:
    return np.var(y)

### Mean Squared Error (MSE) For LF

In [6]:
def mse(y: pd.Series) -> float:
    mean = np.mean(y)
    return np.mean((y - mean) ** 2)

In [7]:
print(f"Variance: {variance(y):.5f}")
print(f"MSE: {mse(y):.5f}")

Variance: 0.02029
MSE: 0.02029


## 5. Information Gain
Information Gain is a metric used to measure the effectiveness of a feature in splitting a dataset into subsets that are more pure concerning the target variable. It quantifies the reduction in variance or MSE, and a higher information gain indicates a better feature for making splits.

\begin{align*}
IG(S, A) = H(S) - \sum_{i=1}^{n} \dfrac{|S_i|}{|S|}H(S_{i})
\end{align*}

where:
- $H(S)$: Variance (or MSE) of the original dataset $S$.
- $S_{i}$: Subset of $S$ created by splitting on feature $A$ for the $i_{th}$ value or range of the feature.
- $\dfrac{|S_i|}{|S|}$: Proportion of samples in subset $S_{i}$.
- $H(S_{i})$: Variance (or MSE) of subset $S_{i}$.



The following `information_gain` function calculates the difference between the metric for the parent node and the weighted average of the metrics for the child nodes (left and right splits).

In [8]:
def information_gain(y: pd.Series, y_left: pd.Series, y_right: pd.Series,
                     metric: str = 'variance') -> float:
    """
    Calculate the information gain for regression.

    Args:
        y: Target variables of the parent node.
        y_left: Target variables of the left child node after the split.
        y_right: Target variables of the right child node after the split.
        metric: Splitting criterion, either 'variance' or 'mse'. Defaults to 'variance'.

    Returns:
        Information gain resulting from the split.
    """
    if metric == 'variance':
        parent_metric = variance(y)
        left_metric = variance(y_left)
        right_metric = variance(y_right)
    else:  # metric == "mse"
        parent_metric = mse(y)
        left_metric = mse(y_left)
        right_metric = mse(y_right)

    weighted_metric = (
        len(y_left) / len(y) * left_metric
        + len(y_right) / len(y) * right_metric
    )
    return parent_metric - weighted_metric

## 6. Bootstrapping
Bootstrapping is a statistical resampling method that involves sampling data points with replacement. In creating a new dataset (**bootstrap sample**) from the original dataset, some data points may appear multiple times, while others may be excluded. Though individual data points may repeat, the size of bootstrap sample $n$ is typically the same as the original dataset. This method ensures variability among datasets, which helps reduce overfitting when used in ensemble learning.

For a dataset with $n$ examples, each sample has a $1 - \left( 1 - \dfrac{1}{n} \right)^{n}$ chance of being selected at least once in the bootstrap sample. As $n$ becomes large, this value approaches $1-\text{e}^{-1} \approx 0.632$. Hence, about 63.2% of the original dataset is expected to appear in any given bootstrap sample.

In [9]:
def bootstrap_sample(X: pd.DataFrame, y: pd.Series, n_samples: Optional[int] = None,
                     random_state: Optional[int] = None) -> Tuple[pd.DataFrame, pd.Series]:
    """
    Generate a bootstrap sample from the dataset.

    Args:
        X: Input features.
        y: Target labels.
        n_samples: Samples to draw (default: dataset size).
        random_state: Random seed.

    Returns:
        Bootstrapped (X, y) tuple.
    """
    if random_state is not None:
        np.random.seed(random_state)
    if n_samples is None:
        n_samples = len(X)
    indices = np.random.randint(0, len(X), size=n_samples)
    return X.iloc[indices], y.iloc[indices]

In [10]:
bootstrap_sample(X, y, random_state=42)[0][:10]  # X

Unnamed: 0,Serial No,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research
102,103,314,106,2,4.0,3.5,8.25,0
348,349,302,99,1,2.0,2.0,7.25,0
270,271,306,105,2,2.5,3.0,8.22,1
106,107,329,111,4,4.5,4.5,9.18,1
71,72,336,112,5,5.0,5.0,9.76,1
188,189,331,115,5,4.5,3.5,9.36,1
20,21,312,107,3,3.0,2.0,7.9,1
102,103,314,106,2,4.0,3.5,8.25,0
121,122,334,119,5,4.5,4.5,9.48,1
214,215,331,117,4,4.5,5.0,9.42,1


In [11]:
bootstrap_sample(X, y, random_state=42)[1][:10]

102    0.62
348    0.57
270    0.72
106    0.87
71     0.96
188    0.93
20     0.64
102    0.62
121    0.94
214    0.94
Name: Chance of Admit , dtype: float64

## 7. Identifying the Best Split
This function identifies the best feature and threshold to split the data using the specified metric (Variance or MSE).

Steps are:

1. Select some features randomly (Recommended: `sqrt` for classification, `log2` for regression).

2. For each selected feature, iterate over all unique thresholds.

3. Split the data into left and right subsets based on the threshold (skip invalid ones).

4. Compute the variance/MSE for both subsets and calculate Information Gain.

5. If the newly computed `info_gain` > `best_info_gain`, then update `best_info_gain` with the new information.

In [12]:
def best_split(X: pd.DataFrame, y: pd.Series, metric: str = 'variance', feature_names=None, max_features=None) -> Dict[str, Any]:
    """
    Find the best split for a dataset for regression.

    Args:
        X: Input features.
        y: Target variables.
        metric: Splitting criterion, either 'variance' or 'mse'. Defaults to 'variance'.
        feature_names: List of feature names. If None, indices are used. Defaults to None.
        max_features: Number of features to consider at each split. None(log2(total_n_features)) or int(<=total_n_features). Defaults to None.
    Returns:
        Dictionary containing the best split with keys:
            - 'feature_index': Index of the feature used for the split.
            - 'feature_name': Name or index of the feature.
            - 'threshold' : Threshold value for the split.
    """
    if feature_names is None and hasattr(X, 'columns'):
        feature_names = X.columns.tolist()

    # Convert X if DataFrame
    if hasattr(X, 'to_numpy'):
        X = X.to_numpy()

    best_info_gain = float('-inf')
    best_split = None
    total_n_features = X.shape[1]

    if isinstance(max_features, int):  # if max_features is int
        selected_n_features = max_features if max_features <= total_n_features else total_n_features
    else:  # Default = log2(total_n_features)
        selected_n_features = int(np.log2(total_n_features))

    selected_features_idx = np.random.choice(
        a=total_n_features, size=selected_n_features, replace=False)

    # Iterate over randomly selected features.
    for feature in selected_features_idx:
        # Iterate over all unique thresholds for each random feature.
        thresholds = np.unique(X[:, feature])
        for threshold in thresholds:
            # Split the data into left and right subsets based on the threshold.
            left_mask = X[:, feature] <= threshold
            right_mask = X[:, feature] > threshold

            # Skip invalid splits.
            if sum(left_mask) == 0 or sum(right_mask) == 0:
                continue

            # Compute IG.
            info_gain = information_gain(
                y, y[left_mask], y[right_mask], metric)

            # Update `best_info_gain` if `info_gain` > `best_info_gain`.
            if info_gain > best_info_gain:
                best_info_gain = info_gain
                best_split = {
                    'feature_index': int(feature),
                    'feature_name': feature_names[feature] if feature_names is not None else feature,
                    'threshold': float(threshold),
                }

    return best_split

In [13]:
split = best_split(X, y, metric='variance', feature_names=feature_names)
print('Best Split:', split)

Best Split: {'feature_index': 3, 'feature_name': 'University Rating', 'threshold': 3.0}


## 8. Building the Decision Tree
This function resursively creates the tree structure as a nested dictionary with conditions (`feature` and `threshold`) and leaf nodes.

In [14]:
def build_tree(X: pd.DataFrame, y: pd.Series, max_depth: int = None,
               depth: int = 0, metric: str = 'variance', feature_names: List[str | int] = None, max_features=None) -> Dict[
        str, Any]:
    """
    Build a decision tree using recursive splitting.

    Args:
        X: Input features(pd.DataFrame).
        y: Labels (pd.Series).
        max_depth: Maximum depth of the tree. Defaults to None (unlimited depth).
        depth: Current depth of the tree. Used internally for recursion. Defaults to 0.
        metric: Splitting criterion, either 'variance' or 'mse'. Defaults to 'variance'.
        feature_names: List of feature names. If None, indices are used. Defaults to None.
        max_features: Number of features to consider at each split. None(√total_n_features) or int(<=total_n_features). Defaults to None.

    Returns:
        - Nested dictionary representing the tree structure.
        - Nodes contain keys: 'type', 'feature', 'threshold', 'left', 'right'.
        - Leaf nodes contain keys: 'type', 'value'.
    """

    # Convert DataFrames to NumPy arrays
    if hasattr(X, 'to_numpy'):
        X = X.to_numpy()
    if hasattr(y, 'to_numpy'):
        y = y.to_numpy().flatten()  # Ensure 1D array

    # Stop the recursion if all labels are identical or the maximum depth is reached.
    if len(set(y)) == 1 or (max_depth is not None and depth == max_depth):
        return {'type': 'leaf', 'value': np.mean(y)}

    # Find the best split.
    split = best_split(X, y, metric, feature_names, max_features)
    if not split:
        return {'type': 'leaf', 'value': np.mean(y)}

    # Split the data into left and right subsets.
    # Use feature_index for calculations.
    left_mask = X[:, split['feature_index']] <= split['threshold']
    right_mask = X[:, split['feature_index']] > split['threshold']

    # Recursively build the left and right subtrees.
    left_tree = build_tree(X[left_mask], y[left_mask],
                           max_depth, depth + 1, metric, feature_names, max_features)
    right_tree = build_tree(X[right_mask], y[right_mask],
                            max_depth, depth + 1, metric, feature_names, max_features)

    # Return the tree structure as a nested dictionary.
    return {
        'type': 'node',
        'feature': split['feature_name'],
        'threshold': split['threshold'],
        'left': left_tree,
        'right': right_tree,
    }

## 9. Building Random Forest
We now create `build_random_forest` function that iterates bootstrapping samples and building trees `n_estimators` times. To reduce execution speed, parallel tree construction is implemented (with all CPU cores used).

In [15]:
def build_random_forest(X_train: pd.DataFrame, y_train: pd.Series, n_estimators: int,
                        n_jobs: int = -1, max_depth: int = 15) -> List[Dict[str, Any]]:
    """
    Optimised random forest builder using parallel processing

    Args:
        X_train: Training features
        y_train: Training labels
        n_estimators: Number of trees
        n_jobs: Number of CPU cores to use (-1 = all cores)
        max_depth: Maximum tree depth

    Returns:
        List of decision trees
    """
    # Build single tree
    def _build_single_tree(i):
        X_boot, y_boot = bootstrap_sample(X_train, y_train, random_state=i)
        return build_tree(X_boot, y_boot, max_depth=max_depth,
                          metric='variance', feature_names=feature_names)

    # Parallel execution
    forest = Parallel(n_jobs=n_jobs)(
        delayed(_build_single_tree)(i)
        for i in range(n_estimators)
    )

    return forest

## 10. Traversing the Tree for Prediction
This function traverses the tree to make predictions by following the tree from the root to a leaf node.

In [16]:
def traverse_tree(x: pd.DataFrame, tree: Dict[str, Any],
                  feature_names: List[str | int] = None, max_features=None) -> float:
    """
    Traverse a decision tree to make a prediction for a single sample.

    Args:
        x: Single sample.
        tree: Decision tree structure.
        feature_names: List of feature names. Needed for name-to-index mapping. Defaults to None.
        max_features: Number of features to consider at each split. None(√total_n_features) or int(<=total_n_features). Defaults to None.

    Returns:
        Predicted label.
    """
    if tree['type'] == 'leaf':
        return tree['value']

    # Resolve feature index if feature_names is provided
    feature_index = feature_names.index(
        tree['feature']) if feature_names is not None else tree['feature']

    if x[feature_index] <= tree['threshold']:
        return traverse_tree(x, tree['left'], feature_names, max_features)
    else:
        return traverse_tree(x, tree['right'], feature_names, max_features)

## 11. Predictions
This function predicts labels for all samples in the dataset.

In [17]:
def predict(X: pd.DataFrame, tree: Dict[str, Any],
            feature_names: List[str | int] = None) -> float | NDArray[np.float64]:
    """
    Predict labels for the given dataset using a decision tree classifier.

    Args:
        X: Input features.
        tree: Decision tree structure.
        feature_names : List of feature names. Needed for name-to-index mapping. Defaults to None.

    Returns:
        Predicted labels (1D array for multiple samples or a single label for one sample).
    """
    # Convert DataFrames to NumPy arrays
    if hasattr(X, 'to_numpy'):
        X = X.to_numpy()

    if len(X.shape) == 1:  # If a single sample is provided
        return traverse_tree(X, tree, feature_names)
    return np.array([traverse_tree(x, tree, feature_names) for x in X])

After all predictions have been made for the `n_estimators`, we will the average value to determine the final prediction.

In [18]:
def predict_average(forest: List[Dict[str, Any]], X: pd.DataFrame,
                    feature_names: List[str] = None) -> List[float]:
    all_preds = []
    for tree in forest:
        preds = predict(X, tree, feature_names)
        all_preds.append(preds)
    all_preds = np.array(all_preds)
    all_means = np.mean(all_preds, axis=0)
    return all_means

## 12. Evaluation Metrics
### Mean Squared Error (MSE)
Mean Squared Error measures the average squared difference between predicted ($\hat y$) and actual ($y$) values. Large errors are penalised heavily. Smaller MSE indicates better predictions.

\begin{align*}
MSE = \dfrac{1}{n} \sum_{i=1}^{n}(\hat y_{i} = y_{i})^2
\end{align*}

In [19]:
def calculate_MSE(y_true: pd.Series, y_pred: NDArray[np.float64]) -> float:
    return np.mean((y_true - y_pred) ** 2)

### Root Mean Squared Error (RMSE)
Square root of MSE. It provides error in the same unit as the target variable ($y$) and easier to interpret.

\begin{align*}
RMSE = \sqrt{(MSE)}
\end{align*}

In [20]:
def calculate_RMSE(y_true: pd.Series, y_pred: NDArray[np.float64]) -> float:
    return np.sqrt(np.mean((y_true - y_pred) ** 2))

### Mean Absolute Error (MAE)
Mean Absolute Error measures the average absolute difference between predicted ($\hat y$) and actual ($y$) values. It is less sensitive to outliers than MSE. Smaller MAE indicates better predictions.

\begin{align*}
MAE = \dfrac{1}{n} \sum_{i=1}^{n}|\hat y_{i} = y_{i}|
\end{align*}

In [21]:
def calculate_MAE(y_true: pd.Series, y_pred: NDArray[np.float64]) -> float:
    return np.mean(np.abs(y_true - y_pred))

<a id="r-squared"></a>
### R-Squared($R^2$)

R-squared indicated the proportion of variance in the dependent variable that is predictable from the independent variables. Value ranges from 0 to 1. Closer to 1 indicates a better fit.



Residual Sum of Squares ($SS_{residual}$): 
\begin{align*}
SS_{residual} = \sum_{i=1}^{n} (y_{i} - \hat y_{i})^{2}
\end{align*}

Total Sum of Squares ($SS_{total}$): 
\begin{align*}
SS_{total} = \sum_{i=1}^{n} (y_{i} - \bar y_{i})^{2}
\end{align*}

$R^2$ is computed as:

\begin{align*}

R^2 = 1 - \dfrac{SS_{residual}}{SS_{total}} = 1 - \dfrac{\sum_{i=1}^{n} (y_{i} - \hat y_{i})^{2}}{\sum_{i=1}^{n} (y_{i} - \bar y_{i})^{2}}

\end{align*}

where:

$y$: Actual target values.

$\bar y$: Mean of the actual target values.

$\hat y$: Precicted target values.

In [22]:
def calculate_r2(y_true: pd.Series, y_pred: NDArray[np.float64]) -> float:
    ss_total = np.sum((y_true - np.mean(y_true)) ** 2)
    ss_residual = np.sum((y_true - y_pred) ** 2)
    r2 = 1 - (ss_residual / ss_total)
    return r2

In [23]:
def evaluate(y_true: pd.Series, y_pred: NDArray[np.float64]) -> Tuple[float, float, float, float]:
    """
    Calculate and return evaluation metrics for a regression model, including MSE, RMSE, MAE, and R-squared.

     Args:
        y_true): True labels.
        y_pred: Predicted labels.
        class_names: List of class names. Defaults to None.
    Returns:
        - mse: Mean Squared Error (MSE), indicating the average of the squared differences between predicted and true values.
        - rmse: Root Mean Squared Error (RMSE), indicating the standard deviation of the residuals.
        - mae: Mean Absolute Error (MAE), representing the average absolute difference between predicted and true values.
        - r2: R-squared (coefficient of determination), showing the proportion of variance in the dependent variable that is predictable from the independent variable(s).
    """
    mse = calculate_MSE(y_true, y_pred)
    rmse = calculate_RMSE(y_true, y_pred)
    mae = calculate_MAE(y_true, y_pred)
    r2 = calculate_r2(y_true, y_pred)
    return mse, rmse, mae, r2

## 13. Encapsulation

In [24]:
class CustomRandomForestRegressor:
    def __init__(self, n_estimators: int = 100, max_depth: int = 15, min_samples_leaf: int = 1,
                 min_samples_split=2, metric: str = 'variance', max_features: Optional[int] = None,
                 random_state: Optional[int] = None, n_jobs: int = -1) -> None:
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.min_samples_leaf = min_samples_leaf
        self.min_samples_split = min_samples_split
        self.metric = metric
        self.max_features = max_features
        self.random_state = random_state
        self.forest = None
        self.n_jobs = n_jobs

    def variance(self, y: pd.Series) -> float:
        """
        Calculate the variance.

        Args:
            y: Series of values.

        Returns:
            Variance value.
        """
        return np.var(y) if len(y) > 0 else 0

    def mse(self, y: pd.Series) -> float:
        """
        Calculate the mean squared error.

        Args:
            y: Series of values.

        Returns:
            Mean squared error value.
        """
        return np.mean((y - np.mean(y)) ** 2) if len(y) > 0 else 0

    def _information_gain(self, y: pd.Series, y_left: pd.Series, y_right: pd.Series) -> float:
        """
        Compute the information gain of a split.

        Args:
            y: Values of the parent node.
            y_left: Values of the left child node.
            y_right: Values of the right child node.

        Returns:
            Information gain from the split.
        """
        if self.metric == 'variance':
            parent_metric = self.variance(y)
            left_metric = self.variance(y_left)
            right_metric = self.variance(y_right)
        else:  # metric == "mse"
            parent_metric = self.mse(y)
            left_metric = self.mse(y_left)
            right_metric = self.mse(y_right)

        weighted_metric: float = (
            len(y_left) / len(y) * left_metric
            + len(y_right) / len(y) * right_metric
        )
        return parent_metric - weighted_metric

    def _bootstrap_sample(self, X: pd.DataFrame, y: pd.Series, n_samples: Optional[int] = None,
                          random_state: Optional[int] = None) -> Tuple[pd.DataFrame, pd.Series]:
        """
        Generate a bootstrap sample from the dataset.

        Args:
            X: Input features.
            y: Target labels.
            n_samples: Samples to draw (default: dataset size).
            random_state: Random seed.

        Returns:
            Bootstrapped (X, y) tuple.
        """
        rng = np.random.RandomState(random_state)
        if n_samples is None:
            n_samples = len(X)
        indices = rng.randint(0, len(X), size=n_samples)
        return X.iloc[indices], y.iloc[indices]

    def _best_split(self, X: NDArray[np.float64], y: NDArray[np.float64]) -> Dict[str, Any]:
        """
        Find the best split for a dataset.

        Args:
            X: Input features (DataFrame of shape [n_samples, total_n_features]).
            y: Labels (Series of shape [n_samples]).
            metric: Splitting criterion, either "gini" or "entropy". Defaults to 'gini'.
            feature_names: List of feature names. If None, indices are used. Defaults to None.
            max_features: Number of features to consider at each split. None(logs(total_n_features)) or int(<=total_n_features). Defaults to None.
        Returns:
            Dictionary containing the best split with keys:
                - 'feature_index' : Index of the feature used for the split.
                - 'feature_name': Name or index of the feature.
                - 'threshold' : Threshold value for the split.
        """

        best_info_gain = float('-inf')
        best_split = None
        total_n_features = X.shape[1]

        if isinstance(self.max_features, int):  # if max_features is int
            selected_n_features = self.max_features if self.max_features <= total_n_features else total_n_features
        else:  # Default = log2(total_n_features)
            # selected_n_features = int(np.log2(total_n_features))
            selected_n_features = int(np.log2(total_n_features))

        selected_features_idx = np.random.choice(
            a=total_n_features, size=selected_n_features, replace=False)

        # Iterate over randomly selected features.
        for feature in selected_features_idx:
            # Iterate over all unique thresholds for each random feature.
            thresholds = np.unique(X[:, feature])
            for threshold in thresholds:
                # Split the data into left and right subsets based on the threshold.
                left_mask = X[:, feature] <= threshold
                right_mask = X[:, feature] > threshold

                # Skip invalid splits.
                if sum(left_mask) < self.min_samples_leaf or sum(right_mask) < self.min_samples_leaf:
                    continue

                # Compute IG.
                info_gain = self._information_gain(
                    y, y[left_mask], y[right_mask])

                # Update `best_info_gain` if `info_gain` > `best_info_gain`.
                if info_gain > best_info_gain:
                    best_info_gain = info_gain
                    best_split = {
                        'feature_index': feature,
                        'feature_name': self.feature_names[feature] if self.feature_names is not None else feature,
                        'threshold': threshold,
                    }

        return best_split

    def _build_tree(self, X: pd.DataFrame, y: pd.Series, depth: int = 0) -> Dict[str, Any]:
        """
        Recursively build a decision tree.

        Args:
            X: Input features.
            y: Target labels.
            depth: Current tree depth.

        Returns:
            Tree structure dictionary.
        """

        # Convert to numpy arrays
        X_np = X.to_numpy() if hasattr(X, 'to_numpy') else np.array(X)
        y_np = y.to_numpy().flatten() if hasattr(
            y, 'to_numpy') else np.array(y).flatten()

        # Stopping conditions
        if len(np.unique(y_np)) == 1 or (self.max_depth is not None and depth == self.max_depth):
            return {'type': 'leaf', 'value': np.mean(y)}

        if len(y) < self.min_samples_leaf:
            return {'type': 'leaf', 'value': np.mean(y)}

        # Find best split
        split = self._best_split(X_np, y_np)
        if not split:
            return {'type': 'leaf', 'value': np.mean(y)}

        # Apply split
        feature_idx = split['feature_index']
        left_mask = X_np[:, feature_idx] <= split['threshold']
        right_mask = X_np[:, feature_idx] > split['threshold']

        # Recursive tree building
        left_tree = self._build_tree(
            X.iloc[left_mask] if hasattr(X, 'iloc') else X[left_mask],
            y.iloc[left_mask] if hasattr(y, 'iloc') else y[left_mask],
            depth + 1
        )
        right_tree = self._build_tree(
            X.iloc[right_mask] if hasattr(X, 'iloc') else X[right_mask],
            y.iloc[right_mask] if hasattr(y, 'iloc') else y[right_mask],
            depth + 1
        )

        return {
            'type': 'node',
            'feature': split['feature_name'],
            'threshold': split['threshold'],
            'left': left_tree,
            'right': right_tree
        }

    def fit(self, X: pd.DataFrame, y: pd.Series) -> None:
        """
        Train the random forest on input data.

        Args:
            X: Training features.
            y: Training labels.
        """
        # Store feature names
        if hasattr(X, 'columns'):
            self.feature_names = X.columns.tolist()

        # Set random seeds for reproducibility
        if self.random_state is not None:
            np.random.seed(self.random_state)
        seeds = np.random.randint(0, 10000, size=self.n_estimators)

        # Build trees in parallel
        self.forest = Parallel(n_jobs=self.n_jobs)(
            delayed(self._build_single_tree)(X, y, seed)
            for seed in seeds
        )

    def _build_single_tree(self, X: pd.DataFrame, y: pd.Series,
                           seed: int) -> Dict[str, Any]:
        """
        Build a single decision tree with bootstrap sampling.
        """
        X_boot, y_boot = self._bootstrap_sample(X, y, random_state=seed)
        return self._build_tree(X_boot, y_boot)

    def _traverse_tree(self, x: np.ndarray,
                       tree: Dict[str, Any]) -> float:
        """
        Traverse a tree to make a prediction for a single sample.

        Args:
            x: Input sample (1D array).
            tree: Decision tree structure.

        Returns:
            Predicted label.
        """
        if tree['type'] == 'leaf':
            return tree['value']

        # Resolve feature index
        if self.feature_names is not None:
            feature_index = self.feature_names.index(tree['feature'])
        else:
            feature_index = tree['feature']  # Assume integer index

        if x[feature_index] <= tree['threshold']:
            return self._traverse_tree(x, tree['left'])
        else:
            return self._traverse_tree(x, tree['right'])

    def predict(self,
                X: pd.DataFrame | NDArray[np.float64]) -> NDArray[np.float64]:
        """
        Predict labels for input data using majority voting.

        Args:
            X: Input features (DataFrame or array)

        Returns:
            Predicted labels (1D array)
        """
        if self.forest is None:
            raise RuntimeError("Model not trained. Call fit() first.")

        # Convert to numpy array
        X_np = X.to_numpy() if hasattr(X, 'to_numpy') else np.array(X)

        # Single sample case
        if len(X_np.shape) == 1:
            return np.mean([self._traverse_tree(X_np, tree) for tree in self.forest])

        # Batch predictions
        all_preds = [[self._traverse_tree(x, tree) for x in X_np]
                     for tree in self.forest
                     ]
        all_means = np.mean(all_preds, axis=0)
        return all_means

In [25]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Train the decision tree
tree = CustomRandomForestRegressor(
    n_estimators=100, n_jobs=-1, random_state=42)
tree.fit(X_train, y_train)

# Predict and evaluate
y_pred = tree.predict(X_test)
mse_custom, rmse_custom, mae_custom, r2_custom = evaluate(y_test, y_pred)
print(f'MSE (Custom): {mse_custom:.4f}')
print(f'RMSE (Custom): {rmse_custom:.4f}')
print(f'MAE (Custom): {mae_custom:.4f}')
print(f'R-Squared (Custom): {r2_custom:.4f}')
print('----------')

MSE (Custom): 0.0039
RMSE (Custom): 0.0628
MAE (Custom): 0.0433
R-Squared (Custom): 0.8473
----------


## 13. Comparison with Scikit-Learn

In [26]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

regressor = RandomForestRegressor(
    n_estimators=100
)
regressor.fit(X, y)

# Predict and evaluate
y_pred = regressor.predict(X_test)
mse_sk = mean_squared_error(y_test, y_pred)
rmse_sk = np.sqrt(mse_sk)
mae_sk = mean_absolute_error(y_test, y_pred)
r2_sk = r2_score(y_test, y_pred)

print(f'MSE (SK): {mse_sk:.4f}')
print(f'MSE (Custom): {mse_custom:.4f}')
print('----------')
print(f'RMSE (SK): {rmse_sk:.4f}')
print(f'RMSE (Custom): {rmse_custom:.4f}')
print('----------')
print(f'MAE (SK): {mae_sk:.4f}')
print(f'MAE (Custom): {mae_custom:.4f}')
print('----------')
print(f'R-Squared (SK): {r2_sk:.4f}')
print(f'R-Squared (Custom): {r2_custom:.4f}')

MSE (SK): 0.0006
MSE (Custom): 0.0039
----------
RMSE (SK): 0.0240
RMSE (Custom): 0.0628
----------
MAE (SK): 0.0166
MAE (Custom): 0.0433
----------
R-Squared (SK): 0.9777
R-Squared (Custom): 0.8473
