# Random Forest Classifier from Scratch
***
## Table of Contents
1. [Introduction](#1-introduction)
2. [Loading Data](#2-loading-data)
3. [Train Test Split](#3-train-test-split)
4. [Gini Impurity and Entropy Metrics](#4-gini-impurity-and-entropy-metrics)
    - [Gini Impurity](#gini-impurity)
    - [Entropy](#entropy)
5. [Information Gain](#5-information-gain)
6. [Bootstrapping](#6-bootstrapping)
7. [Identifying the Best Split](#7-identifying-the-best-split)
8. [Building the Decision Tree](#8-building-the-decision-tree)
9. [Building Random Forest](#9-building-random-forest)
10. [Traversing the Free for Prediction](#10-traversing-the-tree-for-prediction)
11. [Predictions](#11-predictions)
12. [Evaluation Metrics](#12-evaluation-metrics)
    - [Binary Confusion Matrix](#binary-confusion-matrix)
    - [Multi-Class Confusion Matrix](#multi-class-confusion-matrix)
    - [Accuracy](#accuracy)
    - [Precision](#precision)
    - [Recall](#recall)
    - [F1-Score](#f1-score)
13. [Encapsulation](#13-encapsulation)
14. [Comparison with Scikit-Learn](#14-comparison-with-scikit-learn)
***

In [62]:
from sklearn.datasets import load_breast_cancer
import numpy as np
import pandas as pd
from typing import Tuple, List, Dict, Optional, Any
from numpy.typing import NDArray
from joblib import Parallel, delayed
from scipy.stats import mode

## 1. Introduction
This notebook is an extension of [Decision Tree Classifier from Scratch](https://github.com/tsu76i/DS-playground/blob/main/2.%20Building%20ML%20Models%20From%20Scratch/2.3%20CART/decision_tree_classifier.ipynb).

Random forests are an ensemble learning technique that combines multiple decision trees, each trained on a random subset of the data (with replacement) and a random subset of features at each split. The final prediction is made by aggregating the results of all trees(**majority vote** for classification, **average** for regression). Compared to decision trees, this approach provides better accuracy, reduced overfitting and more stable predictions, though at the cost of increased computational complexity and reduced interpretability. This method introduces two key randomisation techniques during the training process:

1. **Bootstrap Sampling**: Each tree is trained on a bootstrapped dataset, which is a random sample of the original dataset created *with replacement*. This ensures diversity among the trees.
2. **Feature Randomisation**: At each split in a tree, a random subset of features is considered rather than evaluating all features. This prevents dominant features from appearing in every tree and further promotes diversity.

## 2. Loading Data

In [63]:
# Load the dataset
data = load_breast_cancer()
feature_names = data.feature_names.tolist()
class_names = data.target_names.tolist()
X, y = data.data, data.target
df = pd.DataFrame(X, columns=feature_names)
df['diagnosis'] = y
X, y = df.drop('diagnosis', axis=1), df['diagnosis']

# Check the shape of the data
print(f'Features shape: {X.shape}')
print(f'Target shape: {y.shape}')
print(f'Features: \n{feature_names}')

Features shape: (569, 30)
Target shape: (569,)
Features: 
['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension', 'radius error', 'texture error', 'perimeter error', 'area error', 'smoothness error', 'compactness error', 'concavity error', 'concave points error', 'symmetry error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points', 'worst symmetry', 'worst fractal dimension']


In [64]:
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,diagnosis
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [65]:
y.value_counts()

diagnosis
1    357
0    212
Name: count, dtype: int64

In [66]:
print(class_names)

['malignant', 'benign']


In this dataset, the features represent the characteristics of breast cancer (e.g., radius, texture, etc.), while the target is a boolean value indicating whether the tumour is malignant (0) or benign (1).

## 3. Train Test Split
Train test split is a fundamental model validation technique in machine learning. It divides a dataset into two separate portions: a **training set** used to train a model, and a **testing set** used to evaluate how well the model can perform on unseen data. 

The typical split ratio is 80% for training and 20% for testing, though this can vary (70/30 or 90/10 are also common). The key principle is that the test set must remain completely separated during model training process, and should never be used to make decisions about the model or tune parameters. 

The split is usually done randomly to ensure both sets are representative of the overall dataset, and many libraries (such as scikit-learn) provide build-in functions that handle this process automatically while maintaining proper randomisation.


In [67]:
def train_test_split(X: pd.DataFrame, y: pd.Series, test_size: float = 0.2,
                     random_state: int = None) -> Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]:
    """
    Split arrays or matrices into random train and test subsets.

    Args:
        X: Input features, a 2D array with rows (samples) and columns (features).
        y: Target values/labels, a 1D array with rows (samples).
        test_size: Proportion of the dataset to include in the test split. Must be between 0.0 and 1.0. default = 0.2
        random_state: Seed for the random number generator to ensure reproducible results. default = None

    Returns:
        A tuple containing:
            - X_train: Training set features.
            - X_test: Testing set features.
            - y_train: Training set target values.
            - y_test: Testing set target values.
    """
    # Set a random seed if it exists
    if random_state:
        np.random.seed(random_state)

    # Create a list of numbers from 0 to len(X)
    indices = np.arange(len(X))

    # Shuffle the indices
    np.random.shuffle(indices)

    # Define the size of our test data from len(X)
    test_size = int(test_size * len(X))

    # Generate indices for test and train data
    test_indices: NDArray[np.int64] = indices[:test_size]
    train_indices: NDArray[np.int64] = indices[test_size:]

    # Return: X_train, X_test, y_train, y_test
    return X.iloc[train_indices], X.iloc[test_indices], y.iloc[train_indices], y.iloc[test_indices]

## 4. Gini Impurity and Entropy Metrics
### Gini Impurity
Gini impurity is a measure of the likelihood that a randomly chosen sample from a dataset will be incorrectly classified. It quantifies how impure a node is, with values ranging from $0$ (minimum impurity) to $0.5$ (maximum impurity) for binary classification. For multi-class problems, however, the maximum impurity occurs when all classes are equally probable, and the value depends on the number of classes. The formula for Gini impurity is:

\begin{align*}
G = 1 - \sum_{i=1}^{k} p_{i}^{2}
\end{align*}

where:
- $k$: Number of classes.
- $p_{i}$: Proportion of samples belonging to class $i$ in the node.

In [68]:
def gini(y: pd.Series) -> float:
    proportions = np.bincount(y) / len(y)
    return 1 - np.sum(proportions**2)

### Entropy
Entropy measures the amount of uncertainty or randomness in the data. It is based on information theory and represents the expected amount of information required to classify a sample. For binary classification ranges from $0$ (minimum entropy) to $1$ (maximum entropy). For $k$ classes, the range is from $0$ to $log_{2}(k)$. The formula for entropy is:


\begin{align*}
H = - \sum_{i=1}^{k} p_{i} log_{2}(p_{i})
\end{align*}

where:
- $k$: Number of classes.
- $p_{i}$: Proportion of samples belonging to class $i$ in the node.

Gini tends to split nodes based on the most frequent classes, while entropy provides a more nuanced measure especially in cases with many classes or highly imbalanced distributions. Both metrics provide similar results, but Gini is often preferred for computational efficiency.

In [69]:
def entropy(y: pd.Series) -> float:
    proportions = np.bincount(y) / len(y)
    proportions = proportions[proportions > 0]  # Avoid log(0)
    return -np.sum(proportions * np.log2(proportions))

In [70]:
print(y.value_counts())

diagnosis
1    357
0    212
Name: count, dtype: int64


In [71]:
print(f"Gini Impurity: {gini(y):.5f}")
print(f"Entropy: {entropy(y):.5f}")

Gini Impurity: 0.46753
Entropy: 0.95264


## 5. Information Gain
Information Gain is a metric used to measure the effectiveness of a feature in splitting a dataset into subsets that are more pure concerning the target variable. It quantifies the reduction in entropy or Gini impurity, and a higher information gain indicates a better feature for making splits.

\begin{align*}
IG(S, A) = H(S) - \sum_{i=1}^{n} \dfrac{|S_i|}{|S|}H(S_{i})
\end{align*}

where:
- $H(S)$: Entropy (or Geni) of the original dataset $S$.
- $S_{i}$: Subset of $S$ created by splitting on feature $A$ for the $i_{th}$ value or range of the feature.
- $\dfrac{|S_i|}{|S|}$: Proportion of samples in subset $S_{i}$.
- $H(S_{i})$: Entropy (or Geni) of subset $S_{i}$.



The following `information_gain` function calculates the difference between the metric for the parent node and the weighted average of the metrics for the child nodes (left and right splits).

In [72]:
def information_gain(y: pd.Series, y_left: pd.Series, y_right: pd.Series,
                     metric: str = 'gini') -> float:
    """
    Calculate the information gain of a split.

    Args:
        y: Labels of the parent node.
        y_left: Labels of the left child node after the split.
        y_right: Labels of the right child node after the split.
        metric: Splitting criterion, either 'gini' or 'entropy'. Defaults to 'gini'.

    Returns:
        Information gain resulting from the split.
    """
    if metric == 'gini':
        parent_metric = gini(y)
        left_metric = gini(y_left)
        right_metric = gini(y_right)
    else:  # metric == "entropy"
        parent_metric = entropy(y)
        left_metric = entropy(y_left)
        right_metric = entropy(y_right)

    weighted_metric = (
        len(y_left) / len(y) * left_metric
        + len(y_right) / len(y) * right_metric
    )
    return parent_metric - weighted_metric

## 6. Bootstrapping
Bootstrapping is a statistical resampling method that involves sampling data points with replacement. In creating a new dataset (**bootstrap sample**) from the original dataset, some data points may appear multiple times, while others may be excluded. Though individual data points may repeat, the size of bootstrap sample $n$ is typically the same as the original dataset. This method ensures variability among datasets, which helps reduce overfitting when used in ensemble learning.

For a dataset with $n$ examples, each sample has a $1 - \left( 1 - \dfrac{1}{n} \right)^{n}$ chance of being selected at least once in the bootstrap sample. As $n$ becomes large, this value approaches $1-\text{e}^{-1} \approx 0.632$. Hence, about 63.2% of the original dataset is expected to appear in any given bootstrap sample.

In [73]:
def bootstrap_sample(X: pd.DataFrame, y: pd.Series, n_samples: Optional[int] = None,
                     random_state: Optional[int] = None) -> Tuple[pd.DataFrame, pd.Series]:
    """
    Generate a bootstrap sample from the dataset.

    Args:
        X: Input features.
        y: Target labels.
        n_samples: Samples to draw (default: dataset size).
        random_state: Random seed.

    Returns:
        Bootstrapped (X, y) tuple.
    """
    if random_state is not None:
        np.random.seed(random_state)
    if n_samples is None:
        n_samples = len(X)
    indices = np.random.choice(len(X), size=n_samples, replace=True)
    return X.iloc[indices], y.iloc[indices]

In [74]:
bootstrap_sample(X, y, random_state=42)[0][:10]  # X

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
102,12.18,20.52,77.22,458.7,0.08013,0.04038,0.02383,0.0177,0.1739,0.05677,...,13.34,32.84,84.58,547.8,0.1123,0.08862,0.1145,0.07431,0.2694,0.06878
435,13.98,19.62,91.12,599.5,0.106,0.1133,0.1126,0.06463,0.1669,0.06544,...,17.04,30.8,113.9,869.3,0.1613,0.3568,0.4069,0.1827,0.3179,0.1055
270,14.29,16.82,90.3,632.6,0.06429,0.02675,0.00725,0.00625,0.1508,0.05376,...,14.91,20.65,94.44,684.6,0.08567,0.05036,0.03866,0.03333,0.2458,0.0612
106,11.64,18.33,75.17,412.5,0.1142,0.1017,0.0707,0.03485,0.1801,0.0652,...,13.14,29.26,85.51,521.7,0.1688,0.266,0.2873,0.1218,0.2806,0.09097
71,8.888,14.64,58.79,244.0,0.09783,0.1531,0.08606,0.02872,0.1902,0.0898,...,9.733,15.67,62.56,284.4,0.1207,0.2436,0.1434,0.04786,0.2254,0.1084
20,13.08,15.71,85.63,520.0,0.1075,0.127,0.04568,0.0311,0.1967,0.06811,...,14.5,20.49,96.09,630.5,0.1312,0.2776,0.189,0.07283,0.3184,0.08183
121,18.66,17.12,121.4,1077.0,0.1054,0.11,0.1457,0.08665,0.1966,0.06213,...,22.25,24.9,145.4,1549.0,0.1503,0.2291,0.3272,0.1674,0.2894,0.08456
466,13.14,20.74,85.98,536.9,0.08675,0.1089,0.1085,0.0351,0.1562,0.0602,...,14.8,25.46,100.9,689.1,0.1351,0.3549,0.4504,0.1181,0.2563,0.08174
214,14.19,23.81,92.87,610.7,0.09463,0.1306,0.1115,0.06462,0.2235,0.06433,...,16.86,34.85,115.0,811.3,0.1559,0.4059,0.3744,0.1772,0.4724,0.1026
330,16.03,15.51,105.8,793.2,0.09491,0.1371,0.1204,0.07041,0.1782,0.05976,...,18.76,21.98,124.3,1070.0,0.1435,0.4478,0.4956,0.1981,0.3019,0.09124


In [75]:
bootstrap_sample(X, y, random_state=42)[1][:10]

102    1
435    0
270    1
106    1
71     1
20     1
121    0
466    1
214    0
330    0
Name: diagnosis, dtype: int64

## 7. Identifying the Best Split
This function identifies the best feature and threshold to split the data using the specified metric (Gini or Entropy).

Steps are:

1. Select some features randomly (Recommended: `sqrt` for classification, `log2` for regression).

2. For each selected feature, iterate over all unique thresholds.

3. Split the data into left and right subsets based on the threshold (skip invalid ones).

4. Compute the Gini/Entropy for both subsets and calculate Information Gain.

5. If the newly computed `info_gain` > `best_info_gain`, then update `best_info_gain` with the new information.

In [76]:
def best_split(X: pd.DataFrame, y: pd.Series, metric: str = 'gini', feature_names=None, max_features=None) -> Dict[str, Any]:
    """
    Find the best split for a dataset.

    Args:
        X: Input features (DataFrame of shape [n_samples, total_n_features]).
        y: Labels (Series of shape [n_samples]).
        metric: Splitting criterion, either "gini" or "entropy". Defaults to 'gini'.
        feature_names: List of feature names. If None, indices are used. Defaults to None.
        max_features: Number of features to consider at each split. None(√total_n_features) or int(<=total_n_features). Defaults to None.
    Returns:
        Dictionary containing the best split with keys:
              - 'feature_index' : Index of the feature used for the split.
              - 'feature_name': Name or index of the feature.
              - 'threshold' : Threshold value for the split.
    """
    if feature_names is None and hasattr(X, 'columns'):
        feature_names = X.columns.tolist()

    # Convert X if DataFrame
    if hasattr(X, 'to_numpy'):
        X = X.to_numpy()

    best_info_gain = float('-inf')
    best_split = None
    total_n_features = X.shape[1]

    if isinstance(max_features, int):  # if max_features is int
        selected_n_features = max_features if max_features <= total_n_features else total_n_features
    else:  # Default = √total_n_features
        selected_n_features = int(np.sqrt(total_n_features))

    selected_features_idx = np.random.choice(
        a=total_n_features, size=selected_n_features, replace=False)

    # Iterate over randomly selected features.
    for feature in selected_features_idx:
        # Iterate over all unique thresholds for each random feature.
        thresholds = np.unique(X[:, feature])
        for threshold in thresholds:
            # Split the data into left and right subsets based on the threshold.
            left_mask = X[:, feature] <= threshold
            right_mask = X[:, feature] > threshold

            # Skip invalid splits.
            if sum(left_mask) == 0 or sum(right_mask) == 0:
                continue

            # Compute IG.
            info_gain = information_gain(
                y, y[left_mask], y[right_mask], metric)

            # Update `best_info_gain` if `info_gain` > `best_info_gain`.
            if info_gain > best_info_gain:
                best_info_gain = info_gain
                best_split = {
                    'feature_index': int(feature),
                    'feature_name': feature_names[feature] if feature_names is not None else feature,
                    'threshold': float(threshold),
                }

    return best_split

In [77]:
split = best_split(X, y, metric='variance')
print('Best Split:', split)

Best Split: {'feature_index': 3, 'feature_name': 'mean area', 'threshold': 693.7}


## 8. Building the Decision Tree
This function resursively creates the tree structure as a nested dictionary with conditions (`feature` and `threshold`) and leaf nodes.

In [78]:
def build_tree(X: pd.DataFrame, y: pd.Series, max_depth: int = None,
               depth: int = 0, metric: str = 'gini', feature_names: List[str | int] = None, max_features=None) -> Dict[
        str, Any]:
    """
    Build a decision tree using recursive splitting.

    Args:
        X: Input features (DataFrame of shape [n_samples, n_features]).
        y: Labels (Series of shape [n_samples]).
        max_depth: Maximum depth of the tree. Defaults to None (unlimited depth).
        depth: Current depth of the tree. Used internally for recursion. Defaults to 0.
        metric: Splitting criterion, either 'gini' or 'entropy'. Defaults to 'gini'.
        feature_names: List of feature names. If None, indices are used. Defaults to None.
        max_features: Number of features to consider at each split. None(√total_n_features) or int(<=total_n_features). Defaults to None.

    Returns:
        - Nested dictionary representing the tree structure.
        - Nodes contain keys: 'type', 'feature', 'threshold', 'left', 'right'.
        - Leaf nodes contain keys: 'type', 'value'.
    """

    # Convert DataFrames to NumPy arrays
    if hasattr(X, 'to_numpy'):
        X = X.to_numpy()
    if hasattr(y, 'to_numpy'):
        y = y.to_numpy().flatten()  # Ensure 1D array

    # Stop the recursion if all labels are identical or the maximum depth is reached.
    if len(set(y)) == 1 or (max_depth is not None and depth == max_depth):
        return {'type': 'leaf', 'value': int(np.argmax(np.bincount(y)))}

    # Find the best split.
    split = best_split(X, y, metric, feature_names, max_features)
    if not split:
        return {'type': 'leaf', 'value': int(np.argmax(np.bincount(y)))}

    # Split the data into left and right subsets.
    # Use feature_index for calculations.
    left_mask = X[:, split['feature_index']] <= split['threshold']
    right_mask = X[:, split['feature_index']] > split['threshold']

    # Recursively build the left and right subtrees.
    left_tree = build_tree(X[left_mask], y[left_mask],
                           max_depth, depth + 1, metric, feature_names, max_features)
    right_tree = build_tree(X[right_mask], y[right_mask],
                            max_depth, depth + 1, metric, feature_names, max_features)

    # Return the tree structure as a nested dictionary.
    return {
        'type': 'node',
        'feature': split['feature_name'],
        'threshold': split['threshold'],
        'left': left_tree,
        'right': right_tree,
    }

## 9. Building Random Forest
We now create `build_random_forest` function that iterates bootstrapping samples and building trees `n_estimators` times. To reduce execution speed, parallel tree construction is implemented (with all CPU cores used).

In [79]:
def build_random_forest(X_train: pd.DataFrame, y_train: pd.Series, n_estimators: int,
                        n_jobs: int = -1, max_depth: int = 15) -> List[Dict[str, Any]]:
    """
    Optimised random forest builder using parallel processing

    Args:
        X_train: Training features
        y_train: Training labels
        n_estimators: Number of trees
        n_jobs: Number of CPU cores to use (-1 = all cores)
        max_depth: Maximum tree depth

    Returns:
        List of decision trees
    """
    # Build single tree
    def _build_single_tree(i):
        X_boot, y_boot = bootstrap_sample(X_train, y_train, random_state=i)
        return build_tree(X_boot, y_boot, max_depth=max_depth,
                          metric='gini', feature_names=feature_names)

    # Parallel execution
    forest = Parallel(n_jobs=n_jobs)(
        delayed(_build_single_tree)(i)
        for i in range(n_estimators)
    )

    return forest

## 10. Traversing the Tree for Prediction
This function traverses the tree to make predictions by following the tree from the root to a leaf node.

In [80]:
def traverse_tree(x: pd.DataFrame, tree: Dict[str, Any],
                  feature_names: List[str | int] = None, max_features=None) -> int:
    """
    Traverse a decision tree to make a prediction for a single sample.

    Args:
        x: Single sample.
        tree: Decision tree structure.
        feature_names: List of feature names. Needed for name-to-index mapping. Defaults to None.
        max_features: Number of features to consider at each split. None(√total_n_features) or int(<=total_n_features). Defaults to None.

    Returns:
        Predicted label.
    """
    if tree['type'] == 'leaf':
        return tree['value']

    # Resolve feature index if feature_names is provided
    feature_index = feature_names.index(
        tree['feature']) if feature_names is not None else tree['feature']

    if x[feature_index] <= tree['threshold']:
        return traverse_tree(x, tree['left'], feature_names, max_features)
    else:
        return traverse_tree(x, tree['right'], feature_names, max_features)

## 11. Predictions
This function predicts labels for all samples in the dataset.

In [81]:
def predict(X: pd.DataFrame, tree: Dict[str, Any],
            feature_names: List[str | int] = None) -> int | NDArray[np.int64]:
    """
    Predict labels for the given dataset using a decision tree classifier.

    Args:
        X: Input features.
        tree: Decision tree structure.
        feature_names : List of feature names. Needed for name-to-index mapping. Defaults to None.

    Returns:
        Predicted labels (1D array for multiple samples or a single label for one sample).
    """
    # Convert DataFrames to NumPy arrays
    if hasattr(X, 'to_numpy'):
        X = X.to_numpy()

    if len(X.shape) == 1:  # If a single sample is provided
        return traverse_tree(X, tree, feature_names)
    return np.array([traverse_tree(x, tree, feature_names) for x in X])

After all predictions have been made for the `n_estimators`, we will use a majority vote to determine the final prediction.

In [82]:
def predict_majority_vote(forest: List[Dict[str, Any]], X: pd.DataFrame,
                          feature_names: List[str] = None) -> List[int]:
    all_preds = []
    for tree in forest:
        preds = predict(X, tree, feature_names)
        all_preds.append(preds)
    all_preds = np.array(all_preds)
    majority_vote, _ = mode(all_preds, axis=0)
    return majority_vote

## 12. Evaluation Metrics
### Binary Confusion Matrix
In a confusion matrix, the terms True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) describe the classification performance for binary classification. 

|                     | Predicted Negative  | Predicted Positive  |
| ------------------- | ------------------- | ------------------- |
| **Actual Negative** | True Negative (TN)  | False Positive (FP) |
| **Actual Positive** | False Negative (FN) | True Positive (TP)  |


1. True Positive (TP): The number of instances correctly predicted as positive (e.g., a disease correctly identified).

2. True Negative (TN): The number of instances correctly predicted as negative (e.g., no disease correctly identified).

3. False Positive (FP): The number of instances incorrectly predicted as positive (e.g., predicting disease when there isn't any).

4. False Negative (FN): The number of instances incorrectly predicted as negative (e.g., missing a disease when it exists).

### Multi-Class Confusion Matrix
For multi-class classification, the concepts can be extended by treating one class as the "positive" class and all others as "negative" classes in a one-vs-all approach. Rows represent the actual classes (true labels), and columns represent the predicted classes. For a class $C$,
1. True Positive (TP): The count in the diagonal cell corresponding to class $C$ ($\text{matrix} [C][C]$).
2. False Positive (FP): The sum of the column for class $C$, excluding the diagonal ($\sum(\text{matrix} [:, C]) - \text{matrix} [C][C]$).
3. False Negative (FN): The sum of the row for class $C$, excluding the diagonal ($\sum(\text{matrix} [C, :]) - \text{matrix} [C][C]$).
4. True Negative (TN): All other cells not in the row or column for class $C$ ($\text{total} - (FP + FN + TP)$).

|                  | Predicted Class 0 | Predicted Class 1 | Predicted Class 2 |
| ---------------- | ----------------- | ----------------- | ----------------- |
| **True Class 0** | 5                 | 2                 | 0                 |
| **True Class 1** | 1                 | 6                 | 1                 |
| **True Class 2** | 0                 | 2                 | 7                 |


For Class 0:
- TP = 5 (diagonal element for Class 0)
- FP = 1 (sum of column 0 minus TP: 1 + 0)
- FN = 2 (sum of row 0 minus TP: 2 + 0)
- TN = 6 + 1 + 2 + 7 = 16 (all other cells not in row 0 or column 0)

For Class 1:
- TP = 6 (diagonal element for Class 1)
- FP = 4 (sum of column 1 minus TP: 2 + 2)
- FN = 2 (sum of row 1 minus TP: 1 + 1)
- TN = 5 + 0 + 0 + 7 = 12 (all other cells not in row 1 or column 1)

In [83]:
def confusion_matrix(y_true: NDArray[np.int64], y_pred: NDArray[np.int64],
                     class_names: List[str] = None) -> Tuple[NDArray[np.int64], List[str]]:
    """
    Calculate the confusion matrix.

    Args:
        y_true: True labels.
        y_pred: Predicted labels.
        class_names: List of class names. Defaults to None.

    Returns:
        Tuple: 
        - Confusion matrix.
        - List of class names.
    """
    # Encode labels as integers
    unique_classes = np.unique(y_true)
    if class_names is None:
        class_names = [str(cls) for cls in unique_classes]
    class_to_index = {cls: i for i, cls in enumerate(unique_classes)}

    n_classes = len(unique_classes)
    matrix = np.zeros((n_classes, n_classes), dtype=int)

    for true, pred in zip(y_true, y_pred):
        true_idx = class_to_index[true]
        pred_idx = class_to_index[pred]
        matrix[true_idx][pred_idx] += 1

    return matrix, class_names

### Accuracy
Accuracy is the most common evaluation metric for classification problems, representing the percentage of correct predictions out of total predictions. It provides a simple measure of how often the classifier makes correct predictions across all classes.

\begin{align*}
\text{Accuracy} = \dfrac{\text{True Positives (TP)} + \text{True Negatives (TN)}}{\text{Total Samples}}
\end{align*}

In [84]:
def accuracy(y_true: NDArray[np.int64],
             y_pred: NDArray[np.int64]) -> float:
    """
    Calculate the accuracy of predictions by comparing true and predicted labels.

    Args:
        y_true: Ground truth target values. Contains the actual class labels for each sample.
        y_pred: Estimated target as returned by a classifier. Contains the predicted class labels for each sample.
    Returns:
        Classification accuracy (0.0 to 1.0).
    """
    return np.mean(y_true == y_pred)

### Precision
Precision measures the proportion of true positive predictions out of all positive predictions made by the classifier.

\begin{align*}
\text{Precision} = \dfrac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}
\end{align*}

In [85]:
def precision(y_true: NDArray[np.int64], y_pred: NDArray[np.int64]) -> NDArray[np.float64]:
    """
    Calculate precision for each class.

    Args:
        y_true: True labels.
        y_pred: Predicted labels.

    Returns:
        Precision values for each class.
    """
    cm, _ = confusion_matrix(y_true, y_pred)
    return np.diag(cm) / (np.sum(cm, axis=0) + 1e-7)

### Recall
Recall measures the proportion of true positive predications out of all actual positive cases.

\begin{align*}
\text{Recall} = \dfrac{\text{True Positives (TP)} }{\text{True Positives (TP)} + \text{False Negatives (FN)}}
\end{align*}

In [86]:
def recall(y_true: NDArray[np.int64], y_pred: NDArray[np.int64]) -> NDArray[np.float64]:
    """
    Calculate recall for each class.

    Args:
        y_true: True labels.
        y_pred: Predicted labels.

    Returns:
        Recall values for each class.
    """
    cm, _ = confusion_matrix(y_true, y_pred)
    return np.diag(cm) / (np.sum(cm, axis=1) + 1e-7)

### F1-Score
The F1-Score is the harmonic mean of precision and recall.

\begin{align*}
\text{F1-Score} = 2 \times \dfrac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
\end{align*}

In [87]:
def f1_score(y_true: NDArray[np.int64], y_pred: NDArray[np.int64]) -> NDArray[np.float64]:
    """
    Calculate F1-score for each class.

    Args:
        y_true: True labels.
        y_pred: Predicted labels.

    Returns:
        F1-scores for each class.
    """
    prec = precision(y_true, y_pred)
    rec = recall(y_true, y_pred)
    return 2 * (prec * rec) / (prec + rec + 1e-7)

In [88]:
def evaluate(y_true: NDArray[np.int64], y_pred: NDArray[np.int64], class_names: List[str] = None) -> None:
    """
    Print evaluation metrics including accuracy, precision, recall, and F1-score for each class.

    Args:
        y_true: True labels.
        y_pred: Predicted labels.
        class_names: List of class names. Defaults to None.
    """
    cm, class_names = confusion_matrix(y_true, y_pred, class_names)
    acc = accuracy(y_true, y_pred)
    prec = precision(y_true, y_pred)
    rec = recall(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    # print("Class\tPrecision\tRecall\tF1-Score")
    # for i, class_name in enumerate(class_names):
    #     print(f"{class_name}\t{prec[i]:.4f}\t\t{rec[i]:.4f}\t{f1[i]:.4f}")
    return acc, np.mean(prec), np.mean(rec), np.mean(f1), cm

## 13. Encapsulation

In [89]:
class CustomRandomForest:
    def __init__(self, n_estimators: int = 100, max_depth: int = 15, min_samples_leaf: int = 1, metric: str = 'gini',
                 max_features: Optional[int] = None, random_state: Optional[int] = None, n_jobs: int = -1) -> None:
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.min_samples_leaf = min_samples_leaf
        self.metric = metric
        self.max_features = max_features
        self.random_state = random_state
        self.forest = None
        self.n_jobs = n_jobs

    def _gini(self, y: pd.Series) -> float:
        """
        Calculate the Gini impurity for a set of labels.
        """
        if len(y) == 0:
            return 0
        proportions = np.bincount(y) / len(y)
        return 1 - np.sum(proportions ** 2)

    def _entropy(self, y: pd.Series) -> float:
        """
        Calculate the entropy for a set of labels.
        """
        if len(y) == 0:
            return 0
        proportions = np.bincount(y) / len(y)
        proportions = proportions[proportions > 0]  # Avoid log(0)
        return -np.sum(proportions * np.log2(proportions))

    def _information_gain(self, y: pd.Series, y_left: pd.Series, y_right: pd.Series) -> float:
        """
        Compute the information gain of a split.

        Args:
            y: Series of the parent node.
            y_left: Series of the left child node.
            y_right: Series of the right child node.

        Returns:
            Information gain from the split.
        """
        if self.metric == 'gini':
            parent_metric = self._gini(y)
            left_metric = self._gini(y_left)
            right_metric = self._gini(y_right)
        else:  # metric == "entropy"
            parent_metric = self._entropy(y)
            left_metric = self._entropy(y_left)
            right_metric = self._entropy(y_right)

        weighted_metric: float = (
            len(y_left) / len(y) * left_metric
            + len(y_right) / len(y) * right_metric
        )
        return parent_metric - weighted_metric

    def _bootstrap_sample(self, X: pd.DataFrame, y: pd.Series, n_samples: Optional[int] = None,
                          random_state: Optional[int] = None) -> Tuple[pd.DataFrame, pd.Series]:
        """
        Generate a bootstrap sample from the dataset.

        Args:
            X: Input features.
            y: Target labels.
            n_samples: Samples to draw (default: dataset size).
            random_state: Random seed.

        Returns:
            Bootstrapped (X, y) tuple.
        """
        rng = np.random.RandomState(random_state)
        if n_samples is None:
            n_samples = len(X)
        indices = np.random.choice(len(X), size=n_samples, replace=True)
        return X.iloc[indices], y.iloc[indices]

    def _best_split(self, X: NDArray[np.float64], y: NDArray[np.int16]) -> Dict[str, Any]:
        """
        Find the best split for a dataset.

        Args:
            X: Input features (DataFrame of shape [n_samples, total_n_features]).
            y: Labels (Series of shape [n_samples]).
            metric: Splitting criterion, either "gini" or "entropy". Defaults to 'gini'.
            feature_names: List of feature names. If None, indices are used. Defaults to None.
            max_features: Number of features to consider at each split. None(√total_n_features) or int(<=total_n_features). Defaults to None.
        Returns:
            Dictionary containing the best split with keys:
                - 'feature_index' : Index of the feature used for the split.
                - 'feature_name': Name or index of the feature.
                - 'threshold' : Threshold value for the split.
        """

        best_info_gain = float('-inf')
        best_split = None
        total_n_features = X.shape[1]

        if isinstance(self.max_features, int):  # if max_features is int
            selected_n_features = self.max_features if self.max_features <= total_n_features else total_n_features
        else:  # Default = √total_n_features
            selected_n_features = int(np.sqrt(total_n_features))

        selected_features_idx = np.random.choice(
            a=total_n_features, size=selected_n_features, replace=False)

        # Iterate over randomly selected features.
        for feature in selected_features_idx:
            # Iterate over all unique thresholds for each random feature.
            thresholds = np.unique(X[:, feature])
            for threshold in thresholds:
                # Split the data into left and right subsets based on the threshold.
                left_mask = X[:, feature] <= threshold
                right_mask = X[:, feature] > threshold

                # Skip invalid splits.
                if sum(left_mask) < self.min_samples_leaf or sum(right_mask) < self.min_samples_leaf:
                    continue

                # Compute IG.
                info_gain = self._information_gain(
                    y, y[left_mask], y[right_mask])

                # Update `best_info_gain` if `info_gain` > `best_info_gain`.
                if info_gain > best_info_gain:
                    best_info_gain = info_gain
                    best_split = {
                        'feature_index': int(feature),
                        'feature_name': self.feature_names[feature] if self.feature_names is not None else feature,
                        'threshold': float(threshold),
                    }

        return best_split

    def _build_tree(self, X: pd.DataFrame, y: pd.Series, depth: int = 0) -> Dict[str, Any]:
        """
        Recursively build a decision tree.

        Args:
            X: Input features.
            y: Target labels.
            depth: Current tree depth.

        Returns:
            Tree structure dictionary.
        """

        # Convert to numpy arrays
        X_np = X.to_numpy() if hasattr(X, 'to_numpy') else np.array(X)
        y_np = y.to_numpy().flatten() if hasattr(
            y, 'to_numpy') else np.array(y).flatten()

        # Stopping conditions
        if len(np.unique(y_np)) == 1 or (self.max_depth is not None and depth == self.max_depth):
            return {'type': 'leaf', 'value': int(np.argmax(np.bincount(y_np)))}

        if len(y) < self.min_samples_leaf:
            return {'type': 'leaf', 'value': int(np.argmax(np.bincount(y_np)))}

        # Find best split
        split = self._best_split(X_np, y_np)
        if not split:
            return {'type': 'leaf', 'value': int(np.argmax(np.bincount(y_np)))}

        # Apply split
        feature_idx = split['feature_index']
        left_mask = X_np[:, feature_idx] <= split['threshold']
        right_mask = X_np[:, feature_idx] > split['threshold']

        # Recursive tree building
        left_tree = self._build_tree(
            X.iloc[left_mask] if hasattr(X, 'iloc') else X[left_mask],
            y.iloc[left_mask] if hasattr(y, 'iloc') else y[left_mask],
            depth + 1
        )
        right_tree = self._build_tree(
            X.iloc[right_mask] if hasattr(X, 'iloc') else X[right_mask],
            y.iloc[right_mask] if hasattr(y, 'iloc') else y[right_mask],
            depth + 1
        )

        return {
            'type': 'node',
            'feature': split['feature_name'],
            'threshold': split['threshold'],
            'left': left_tree,
            'right': right_tree
        }

    def fit(self, X: pd.DataFrame, y: pd.Series) -> None:
        """
        Train the random forest on input data.

        Args:
            X: Training features.
            y: Training labels.
        """
        # Store feature names
        if hasattr(X, 'columns'):
            self.feature_names = X.columns.tolist()

        # Set random seeds for reproducibility
        if self.random_state is not None:
            np.random.seed(self.random_state)
        seeds = np.random.randint(0, 10000, size=self.n_estimators)

        # Build trees in parallel
        self.forest = Parallel(n_jobs=self.n_jobs)(
            delayed(self._build_single_tree)(X, y, seed)
            for seed in seeds
        )

    def _build_single_tree(self, X: pd.DataFrame, y: pd.Series,
                           seed: int) -> Dict[str, Any]:
        """
        Build a single decision tree with bootstrap sampling.
        """
        X_boot, y_boot = self._bootstrap_sample(X, y, random_state=seed)
        return self._build_tree(X_boot, y_boot)

    def _traverse_tree(self, x: np.ndarray,
                       tree: Dict[str, Any]) -> int:
        """
        Traverse a tree to make a prediction for a single sample.

        Args:
            x: Input sample (1D array).
            tree: Decision tree structure.

        Returns:
            Predicted label.
        """
        if tree['type'] == 'leaf':
            return tree['value']

        # Resolve feature index
        if self.feature_names is not None:
            feature_index = self.feature_names.index(tree['feature'])
        else:
            feature_index = tree['feature']  # Assume integer index

        if x[feature_index] <= tree['threshold']:
            return self._traverse_tree(x, tree['left'])
        else:
            return self._traverse_tree(x, tree['right'])

    def predict(self,
                X: pd.DataFrame | NDArray[np.float64]) -> NDArray[np.int16]:
        """
        Predict labels for input data using majority voting.

        Args:
            X: Input features (DataFrame or array)

        Returns:
            Predicted labels (1D array)
        """
        if self.forest is None:
            raise RuntimeError("Model not trained. Call fit() first.")

        # Convert to numpy array
        X_np = X.to_numpy() if hasattr(X, 'to_numpy') else np.array(X)

        # Single sample case
        if len(X_np.shape) == 1:
            tree_preds = [self._traverse_tree(
                X_np, tree) for tree in self.forest]
            majority_vote, _ = mode(tree_preds)
            return majority_vote[0]

        # Batch predictions
        all_preds = np.zeros((len(self.forest), len(X_np)), dtype=int)
        for i, tree in enumerate(self.forest):
            all_preds[i] = [self._traverse_tree(x, tree) for x in X_np]

        majority_vote, _ = mode(all_preds, axis=0)
        return majority_vote.flatten()

In [90]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Initialise and train
rf = CustomRandomForest(n_estimators=100, max_depth=15,
                        n_jobs=-1, random_state=42)
rf.fit(X_train, y_train)

# Predict
y_pred = rf.predict(X_test)
# Evaluate
acc_custom, prec_custom, rec_custom, f1_custom, cm_custom = evaluate(
    y_test, y_pred)
print(f'Accuracy (Custom): {acc_custom:.4f}')
print(f'Precision: (Custom) {prec_custom:.4f}')
print(f'Recall (Custom): {rec_custom:.4f}')
print(f'F1-Score (Custom): {f1_custom:.4f}')
print(f'Confusion Matrix (Custom):\n{cm_custom}')

Accuracy (Custom): 0.9558
Precision: (Custom) 0.9548
Recall (Custom): 0.9502
F1-Score (Custom): 0.9524
Confusion Matrix (Custom):
[[39  3]
 [ 2 69]]


## 14. Comparison with Scikit-Learn

In [91]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
clf.fit(X_train, y_train)

# Predict on test data
y_pred = clf.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.96
