# Light Gradient Boosting Machine from Scratch
***
## Table of Contents
1. [Introduction](#1-introduction)
    - [Advantages](#advantages)
    - [Limitations](#limitations)
    - [Steps](#steps)
1. [Loading Data](#2-loading-data)
1. [Loss Function](#3-loss-function)
    - [Regression](#regression)
    - [Binary Classification](#binary-classification)
    - [Multi-class Classification](#multi-class-classification)
1. [Initialising Model](#4-initialising-model)
1. [Gradient and Hessian Computation](#5-gradient-and-hessian-computation)
1. [Histogram Binning](#6-histogram-binning)
1. [Finding the Best Split](#7-finding-the-best-split)
    - [Regularised Objective Function](#regularised-objective-function)
    - [Second-Order Taylor Expansion](#second-order-taylor-expansion)
    - [Regularisation Term](#regularisation-term)
    - [Total Objective Function](#total-objective-function)
    - [Optimising Leaf Weights](#optimising-leaf-weights)
    - [Histogram-Based Split Finding](#histogram-based-split-finding)
1. [Building Trees](#8-building-trees)
1. [Predictions (Trees)](#9-predictions-trees)
1. [Training Model](#10-training-model)
1. [Final Predictions](#11-final-predictions)
1. [Evaluation Metrics](#12-evaluation-metrics)
    - [Binary Confusion Matrix](#binary-confusion-matrix)
    - [Multi-Class Confusion Matrix](#multi-class-confusion-matrix)
    - [Accuracy](#accuracy)
    - [Precision](#precision)
    - [Recall](#recall)
    - [F1-Score](#f1-score)
1. [Comparison with LightBGM](#13-comparison-with-lightgbm)
1. [References](#14-references)
***

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from numpy.typing import NDArray

## 1. Introduction

**Light Gradient Boosting Machine (LightGBM)** is a gradient boosting framework optimised for speed and scalability, making it highly suitable for regression and classification tasks involving large datasets. LightGBM uses **gradient-based one-sided sampling (GOSS)** and **exclusive feature bundling (EFB)** to speed up training and handle high-dimensional data efficiently. Traditional gradient boosting and XGBoost grow trees level-wise (depth-wise), whereas LightGBM grows trees leaf-wise. At each step, LightGBM splits the leaf with the largest loss reduction, which can lead to deeper, more complex trees and often higher accuracy, but also a higher risk of overfitting if not properly regularised.


**Gradient Boosting Machine (GBM)** is an ensemble machine learning model that builds a strong predictive model by sequentially combining multiple weak models (typically decision trees) in a stage-wise manner. The core idea is to iteratively add new models that correct the errors made by the existing ensemble, thereby improving overall predictive accuracy.

Suppose we have a dataset ${(x_i, y_i)}^n_{i=1}$ where $x_i$ are the features and $y_i$ are the target values. The goal of gradient boosting is to find a function $F(x)$ that minimises a given differentiable loss function $L(y, F(x))$:

\begin{align*}
    F(x) = F_0(x) + \sum^{M}_{m=1}\gamma_m h_m(x)
\end{align*}

where:
- $F_0(x)$: Initial model (e.g., the mean of $y$).
- $\gamma_i$: Weight (step size) for the $m$-th weak learner, typically determined by minimising the loss function along the direction of $h_m(x)$.
- $M$: Number of boosting iterations (e.g., the number of weak learners).
- $h_m$: prediction from the $m$-th weak learner (e.g., a decision tree).


### Advantages
- The histogram-based approach and leaf-wise growth strategy make LightGBM significantly faster and more memory-efficient, especially on large datasets.
- The leaf-wise spliting often leads to better accuracy, as it focuses on reducing the largest errors at each iteration.
- Efficient binning and feature bundling reduce memory usage.

### Limitations
- Overgitting risk due to the leaf-wise tree growth.
- Sensitive to hypermarameters (e.g., learning rate, number of leaves, regularisation parameter).

### Steps
1. Initialise the model:
    - Start with a simple model, typically a constant value:
        - For regression: the mean of the target variable.
        - For binary classification: the log-odds of the positive classes.
1. Calculate residuals (Negative Gradients) and Hessian:
    - For each iteration, compute the the negative gradients (and Hessian for second-order methods) of the loss function with respect to the current predictions.
1. Fit a new weak model to predict the residuals:
    - Train a weak learner (decision tree) to predict the rediduals.
    - LightGBM grows the tree by splitting the leaf with the largest loss reduction (max delta loss), rather than level-wise as in traditional boosting.
1. Update the model:
    - Add the predictions from the new weak learners to the current model's predictions, scaled by a learning rate (shrinkage parameter).
    - The update corrects the errors made by the current ensemble
1. Repeat steps 2-4 for a pre-defined number of iterations.
1. Final prediction:
    - The sum of the initial prediction and the scaled outputs of all weak learners forms the final model.

## 2. Loading Data
Retrieved from [GitHub - YBI Foundation](https://github.com/YBI-Foundation/Dataset/blob/main/Admission%20Chance.csv)

In [22]:
df = pd.read_csv(
    "https://raw.githubusercontent.com/YBI-Foundation/Dataset/refs/heads/main/Admission%20Chance.csv"
)
df.head()

Unnamed: 0,Serial No,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


In [23]:
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
feature_names = df.columns[:-1].tolist()  # All columns except the last one

# Check the shape of the data
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"Features: \n{feature_names}")

Features shape: (400, 8)
Target shape: (400,)
Features: 
['Serial No', 'GRE Score', 'TOEFL Score', 'University Rating', ' SOP', 'LOR ', 'CGPA', 'Research']


## 3. Loss Function
In LightGBM, the loss function is used to compute the training loss, which is then combined with regularisation terms to construct the overall objective function that is minimised during training. The gradients and Hessians used to fit new trees are calculated with respect to the training loss, but the final optimisation process incorporates both the loss and the regularisation components.

### Regression
The most common loss function for regression is the mean squared error (MSE):

\begin{align*}
    l(y, \hat y) = \dfrac{1}{n}\sum^{n}_{i=1}(y_i - \hat y_i)^2
\end{align*}

where:
- $y_i$: True value.
- $\hat y_i$: Predicted value.
- $n$: Number of samples.

### Binary Classification
The standard loss function for binary classification in XGBoost is the binary cross-entropy (log loss):

\begin{align*}
    l(y, \hat p) = - \dfrac{1}{n} \sum^{n}_{i=1}\left[y_i \log{(\hat p_i) + (1-y_i) \log{(1- \hat p_i)}} \right]
\end{align*}

where:
- $y_i$: True label ($0$ or $1$).
- $\hat p_i$: Predicted probability for class 1 (after applying the sigmoid function to the raw score).
- $n$: Number of samples.

### Multi-class Classification
For multi-class classification (with $K$ classes), XGBoost uses the multi-class cross-entropy (also called softmax loss):

\begin{align*}
    l(y, \hat P) = - \dfrac{1}{n} \sum^{n}_{i=1} \sum^{K}_{k=1} \mathbb{I}(y_i = k) \log{(\hat p_{ik})}
\end{align*}

where:
- $y_i$: True class label for sample $i$.
- $\hat p_i$: Predicted probability that sample $i$ belongs to class $k$ (output of the softmax function).
- $\mathbb{I}(y_i = k)$: Indicator function, equal to $1$ if $y_i = k$ and $0$ otherwise.
- $n$: Number of samples.

## 4. Initialising Model
First, we need to initialise the model with a constant function that minimises the loss (initial predictions).

\begin{align*}
    F_0(x) = \arg \min_{\gamma}\sum^{n}_{i=1}L(y_i, \gamma)
\end{align*}

For squared error, the best constant is the mean of the target values. Thus,

\begin{align*}
    F_0(x) = \bar y = \dfrac{1}{n}\sum^{n}_{i=1}y_i
\end{align*}

In [24]:
def initialise_model(y: NDArray[np.float64]) -> NDArray[np.float64]:
    """
    Initialise predictions with the mean of the target values.

    Parameters:
        y: Target values, shape (n_samples,).

    Returns:
        Array of initial predictions, each set to mean of y.
    """
    return np.full_like(y, np.mean(y), dtype=float)

## 5. Gradient and Hessian Computation
We will need the first derivative (gradient) and the second derivative (Hessian) of the loss function for each sample $i$. They will be used later in the algorithm.

- $g_i = \dfrac{\partial l(y_i, \hat y_i)}{\partial \hat y_i} = \dfrac{\partial}{\partial \hat y_i} \dfrac{1}{2}(y_i-\hat y_i)^2 = \hat y_i - y_i$
- $h_i = \dfrac{\partial^2 l(y_i, \hat y_i)}{\partial \hat y_i ^2} = 1$

In [None]:
def compute_gradients_and_hessians(
    y_true: NDArray[np.float64], y_pred: NDArray[np.float64]
) -> tuple[NDArray[np.float64], NDArray[np.float64]]:
    """
    Compute gradients and Hessians for squared error loss.

    Args:
        y_true: True target values, shape (n_samples,).
        y_pred: Predicted values, shape (n_samples,).

    Returns:
        A tuple with gradients and Hessians, both of shape (n_samples,).
    """
    gradients = y_pred - y_true
    hessians = np.ones_like(y_true)
    return gradients, hessians

## 6. Histogram Binning
The `bin_features()` function discretises continuous feature values in a dataset into a fixed number of bins using a process called **histogram binning**. This is a key step in histogram-based gradient boosting algorithms such as LightGBM, which use binned features to accelerate split finding and reduce memory usage.

For each feature (column) $j$:
- Extract the column: $\text{col} = X \left[:, j\right]$

- Compute bin edges:
    - `np.linspace()` creates `n_bins` equally spaced intervals (bins) between the minimum and maximum values of the feature. Mathematically, for feature $j$, the bin edges are:
    $$
    b_0 = \min{(x_{:, j})}, b_{\text{n\textunderscore bins}} = \max{(x_{:, j})}
    $$

- Store bin edges:
    - The bin edges for each feature are appended to the `bins` list for later use.

- Digitise feature values:
    - Each value in the feature column is assigned a bin index (from $0$ to $\text{n\textunderscore bins} - 1$), indicating which interval it falls into.

Suppose we have a sample dataset with two continuous features and five samples:

| Sample | Feature 1 | Feature 2 |
|--------|-----------|-----------|
|   1    |   2.1     |   8.5     |
|   2    |   3.4     |   7.3     |
|   3    |   1.8     |   6.9     |
|   4    |   2.9     |   9.1     |
|   5    |   3.0     |   7.8     |

Assume we want to bin each feature into **3 bins**.

**Step 1: Compute Bin Edges**
For each feature, bin edges are calculated using equally spaced intervals between the minimum and maximum values.
- Feature 1: $\min = 1.8$, $\max = 3.4$
Bin edges: $\left[1.8, 2.3333..., 2.866..., 3.4\right]$

- Feature 2: $\min = 6.9$, $\max = 9.1$
Bin edges: $\left[6.9, 7.633..., 8.366..., 9.1\right]$

| Feature   | Min   | Max   | Bin Edges                       |
|-----------|-------|-------|---------------------------------|
| Feature 1 | 1.8   | 3.4   | 1.8, 2.333..., 2.866..., 3.4    |
| Feature 2 | 6.9   | 9.1   | 6.9, 7.633..., 8.366..., 9.1    |



**Step 2: Assign Bin Indices**
Each value is assigned a bin index ($0$, $1$, or $2$) based on which interval it falls into.

| Sample | Feature 1 | Bin Index (F1) | Feature 2 | Bin Index (F2) |
|--------|-----------|----------------|-----------|----------------|
|   1    |   2.1     |      0         |   8.5     |      2         |
|   2    |   3.4     |      2         |   7.3     |      0         |
|   3    |   1.8     |      0         |   6.9     |      0         |
|   4    |   2.9     |      1         |   9.1     |      2         |
|   5    |   3.0     |      2         |   7.8     |      1         |


**Step 3: Binned Feature Matrix**
The resulting binned feature matrix (each value is the bin index):

| Sample | Feature 1 (Binned) | Feature 2 (Binned) |
|--------|--------------------|--------------------|
|   1    |         0          |         2          |
|   2    |         2          |         0          |
|   3    |         0          |         0          |
|   4    |         1          |         2          |
|   5    |         2          |         1          |

This transformation enables efficient histogram-based split finding as the algorithm only needs to consider splits at bin boundaries rather than every unique feature values.

In [None]:
def bin_features(
    X: np.ndarray, n_bins: int = 255
) -> tuple[np.ndarray, list[np.ndarray]]:
    """
    Discretise continuous features into bins using histogram binning.

    Args:
        X: Feature matrix, shape (n_samples, n_features).
        n_bins: Number of bins to discretise each feature.

    Returns:
        Tuple containing:
            - X_binned: Binned feature matrix of same shape as X, with bin indices.
            - bins: List of bin edges for each feature.
    """
    bins = []
    X_binned = np.zeros_like(X, dtype=np.uint8)
    for j in range(X.shape[1]):
        col = X[:, j]
        bin_edges = np.linspace(col.min(), col.max(), n_bins + 1)
        bins.append(bin_edges)
        X_binned[:, j] = np.digitize(col, bin_edges) - 1
    return X_binned, bins

## 7. Finding the Best Split
In LightGBM, the process of finding the best split is mathematically similar to XGBoost, as both are based on gradient boosting with second-order Taylor expansion. However, LightGBM introduces several algorithmic innovations, particularly the use of histogram-based split finding and a leaf-wise tree growth strategy. Below is a detailed explanation, with a focus on the LightGBM approach.

### Regularised Objective Function
For a tree at boosting iteration $t$, the regularised objective function is:

\begin{align*}
    \mathcal{L}^{(t)} = \sum^{n}_{i=1}l(y_i, \hat y_i^{(t-1)} + f_t(x_i)) + \Omega(f_t)
\end{align*}

where:
- $l$: Loss function (e.g., mean squared error).
- $\hat y_i^{(t-1)}$: Prediction from previous trees.
- $f_t$: New tree.
- $\Omega(f_t)$: Regularisation term.

### Second-Order Taylor Expansion
LightGBM, like XGBoost, uses a second-order Taylor expansion of the loss function around the current prediction:

\begin{align*}
    l(y_i, \hat y_i^{(t-1)} + f_t(x_i)) \approx l(y_i, \hat y_i^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)
\end{align*}

where:
- $g_i = \dfrac{\partial l(y_i, \hat y_i)}{\partial \hat y_i}$
- $h_i = \dfrac{\partial^2 l(y_i, \hat y_i)}{\partial \hat y_i ^2}$

Assume the new tree $f_t$ assigns a constant score $w_j$ to all samples in leaft $j$:
\begin{align*}
    f_t(x_i) = w_{q(x_i)}
\end{align*}

where $q(x_i)$ maps sample $i$ to its leaf.

### Regularisation Term
LightGBM typically uses only L2 regularisation for leaf weights in most practical scenarios:

\begin{align*}
    \Omega(f_t) = \gamma T + \frac{1}{2} \lambda \sum^{T}_{j=1}w_j^2
\end{align*}

where:
- $T$: Number of leaves.
- $\gamma$: Penalty for the number of leaves (tree complexity).
- $\lambda$ L2 regularisation parameter.

### Total Objective Function

\begin{align*}
    \mathcal{\tilde L}^{(t)} = \sum^{T}_{j=1} \left[
        G_j w_j + \frac{1}{2}(H_j+\lambda)w_j^2
        \right] + \gamma T
\end{align*}

where:
- $G_j = \sum_{i \in I_j} g_i$: Sum of gradients in leaf $j$.
- $H_j = \sum_{i \in I_j} h_i$: Sum of Hessians in leaf $j$.

### Optimising Leaf Weights

To find the optimal leaf weight, minimise $\mathcal{\tilde L^{(t)}}$ with respect to $w_j$:

\begin{align*}
    \dfrac{\partial \mathcal{\tilde L^{(t)}}}{\partial w_j} = G_j + (H_j + \lambda)w_j = 0 \rightarrow w_j^* = - \dfrac{G_j}{H_j + \lambda}
\end{align*}

Plug $w_j^*$ back into the objective:

\begin{align*}
    \mathcal{\tilde L}^{(t)} = - \dfrac{1}{2} \sum^{T}_{j=1} 
        \dfrac{G_j^2}{H_j + \lambda}
         + \gamma T
\end{align*}

Suppose a node is split into left ($L$) and right ($R$) children, the gain is the reduction in the objective:

\begin{align*}
    \text{Gain} = \dfrac{1}{2} \left( 
        \dfrac{G^2_L}{H_L + \lambda} + \dfrac{G^2_R}{H_R + \lambda} - \dfrac{(G_L + G_R)^2}{H_L + H_R + \lambda}
        \right) - \gamma
\end{align*}

where:
- $G_L, H_L$: Sums for the left child.
- $G_R, H_R$: Sums for the right child.

### Histogram-Based Split Finding

- **Feature Binning**: LightGBM bins continuous feature values into discrete intervals (bins), greatly reducing the number of possible split points.

- **Histogram Construction**: For each feature, LightGBM builds histograms of gradient and Hessian sums for each bin.

- **Split Search**: The gain formula above is evaluated only at bin boundaries, making the process much faster and more memory-efficient.

In [None]:
def best_split_histogram(
    X_binned: NDArray[np.int64],
    gradients: NDArray[np.float64],
    hessians: NDArray[np.float64],
    min_samples_leaf: int,
    lambda_: float,
    gamma: float,
    n_bins: int = 255,
) -> tuple:
    """
    Find the best split for a node using histogram-based split finding.

    Args:
        X_binned: Binned feature matrix of shape (n_samples, n_features).
        gradients: Gradients for each sample, shape (n_samples,).
        hessians: Hessians for each sample, shape (n_samples,).
        min_samples_leaf: Minimum number of samples required in a leaf.
        lambda_: L2 regularisation parameter.
        gamma: Minimum loss reduction required to make a split.
        n_bins: Number of bins per feature.

    Returns:
        Tuple containing:
            - best_feature: Index of the best feature to split.
            - best_bin: Bin index for the best split.
            - best_gain: Gain value for the best split.
    """
    m, n = X_binned.shape
    best_feature, best_bin, best_gain = None, None, -np.inf
    for feature in range(n):
        grad_hist = np.zeros(n_bins)
        hess_hist = np.zeros(n_bins)
        for b in range(n_bins):
            mask = X_binned[:, feature] == b
            grad_hist[b] = gradients[mask].sum()
            hess_hist[b] = hessians[mask].sum()
        G_total, H_total = grad_hist.sum(), hess_hist.sum()
        G_L, H_L = 0.0, 0.0
        for b in range(n_bins - 1):
            G_L += grad_hist[b]
            H_L += hess_hist[b]
            G_R = G_total - G_L
            H_R = H_total - H_L
            if H_L < min_samples_leaf or H_R < min_samples_leaf:
                continue
            gain = (
                0.5
                * (
                    G_L**2 / (H_L + lambda_)
                    + G_R**2 / (H_R + lambda_)
                    - (G_L + G_R) ** 2 / (H_L + H_R + lambda_)
                )
                - gamma
            )
            if gain > best_gain:
                best_feature, best_bin, best_gain = feature, b, gain
    return best_feature, best_bin, best_gain

## 8. Building Trees
The following `build_tree_leafwise()` function implements the **leaf-wise tree growth strategy** for constructing a single decision tree in a gradient boosting ensemble. With this strategy, the algorithm splits the leaf with the highest gain, leading to potentially deeper and more complex trees compared to level-wise (depth-wise) growth. The formula for the optimal leaf value is:

\begin{align*}
    w^* = - \dfrac{\sum g_i}{\sum h_i + \lambda}
\end{align*}

where $g_i$ and $h_i$ are the gradients and Hessians, and $\lambda$ is the L2 regularisation parameter.

The function relies on binned features and pre-computed histograms for efficiency, only evaluating splits at bin boundaries.

In [None]:
def build_tree_leafwise(
    X_binned: NDArray[np.uint8],
    gradients: NDArray[np.float64],
    hessians: NDArray[np.float64],
    max_leaves: int,
    min_samples_leaf: int,
    lambda_: float,
    gamma: float,
    n_bins: int = 255,
) -> list:
    """
    Build a decision tree using the leaf-wise growth strategy.

    Args:
        X_binned: Binned feature matrix of shape (n_samples, n_features).
        gradients: Gradients for each sample, shape (n_samples,).
        hessians: Hessians for each sample, shape (n_samples,).
        max_leaves: Maximum number of leaves in the tree.
        min_samples_leaf: Minimum number of samples required in a leaf.
        lambda_: L2 regularisation parameter.
        gamma: Minimum loss reduction required to make a split.
        n_bins: Number of bins per feature.

    Returns:
        List of leaves representing the tree structure.
    """
    m = X_binned.shape[0]
    leaves = [
        {
            "indices": np.arange(m),
            "depth": 0,
            "parent": None,
            "gain": 0.0,
            "value": -gradients.sum() / (hessians.sum() + lambda_),
        }
    ]
    for _ in range(max_leaves - 1):
        best_gain = -np.inf
        best_split = None
        for leaf_idx, leaf in enumerate(leaves):
            idxs = leaf["indices"]
            if len(idxs) <= min_samples_leaf or "split" in leaf:
                continue
            feature, bin_idx, gain = best_split_histogram(
                X_binned[idxs],
                gradients[idxs],
                hessians[idxs],
                min_samples_leaf,
                lambda_,
                gamma,
                n_bins,
            )
            if gain > best_gain:
                best_gain = gain
                best_split = (leaf_idx, feature, bin_idx, gain)
        if best_split is None or best_gain <= 0:
            break
        leaf_idx, feature, bin_idx, gain = best_split
        idxs = leaves[leaf_idx]["indices"]
        left_mask = X_binned[idxs, feature] <= bin_idx
        right_mask = ~left_mask
        left_indices = idxs[left_mask]
        right_indices = idxs[right_mask]
        leaves[leaf_idx]["split"] = (feature, bin_idx)
        leaves[leaf_idx]["left"] = len(leaves)
        leaves[leaf_idx]["right"] = len(leaves) + 1
        leaves.append(
            {
                "indices": left_indices,
                "depth": leaves[leaf_idx]["depth"] + 1,
                "parent": leaf_idx,
                "gain": gain,
                "value": -gradients[left_indices].sum()
                / (hessians[left_indices].sum() + lambda_),
            }
        )
        leaves.append(
            {
                "indices": right_indices,
                "depth": leaves[leaf_idx]["depth"] + 1,
                "parent": leaf_idx,
                "gain": gain,
                "value": -gradients[right_indices].sum()
                / (hessians[right_indices].sum() + lambda_),
            }
        )
    return leaves

## 9. Predictions (Trees)
The `predict_tree_batch()` function efficiently makes predictions for a batch of samples using a decision tree represented in a specific structure (as constructed by the `build_tree_leafwise()` function). 

In [None]:
def predict_tree_batch(tree: list, X_binned: NDArray[np.int64]) -> NDArray[np.float64]:
    """
    Predict outputs for a batch of samples using a decision tree.

    Args:
        tree: List of leaves representing the tree structure.
        X_binned: Binned feature matrix of shape (n_samples, n_features).

    Returns:
        Predicted values of shape (n_samples,).
    """
    # For each sample, traverse the tree to find the leaf value
    y_pred = np.zeros(X_binned.shape[0])
    for i in range(X_binned.shape[0]):
        node = 0
        while "split" in tree[node]:
            feature, bin_idx = tree[node]["split"]
            if X_binned[i, feature] <= bin_idx:
                node = tree[node]["left"]
            else:
                node = tree[node]["right"]
        y_pred[i] = tree[node]["value"]
    return y_pred

## 10. Training Model
During the training process of LightGBM:
1. Initialise model predictions as $ F_0(x) = \bar y $.
1. Continuous features in $X$ are discretised into a fixed number of bins using histogram binning.
1. Boosting loop from $ m=1 $ to $ M $ (number of boosting rounds `n_estimators`):
    - Compute gradients and hessians.
    - LightGBM splits the leaf with the highest potential gain at each step, rather than growing trees level-by-level as in XGBoost.
    - Fit a tree $ h_m^{(x)} $ to the gradients and hessians, optimising the regularised objective function:

      $$
      h_m^{(x)} = \text{Tree}(X, \{g_i^{(m)}, h_i^{(m)}\})
      $$

      For each candidate split (feature and bin), the gain is computed as:

      $$
      \text{Gain} = \frac{1}{2} \left( 
          \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda}
          \right) - \gamma
      $$

      where $G_L$, $H_L$ and $G_R$, $H_R$ are the sums of gradients and hessians for the left and right splits respectively.

    - Update predictions.

      The model is updated additively:

      $$
      F_m(x) = F_{m-1}(x) + \eta \cdot h_m(x)
      $$

      where $\eta$ is the learning rate.
1. The function returns the initial prediction (mean of $y$), the list of fitted trees, the learning rate and the bin edges for each feature.

In [None]:
def fit(
    X: NDArray[np.float64],
    y: NDArray[np.float64],
    n_estimators: int = 10,
    learning_rate: float = 0.1,
    max_leaves: int = 31,
    min_samples_leaf: int = 20,
    lambda_: float = 1.0,
    gamma: float = 0.0,
    n_bins: int = 255,
) -> tuple[float, list[list], float, list[NDArray[np.float64]]]:
    """
    Fit a LightGBM-style gradient boosting model for regression.

    Args:
        X: Feature matrix of shape (n_samples, n_features).
        y: Target values, shape (n_samples,).
        n_estimators: Number of boosting rounds.
        learning_rate: Learning rate.
        max_leaves: Maximum number of leaves per tree.
        min_samples_leaf: Minimum samples per leaf.
        lambda_: L2 regularisation parameter.
        gamma: Minimum gain required to split.
        n_bins: Number of bins for feature discretisation.

    Returns:
        Tuple containing:
            - initial_prediction: Initial prediction (mean of y).
            - models: List of fitted trees.
            - learning_rate: Learning rate used.
            - bins: List of bin edges for each feature.
    """
    models = []
    X_binned, bins = bin_features(X, n_bins)
    y_pred = initialise_model(y)
    for m in range(n_estimators):
        gradients, hessians = compute_gradients_and_hessians(y, y_pred)
        tree = build_tree_leafwise(
            X_binned,
            gradients,
            hessians,
            max_leaves,
            min_samples_leaf,
            lambda_,
            gamma,
            n_bins,
        )
        update = predict_tree_batch(tree, X_binned)
        y_pred += learning_rate * update
        models.append(tree)
    return np.mean(y), models, learning_rate, bins

## 11. Final Predictions
The final predictions after $M$ trees is:
\begin{align*}
    F_M(x) = F_0(x) + \sum^{M}_{m=1} \eta \cdot h_m(x)
\end{align*}

where:
- $F_0(x)$: Initial prediction (mean of $y$).
- $h_m(x)$: Prediction of the $m$-th tree.
- $\eta$: Learning rate.

In [None]:
def predict(
    X: NDArray[np.float64],
    initial_prediction: float,
    models: list[list],
    learning_rate: float,
    bins: list[NDArray[np.float64]],
) -> NDArray[np.float64]:
    """
    Predict outputs for a batch of samples using the fitted LightGBM model.

    Args:
        X: Feature matrix of shape (n_samples, n_features).
        initial_prediction: Initial prediction (mean of y from training).
        models: List of fitted trees.
        learning_rate: Learning rate.
        bins: List of bin edges for each feature.

    Returns:
        Predicted values of shape (n_samples,).
    """
    X_binned = np.zeros_like(X, dtype=np.uint8)
    for j in range(X.shape[1]):
        X_binned[:, j] = np.digitize(X[:, j], bins[j]) - 1
    y_pred = np.full(X.shape[0], initial_prediction, dtype=float)
    for tree in models:
        update = predict_tree_batch(tree, X_binned)
        y_pred += learning_rate * update
    return y_pred

In [32]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Fit model
init_pred, models, lr, bins = fit(
    X_train,
    y_train,
    n_estimators=20,
    learning_rate=0.2,
    max_leaves=16,
    min_samples_leaf=5,
    lambda_=1.0,
    gamma=0.0,
    n_bins=64,
)

# Predict
y_pred = predict(X_test, init_pred, models, lr, bins)

## 12. Evaluation Metrics
### Mean Squared Error (MSE)
Mean Squared Error measures the average squared difference between predicted ($\hat y$) and actual ($y$) values. Large errors are penalised heavily. Smaller MSE indicates better predictions.

\begin{align*}
MSE = \dfrac{1}{n} \sum_{i=1}^{n}(\hat y_{i} - y_{i})^2
\end{align*}

In [33]:
def calculate_MSE(y_true: NDArray[np.float64], y_pred: NDArray[np.float64]) -> float:
    return np.mean((y_true - y_pred) ** 2)

### Root Mean Squared Error (RMSE)
Square root of MSE. It provides error in the same unit as the target variable ($y$) and easier to interpret.

\begin{align*}
RMSE = \sqrt{(MSE)}
\end{align*}

In [34]:
def calculate_RMSE(y_true: NDArray[np.float64], y_pred: NDArray[np.float64]) -> float:
    return np.sqrt(np.mean((y_true - y_pred) ** 2))

### Mean Absolute Error (MAE)
Mean Absolute Error measures the average absolute difference between predicted ($\hat y$) and actual ($y$) values. It is less sensitive to outliers than MSE. Smaller MAE indicates better predictions.

\begin{align*}
MAE = \dfrac{1}{n} \sum_{i=1}^{n}|\hat y_{i} = y_{i}|
\end{align*}

In [35]:
def calculate_MAE(y_true: NDArray[np.float64], y_pred: NDArray[np.float64]) -> float:
    return np.mean(np.abs(y_true - y_pred))

<a id="r-squared"></a>
### R-Squared($R^2$)

R-squared indicated the proportion of variance in the dependent variable that is predictable from the independent variables. Value ranges from 0 to 1. Closer to 1 indicates a better fit.



Residual Sum of Squares ($SS_{residual}$): 
\begin{align*}
SS_{residual} = \sum_{i=1}^{n} (y_{i} - \hat y_{i})^{2}
\end{align*}

Total Sum of Squares ($SS_{total}$): 
\begin{align*}
SS_{total} = \sum_{i=1}^{n} (y_{i} - \bar y_{i})^{2}
\end{align*}

$R^2$ is computed as:

\begin{align*}

R^2 = 1 - \dfrac{SS_{residual}}{SS_{total}} = 1 - \dfrac{\sum_{i=1}^{n} (y_{i} - \hat y_{i})^{2}}{\sum_{i=1}^{n} (y_{i} - \bar y_{i})^{2}}

\end{align*}

where:

$y$: Actual target values.

$\bar y$: Mean of the actual target values.

$\hat y$: Precicted target values.

In [36]:
def calculate_r2(y_true: NDArray[np.float64], y_pred: NDArray[np.float64]) -> float:
    ss_total = np.sum((y_true - np.mean(y_true)) ** 2)
    ss_residual = np.sum((y_true - y_pred) ** 2)
    r2 = 1 - (ss_residual / ss_total)
    return r2

In [None]:
def evaluate(
    y_true: NDArray[np.float64], y_pred: NDArray[np.float64]
) -> tuple[float, float, float, float]:
    """
    Calculate and return evaluation metrics for a regression model, including MSE, RMSE, MAE, and R-squared.

     Args:
        y_true: True labels.
        y_pred: Predicted labels.
        class_names: List of class names. Defaults to None.
    Returns:
        - mse: Mean Squared Error (MSE), indicating the average of the squared differences between predicted and true values.
        - rmse: Root Mean Squared Error (RMSE), indicating the standard deviation of the residuals.
        - mae: Mean Absolute Error (MAE), representing the average absolute difference between predicted and true values.
        - r2: R-squared (coefficient of determination), showing the proportion of variance in the dependent variable that is predictable from the independent variable(s).
    """
    mse = calculate_MSE(y_true, y_pred)
    rmse = calculate_RMSE(y_true, y_pred)
    mae = calculate_MAE(y_true, y_pred)
    r2 = calculate_r2(y_true, y_pred)
    return mse, rmse, mae, r2

In [None]:
class CustomLightGBM:
    """
    LightGBM-style Gradient Boosting Regressor.
    """

    def __init__(
        self,
        n_estimators: int = 10,
        learning_rate: float = 0.1,
        max_leaves: int = 31,
        min_samples_leaf: int = 20,
        lambda_: float = 1.0,
        gamma: float = 0.0,
        n_bins: int = 255,
    ):
        """
        Initialise the CustomLightGBM regressor.

        Args:
            n_estimators: Number of boosting rounds.
            learning_rate: Learning rate.
            max_leaves: Maximum number of leaves per tree.
            min_samples_leaf: Minimum samples per leaf.
            lambda_: L2 regularisation parameter.
            gamma: Minimum gain required to split.
            n_bins: Number of bins for feature discretisation.
        """
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_leaves = max_leaves
        self.min_samples_leaf = min_samples_leaf
        self.lambda_ = lambda_
        self.gamma = gamma
        self.n_bins = n_bins
        self.models: list[list] = []
        self.bins: list[NDArray[np.float64]] = []
        self.initial_prediction: float = 0.0

    def _bin_features(
        self, X: NDArray[np.float64]
    ) -> tuple[NDArray[np.uint8], list[NDArray[np.float64]]]:
        """
        Discretise continuous features into bins using histogram binning.

        Args:
            X: Feature matrix of shape (n_samples, n_features).

        Returns:
            Tuple containing:
                - X_binned: Binned feature matrix of same shape as X, with bin indices.
                - bins: List of bin edges for each feature.
        """
        bins = []
        X_binned = np.zeros_like(X, dtype=np.uint8)
        for j in range(X.shape[1]):
            col = X[:, j]
            bin_edges = np.linspace(col.min(), col.max(), self.n_bins + 1)
            bins.append(bin_edges)
            X_binned[:, j] = np.digitize(col, bin_edges) - 1
        return X_binned, bins

    def _best_split_histogram(
        self,
        X_binned: NDArray[np.uint8],
        gradients: NDArray[np.float64],
        hessians: NDArray[np.float64],
    ) -> tuple:
        """
        Find the best split for a node using histogram-based split finding.

        Args:
            X_binned: Binned feature matrix of shape (n_samples, n_features).
            gradients: Gradients for each sample, shape (n_samples,).
            hessians: Hessians for each sample, shape (n_samples,).

        Returns:
            Tuple containing:
                - best_feature: Index of the best feature to split.
                - best_bin: Bin index for the best split.
                - best_gain: Gain value for the best split.
        """
        m, n = X_binned.shape
        best_feature, best_bin, best_gain = None, None, -np.inf
        for feature in range(n):
            grad_hist = np.zeros(self.n_bins)
            hess_hist = np.zeros(self.n_bins)
            for b in range(self.n_bins):
                mask = X_binned[:, feature] == b
                grad_hist[b] = gradients[mask].sum()
                hess_hist[b] = hessians[mask].sum()
            G_total, H_total = grad_hist.sum(), hess_hist.sum()
            G_L, H_L = 0.0, 0.0
            for b in range(self.n_bins - 1):
                G_L += grad_hist[b]
                H_L += hess_hist[b]
                G_R = G_total - G_L
                H_R = H_total - H_L
                if H_L < self.min_samples_leaf or H_R < self.min_samples_leaf:
                    continue
                gain = (
                    0.5
                    * (
                        G_L**2 / (H_L + self.lambda_)
                        + G_R**2 / (H_R + self.lambda_)
                        - (G_L + G_R) ** 2 / (H_L + H_R + self.lambda_)
                    )
                    - self.gamma
                )
                if gain > best_gain:
                    best_feature, best_bin, best_gain = feature, b, gain
        return best_feature, best_bin, best_gain

    def _build_tree_leafwise(
        self,
        X_binned: NDArray[np.uint8],
        gradients: NDArray[np.float64],
        hessians: NDArray[np.float64],
    ) -> list:
        """
        Build a decision tree using the leaf-wise growth strategy.

        Args:
            X_binned: Binned feature matrix of shape (n_samples, n_features).
            gradients: Gradients for each sample, shape (n_samples,).
            hessians: Hessians for each sample, shape (n_samples,).

        Returns:
            List of leaves representing the tree structure.
        """
        m = X_binned.shape[0]
        leaves = [
            {
                "indices": np.arange(m),
                "depth": 0,
                "parent": None,
                "gain": 0.0,
                "value": -gradients.sum() / (hessians.sum() + self.lambda_),
            }
        ]
        for _ in range(self.max_leaves - 1):
            best_gain = -np.inf
            best_split = None
            for leaf_idx, leaf in enumerate(leaves):
                idxs = leaf["indices"]
                if len(idxs) <= self.min_samples_leaf or "split" in leaf:
                    continue
                feature, bin_idx, gain = self._best_split_histogram(
                    X_binned[idxs], gradients[idxs], hessians[idxs]
                )
                if gain > best_gain:
                    best_gain = gain
                    best_split = (leaf_idx, feature, bin_idx, gain)
            if best_split is None or best_gain <= 0:
                break
            leaf_idx, feature, bin_idx, gain = best_split
            idxs = leaves[leaf_idx]["indices"]
            left_mask = X_binned[idxs, feature] <= bin_idx
            right_mask = ~left_mask
            left_indices = idxs[left_mask]
            right_indices = idxs[right_mask]
            leaves[leaf_idx]["split"] = (feature, bin_idx)
            leaves[leaf_idx]["left"] = len(leaves)
            leaves[leaf_idx]["right"] = len(leaves) + 1
            leaves.append(
                {
                    "indices": left_indices,
                    "depth": leaves[leaf_idx]["depth"] + 1,
                    "parent": leaf_idx,
                    "gain": gain,
                    "value": -gradients[left_indices].sum()
                    / (hessians[left_indices].sum() + self.lambda_),
                }
            )
            leaves.append(
                {
                    "indices": right_indices,
                    "depth": leaves[leaf_idx]["depth"] + 1,
                    "parent": leaf_idx,
                    "gain": gain,
                    "value": -gradients[right_indices].sum()
                    / (hessians[right_indices].sum() + self.lambda_),
                }
            )
        return leaves

    def _predict_tree_batch(
        self, tree: list, X_binned: NDArray[np.uint8]
    ) -> NDArray[np.float64]:
        """
        Predict outputs for a batch of samples using a decision tree.

        Args:
            tree: List of leaves representing the tree structure.
            X_binned: Binned feature matrix of shape (n_samples, n_features).

        Returns:
            Predicted values of shape (n_samples,).
        """
        y_pred = np.zeros(X_binned.shape[0])
        for i in range(X_binned.shape[0]):
            node = 0
            while "split" in tree[node]:
                feature, bin_idx = tree[node]["split"]
                if X_binned[i, feature] <= bin_idx:
                    node = tree[node]["left"]
                else:
                    node = tree[node]["right"]
            y_pred[i] = tree[node]["value"]
        return y_pred

    def _initialise_model(self, y: NDArray[np.float64]) -> NDArray[np.float64]:
        """Initialises predictions with the mean of the target values.

        Args:
            y: Target values of shape (n_samples,).

        Returns:
            Array of initial predictions, each set to mean of y.
        """
        return np.full_like(y, np.mean(y), dtype=float)

    def _compute_gradients_and_hessians(
        self, y_true: NDArray[np.float64], y_pred: NDArray[np.float64]
    ) -> tuple[NDArray[np.float64], NDArray[np.float64]]:
        """Computes gradients and Hessians for squared error loss.

        Args:
            y_true: True target values, shape (n_samples,).
            y_pred: Predicted values, shape (n_samples,).

        Returns:
            Tuple of gradients and Hessians, both of shape (n_samples,).
        """
        gradients = y_pred - y_true
        hessians = np.ones_like(y_true)
        return gradients, hessians

    def fit(self, X: NDArray[np.float64], y: NDArray[np.float64]) -> None:
        """Fits the CustomLightGBM model to the training data.

        Args:
            X: Feature matrix of shape (n_samples, n_features).
            y: Target values, shape (n_samples,).
        """
        self.models = []
        X_binned, self.bins = self._bin_features(X)
        y_pred = self._initialise_model(y)
        self.initial_prediction = float(np.mean(y))
        for m in range(self.n_estimators):
            gradients, hessians = self._compute_gradients_and_hessians(y, y_pred)
            tree = self._build_tree_leafwise(X_binned, gradients, hessians)
            update = self._predict_tree_batch(tree, X_binned)
            y_pred += self.learning_rate * update
            self.models.append(tree)

    def predict(self, X: NDArray[np.float64]) -> NDArray[np.float64]:
        """Predicts outputs for a batch of samples using the fitted model.

        Args:
            X: Feature matrix of shape (n_samples, n_features).

        Returns:
            Predicted values of shape (n_samples,).
        """
        X_binned = np.zeros_like(X, dtype=np.uint8)
        for j in range(X.shape[1]):
            X_binned[:, j] = np.digitize(X[:, j], self.bins[j]) - 1
        y_pred = np.full(X.shape[0], self.initial_prediction, dtype=float)
        for tree in self.models:
            update = self._predict_tree_batch(tree, X_binned)
            y_pred += self.learning_rate * update
        return y_pred

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Instantiate and fit the model
model_custom = CustomLightGBM(
    n_estimators=100, learning_rate=0.1, max_leaves=20, lambda_=0.01
)
model_custom.fit(X_train, y_train)

# Predict and evaluate
y_pred_custom = model_custom.predict(X_test)
mse_custom, rmse_custom, mae_custom, r2_custom = evaluate(y_test, y_pred_custom)
print(f"MSE (Custom): {mse_custom:.4f}")
print(f"RMSE (Custom): {rmse_custom:.4f}")
print(f"MAE (Custom): {mae_custom:.4f}")
print(f"R-Squared (Custom): {r2_custom:.4f}")

MSE (Custom): 0.0049
RMSE (Custom): 0.0699
MAE (Custom): 0.0469
R-Squared (Custom): 0.8106


## 13. Comparison with LightGBM

In [None]:
import lightgbm as lgb
from sklearn.metrics import (
    mean_squared_error,
    root_mean_squared_error,
    mean_absolute_error,
    r2_score,
)

train_data = lgb.Dataset(X_train, label=y_train)
params = {
    "objective": "regression",
    "metric": "rmse",
    "learning_rate": 0.1,
    "num_leaves": 31,
    "verbose": -1,  # Suppress LightGBM internal logs
}
model_lgb = lgb.train(params, train_data, num_boost_round=100)
y_pred_lgb = model_lgb.predict(X_test)

mse_lgb = mean_squared_error(y_test, y_pred_lgb)
rmse_lgb = root_mean_squared_error(y_test, y_pred_lgb)
mae_lgb = mean_absolute_error(y_test, y_pred_lgb)
r2_lgb = r2_score(y_test, y_pred_lgb)
print(f"MSE (Custom): {mse_custom:.4f}")
print(f"MAE (Custom): {mae_custom:.4f}")
print(f"RMSE (Custom): {rmse_custom:.4f}")
print(f"R-Squared (Custom): {r2_custom:.4f}")
print("----------")
print(f"MSE (lightgbm): {mse_lgb:.4f}")
print(f"MAE (lightgbm): {mae_lgb:.4f}")
print(f"RMSE (lightgbm): {rmse_lgb:.4f}")
print(f"R-Squared (lightgbm): {r2_lgb:.4f}")


MSE (Custom): 0.0049
MSE (lightgbm): 0.0045
RMSE (Custom): 0.0699
RMSE (lightgbm): 0.0669
MAE (Custom): 0.0469
MAE (lightgbm): 0.0452
R-Squared (Custom): 0.8106
R-Squared (lightgbm): 0.8264


## 14. References

1. Andreas Mueller. (2020). *Applied ML 2020 - 08 - Gradient Boosting.* <br>
https://www.youtube.com/watch?v=yrTW5YTmFjw

1. Bex Tuychiev. (2023). *A Guide to The Gradient Boosting Algorithm.* <br>
https://www.datacamp.com/tutorial/guide-to-the-gradient-boosting-algorithm

1. DataMListic. (2023). *Gradient Boosting with Regression Trees Explained* [YouTube Video]. <br>
https://youtu.be/lOwsMpdjxog

1. DMLC XGBOOST. (2022). *Introduction to Boosted Trees*. <br>
https://xgboost.readthedocs.io/en/stable/tutorials/model.html

1. GeeksforGeeks. (2025). *LightGBM (Light Gradient Boosting Machine)*. <br>
https://www.geeksforgeeks.org/machine-learning/lightgbm-light-gradient-boosting-machine/

1. GeeksforGeeks. (2025). *XGBoost*. <br>
https://www.geeksforgeeks.org/machine-learning/xgboost/

1. IBM. (2024). *What is XGBoost?* <br>
https://www.ibm.com/think/topics/xgboost

1. Jason Brownlee. (2021). *How to Develop Your First XGBoost Model in Python*.<br>
https://machinelearningmastery.com/develop-first-xgboost-model-python-scikit-learn/

1. M Iqbal. (2025). *LightGBM Explained: Fast, Accurate, and Lightweight Machine Learning*. <br>
https://youtu.be/DJYjUOtEHCE

1. Nilimesh Halder. (2023). *Unpacking XGBoost: A Comprehensive Guide to Enhanced Gradient Boosting in Machine Learning*. <br>
https://blog.gopenai.com/unpacking-xgboost-a-comprehensive-guide-to-enhanced-gradient-boosting-in-machine-learning-c145acec09fc

1. nVIDIA. (n.d.). *What is XGBoost?* <br>
https://www.nvidia.com/en-us/glossary/xgboost/

1. StatQuest with Josh Starmer. (2019). *Gradient Boost Part 1 (of 4): Regression Main Ideas* [YouTube Video]. <br>
https://youtu.be/3CC4N4z3GJc

1. StatQuest with Josh Starmer. (2019). *Gradient Boost Part 2 (of 4): Regression Details* [YouTube Video]. <br>
https://youtu.be/2xudPOBz-vs

1. StatQuest with Josh Starmer. (2019). *XGBoost Part 1 (of 4): Regression* [YouTube Video]. <br>
https://youtu.be/OtD8wVaFm6E

1. Terence Parr and Jeremy Howard. (n.d.). *How to explain gradient boosting.* <br>
https://explained.ai/gradient-boosting/index.html

1. The IoT Academy. (2024). *What is the XGBoost Algorithm in ML – Explained With Steps*. <br>
https://www.theiotacademy.co/blog/xgboost-algorithm/

1. Tomonori Masui. (2022). *All You Need to Know about Gradient Boosting Algorithm − Part 1. Regression.* <br>
https://medium.com/data-science/all-you-need-to-know-about-gradient-boosting-algorithm-part-1-regression-2520a34a502

1. ultralytics. (n.d.). *LightGBM*. <br>
https://www.ultralytics.com/glossary/lightgbm

1. Unfold Data Science. (2022). *LightGBM algorithm explained | Lightgbm vs xgboost | lightGBM regression| LightGBM model*. <br>
https://youtu.be/9uxWzeLglr0

1. Wiens, M., Verone-Boyle, A., Henscheid, N., Podichetty, J. T., & Burton, J. (2025). A Tutorial and Use Case Example of the eXtreme Gradient Boosting (XGBoost) Artificial Intelligence Algorithm for Drug Development Applications. Clinical and translational science, 18(3), e70172. <br>
https://doi.org/10.1111/cts.70172