# Extreme Gradient Boosting from Scratch
***
## Table of Contents
***

In [74]:
import numpy as np
import pandas as pd
from typing import Tuple, List, Dict, Any
from sklearn.model_selection import train_test_split
from numpy.typing import NDArray

## 1. Introduction
**Extreme Gradient Boosting (XGBoost)** is an advanced implementation of the gradient boosting framework, specifically designed for speed, efficiency, and scalability in supervise machine learning tasks such as classification and regression. It is particularly renowed for its performance on structured (tabular) data and has become a standard tool in data science competitions (e.g., Kaggle) and industry applications.

XGBoost is based on **Gradient Boosting Machine (GBM)**, which is an ensemble machine learning model that builds a strong predictive model by sequentially combining multiple weak models (typically decision trees) in a stage-wise manner. The core idea is to iteratively add new models that correct the errors made by the existing ensemble, thereby improving overall predictive accuracy.

Suppose we have a dataset ${(x_i, y_i)}^n_{i=1}$ where $x_i$ are the features and $y_i$ are the target values. The goal of XGBoost is to find a function $F(x)$ that minimises a given regularised objective function $L$:

\begin{align*}
    F(x) = F_0(x) +  \sum^{M}_{m=1} \eta \cdot h_m(x)
\end{align*}

where:
- $F_0(x)$: Initial model (e.g., the mean of $y$).
- $\eta$: Learning rate.
- $M$: Number of boosting iterations (e.g., the number of weak learners).
- $h_m$: prediction from the $m$-th weak learner (e.g., a decision tree).


And the regularised objective function to be minimised:

\begin{align*}
    \mathcal{L} = \sum^{n}_{i=1}l(y_i, \hat y_i) + \sum^{K}_{k=1}\Omega(f_k)
\end{align*}

where:
- $l(y_i, \hat y_i)$: Loss function for the $i$-th instance (**Training Loss**).
- $\Omega(f_k)$: Regularisation term for the $k$-th tree $f_k$ (**Regularisation Term**).
- $K$: Number of trees in the ensemble.

<!-- Suppose we have a dataset ${(x_i, y_i)}^n_{i=1}$ where $x_i$ are the features and $y_i$ are the target values. The goal of gradient boosting is to find a function $F(x)$ that minimises a given differentiable loss function $L(y, F(x))$:

\begin{align*}
    F(x) = F_0(x) + \sum^{M}_{m=1}\gamma_m h_m(x)
\end{align*}

where:
- $F_0(x)$: Initial model (e.g., the mean of $y$).
- $\gamma_i$: Weight (step size) for the $m$-th weak learner, typically determined by minimising the loss function along the direction of $h_m(x)$.
- $M$: Number of boosting iterations (e.g., the number of weak learners).
- $h_m$: prediction from the $m$-th weak learner (e.g., a decision tree). -->

### Differences: Gradient Boosting vs. XGBoost
- Gradient Boosting fits trees to the negative gradient (residuals) of the loss function, typically using only first-order derivatives and mean values at leaves.

- XGBoost uses both first and second derivatives (gradient and Hessian), incorporates regularisation, and computes optimal leaf weights analytically. It also uses a more sophisticated split criterion (gain) and regularisation to control overfitting

### Advantages
- XGBoost adds L1 (Lasso) and L2 (Ridge) regularisation to the loss function, which helps prevent overfitting and improves overall model performance.
- Rather than stopping splits when no further gain is achieved, XGBoost grows trees to a maximum depth and then prunes them backward, removing splits with negative gain (depth-first approach).
- Natively handles missing values by learning the optimal direction to take when a value is missing.
- Supports parallel tree construction and distributed computation, making it much faster than traditional gradient boosting implementations.

### Limitations
- Memory usage can be high with large datasets.
- Not ideal for text or images. XGBoost is suited for tabular data than unstructured inputs.
- Sensitive to hypermarameters (e.g., learning rate, number of trees, regularisation parameter).

### Steps
1. Initialise the model:
    - Start with a simple model, typically a constant value:
        - For regression: the mean of the target variable.
        - For binary classification: the log-odds of the positive classes.
1. Calculate residuals (Negative Gradients) and Hessian:
    - For each iteration, compute the the negative gradients (and Hessian for second-order methods) of the loss function  with respect to the current predictions.
1. Fit a new weak model to predict the residuals:
    - Train a weak learner (typically a shallow decision tree) to predict the rediduals.
    - The weak learner focuses on correcting the errors made by the current ensemble.
1. Update the model:
    - Add the predictions from the new weak learners to the current model's predictions, scaled by a learning rate (shrinkage parameter).
    - This update moves the ensemble closer to the true values by correcting previous errors.
1. Repeat steps 2-4 for a pre-defined number of iterations.
1. Final prediction:
    - The sum of the initial prediction and the scaled outputs of all weak learners.


## 2. Loading Data
Retrieved from [GitHub - YBI Foundation](https://github.com/YBI-Foundation/Dataset/blob/main/Admission%20Chance.csv)

In [75]:
df = pd.read_csv(
    'https://raw.githubusercontent.com/YBI-Foundation/Dataset/refs/heads/main/Admission%20Chance.csv')
df.head()

Unnamed: 0,Serial No,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


In [76]:
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
feature_names = df.columns[:-1].tolist()  # All columns except the last one

# Check the shape of the data
print(f'Features shape: {X.shape}')
print(f'Target shape: {y.shape}')
print(f'Features: \n{feature_names}')

Features shape: (400, 8)
Target shape: (400,)
Features: 
['Serial No', 'GRE Score', 'TOEFL Score', 'University Rating', ' SOP', 'LOR ', 'CGPA', 'Research']


## 3. Loss Function
In XGBoost, the loss function is used to calculate the training loss, which is then combined with regularisation terms to form the overall objective function that is minimised during training. The gradients used to fit new trees are computed with respect to the training loss, but the final optimisation considers both the loss and regularisation. 

### Regression
The most common loss function for regression is the mean squared error (MSE):

\begin{align*}
    l(y, \hat y) = \dfrac{1}{n}\sum^{n}_{i=1}(y_i - \hat y_i)^2
\end{align*}

where:
- $y_i$: True value.
- $\hat y_i$: Predicted value.
- $n$: Number of samples.

### Binary Classification
The standard loss function for binary classification in XGBoost is the binary cross-entropy (log loss):

\begin{align*}
    l(y, \hat p) = - \dfrac{1}{n} \sum^{n}_{i=1}\left[y_i \log{(\hat p_i) + (1-y_i) \log{(1- \hat p_i)}} \right]
\end{align*}

where:
- $y_i$: True label ($0$ or $1$).
- $\hat p_i$: Predicted probability for class 1 (after applying the sigmoid function to the raw score).
- $n$: Number of samples.

### Multi-class Classification
For multi-class classification (with $K$ classes), XGBoost uses the multi-class cross-entropy (also called softmax loss):

\begin{align*}
    l(y, \hat P) = - \dfrac{1}{n} \sum^{n}_{i=1} \sum^{K}_{k=1} \mathbb{I}(y_i = k) \log{(\hat p_{ik})}
\end{align*}

where:
- $y_i$: True class label for sample $i$.
- $\hat p_i$: Predicted probability that sample $i$ belongs to class $k$ (output of the softmax function).
- $\mathbb{I}(y_i = k)$: Indicator function, equal to $1$ if $y_i = k$ and $0$ otherwise.
- $n$: Number of samples.

## 4. Initialising Model
First, we need to initialise the model with a constant function that minimises the loss (initial predictions).

\begin{align*}
    F_0(x) = \arg \min_{\gamma}\sum^{n}_{i=1}L(y_i, \gamma)
\end{align*}

For squared error, the best constant is the mean of the target values. Thus,

\begin{align*}
    F_0(x) = \bar y = \dfrac{1}{n}\sum^{n}_{i=1}y_i
\end{align*}

In [77]:
def initialise_model(y: NDArray[np.float64]) -> NDArray[np.float64]:
    """
    Initialise predictions with the mean of the target values.

    Parameters:
        y: Target values, shape (n_samples,).

    Returns:
        Array of initial predictions, each set to mean of y.
    """
    return np.full_like(y, np.mean(y), dtype=float)

## 5. Gradient and Hessian Computation
We will need the first derivative (gradient) and the second derivative (Hessian) of the loss function for each sample $i$. They will be used later in the algorithm.

- $g_i = \dfrac{\partial l(y_i, \hat y_i)}{\partial \hat y_i} = \dfrac{\partial}{\partial \hat y_i} \dfrac{1}{2}(y_i-\hat y_i)^2 = \hat y_i - y_i$
- $h_i = \dfrac{\partial^2 l(y_i, \hat y_i)}{\partial \hat y_i ^2} = 1$

In [78]:
def compute_gradients_and_hessians(y_true: NDArray[np.float64], y_pred: NDArray[np.float64]) -> Tuple[NDArray[np.float64], NDArray[np.float64]]:
    """
    Compute gradients and Hessians for squared error loss.

    Args:
        y_true: True target values, shape (n_samples,).
        y_pred: Predicted values, shape (n_samples,).

    Returns:
        A tuple with gradients and Hessians, both of shape (n_samples,).
    """
    gradients = y_pred - y_true
    hessians = np.ones_like(y_true)
    return gradients, hessians

## 5. Finding the Best Split
In XGBoost, the split is chosen to maximise the gain. The gain formula is derived using a second-order Taylor expansion of the loss function.

### Regularised Objective Function
For a tree at boosting iteration $t$, the regularised objective function is:

\begin{align*}
    \mathcal{L}^{(t)} = \sum^{n}_{i=1}l(y_i, \hat y_i^{(t-1)} + f_t(x_i)) + \Omega(f_t)
\end{align*}

where:
- $l$: Loss function (e.g., mean squared error).
- $\hat y_i^{(t-1)}$: Prediction from previous trees.
- $f_t$: New tree.
- $\Omega(f_t)$: Regularisation term.

### Second-Order Taylor Expansion
Expand the loss function around the current prediction $\hat y_i^{(t-1)}$:

\begin{align*}
    l(y_i, \hat y_i^{(t-1)} + f_t(x_i)) \approx l(y_i, \hat y_i^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)
\end{align*}

where:
- $g_i = \dfrac{\partial l(y_i, \hat y_i)}{\partial \hat y_i}$
- $h_i = \dfrac{\partial^2 l(y_i, \hat y_i)}{\partial \hat y_i ^2}$

Assume the new tree $f_t$ assigns a constant score $w_j$ to all samples in leaft $j$:
\begin{align*}
    f_t(x_i) = w_{q(x_i)}
\end{align*}

where $q(x_i)$ maps sample $i$ to its leaf.

### Regularisation Term
The regularisation term for both L1 and L2 is:

\begin{align*}
    \Omega(f_t) = \gamma T + \frac{1}{2} \lambda \sum^{T}_{j=1}w_j^2 + \alpha \sum^{T}_{j=1}|w_j|
\end{align*}

where:
- $T$: Number of leaves.
- $\gamma$: Penalty for the number of leaves (tree complexity).
- $\lambda$ L2 regularisation parameter.
- $\alpha$: L1 regularisation parameter.

However, for simplicity, only L2 regularisation will be considered in this notebook. Thus:

\begin{align*}
    \Omega(f_t) = \gamma T + \frac{1}{2} \lambda \sum^{T}_{j=1}w_j^2
\end{align*}


### Total Objective Function

\begin{align*}
    \mathcal{\tilde L}^{(t)} = \sum^{T}_{j=1} \left[
        G_j w_j + \frac{1}{2}(H_j+\lambda)w_j^2
        \right] + \gamma T
\end{align*}

where:
- $G_j = \sum_{i \in I_j} g_i$: Sum of gradients in leaf $j$.
- $H_j = \sum_{i \in I_j} h_i$: Sum of Hessians in leaf $j$.

### Optimising Leaf Weights

To optimise leaf weight, minimise $\mathcal{\tilde L^{(t)}}$ with respect to $w_j$:

\begin{align*}
    \dfrac{\partial \mathcal{\tilde L^{(t)}}}{\partial w_j} = G_j + (H_j + \lambda)w_j = 0 \rightarrow w_j^* = - \dfrac{G_j}{H_j + \lambda}
\end{align*}

Plug $w_j^*$ back into the objective:

\begin{align*}
    \mathcal{\tilde L}^{(t)} = - \dfrac{1}{2} \sum^{T}_{j=1} 
        \dfrac{G_j^2}{H_j + \lambda}
         + \gamma T
\end{align*}

Suppose a node is split into left ($L$) and right ($R$) children, the gain is the reduction in the objective:

\begin{align*}
    \text{Gain} = \dfrac{1}{2} \left( 
        \dfrac{G^2_L}{H_L + \lambda} + \dfrac{G^2_R}{H_R + \lambda} - \dfrac{(G_L + G_R)^2}{H_L + H_R + \lambda}
        \right) - \gamma
\end{align*}

where:
- $G_L, H_L$: Sums for the left child.
- $G_R, H_R$: Sums for the right child.


In [79]:
def best_split(X: NDArray[np.float64], gradients: NDArray[np.float64], hessians: NDArray[np.float64],
               min_samples_leaf: int, lambda_: float, gamma: float) -> Tuple[int, float, float]:
    """
    Find the best split for a node in the XGBoost tree.

    Args:
        X: Feature matrix, shape (n_samples, n_features).
        gradients: Gradients for each sample, shape (n_samples,).
        hessians: Hessians for each sample, shape (n_samples,).
        min_samples_leaf: Minimum number of samples required in a leaf.
        lambda_ : L2 regularisation parameter.
        gamma: Minimum loss reduction required to make a split.

    Returns:
        A tuple with best feature index, best threshold, and best gain.
    """
    m, n = X.shape
    best_feature, best_threshold, best_gain = None, None, -np.inf
    for feature in range(n):
        thresholds = np.unique(X[:, feature])
        for threshold in thresholds:
            left_mask = X[:, feature] < threshold
            right_mask = ~left_mask
            if left_mask.sum() < min_samples_leaf or right_mask.sum() < min_samples_leaf:
                continue
            G_L, H_L = gradients[left_mask].sum(), hessians[left_mask].sum()
            G_R, H_R = gradients[right_mask].sum(), hessians[right_mask].sum()
            gain = 0.5 * (G_L**2 / (H_L + lambda_) + G_R**2 / (H_R + lambda_) -
                          (G_L + G_R)**2 / (H_L + H_R + lambda_)) - gamma
            if gain > best_gain:
                best_feature, best_threshold, best_gain = feature, threshold, gain
    return best_feature, best_threshold, best_gain

## 6. Building Trees
This function recursively constructs a regression tree by splitting the data at each node using the best split found, until stopping criteria are met. 

The formula for the optimal leaf value is:

\begin{align*}
    w^* = - \dfrac{\sum g_i}{\sum h_i + \lambda}
\end{align*}

- Stopping Criteria: 
    1. The tree stops growing if `max_depth` is reached.
    2. The number of samples is less than or equal to `min_sample_leaf`.
- If no valid split is found or gain is non-positive, a leaf is created.

In [80]:
def build_tree(X: NDArray[np.float64], gradients: NDArray[np.float64], hessians: NDArray[np.float64],
               max_depth: int, min_samples_leaf: int, lambda_: float,
               gamma: float, depth: int = 0) -> Dict[str, Any] | float:
    """
    Recursively build a decision tree for XGBoost.

    Args:
        X: Feature matrix, shape (n_samples, n_features).
        gradients: Gradients for each sample, shape (n_samples,).
        hessians: Hessians for each sample, shape (n_samples,).
        max_depth: Maximum depth of the tree.
        min_samples_leaf: Minimum number of samples required in a leaf.
        lambda_ : L2 regularisation parameter.
        gamma: Minimum loss reduction required to make a split.
        depth: Current depth of the tree. Defaults to 0.

    Returns:
        Tree structure as nested dictionaries, or a float for leaf value.
    """

    if depth >= max_depth or len(gradients) <= min_samples_leaf:
        return -gradients.sum() / (hessians.sum() + lambda_)
    feature, threshold, gain = best_split(
        X, gradients, hessians, min_samples_leaf, lambda_, gamma)
    if feature is None or gain <= 0:
        return -gradients.sum() / (hessians.sum() + lambda_)
    left_mask = X[:, feature] < threshold
    right_mask = ~left_mask
    return {
        'feature': feature,
        'threshold': threshold,
        'left': build_tree(X[left_mask], gradients[left_mask], hessians[left_mask],
                           max_depth, min_samples_leaf, lambda_, gamma, depth + 1),
        'right': build_tree(X[right_mask], gradients[right_mask], hessians[right_mask],
                            max_depth, min_samples_leaf, lambda_, gamma, depth + 1)
    }

## 7. Predictions (Trees)
The two functions `predict_tree()` and `predict_tree_batch()` are used to make predictions with a regression decision tree represented as a nested dictionary.

- `predict_tree()`: Predicts the output for a single data sample $x$ by traversing the decision tree.
- `predict_tree_batch()`: Generates predictions for a batch of samples by applying the `predict_tree` function to each sample in the input array.

In [81]:
def predict_tree(tree: Dict[str, Any] | float, x: NDArray[np.float64]) -> float:
    """
    Predict the output for a single sample using a decision tree.

    Args:
        tree: Tree structure or leaf value.
        x: Feature vector, shape (n_features,).

    Returns:
        Predicted value for the sample.
    """
    while isinstance(tree, dict):
        if x[tree['feature']] < tree['threshold']:
            tree = tree['left']
        else:
            tree = tree['right']
    return tree


def predict_tree_batch(tree: Dict[str, Any] | float, X: NDArray[np.float64]) -> NDArray[np.float64]:
    """
    Predict outputs for a batch of samples using a decision tree.

    Args:
        tree: Tree structure or leaf value.
        X: Feature matrix, shape (n_samples, n_features).

    Returns:
        Predicted values, shape (n_samples,).
    """
    return np.array([predict_tree(tree, x) for x in X])

## 8. Training Model
During the training process of gradient boosting:
1. Initialise model predictions as $ F_0(x) = \bar y $.
1. Boosting loop from $ m=1 $ to $ M $ (number of boosting rounds `n_estimators`):
    - Compute gradients and hessians.
    - Fit a tree $ h_m^{(x)} $ to the gradients and hessians, optimising the regularised objective function:

      $$
      h_m^{(x)} = \text{Tree}(X, \{g_i^{(m)}, h_i^{(m)}\})
      $$

      The tree structure and leaf values are chosen to maximise the gain:

      $$
      \text{Gain} = \frac{1}{2} \left( 
          \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda}
          \right) - \gamma
      $$

      where $G_L$, $H_L$ and $G_R$, $H_R$ are the sums of gradients and hessians for the left and right splits respectively.

    - Update predictions.

      The model is updated additively:

      $$
      F_m(x) = F_{m-1}(x) + \eta \cdot h_m(x)
      $$

      where $\eta$ is the learning rate.
1. The function returns the initial prediction (mean of $y$), the list of fitted trees, and the learning rate.

In [82]:
def fit(X: NDArray[np.float64], y: NDArray[np.float64], n_estimators: int = 10, learning_rate: float = 0.1,
        max_depth: int = 3, min_samples_leaf: int = 1, lambda_: float = 1.0, gamma: float = 0.0
        ) -> Tuple[float, List[Dict[str, Any] | float], float]:
    """
    Fit an XGBoost-like model for regression.

    Args:
        X: Feature matrix, shape (n_samples, n_features).
        y: Target values, shape (n_samples,).
        n_estimators: Number of boosting rounds. Defaults to 10.
        learning_rate: Learning rate. Defaults to 0.1.
        max_depth: Maximum depth of each tree. Defaults to 3.
        min_samples_leaf: Minimum samples per leaf. Defaults to 1.
        lambda_: L2 regularisation parameter. Defaults to 1.0.
        gamma: Minimum loss reduction for a split. Defaults to 0.0.

    Returns:
        Tuple[float, List[Dict[str, Any] | float], float]: Initial prediction (mean of y), list of fitted trees, and learning rate.
    """
    models = []
    y_pred = initialise_model(y)
    for m in range(n_estimators):
        gradients, hessians = compute_gradients_and_hessians(y, y_pred)
        tree = build_tree(
            X, gradients, hessians, max_depth, min_samples_leaf, lambda_, gamma)
        update = predict_tree_batch(tree, X)
        y_pred += learning_rate * update
        models.append(tree)
    return np.mean(y), models, learning_rate

## 9. Final Predictions
The final predictions after $M$ trees is:
\begin{align*}
    F_M(x) = F_0(x) + \sum^{M}_{m=1} \eta \cdot h_m(x)
\end{align*}

where:
- $F_0(x)$: Initial prediction (mean of $y$).
- $h_m(x)$: Prediction of the $m$-th tree.
- $\eta$: Learning rate.

In [83]:
def predict(X: NDArray[np.float64], initial_prediction: float, models: List[Dict[str, Any] | float],
            learning_rate: float) -> NDArray[np.float64]:
    """
    Predict outputs for a batch of samples using the fitted model.

    Args:
        X: Feature matrix, shape (n_samples, n_features).
        initial_prediction: Initial prediction (mean of y from training).
        models: List of fitted trees.
        learning_rate: Learning rate.

    Returns:
        Predicted values, shape (n_samples,).
    """
    y_pred = np.full(X.shape[0], initial_prediction, dtype=float)
    for tree in models:
        y_pred += learning_rate * predict_tree_batch(tree, X)
    return y_pred

## 10. Evaluation Metrics
### Mean Squared Error (MSE)
Mean Squared Error measures the average squared difference between predicted ($\hat y$) and actual ($y$) values. Large errors are penalised heavily. Smaller MSE indicates better predictions.

\begin{align*}
MSE = \dfrac{1}{n} \sum_{i=1}^{n}(\hat y_{i} = y_{i})^2
\end{align*}

In [84]:
def calculate_MSE(y_true: NDArray[np.float64], y_pred: NDArray[np.float64]) -> float:
    return np.mean((y_true - y_pred) ** 2)

### Root Mean Squared Error (RMSE)
Square root of MSE. It provides error in the same unit as the target variable ($y$) and easier to interpret.

\begin{align*}
RMSE = \sqrt{(MSE)}
\end{align*}

In [85]:
def calculate_RMSE(y_true: NDArray[np.float64], y_pred: NDArray[np.float64]) -> float:
    return np.sqrt(np.mean((y_true - y_pred) ** 2))

### Mean Absolute Error (MAE)
Mean Absolute Error measures the average absolute difference between predicted ($\hat y$) and actual ($y$) values. It is less sensitive to outliers than MSE. Smaller MAE indicates better predictions.

\begin{align*}
MAE = \dfrac{1}{n} \sum_{i=1}^{n}|\hat y_{i} = y_{i}|
\end{align*}

In [86]:
def calculate_MAE(y_true: NDArray[np.float64], y_pred: NDArray[np.float64]) -> float:
    return np.mean(np.abs(y_true - y_pred))

<a id="r-squared"></a>
### R-Squared($R^2$)

R-squared indicated the proportion of variance in the dependent variable that is predictable from the independent variables. Value ranges from 0 to 1. Closer to 1 indicates a better fit.



Residual Sum of Squares ($SS_{residual}$): 
\begin{align*}
SS_{residual} = \sum_{i=1}^{n} (y_{i} - \hat y_{i})^{2}
\end{align*}

Total Sum of Squares ($SS_{total}$): 
\begin{align*}
SS_{total} = \sum_{i=1}^{n} (y_{i} - \bar y_{i})^{2}
\end{align*}

$R^2$ is computed as:

\begin{align*}

R^2 = 1 - \dfrac{SS_{residual}}{SS_{total}} = 1 - \dfrac{\sum_{i=1}^{n} (y_{i} - \hat y_{i})^{2}}{\sum_{i=1}^{n} (y_{i} - \bar y_{i})^{2}}

\end{align*}

where:

$y$: Actual target values.

$\bar y$: Mean of the actual target values.

$\hat y$: Precicted target values.

In [87]:
def calculate_r2(y_true: NDArray[np.float64], y_pred: NDArray[np.float64]) -> float:
    ss_total = np.sum((y_true - np.mean(y_true)) ** 2)
    ss_residual = np.sum((y_true - y_pred) ** 2)
    r2 = 1 - (ss_residual / ss_total)
    return r2

In [88]:
def evaluate(y_true: NDArray[np.float64], y_pred: NDArray[np.float64]) -> Tuple[float, float, float, float]:
    """
    Calculate and return evaluation metrics for a regression model, including MSE, RMSE, MAE, and R-squared.

     Args:
        y_true: True labels.
        y_pred: Predicted labels.
        class_names: List of class names. Defaults to None.
    Returns:
        - mse: Mean Squared Error (MSE), indicating the average of the squared differences between predicted and true values.
        - rmse: Root Mean Squared Error (RMSE), indicating the standard deviation of the residuals.
        - mae: Mean Absolute Error (MAE), representing the average absolute difference between predicted and true values.
        - r2: R-squared (coefficient of determination), showing the proportion of variance in the dependent variable that is predictable from the independent variable(s).
    """
    mse = calculate_MSE(y_true, y_pred)
    rmse = calculate_RMSE(y_true, y_pred)
    mae = calculate_MAE(y_true, y_pred)
    r2 = calculate_r2(y_true, y_pred)
    return mse, rmse, mae, r2

## 11. Encapsulation

In [89]:
class CustomXGBoost:
    """
    Custom Extreme Gradient Boosting for regression with decision trees.
    """

    def __init__(self, n_estimators: int = 10, learning_rate: float = 0.1,
                 max_depth: int = 3, min_samples_leaf: int = 1, lambda_: float = 1.0,
                 gamma: float = 0.0) -> None:
        """
        Initialise the CustomXGBoost regressor with specified hyperparameters.

        Parameters:
            n_estimators: Number of boosting rounds (trees).
            learning_rate: Shrinkage factor for each tree's contribution.
            max_depth: Maximum depth of each regression tree.
            min_samples_leaf: Minimum number of samples required in a leaf node.
            lambda_: L2 regularisation parameter for leaf weights.
            gamma: Minimum loss reduction required to make a split.
        """
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.min_samples_leaf = min_samples_leaf
        self.lambda_ = lambda_
        self.gamma = gamma
        self.initial_prediction = None
        self.models: List[Dict[str, Any] | float] = []

    def _compute_gradients_and_hessians(self, y_true: NDArray[np.float64],
                                        y_pred: NDArray[np.float64]) -> Tuple[NDArray[np.float64], NDArray[np.float64]]:
        """
        Compute gradients and Hessians for squared error loss.

        Args:
            y_true: True target values, shape (n_samples,).
            y_pred: Predicted values, shape (n_samples,).

        Returns:
            A tuple with gradients and Hessians, both of shape (n_samples,).
        """
        gradients = y_pred - y_true
        hessians = np.ones_like(y_true)
        return gradients, hessians

    def _best_split(self, X: NDArray[np.float64], gradients: NDArray[np.float64],
                    hessians: NDArray[np.float64]) -> Tuple[int, float, float]:
        """
        Find the best split for a node in the XGBoost tree.

        Args:
            X: Feature matrix, shape (n_samples, n_features).
            gradients: Gradients for each sample, shape (n_samples,).
            hessians: Hessians for each sample, shape (n_samples,).
        Returns:
            A tuple with best feature index, best threshold, and best gain.
        """
        best_feature, best_threshold, best_gain = None, None, -np.inf
        for feature in range(X.shape[1]):
            thresholds = np.unique(X[:, feature])
            for threshold in thresholds:
                left_mask = X[:, feature] < threshold
                right_mask = ~left_mask
                if left_mask.sum() < self.min_samples_leaf or right_mask.sum() < self.min_samples_leaf:
                    continue
                G_L, H_L = gradients[left_mask].sum(
                ), hessians[left_mask].sum()
                G_R, H_R = gradients[right_mask].sum(
                ), hessians[right_mask].sum()
                gain = 0.5 * (
                    G_L**2 / (H_L + self.lambda_) +
                    G_R**2 / (H_R + self.lambda_) -
                    (G_L + G_R)**2 / (H_L + H_R + self.lambda_)
                ) - self.gamma
                if gain > best_gain:
                    best_feature, best_threshold, best_gain = feature, threshold, gain
        return best_feature, best_threshold, best_gain

    def _build_tree(self, X: NDArray[np.float64], gradients: NDArray[np.float64],
                    hessians: NDArray[np.float64], depth: int = 0) -> Dict[str, Any] | float:
        """
        Recursively build a decision tree for XGBoost.

        Args:
            X: Feature matrix, shape (n_samples, n_features).
            gradients: Gradients for each sample, shape (n_samples,).
            hessians: Hessians for each sample, shape (n_samples,).
            depth: Current depth of the tree. Defaults to 0.

        Returns:
            Tree structure as nested dictionaries, or a float for leaf value.
        """
        if depth >= self.max_depth or len(gradients) <= self.min_samples_leaf:
            return -gradients.sum() / (hessians.sum() + self.lambda_)
        feature, threshold, gain = self._best_split(X, gradients, hessians)
        if feature is None or gain <= 0:
            return -gradients.sum() / (hessians.sum() + self.lambda_)
        left_mask = X[:, feature] < threshold
        right_mask = ~left_mask
        return {
            'feature': feature,
            'threshold': threshold,
            'left': self._build_tree(
                X[left_mask], gradients[left_mask], hessians[left_mask], depth + 1
            ),
            'right': self._build_tree(
                X[right_mask], gradients[right_mask], hessians[right_mask], depth + 1
            )
        }

    def _predict_tree(self, tree: Dict[str, Any] | float, x: NDArray[np.float64]) -> float:
        """
        Predict the output for a single sample using a decision tree.

        Args:
            tree: Tree structure or leaf value.
            x: Feature vector, shape (n_features,).

        Returns:
            Predicted value for the sample.
        """
        while isinstance(tree, dict):
            if x[tree['feature']] < tree['threshold']:
                tree = tree['left']
            else:
                tree = tree['right']

        return tree

    def _predict_tree_batch(self, tree: Dict[str, Any] | float, X: NDArray[np.float64]) -> NDArray[np.float64]:
        """
        Predict outputs for a batch of samples using a decision tree.

        Args:
            tree: Tree structure or leaf value.
            X: Feature matrix, shape (n_samples, n_features).

        Returns:
            Predicted values, shape (n_samples,).
        """
        return np.array([self._predict_tree(tree, x) for x in X])

    def fit(self, X: NDArray[np.float64], y: NDArray[np.float64]) -> None:
        """
        Fit an XGBoost-like model for regression.

        Args:
            X: Feature matrix, shape (n_samples, n_features).
            y: Target values, shape (n_samples,).
        """
        self.models = []
        y_pred = np.full_like(y, np.mean(y), dtype=float)
        self.initial_prediction = np.mean(y)
        for _ in range(self.n_estimators):
            gradients, hessians = self._compute_gradients_and_hessians(
                y, y_pred)
            tree = self._build_tree(X, gradients, hessians)
            update = self._predict_tree_batch(tree, X)
            y_pred += self.learning_rate * update
            self.models.append(tree)

    def predict(self, X: NDArray[np.float64]) -> NDArray[np.float64]:
        """
        Predict outputs for a batch of samples using the fitted model.

        Args:
            X: Feature matrix, shape (n_samples, n_features).

        Returns:
            Predicted values, shape (n_samples,).
        """
        y_pred = np.full(X.shape[0], self.initial_prediction, dtype=float)
        for tree in self.models:
            y_pred += self.learning_rate * self._predict_tree_batch(tree, X)
        return y_pred

In [90]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Instantiate and fit the model
model_custom = CustomXGBoost(
    n_estimators=100, learning_rate=0.1, max_depth=2, lambda_=0.01)
model_custom.fit(X_train, y_train)

# Predict and evaluate
y_pred_custom = model_custom.predict(X_test)
mse_custom, rmse_custom, mae_custom, r2_custom = evaluate(
    y_test, y_pred_custom)
print(f'MSE (Custom): {mse_custom:.4f}')
print(f'RMSE (Custom): {rmse_custom:.4f}')
print(f'MAE (Custom): {mae_custom:.4f}')
print(f'R-Squared (Custom): {r2_custom:.4f}')

MSE (Custom): 0.0038
RMSE (Custom): 0.0618
MAE (Custom): 0.0436
R-Squared (Custom): 0.8522


## 12. Comparison with xgboost

In [91]:
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, root_mean_squared_error, mean_absolute_error, r2_score

# Instantiate and fit the xgboost model
xgb = XGBRegressor(
    n_estimators=100,        # Number of boosting rounds (trees)
    learning_rate=0.1,       # Step size shrinkage
    max_depth=5,             # Maximum tree depth
    subsample=0.8,           # Row sampling
    colsample_bytree=0.8,    # Feature sampling
    objective='reg:squarederror',  # Loss function for regression
    random_state=42
)
xgb.fit(X_train, y_train)

# Predict and evaluate
y_pred_xgboost = xgb.predict(X_test)
mse_xgboost = mean_squared_error(y_test, y_pred_xgboost)
rmse_xgboost = root_mean_squared_error(y_test, y_pred_xgboost)
mae_xgboost = mean_absolute_error(y_test, y_pred_xgboost)
r2_xgboost = r2_score(y_test, y_pred_xgboost)
print(f'MSE (Custom): {mse_custom:.4f}')
print(f'MSE (xgboost): {mse_xgboost:.4f}')
print(f'RMSE (Custom): {rmse_custom:.4f}')
print(f'RMSE (xgboost): {rmse_xgboost:.4f}')
print(f'MAE (Custom): {mae_custom:.4f}')
print(f'MAE (xgboost): {mae_xgboost:.4f}')
print(f'R-Squared (Custom): {r2_custom:.4f}')
print(f'R-Squared (xgboost): {r2_xgboost:.4f}')

MSE (Custom): 0.0038
MSE (xgboost): 0.0039
RMSE (Custom): 0.0618
RMSE (xgboost): 0.0623
MAE (Custom): 0.0436
MAE (xgboost): 0.0417
R-Squared (Custom): 0.8522
R-Squared (xgboost): 0.8495


## 13. References

--- 

1. Andreas Mueller. (2020). *Applied ML 2020 - 08 - Gradient Boosting.* <br>
https://www.youtube.com/watch?v=yrTW5YTmFjw

1. Bex Tuychiev. (2023). *A Guide to The Gradient Boosting Algorithm.* <br>
https://www.datacamp.com/tutorial/guide-to-the-gradient-boosting-algorithm

1. DataMListic. (2023). *Gradient Boosting with Regression Trees Explained* [YouTube Video]. <br>
https://youtu.be/lOwsMpdjxog

1. DMLC XGBOOST. (2022). *Introduction to Boosted Trees*. <br>
https://xgboost.readthedocs.io/en/stable/tutorials/model.html

1. GeeksforGeeks. (2025). *XGBoost*. <br>
https://www.geeksforgeeks.org/machine-learning/xgboost/

1. IBM. (2024). *What is XGBoost?* <br>
https://www.ibm.com/think/topics/xgboost

Jason Brownlee. (2021). *How to Develop Your First XGBoost Model in Python*.<br>
https://machinelearningmastery.com/develop-first-xgboost-model-python-scikit-learn/

1. Nilimesh Halder. (2023). *Unpacking XGBoost: A Comprehensive Guide to Enhanced Gradient Boosting in Machine Learning*. <br>
https://blog.gopenai.com/unpacking-xgboost-a-comprehensive-guide-to-enhanced-gradient-boosting-in-machine-learning-c145acec09fc

1. nVIDIA. (n.d.). *What is XGBoost?* <br>
https://www.nvidia.com/en-us/glossary/xgboost/

1. StatQuest with Josh Starmer. (2019). *Gradient Boost Part 1 (of 4): Regression Main Ideas* [YouTube Video]. <br>
https://youtu.be/3CC4N4z3GJc

1. StatQuest with Josh Starmer. (2019). *Gradient Boost Part 2 (of 4): Regression Details* [YouTube Video]. <br>
https://youtu.be/2xudPOBz-vs

1. StatQuest with Josh Starmer. (2019). *XGBoost Part 1 (of 4): Regression* [YouTube Video]. <br>
https://youtu.be/OtD8wVaFm6E

1. Terence Parr and Jeremy Howard. (n.d.). *How to explain gradient boosting.* <br>
https://explained.ai/gradient-boosting/index.html

1. The IoT Academy. (2024). *What is the XGBoost Algorithm in ML – Explained With Steps*. <br>
https://www.theiotacademy.co/blog/xgboost-algorithm/

1. Tomonori Masui. (2022). *All You Need to Know about Gradient Boosting Algorithm − Part 1. Regression.* <br>
https://medium.com/data-science/all-you-need-to-know-about-gradient-boosting-algorithm-part-1-regression-2520a34a502

1. Wiens, M., Verone-Boyle, A., Henscheid, N., Podichetty, J. T., & Burton, J. (2025). A Tutorial and Use Case Example of the eXtreme Gradient Boosting (XGBoost) Artificial Intelligence Algorithm for Drug Development Applications. Clinical and translational science, 18(3), e70172. <br>
https://doi.org/10.1111/cts.70172