# Categorical Gradient Boosting Machine from Scratch
***
## Table of Contents
1. [Introduction](#1-introduction)
    - [Advantages](#advantages)
    - [Limitations](#limitations)
    - [Steps](#steps)
1. [Loading Data](#2-loading-data)
1. [Ordered Target Encoding](#3-ordered-target-encoding)
1. [Finding the Best Split](#4-finding-the-best-split)
1. [Building Trees](#5-building-trees)
1. [Predictions (Trees)](#6-predictions-trees)
1. [Training Model](#7-training-model)
1. [Final Predictions](#8-final-predictions)
1. [Evaluation Metrics](#9-evaluation-metrics)
    - [Binary Confusion Matrix](#binary-confusion-matrix)
    - [Multi-Class Confusion Matrix](#multi-class-confusion-matrix)
    - [Accuracy](#accuracy)
    - [Precision](#precision)
    - [Recall](#recall)
    - [F1-Score](#f1-score)
1. [Encapsulation](#10-encapsulation)
1. [Comparison with catboost](#11-comparison-with-catboost)
1. [References](#12-references)
***

In [1]:
import numpy as np
import pandas as pd
from typing import Tuple, List, Dict, Any, Optional
from tqdm import tqdm
from numpy.typing import NDArray

## 1. Introduction

**Categorical Gradient Boosting (CatBoost)** is the application of gradient boosting algorithms to datasets containing categorical features. CatBoost uses advanced target encoding technique called **ordered target statistics** to convert categorical values to numeric representations, while avoiding target leakage and overfitting. The encoding is based on the mean target value for each category, calculated in a way that prevents the model from '*seeing the future*' (i.e., using information from the current or future row to encode the category).


**Gradient Boosting Machine (GBM)** is an ensemble machine learning model that builds a strong predictive model by sequentially combining multiple weak models (typically decision trees) in a stage-wise manner. The core idea is to iteratively add new models that correct the errors made by the existing ensemble, thereby improving overall predictive accuracy.

Suppose we have a dataset ${(x_i, y_i)}^n_{i=1}$ where $x_i$ are the features and $y_i$ are the target values. The goal of gradient boosting is to find a function $F(x)$ that minimises a given differentiable loss function $L(y, F(x))$:

\begin{align*}
    F(x) = F_0(x) + \sum^{M}_{m=1}\gamma_m h_m(x)
\end{align*}

where:
- $F_0(x)$: Initial model (e.g., the mean of $y$).
- $\gamma_i$: Weight (step size) for the $m$-th weak learner, typically determined by minimising the loss function along the direction of $h_m(x)$.
- $M$: Number of boosting iterations (e.g., the number of weak learners).
- $h_m$: prediction from the $m$-th weak learner (e.g., a decision tree).


### Advantages
- Can process categorical variables directly without manual encoding.
- Ordered boosting and target encoding reduce overfitting.
- Symmetric tree structures enable efficient and swift predictions.

### Limitations
- Memory consumption may be significant especially for large datasets with many categorical features.
- Computationally expensive and time-consuming for large datasets or a large number of trees.


### Steps
1. Identify categorical features in the dataset.
    - Encode categorical variables.
        - Use *target statistics* or *ordered target encoding* to avoid target leakage.
        - For each categorical variable, the encoding for a given row is computed *using only information from previous rows* (ordered by a random permutation).
1. Initialise the model:
    - Start with a simple model, typically a constant value:
        - For regression: the mean of the target variable.
        - For binary classification: the log-odds of the positive classes.
1. Calculate residuals (Negative Gradients):
    - For each iteration, compute the residuals, which are the negative gradients of the loss function with respect to the current predictions.
    - This step can be generalised to any differentiable loss function, not just the mean squared error.
1. Fit a new weak model to predict the residuals:
    - Train a weak learner (typically a symmetric or oblivious decision tree) to predict the rediduals.
    - Tree splits involving categorical features:
        - The algorithm searches for the optimal way to split categories, often by grouping categories based on theri target statistics.
        - CatBoost considers all possible binary splits of the categories, using their encoded values to find the best split.
1. Update the model:
    - Add the predictions from the new weak learners to the current model's predictions, scaled by a learning rate (shrinkage parameter).
    - The update is performed using the encoded categorical features and the ordered boosting mechanism.
1. Repeat steps 3-5 for a pre-defined number of iterations.
    At each iteration, categorical encodings and tree splits are recalculated as necessary, always using the ordered approach to avoid target leakage.
1. Final prediction:
    - The final model is the sum of the initial prediction and the scaled outputs of all weak learners.
    - For categorical features, the final prediction includes the effect of native encoding and optimal splits discovered during training.

## 2. Loading Data
Retrieved from [GitHub - ivalada/Categorical-Naive-Bayes-Implementation/dataset](https://github.com/ivalada/Categorical-Naive-Bayes-Implementation/tree/main/dataset)

In [2]:
df = pd.read_csv("../../_datasets/material.csv", index_col=0)
df.head()

Unnamed: 0,size,material,color,sleeves,demand
0,S,nylon,white,long,medium
1,XL,polyester,cream,short,high
2,S,silk,blue,short,medium
3,M,cotton,black,short,medium
4,XL,polyester,orange,long,medium


In [3]:
X = df.drop("demand", axis=1)
y = df["demand"]

## 3. Ordered Target Encoding
The following function `ordered_target_encode()` transforms a categorical column into a numerical feature by encoding each value based on the mean of the target variable for all *previous* occurences of that category. This prevents target leakage and is especially effective for boosting algorithms. Given a dataset with $n$ rows, the ordered target encoding for row $i$ is:

$$
\text{Encoded}_i = 
\begin{cases}
\frac{S_{x_i}^{<i}}{N_{x_i}^{<i}}, & \text{if } N_{x_i}^{<i} > 0 \\
\bar{y}, & \text{if } N_{x_i}^{<i} = 0
\end{cases}
$$

where:
- $x_i$: Categorical feature for each observation $i$.
- $y_i$: Target variable for each observation $i$.
- $S_{x_i}^{<i}$: Sum of target values for category $x_i$ in rows before $i$.
- $N_{x_i}^{<i}$: Count of category $x_i$ in rows before $i$.
- $\bar y$: Global mean of the target variable.

For example, suppose we have the following dataset:

| Row | Category | Target |
|-----|----------|--------|
| 1   | A        | 10     |
| 2   | B        | 20     |
| 3   | A        | 30     |
| 4   | B        | 40     |
| 5   | A        | 50     |

First, compute the global mean:

$$
\bar y = \dfrac{10+20+30+40+50}{5} = 30
$$

Now, compute the ordered target encoding for each row:

| Row | Category | Target | Previous Sums $S_{x_i}^{<i}$ | Previous Counts $N_{x_i}^{<i}$ | Encoded Value |
|-----|----------|--------|----------------------------|------------------------------|---------------|
| 1   | A        | 10     | 0                          | 0                            | 30            |
| 2   | B        | 20     | 0                          | 0                            | 30            |
| 3   | A        | 30     | 10                         | 1                            | 10            |
| 4   | B        | 40     | 20                         | 1                            | 20            |
| 5   | A        | 50     | 10 + 30 = 40               | 2                            | 20            |

- Row 1(A): No previous A, so use global mean $30$.
- Row 2(B): No previous B, so use global mean $30$.
- Row 3(A): One previous A (row 1, target $10$), so encoding is $\frac{10}{1} = 10$.
- Row 4(B): One previous B (row 2, target $20$), so encoding is $\frac{20}{1} = 20$.
- Row 5(A): Two previous A (row 1 and 3, targets $10$ and $30$), so encoding is $\frac{10 + 30}{2} = 20$.




In [4]:
def ordered_target_encode(df: pd.DataFrame, col: str, target: str) -> pd.Series:
    """
    Vectorised ordered target encoding for a single categorical column.

    Args:
        df: Input DataFrame containing the data.
        col: Name of the categorical column to encode.
        target: Name of the target column.

    Returns:
        Encoded column with ordered target encoding.
    """
    global_mean = df[target].mean()
    # Prepare cumulative sum and count per category
    cumsum = df.groupby(col)[target].cumsum() - df[target]
    cumcnt = df.groupby(col).cumcount()
    enc = cumsum / cumcnt.replace(0, np.nan)
    enc.fillna(global_mean, inplace=True)
    return enc


def apply_ordered_target_encoding(
    df: pd.DataFrame, cat_cols: list[str], target: str
) -> pd.DataFrame:
    """
    Apply ordered target encoding to all categorical columns in a DataFrame.

    Args:
        df: Input DataFrame containing the data.
        cat_cols: List of categorical column names to encode.
        target: Name of the target column.

    Returns:
        DataFrame with encoded categorical columns.
    """
    df_enc = df.copy()
    for col in cat_cols:
        df_enc[col] = ordered_target_encode(df_enc, col, target)
    return df_enc

In [5]:
sample_data = {
    "Category": ["A", "B", "A", "B", "A"],
    "Target": [10, 20, 30, 40, 50],
}
df_sample = pd.DataFrame(sample_data)

print(
    f"Ordered Target Encoding for 'Category':\n{ordered_target_encode(df_sample, 'Category', 'Target')}"
)

Ordered Target Encoding for 'Category':
0    30.0
1    30.0
2    10.0
3    20.0
4    20.0
dtype: float64


## 4. Finding the Best Split
The following functions search for the best feature and threshold to split data into two, such that the sum of variance of the resulting groups is minimised.

For each feature $f$, consider all possible split points $s$. For each split:

- Left group: $\mathcal{L} = \{i: X_{i, f} \leq s\}$
- Right group: $\mathcal{R} = \{i: X_{i, f} > s\}$

Compute the weighted variance:
\begin{align*}
    \text{score}(f, s) = |\mathcal{L}| \cdot \text{Var}(y_{\mathcal{L}}) + |\mathcal{R}| \cdot \text{Var}(y_{\mathcal{R}})
\end{align*}

Select the split with the lowest score.

In [6]:
def find_best_split(
    X: NDArray[np.float64], y: NDArray[np.float64]
) -> Tuple[int | None, float | None, NDArray[np.float64], NDArray[np.float64]]:
    """
    Find the best feature and split value to minimise the weighted variance of the target.

    Args:
        X: Feature matrix of shape (n_samples, n_features).
        y: Target values of shape (n_samples,).

    Returns:
        Tuple: (best_feature, best_split, best_left, best_right)
            best_feature: Index of the best feature to split on.
            best_split: Value of the best split.
            best_left: Boolean mask for samples going to the left child.
            best_right: Boolean mask for samples going to the right child.
    """
    n_samples, n_features = X.shape
    best_feature, best_split, best_score, best_left, best_right = (
        None,
        None,
        np.inf,
        None,
        None,
    )
    for feature_idx in range(n_features):
        values = np.unique(X[:, feature_idx])
        if len(values) == 1:
            continue
        # Try all midpoints between sorted unique values
        sorted_vals = np.sort(values)
        splits = (sorted_vals[:-1] + sorted_vals[1:]) / 2
        for split_val in splits:
            left_idx = X[:, feature_idx] <= split_val
            right_idx = ~left_idx
            if not left_idx.any() or not right_idx.any():
                continue
            score = (
                np.var(y[left_idx]) * left_idx.sum()
                + np.var(y[right_idx]) * right_idx.sum()
            )
            if score < best_score:
                best_feature = feature_idx
                best_split = split_val
                best_score = score
                best_left = left_idx
                best_right = right_idx
    return best_feature, best_split, best_left, best_right

## 5. Building Trees
This function recursively constructs a regression tree by splitting the data at each node using the best split found, until stopping criteria are met. 

Stopping Criteria:
- Maximum depth reached.
- Not enough samples to split.
- All targets are identical.

If stopping criteria are met, return the mean of $y$ at the node. Otherwise, split the data at the best feature and split value, and recursively build left and right subtrees.

In [7]:
def build_tree(
    X: NDArray[np.float64],
    y: NDArray[np.float64],
    max_depth: int,
    min_samples_split: int,
    depth: int = 0,
) -> Dict[str, Any] | float:
    """
    Recursively build a decision tree for regression.

    Args:
        X: Feature matrix of shape (n_samples, n_features).
        y: Target values of shape (n_samples,).
        max_depth: Maximum depth of the tree.
        min_samples_split: Minimum number of samples required to split.
        depth: Current depth of the tree. Defaults to 0.

    Returns:
        A decision tree node represented as a dictionary or a leaf value (float).
    """
    n_samples = X.shape[0]
    if depth >= max_depth or n_samples < min_samples_split or np.all(y == y[0]):
        return np.mean(y)
    best_feature, best_split, best_left, best_right = find_best_split(X, y)
    if best_feature is None:
        return np.mean(y)
    return {
        "feature": best_feature,
        "split": best_split,
        "left": build_tree(
            X[best_left], y[best_left], max_depth, min_samples_split, depth + 1
        ),
        "right": build_tree(
            X[best_right], y[best_right], max_depth, min_samples_split, depth + 1
        ),
    }

## 6. Predictions (Trees)
The `predict_tree()` function predicts the output for a single data sample $x$ by traversing the decision tree.

In [8]:
def predict_tree(node: Dict[str, Any], row: NDArray[np.float64]) -> float:
    """
    Predict the target value for a single sample using the decision tree.

    Args:
        node: Decision tree node or leaf value.
        row: Feature values of the sample.

    Returns:
        Predicted target value.
    """
    while isinstance(node, dict):
        if row[node["feature"]] <= node["split"]:
            node = node["left"]
        else:
            node = node["right"]
    return node

## 7. Training Model
It fits a gradient boosting model with the multi-class log-loss (cross-entropy) as a loss function. At each step, the negative gradient (residuals) is approxiated by a tree.

Let $F_{i, k}$ be the current score (logit) for sample $i$ and class $k$. At each boosting iteration:
1. Compute softmax probabilities:
    $$
    P_{i, k} = \dfrac{e(F_{i, k})}{\sum^{K}_{l=1}e(F_{i, k})}
    $$
2. Compute residuals for each class:
    $$
    r_{i, k} = y_{i, k} - P_{i, k}
    $$
    where $y_{i, k}$ is $1$ if $y_i = k$, else $0$.
3. Fit a tree to $r_{i, k}$ for each class $k$.
4. Update:
    $$
    F_{i, k} \leftarrow F_{i, k} + \eta \cdot \text{tree}_k(X_i)
    $$
    where $\eta$ is the learning rate.

In [9]:
def fit(
    X: NDArray[np.float64],
    y: NDArray[np.int64],
    n_classes: int,
    n_estimators: int,
    learning_rate: float,
    max_depth: int,
    min_samples_split: int,
) -> Dict[str, Any]:
    """
    Fit a gradient boosting model for multi-class classification.

    Args:
        X: Feature matrix of shape (n_samples, n_features).
        y: Target labels of shape (n_samples,).
        n_classes: Number of classes.
        n_estimators: Number of boosting iterations.
        learning_rate: Learning rate for updates.
        max_depth: Maximum depth of each tree.
        min_samples_split: Minimum samples required to split a node.

    Returns:
        List of lists of trees for each boosting iteration and class.
    """
    N = X.shape[0]
    F = np.zeros((N, n_classes), dtype=np.float64)
    y_onehot = np.eye(n_classes)[y]
    trees = []
    for _ in tqdm(range(n_estimators)):
        trees_m = []
        P = np.exp(F - F.max(axis=1, keepdims=True))
        P /= P.sum(axis=1, keepdims=True)
        for k in range(n_classes):
            residual = y_onehot[:, k] - P[:, k]
            tree = build_tree(X, residual, max_depth, min_samples_split)
            update = np.array([predict_tree(tree, row) for row in X])
            F[:, k] += learning_rate * update
            trees_m.append(tree)
        trees.append(trees_m)
    return trees

## 8. Final Predictions
Final predictions using the ensemble of boosted trees.

- For each tree in the sequence and each class, accumulate the predictions (logits).
- Compute softmax probabilities from the logits:
    $$
        P_{i, k} = \dfrac{e(F_{i, k})}{\sum^{K}_{l=1}e(F_{i, k})}
    $$
- The predicted class is the one with the highest probability:
    $$
        \hat y_i = \arg \max_k P_{i, k}
    $$


In [10]:
def predict(
    X: NDArray[np.float64], trees: Dict[str, Any], n_classes: int, learning_rate: float
) -> Tuple[List[int], List[float]]:
    """
    Predict class labels and probabilities using the fitted gradient boosting model.

    Args:
        X: Feature matrix of shape (n_samples, n_features).
        trees: List of lists of trees from the fit function.
        n_classes: Number of classes.
        learning_rate: Learning rate used during training.

    Returns:
        Tuple: (predicted_labels, predicted_probabilities)
            predicted_labels: Array of predicted class labels.
            predicted_probabilities: Array of predicted class probabilities.
    """
    N = X.shape[0]
    F = np.zeros((N, n_classes), dtype=np.float64)
    for trees_m in trees:
        for k, tree in enumerate(trees_m):
            update = np.array([predict_tree(tree, row) for row in X])
            F[:, k] += learning_rate * update
    P = np.exp(F - F.max(axis=1, keepdims=True))
    P /= P.sum(axis=1, keepdims=True)
    return np.argmax(P, axis=1), P

In [11]:
features = ["size", "material", "color", "sleeves"]
target = "demand"

# Encode target as integer
target_map = {v: i for i, v in enumerate(df[target].unique())}
df["target_enc"] = df[target].map(target_map)

# Apply ordered target encoding
df_enc = apply_ordered_target_encoding(df, features, "target_enc")

# Prepare data for boosting
X = df_enc[features].values.astype(np.float64)
y = df["target_enc"].values.astype(np.int64)
n_classes = len(target_map)

# Fit model
trees = fit(
    X,
    y,
    n_classes,
    n_estimators=10,  # Increase for higher accuracy
    learning_rate=0.1,
    max_depth=4,
    min_samples_split=10,  # Increase to speed up on very large data
)

# Predict
y_pred, y_proba = predict(X, trees, n_classes, learning_rate=0.1)

# print("Predicted classes:", y_pred)
print(f"Accuracy: {np.mean(y == y_pred)}")

100%|██████████| 10/10 [07:21<00:00, 44.17s/it]


Accuracy: 0.8022


## 9. Evaluation Metrics
### Binary Confusion Matrix
In a confusion matrix, the terms True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) describe the classification performance for binary classification. 

|                     | Predicted Negative  | Predicted Positive  |
| ------------------- | ------------------- | ------------------- |
| **Actual Negative** | True Negative (TN)  | False Positive (FP) |
| **Actual Positive** | False Negative (FN) | True Positive (TP)  |


1. True Positive (TP): The number of instances correctly predicted as positive (e.g., a disease correctly identified).

2. True Negative (TN): The number of instances correctly predicted as negative (e.g., no disease correctly identified).

3. False Positive (FP): The number of instances incorrectly predicted as positive (e.g., predicting disease when there isn't any).

4. False Negative (FN): The number of instances incorrectly predicted as negative (e.g., missing a disease when it exists).

### Multi-Class Confusion Matrix
For multi-class classification, the concepts can be extended by treating one class as the "positive" class and all others as "negative" classes in a one-vs-all approach. Rows represent the actual classes (true labels), and columns represent the predicted classes. For a class $C$,
1. True Positive (TP): The count in the diagonal cell corresponding to class $C$ ($\text{matrix} [C][C]$).
2. False Positive (FP): The sum of the column for class $C$, excluding the diagonal ($\sum(\text{matrix} [:, C]) - \text{matrix} [C][C]$).
3. False Negative (FN): The sum of the row for class $C$, excluding the diagonal ($\sum(\text{matrix} [C, :]) - \text{matrix} [C][C]$).
4. True Negative (TN): All other cells not in the row or column for class $C$ ($\text{total} - (FP + FN + TP)$).

|                  | Predicted Class 0 | Predicted Class 1 | Predicted Class 2 |
| ---------------- | ----------------- | ----------------- | ----------------- |
| **True Class 0** | 5                 | 2                 | 0                 |
| **True Class 1** | 1                 | 6                 | 1                 |
| **True Class 2** | 0                 | 2                 | 7                 |


For Class 0:
- TP = 5 (diagonal element for Class 0)
- FP = 1 (sum of column 0 minus TP: 1 + 0)
- FN = 2 (sum of row 0 minus TP: 2 + 0)
- TN = 6 + 1 + 2 + 7 = 16 (all other cells not in row 0 or column 0)

For Class 1:
- TP = 6 (diagonal element for Class 1)
- FP = 4 (sum of column 1 minus TP: 2 + 2)
- FN = 2 (sum of row 1 minus TP: 1 + 1)
- TN = 5 + 0 + 0 + 7 = 12 (all other cells not in row 1 or column 1)

In [12]:
def confusion_matrix(
    y_true: NDArray[np.int64], y_pred: NDArray[np.int64], class_names: List[str] = None
) -> Tuple[NDArray[np.int64], List[str]]:
    """
    Calculate the confusion matrix.

    Args:
        y_true: True labels.
        y_pred: Predicted labels.
        class_names: List of class names. Defaults to None.

    Returns:
        - Confusion matrix.
        - List of class names.
    """
    # Encode labels as integers
    unique_classes = np.unique(y_true)
    if class_names is None:
        class_names = [str(cls) for cls in unique_classes]
    class_to_index = {cls: i for i, cls in enumerate(unique_classes)}

    n_classes = len(unique_classes)
    matrix = np.zeros((n_classes, n_classes), dtype=int)

    for true, pred in zip(y_true, y_pred):
        true_idx = class_to_index[true]
        pred_idx = class_to_index[pred]
        matrix[true_idx][pred_idx] += 1

    return matrix, class_names

### Accuracy
Accuracy is the most common evaluation metric for classification problems, representing the percentage of correct predictions out of total predictions. It provides a simple measure of how often the classifier makes correct predictions across all classes.

\begin{align*}
\text{Accuracy} = \dfrac{\text{True Positives (TP)} + \text{True Negatives (TN)}}{\text{Total Samples}}
\end{align*}

In [13]:
def accuracy(y_true: NDArray[np.int64], y_pred: NDArray[np.int64]) -> float:
    """
    Calculate the accuracy of predictions by comparing true and predicted labels.

    Args:
        y_true: Ground truth target values. Contains the actual class labels for each sample.
        y_pred: Estimated target as returned by a classifier. Contains the predicted class labels for each sample.
    Returns:
        Classification accuracy as a percentage (0.0 to 100.0).
    """
    return np.mean(y_true == y_pred)

### Precision
Precision measures the proportion of true positive predictions out of all positive predictions made by the classifier.

\begin{align*}
\text{Precision} = \dfrac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}
\end{align*}

In [14]:
def precision(
    y_true: NDArray[np.int64], y_pred: NDArray[np.int64]
) -> NDArray[np.float64]:
    """
    Calculate precision for each class.

    Args:
        y_true: True labels.
        y_pred: Predicted labels.

    Returns:
        Precision values for each class.
    """
    cm, _ = confusion_matrix(y_true, y_pred)
    return np.diag(cm) / (np.sum(cm, axis=0) + 1e-7)

### Recall
Recall measures the proportion of true positive predications out of all actual positive cases.

\begin{align*}
\text{Recall} = \dfrac{\text{True Positives (TP)} }{\text{True Positives (TP)} + \text{False Negatives (FN)}}
\end{align*}

In [15]:
def recall(y_true: NDArray[np.int64], y_pred: NDArray[np.int64]) -> NDArray[np.float64]:
    """
    Calculate recall for each class.

    Args:
        y_true: True labels.
        y_pred: Predicted labels.

    Returns:
        Recall values for each class.
    """
    cm, _ = confusion_matrix(y_true, y_pred)
    return np.diag(cm) / (np.sum(cm, axis=1) + 1e-7)

### F1-Score
The F1-Score is the harmonic mean of precision and recall.

\begin{align*}
\text{F1-Score} = 2 \times \dfrac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
\end{align*}

In [16]:
def f1_score(
    y_true: NDArray[np.int64], y_pred: NDArray[np.int64]
) -> NDArray[np.float64]:
    """
    Calculate F1-score for each class.

    Args:
        y_true: True labels.
        y_pred: Predicted labels.

    Returns:
        F1-scores for each class.
    """
    prec = precision(y_true, y_pred)
    rec = recall(y_true, y_pred)
    return 2 * (prec * rec) / (prec + rec + 1e-7)

In [17]:
def evaluate(
    y_true: NDArray[np.int64], y_pred: NDArray[np.int64], class_names: List[str] = None
) -> Tuple[float, float, float, float, NDArray[np.int64]]:
    """
    Calculate evaluation metrics including accuracy, precision, recall, and F1-score for each class.

    Args:
        y_true: True labels.
        y_pred: Predicted labels.
        class_names: List of class names. Defaults to None.

    Returns:
        - Overall accuracy.
        - Average precision.
        - Average recall.
        - Average F1-score.
        - Confusion matrix.
    """
    cm, class_names = confusion_matrix(y_true, y_pred, class_names)
    acc = accuracy(y_true, y_pred)
    prec = precision(y_true, y_pred)
    rec = recall(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    # print("Class\tPrecision\tRecall\tF1-Score")
    # for i, class_name in enumerate(class_names):
    #     print(f"{class_name}\t{prec[i]:.4f}\t\t{rec[i]:.4f}\t{f1[i]:.4f}")
    return acc, np.mean(prec), np.mean(rec), np.mean(f1), cm

## 10. Encapsulation

In [18]:
class CustomCatBoost:
    def __init__(
        self,
        n_estimators: int = 100,
        learning_rate: float = 0.1,
        max_depth: int = 3,
        min_samples_split: int = 2,
    ) -> None:
        """
        Initialise the CustomCatBoost model with hyperparameters.

        Args:
            n_estimators: Number of boosting iterations.
            learning_rate: Learning rate for updates.
            max_depth: Maximum depth of each tree.
            min_samples_split: Minimum samples required to split a node.
        """
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.trees = []

    def ordered_target_encode(
        self, df: pd.DataFrame, col: str, target: str
    ) -> pd.Series:
        """
        Vectorised ordered target encoding for a single categorical column.

        Args:
            df: Input DataFrame containing the data.
            col: Name of the categorical column to encode.
            target: Name of the target column.

        Returns:
            Encoded column with ordered target encoding.
        """
        global_mean = df[target].mean()
        cumsum = df.groupby(col)[target].cumsum() - df[target]
        cumcnt = df.groupby(col).cumcount()
        enc = cumsum / cumcnt.replace(0, np.nan)
        enc.fillna(global_mean, inplace=True)
        return enc

    def apply_ordered_target_encoding(
        self, df: pd.DataFrame, cat_cols: List[str], target: str
    ) -> pd.DataFrame:
        """
        Apply ordered target encoding to all categorical columns in a DataFrame.

        Args:
            df: Input DataFrame containing the data.
            cat_cols: List of categorical column names to encode.
            target: Name of the target column.

        Returns:
            DataFrame with encoded categorical columns.
        """
        df_enc = df.copy()
        for col in cat_cols:
            df_enc[col] = self.ordered_target_encode(df_enc, col, target)
        return df_enc

    def find_best_split(
        self, X: NDArray[np.float64], y: NDArray[np.float64]
    ) -> Tuple[Optional[int], Optional[float], NDArray[np.bool_], NDArray[np.bool_]]:
        """
        Find the best feature and split value to minimise the weighted variance of the target.

        Args:
            X: Feature matrix of shape (n_samples, n_features).
            y: Target values of shape (n_samples,).

        Returns:
                best_feature: Index of the best feature to split on.
                best_split: Value of the best split.
                best_left: Boolean mask for samples going to the left child.
                best_right: Boolean mask for samples going to the right child.
        """
        n_samples, n_features = X.shape
        best_feature, best_split, best_score, best_left, best_right = (
            None,
            None,
            np.inf,
            None,
            None,
        )
        for feature_idx in range(n_features):
            values = np.unique(X[:, feature_idx])
            if len(values) == 1:
                continue
            sorted_vals = np.sort(values)
            splits = (sorted_vals[:-1] + sorted_vals[1:]) / 2
            for split_val in splits:
                left_idx = X[:, feature_idx] <= split_val
                right_idx = ~left_idx
                if not left_idx.any() or not right_idx.any():
                    continue
                score = (
                    np.var(y[left_idx]) * left_idx.sum()
                    + np.var(y[right_idx]) * right_idx.sum()
                )
                if score < best_score:
                    best_feature = feature_idx
                    best_split = split_val
                    best_score = score
                    best_left = left_idx
                    best_right = right_idx
        return best_feature, best_split, best_left, best_right

    def build_tree(
        self, X: NDArray[np.float64], y: NDArray[np.float64], depth: int = 0
    ) -> Dict[str, Any] | float:
        """
        Recursively build a decision tree for regression.

        Args:
            X: Feature matrix of shape (n_samples, n_features).
            y: Target values of shape (n_samples,).
            depth: Current depth of the tree. Defaults to 0.

        Returns:
            A decision tree node represented as a dictionary or a leaf value (float).
        """
        n_samples = X.shape[0]
        if (
            depth >= self.max_depth
            or n_samples < self.min_samples_split
            or np.all(y == y[0])
        ):
            return np.mean(y)
        best_feature, best_split, best_left, best_right = self.find_best_split(X, y)
        if best_feature is None:
            return np.mean(y)
        return {
            "feature": best_feature,
            "split": best_split,
            "left": self.build_tree(X[best_left], y[best_left], depth + 1),
            "right": self.build_tree(X[best_right], y[best_right], depth + 1),
        }

    def predict_tree(
        self, node: Dict[str, Any] | float, row: NDArray[np.float64]
    ) -> float:
        """
        Predict the target value for a single sample using the decision tree.

        Args:
            node: Decision tree node or leaf value.
            row: Feature values of the sample.

        Returns:
            Predicted target value.
        """
        while isinstance(node, dict):
            if row[node["feature"]] <= node["split"]:
                node = node["left"]
            else:
                node = node["right"]
        return node

    def fit(
        self,
        X: NDArray[np.float64],
        y: NDArray[np.int64],
        n_classes: int,
        cat_cols: Optional[List[str]] = None,
        df: Optional[pd.DataFrame] = None,
        target_col: Optional[str] = None,
    ) -> None:
        """
        Fit the gradient boosting model for multi-class classification.

        Args:
            X: Feature matrix of shape (n_samples, n_features).
            y: Target labels of shape (n_samples,).
            n_classes: Number of classes.
            cat_cols: List of categorical columns for ordered target encoding. Defaults to None.
            df: DataFrame containing original data for encoding. Defaults to None.
            target_col: Target column name in DataFrame. Defaults to None.
        """
        if cat_cols is not None and df is not None and target_col is not None:
            df_enc = self.apply_ordered_target_encoding(df, cat_cols, target_col)
            for col in cat_cols:
                if col in df_enc.columns:
                    X[:, df.columns.get_loc(col)] = df_enc[col].values

        N = X.shape[0]
        F = np.zeros((N, n_classes), dtype=np.float64)
        y_onehot = np.eye(n_classes)[y]
        self.trees = []
        for _ in tqdm(range(self.n_estimators)):
            trees_m = []
            P = np.exp(F - F.max(axis=1, keepdims=True))
            P /= P.sum(axis=1, keepdims=True)
            for k in range(n_classes):
                residual = y_onehot[:, k] - P[:, k]
                tree = self.build_tree(X, residual)
                update = np.array([self.predict_tree(tree, row) for row in X])
                F[:, k] += self.learning_rate * update
                trees_m.append(tree)
            self.trees.append(trees_m)

    def predict(
        self, X: NDArray[np.float64], n_classes: int
    ) -> Tuple[NDArray[np.int64], NDArray[np.float64]]:
        """
        Predict class labels and probabilities using the fitted gradient boosting model.

        Args:
            X: Feature matrix of shape (n_samples, n_features).
            n_classes: Number of classes.

        Returns:
                predicted_labels: Array of predicted class labels.
                predicted_probabilities: Array of predicted class probabilities.
        """
        N = X.shape[0]
        F = np.zeros((N, n_classes), dtype=np.float64)
        for trees_m in self.trees:
            for k, tree in enumerate(trees_m):
                update = np.array([self.predict_tree(tree, row) for row in X])
                F[:, k] += self.learning_rate * update
        P = np.exp(F - F.max(axis=1, keepdims=True))
        P /= P.sum(axis=1, keepdims=True)
        return np.argmax(P, axis=1), P

    def fit_predict(
        self,
        X: NDArray[np.float64],
        y: NDArray[np.int64],
        n_classes: int,
        cat_cols: Optional[List[str]] = None,
        df: Optional[pd.DataFrame] = None,
        target_col: Optional[str] = None,
    ) -> Tuple[NDArray[np.int64], NDArray[np.float64]]:
        """
        Fit the model and predict class labels and probabilities.

        Args:
            X: Feature matrix of shape (n_samples, n_features).
            y: Target labels of shape (n_samples,).
            n_classes: Number of classes.
            cat_cols: List of categorical columns for ordered target encoding. Defaults to None.
            df: DataFrame containing original data for encoding. Defaults to None.
            target_col: Target column name in DataFrame. Defaults to None.

        Returns:
                predicted_labels: Array of predicted class labels.
                predicted_probabilities: Array of predicted class probabilities.
        """
        self.fit(X, y, n_classes, cat_cols, df, target_col)
        return self.predict(X, n_classes)

In [None]:
cat_features = ["size", "material", "color", "sleeves"]
target = "demand"

# Encode target as integer
target_map = {v: i for i, v in enumerate(df[target].unique())}
df["target_enc"] = df[target].map(target_map)

# Instantiate model
model_custom = CustomCatBoost(
    n_estimators=20, learning_rate=0.2, max_depth=4, min_samples_split=5
)

# Apply ordered target encoding
df_enc = model_custom.apply_ordered_target_encoding(df, cat_features, "target_enc")

# Prepare data
X = df_enc[cat_features].values.astype(np.float64)
y = df["target_enc"].values.astype(np.int64)
n_classes = len(target_map)

# Fit model
model_custom.fit(X, y, n_classes, cat_cols=None, df=None, target_col=None)

# Predict and evaluate
y_pred, y_proba = model_custom.predict(X, n_classes)
acc_custom, prec_custom, rec_custom, f1_custom, cm_custom = evaluate(y, y_pred)
print(f"Accuracy: (Custom) {acc_custom:.4f}")
print(f"Precision: (Custom) {prec_custom:.4f}")
print(f"Recall (Custom): {rec_custom:.4f}")
print(f"F1-Score (Custom): {f1_custom:.4f}")
print(f"Confusion Matrix (Custom):\n{cm_custom}")

 70%|███████   | 14/20 [10:24<04:20, 43.47s/it]

## 11. Comparison with catboost 

In [None]:
from catboost import CatBoostClassifier, Pool
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,  # noqa: F811
    confusion_matrix,  # noqa: F811
    classification_report,
)


df_cb = pd.read_csv("../../_datasets/material.csv", index_col=0)

cat_features = ["size", "material", "color", "sleeves"]
target = "demand"

# Prepare features and target
X_cb = df_cb.drop("demand", axis=1)
y_cb = df_cb["demand"]


# Create CatBoost Pool (optional, but recommended for categorical data)
train_pool = Pool(X_cb, y_cb, cat_features=cat_features)

# Initialise and fit the classifier
model = CatBoostClassifier(
    iterations=20,
    learning_rate=0.2,
    depth=4,
    loss_function="MultiClass",
    eval_metric="MultiClass",
    verbose=0,
    random_seed=42,
)
model.fit(train_pool)

# Predict class labels and probabilities
y_pred = model.predict(X_cb).flatten()
preds_proba = model.predict_proba(X_cb)

# Calculate evaluation metrics
acc_cb = accuracy_score(y_cb, y_pred)
prec_cb = precision_score(y_cb, y_pred, average="weighted")
rec_cb = recall_score(y_cb, y_pred, average="weighted")
f1_cb = f1_score(y_cb, y_pred, average="weighted")
cm_cb = confusion_matrix(y_cb, y_pred)

print(f"Accuracy: (Custom) {acc_custom:.4f}")
print(f"Precision: (Custom) {prec_custom:.4f}")
print(f"Recall (Custom): {rec_custom:.4f}")
print(f"F1-Score (Custom): {f1_custom:.4f}")
print(f"Confusion Matrix (Custom):\n{cm_custom}")
print("----------")
print(f"Accuracy: (CatBoost) {acc_cb:.4f}")
print(f"Precision: (CatBoost) {prec_cb:.4f}")
print(f"Recall (CatBoost): {rec_cb:.4f}")
print(f"F1-Score (CatBoost): {f1_cb:.4f}")
print(f"Confusion Matrix (CatBoost):\n{cm_cb}")
print(classification_report(y_cb, y_pred))

## 12. References
1. Andreas Mueller. (2020). *Applied ML 2020 - 08 - Gradient Boosting.* <br>
https://www.youtube.com/watch?v=yrTW5YTmFjw

1. Artem Oppermann. (2023). *What is CatBoost?* <br>
https://builtin.com/machine-learning/catboost

1. Bex Tuychiev. (2023). *A Guide to The Gradient Boosting Algorithm.* <br>
https://www.datacamp.com/tutorial/guide-to-the-gradient-boosting-algorithm

1. CatBoost. (n.d.). *Quick start*.<br>
https://catboost.ai/docs/en/concepts/python-quickstart

1. DataMListic. (2023). *Gradient Boosting with Regression Trees Explained* [YouTube Video]. <br>
https://youtu.be/KXOTSkPL2X4

1. GeeksforGeeks. (2025). *CatBoost in Machine Learning*. <br>
https://www.geeksforgeeks.org/machine-learning/catboost-ml/

1. GeeksforGeeks. (2024). *CatBoost's Categorical Encoding: One-Hot vs. Target Encoding*. <br>
https://www.geeksforgeeks.org/machine-learning/catboosts-categorical-encoding-one-hot-vs-target-encoding/

1. M Iqbal. (2025). *CatBoost Explained: Intuition, Advantages, and Math Behind Classification & Regression 🚀* <br>
https://youtu.be/DGwLsx47Quc

1. StatQuest with Josh Starmer. (2023). *CatBoost Part 1: Ordered Target Encoding* [YouTube Video]. <br>
https://youtu.be/2xudPOBz-vs

1. StatQuest with Josh Starmer. (2023). *CatBoost Part 2: Building and Using Trees* [YouTube Video]. <br>
https://youtu.be/3Bg2XRFOTzg

1. StatQuest with Josh Starmer. (2019). *Gradient Boost Part 1 (of 4): Regression Main Ideas* [YouTube Video]. <br>
https://youtu.be/3CC4N4z3GJc

1. StatQuest with Josh Starmer. (2019). *Gradient Boost Part 2 (of 4): Regression Details* [YouTube Video]. <br>
https://youtu.be/2xudPOBz-vs

1. Terence Parr and Jeremy Howard. (n.d.). *How to explain gradient boosting.* <br>
https://explained.ai/gradient-boosting/index.html

1. Tomonori Masui. (2022). *All You Need to Know about Gradient Boosting Algorithm − Part 1. Regression.* <br>
https://medium.com/data-science/all-you-need-to-know-about-gradient-boosting-algorithm-part-1-regression-2520a34a502