# Stochastic Gradient Boosting from Scratch
***
## Table of Contents
***

In [1]:
import numpy as np
import pandas as pd
from typing import Tuple, List, Dict, Any
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, root_mean_squared_error, mean_absolute_error, r2_score
from numpy.typing import NDArray

## 1. Introduction
This notebook is an extension of [Gradient Boosting Machine from Scratch](https://github.com/tsu76i/DS-playground/tree/main/2.%20Building%20ML%20Models%20From%20Scratch/2.1%20Supervised%20Learning/2.1.6%20Boosting/gradient_boosting_machine.ipynb).


**Gradient Boosting Machine (GBM)** is an ensemble machine learning model that builds a strong predictive model by sequentially combining multiple weak models (typically decision trees) in a stage-wise manner. The core idea is to iteratively add new models that correct the errors made by the existing ensemble, thereby improving overall predictive accuracy.

Suppose we have a dataset ${(x_i, y_i)}^n_{i=1}$ where $x_i$ are the features and $y_i$ are the target values. The goal of gradient boosting is to find a function $F(x)$ that minimises a given differentiable loss function $L(y, F(x))$:

\begin{align*}
    F(x) = F_0(x) + \sum^{M}_{m=1}\gamma_m h_m(x)
\end{align*}

where:
- $F_0(x)$: Initial model (e.g., the mean of $y$).
- $\gamma_i$: Weight (step size) for the $m$-th weak learner, typically determined by minimising the loss function along the direction of $h_m(x)$.
- $M$: Number of boosting iterations (e.g., the number of weak learners).
- $h_m$: prediction from the $m$-th weak learner (e.g., a decision tree).


**Stochastic Gradient Boosting** enhances the gradient boosting approach by introducing randomness into the training process to improve model generalisation, reduce overfitting and increase predictive accuracy. Stochasticity in SGB is as follows:

- **Subsampling the training data**: Instead of training each new tree on the entire dataset, a random subset (usually 40 - 80%) is selected *without replacement* for each iteration.
- **Feature subsampling**: At each split in a tree, only a random subset of features is considered, further increasing diversity among the trees.

### Steps
1. Initialise the model
    - Start with a simple model, typically a constant value:
        - For regression: the mean of the target variable.
        - For binary classification: the log-odds of the positive classes.
1. Calculate residuals (Negative Gradients)
    - For each iteration, compute the residuals, which are the negative gradients of the loss function with respect to the current predictions.
    - This step can be generalised to any differentiable loss function, not just the mean squared error.
1. Fit a new weak model to predict the residuals.
    - **At each boosting iteration, randomly select a subset of the training data without replacement.**
    - Train a weak learner (typically a shallow decision tree) to predict the rediduals, **but only using the selected subsample**.
    - **The subsample fraciton (e.g., 40 - 80%) is a key hypermarameter.**
    - The weak learner focuses on correcting the errors made by the current ensemble.
    - **At each node split, randomly select a subset of features to consider**.
1. Update the model
    - Add the predictions from the new weak learners to the current model's predictions, scaled by a learning rate (shrinkage parameter).
    - This update moves the ensemble closer to the true values by correcting previous errors.
1. Repeat steps 2-4 for a pre-defined number of iterations
1. Final prediction
    - The sum of the initial prediction and the scaled outputs of all weak learners.

## 2. Loading Data
Retrieved from [GitHub - YBI Foundation](https://github.com/YBI-Foundation/Dataset/blob/main/Admission%20Chance.csv)

In [2]:
df = pd.read_csv(
    'https://raw.githubusercontent.com/YBI-Foundation/Dataset/refs/heads/main/Admission%20Chance.csv')
df.head()

Unnamed: 0,Serial No,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


In [3]:
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
feature_names = df.columns[:-1].tolist()  # All columns except the last one

# Check the shape of the data
print(f'Features shape: {X.shape}')
print(f'Target shape: {y.shape}')
print(f'Features: \n{feature_names}')

Features shape: (400, 8)
Target shape: (400,)
Features: 
['Serial No', 'GRE Score', 'TOEFL Score', 'University Rating', ' SOP', 'LOR ', 'CGPA', 'Research']


## 3. Evaluation Metrics
We will use the same evaluation metrics as [Gradient Boosting Machine from Scratch](https://github.com/tsu76i/DS-playground/tree/main/2.%20Building%20ML%20Models%20From%20Scratch/2.1%20Supervised%20Learning/2.1.6%20Boosting/gradient_boosting_machine.ipynb).
### Mean Squared Error (MSE)

In [4]:
def calculate_MSE(y_true: pd.Series, y_pred: NDArray[np.float64]) -> float:
    return np.mean((y_true - y_pred) ** 2)

### Root Mean Squared Error (RMSE)

In [5]:
def calculate_RMSE(y_true: pd.Series, y_pred: NDArray[np.float64]) -> float:
    return np.sqrt(np.mean((y_true - y_pred) ** 2))

### Mean Absolute Error (MAE)

In [6]:
def calculate_MAE(y_true: pd.Series, y_pred: NDArray[np.float64]) -> float:
    return np.mean(np.abs(y_true - y_pred))

<a id="r-squared"></a>
### R-Squared($R^2$)
$\hat y$: Precicted target values.

In [7]:
def calculate_r2(y_true: pd.Series, y_pred: NDArray[np.float64]) -> float:
    ss_total = np.sum((y_true - np.mean(y_true)) ** 2)
    ss_residual = np.sum((y_true - y_pred) ** 2)
    r2 = 1 - (ss_residual / ss_total)
    return r2

In [8]:
def evaluate(y_true: pd.Series, y_pred: NDArray[np.float64]) -> Tuple[float, float, float, float]:
    """
    Calculate and return evaluation metrics for a regression model, including MSE, RMSE, MAE, and R-squared.

     Args:
        y_true): True labels.
        y_pred: Predicted labels.
        class_names: List of class names. Defaults to None.
    Returns:
        - mse: Mean Squared Error (MSE), indicating the average of the squared differences between predicted and true values.
        - rmse: Root Mean Squared Error (RMSE), indicating the standard deviation of the residuals.
        - mae: Mean Absolute Error (MAE), representing the average absolute difference between predicted and true values.
        - r2: R-squared (coefficient of determination), showing the proportion of variance in the dependent variable that is predictable from the independent variable(s).
    """
    mse = calculate_MSE(y_true, y_pred)
    rmse = calculate_RMSE(y_true, y_pred)
    mae = calculate_MAE(y_true, y_pred)
    r2 = calculate_r2(y_true, y_pred)
    return mse, rmse, mae, r2

## 3. From Gradient Boosting to Stochastic Gradient Boosting
Stochastic Gradient Boosting shares its core functionalities with the classical gradient boosting method except for:
1. Random Row subsampling (samples).
- At each boosting iteration, instead of using the entire dataset to fit the new tree, randomly select a fraction of the data (e.g., 40 - 80%) *without replacement* and fit the tree only on the subset.
    - Add a new `sub_samples` parameter (default = $0.8$).
    - In the `fit()` method, before building each tree, randomly select a subset of the data using `np.random.choice`.
    - Use only selected subset to compute residuals and fit the tree.
2. Random Column subsampling (features).
- At each split in the tree, use only a random subset of features rather than all features. This is especially for high-dimensional data and further increases model diversity. 
    - Add a new `sub_features` parameter (default = $0.8$).
    - In the `_best_split()` method, at each node, randomly select a subset of features to consider for splitting

In [9]:
import numpy as np
from typing import Any, Dict, List, Tuple
from numpy.typing import NDArray


class CustomSGBRegressor:
    """
    Custom Stochastic Gradient Boosting for regression with decision trees.
    """

    def __init__(
        self,
        n_estimators: int = 3,
        learning_rate: float = 0.1,
        max_depth: int = 3,
        min_samples_leaf: int = 1,
        sub_samples: float = 0.8,   # Added
        sub_features: float = 0.8,   # Added
        random_state: int = None    # Added
    ):
        """
        Initialises the CustomSGBRegressor.

        Args:
            n_estimators: Number of boosting rounds.
            learning_rate: Learning rate (shrinkage).
            max_depth: Maximum depth of each tree.
            min_samples_leaf: Minimum samples per leaf.
            sub_samples: Friction of row subsampling. Default = 0.8.
            sub_features: Friction of feature subsampling. Default = 0.8.
            random_state: Seed for the random number generator to ensure reproducibility.
        """
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.min_samples_leaf = min_samples_leaf
        self.models: List[Dict[str, Any] | float] = []
        self.initial_prediction: float = 0.0
        self.sub_samples = sub_samples      # Added
        self.sub_features = sub_features    # Added
        self.random_state = random_state    # Added

    def fit(self, X: NDArray[np.float64], y: NDArray[np.float64]) -> None:
        """
        Fit the gradient boosting regressor to the data.

        Args:
            X: Feature matrix, shape (n_samples, n_features).
            y: Target values, shape (n_samples,).
        """
        if self.random_state is not None:
            np.random.seed(self.random_state)
        
        self.models = []
        self.initial_prediction = float(np.mean(y))
        predictions = np.full_like(y, self.initial_prediction, dtype=float)
        for _ in range(self.n_estimators):
            residuals = y - predictions
            # Stochastic row subsampling
            n_samples = int(self.sub_samples * X.shape[0])
            sample_indices = np.random.choice(
                X.shape[0], n_samples, replace=False)
            X_sub, residuals_sub = X[sample_indices], residuals[sample_indices]
            tree = self._build_tree(
                # Build trees with subsamples
                X_sub, residuals_sub, self.max_depth, self.min_samples_leaf)
            self.models.append(tree)
            # Predictions on the entire dataset
            update = self._predict_tree_batch(tree, X)
            predictions += self.learning_rate * update

    def predict(self, X: NDArray[np.float64]) -> NDArray[np.float64]:
        """
        Predict target values for given feature matrix.

        Args:
            X: Feature matrix, shape (n_samples, n_features).

        Returns:
            Predicted values.
        """
        y_pred = np.full(X.shape[0], self.initial_prediction, dtype=np.float64)
        for tree in self.models:
            y_pred += self.learning_rate * self._predict_tree_batch(tree, X)
        return y_pred

    def _variance(self, y: NDArray[np.float64]) -> float:
        """
        Calculate the variance of the target values.

        Args:
            y: Target values.

        Returns:
            float: Variance of y.
        """
        return np.var(y)

    def _split_dataset(self, X: NDArray[np.float64],
                       y: NDArray[np.float64], feature_index: int,
                       threshold: float) -> Tuple[NDArray[np.float64], NDArray[np.float64], NDArray[np.float64], NDArray[np.float64]]:
        """
        Splits the dataset based on a feature and threshold.

        Args:
            X: Feature matrix, shape (n_samples, n_features).
            y: Target values, shape (n_samples,).
            feature_index: Index of the feature to split on.
            threshold: Threshold value for the split.

        Returns:
            Tuple containing X_left, y_left, X_right, y_right after the split.
        """
        left_mask = X[:, feature_index] < threshold
        right_mask = ~left_mask
        return X[left_mask], y[left_mask], X[right_mask], y[right_mask]

    def _best_split(self, X: NDArray[np.float64],
                    y: NDArray[np.float64], min_samples_leaf: int) -> Tuple[int | None, float | None]:
        """
        Find the best feature and threshold to split the dataset, minimising weighted variance.

        Args:
            X: Feature matrix, shape (n_samples, n_features).
            y: Target values.
            min_samples_leaf: Minimum number of samples required at a leaf node.

        Returns:
            Tuple of (best_feature, best_threshold). Returns (None, None) if no valid split is found.
        """
        m, n = X.shape
        # Stochastic feature subsampling
        n_features = int(self.sub_features * n)
        feature_indices = np.random.choice(n, n_features, replace=False)

        best_feature, best_threshold, best_var = None, None, float('inf')
        for feature in feature_indices:
            thresholds = np.unique(X[:, feature])
            for threshold in thresholds:
                _, y_left, _, y_right = self._split_dataset(
                    X, y, feature, threshold)
                if len(y_left) < min_samples_leaf or len(y_right) < min_samples_leaf:
                    continue
                var_left = self._variance(y_left)
                var_right = self._variance(y_right)
                var_split = (len(y_left) * var_left +
                             len(y_right) * var_right) / m
                if var_split < best_var:
                    best_feature = feature
                    best_threshold = threshold
                    best_var = var_split
        return best_feature, best_threshold

    def _build_tree(self, X: NDArray[np.float64], y: NDArray[np.float64],
                    max_depth: int, min_samples_leaf: int, depth: int = 0) -> Dict[str, Any] | float:
        """
        Recursively build a regression tree.

        Args:
            X: Feature matrix.
            y: Target values.
            max_depth: Maximum depth of the tree.
            min_samples_leaf: Minimum samples required at a leaf node.
            depth: Current depth of the tree (default is 0).

        Returns:
            Tree as a nested dictionary, or a float if a leaf node.
        """
        if depth >= max_depth or len(y) <= min_samples_leaf:
            return float(np.mean(y))
        feature, threshold = self._best_split(X, y, min_samples_leaf)
        if feature is None:
            return float(np.mean(y))
        X_left, y_left, X_right, y_right = self._split_dataset(
            X, y, feature, threshold)
        return {
            'feature': feature,
            'threshold': threshold,
            'left': self._build_tree(X_left, y_left, max_depth, min_samples_leaf, depth + 1),
            'right': self._build_tree(X_right, y_right, max_depth, min_samples_leaf, depth + 1)
        }

    def _predict_tree(self, tree: Dict[str, Any] | float, x: NDArray[np.float64]) -> float:
        """
        Predict the target value for a single sample using the regression tree.

        Args:
            tree: The regression tree or a leaf value.
            x: Feature vector for a single sample.

        Returns:
            float: Predicted value.
        """
        while isinstance(tree, dict):
            if x[tree['feature']] < tree['threshold']:
                tree = tree['left']
            else:
                tree = tree['right']
        return float(tree)

    def _predict_tree_batch(self, tree: Dict[str, Any] | float, X: NDArray[np.float64]) -> NDArray[np.float64]:
        """
        Predict target values for a batch of samples using the regression tree.

        Args:
            tree: The regression tree or a leaf value.
            X: Feature matrix, shape (n_samples, n_features).

        Returns:
            Predicted values for all samples.
        """
        return np.array([self._predict_tree(tree, x) for x in X], dtype=np.float64)

In [10]:
# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Instantiate and fit the model
model_custom = CustomSGBRegressor(n_estimators=200,
                                  learning_rate=0.1,
                                  max_depth=2,
                                  random_state=42,
                                  min_samples_leaf=1,
                                  sub_features=0.8,  # Added
                                  sub_samples=0.8)  # Added
model_custom.fit(X_train, y_train)

# Predict and evaluate
y_pred_custom = model_custom.predict(X_test)
mse_custom, rmse_custom, mae_custom, r2_custom = evaluate(
    y_test, y_pred_custom)
print(f'MSE (Custom): {mse_custom:.4f}')
print(f'RMSE (Custom): {rmse_custom:.4f}')
print(f'MAE (Custom): {mae_custom:.4f}')
print(f'R-Squared (Custom): {r2_custom:.4f}')
print('----------')

MSE (Custom): 0.0037
RMSE (Custom): 0.0605
MAE (Custom): 0.0408
R-Squared (Custom): 0.8582
----------


## 4. Comparison with Scikit-Learn

In [11]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, root_mean_squared_error, mean_absolute_error, r2_score

# Instantiate and fit the sklearn model
sklearn_gbm = GradientBoostingRegressor(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=2,
    min_samples_leaf=1,
    random_state=42,
    subsample=0.8
)
sklearn_gbm.fit(X_train, y_train)

# Predict and evaluate
y_pred_sklearn = sklearn_gbm.predict(X_test)
mse_sklearn = mean_squared_error(y_test, y_pred_sklearn)
rmse_sklearn = root_mean_squared_error(y_test, y_pred_sklearn)
mae_sklearn = mean_absolute_error(y_test, y_pred_sklearn)
r2_sklearn = r2_score(y_test, y_pred_sklearn)
print(f'MSE (SK): {mse_sklearn:.4f}')
print(f'RMSE (SK): {rmse_sklearn:.4f}')
print(f'MAE (SK): {mae_sklearn:.4f}')
print(f'R-Squared (SK): {r2_sklearn:.4f}')
print('----------')

MSE (SK): 0.0035
RMSE (SK): 0.0588
MAE (SK): 0.0427
R-Squared (SK): 0.8662
----------
