# Gradient Boosting from Scratch - Binary Classification

Bare bones implementation of Gradient Boosting for binary classification.

**Key Concepts:**
- Sequential ensemble learning
- Fitting trees to residuals (gradients)
- Additive model building
- Learning rate (shrinkage)
- Decision tree as weak learner


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve
from sklearn.metrics import confusion_matrix, classification_report

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
np.random.seed(42)

print("=" * 80)
print("GRADIENT BOOSTING FROM SCRATCH - BINARY CLASSIFICATION")
print("=" * 80)

## Mathematical Foundation

**Gradient Boosting for Binary Classification:**

**1. Model:**
$$F_M(x) = F_0(x) + \sum_{m=1}^{M} \nu \cdot h_m(x)$$

where:
- $F_M(x)$ = final model (log-odds)
- $F_0(x)$ = initial prediction (constant)
- $h_m(x)$ = weak learner at iteration $m$ (decision tree)
- $\nu$ = learning rate (shrinkage)
- $M$ = number of boosting iterations

**2. Loss Function (Log Loss / Cross-Entropy):**
$$L(y, F(x)) = -[y \log(p) + (1-y) \log(1-p)]$$

where $p = \frac{1}{1 + e^{-F(x)}}$ (sigmoid)

**3. Gradient (Negative Gradient = Residuals):**
$$r_{im} = -\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} = y_i - p_i$$

**4. Algorithm:**
1. Initialize: $F_0(x) = \log\left(\frac{\bar{p}}{1-\bar{p}}\right)$ (log-odds of mean)
2. For $m = 1$ to $M$:
   - Compute pseudo-residuals: $r_i = y_i - p_i$ where $p_i = \sigma(F_{m-1}(x_i))$
   - Fit weak learner $h_m(x)$ to residuals $r_i$
   - Update: $F_m(x) = F_{m-1}(x) + \nu \cdot h_m(x)$
3. Output: $\hat{p}(x) = \sigma(F_M(x))$

**5. Prediction:**
$$\hat{y} = \begin{cases} 
1 & \text{if } \sigma(F_M(x)) \geq 0.5 \\
0 & \text{otherwise}
\end{cases}$$

## Gradient Boosting Classifier Implementation

In [None]:
class GradientBoostingClassifierScratch:
    """
    Gradient Boosting for binary classification implemented from scratch.
    
    Uses decision trees as weak learners and fits them sequentially
    to negative gradients (residuals) of the log loss.
    
    Parameters:
    -----------
    n_estimators : int, default=100
        Number of boosting iterations (trees)
    learning_rate : float, default=0.1
        Shrinkage parameter (0 < lr <= 1)
    max_depth : int, default=3
        Maximum depth of individual trees
    min_samples_split : int, default=2
        Minimum samples required to split a node
    subsample : float, default=1.0
        Fraction of samples to use for fitting each tree (stochastic GB)
    verbose : bool, default=False
        Whether to print progress
    """
    
    def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3,
                 min_samples_split=2, subsample=1.0, verbose=False):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.subsample = subsample
        self.verbose = verbose
        
        # Model components
        self.trees = []
        self.init_pred = None  # F_0(x)
        
        # Training history
        self.train_losses = []
        self.train_scores = []
        
    def _sigmoid(self, z):
        """
        Sigmoid function (logistic function).
        Converts log-odds to probabilities.
        """
        # Clip to prevent overflow
        z = np.clip(z, -500, 500)
        return 1 / (1 + np.exp(-z))
    
    def _log_loss(self, y_true, y_pred_proba):
        """
        Binary cross-entropy loss.
        """
        epsilon = 1e-15
        y_pred_proba = np.clip(y_pred_proba, epsilon, 1 - epsilon)
        return -np.mean(y_true * np.log(y_pred_proba) + 
                       (1 - y_true) * np.log(1 - y_pred_proba))
    
    def _compute_residuals(self, y_true, y_pred_proba):
        """
        Compute negative gradient (pseudo-residuals).
        
        For log loss: -∂L/∂F = y - p
        """
        return y_true - y_pred_proba
    
    def fit(self, X, y):
        """
        Fit gradient boosting classifier.
        
        Parameters:
        -----------
        X : array-like, shape (n_samples, n_features)
            Training data
        y : array-like, shape (n_samples,)
            Binary target values (0 or 1)
        """
        n_samples = len(y)
        
        # Step 1: Initialize with constant (log-odds of mean)
        mean_y = np.mean(y)
        # Avoid log(0) or log(inf)
        mean_y = np.clip(mean_y, 1e-15, 1 - 1e-15)
        self.init_pred = np.log(mean_y / (1 - mean_y))
        
        # Initialize predictions (log-odds)
        F = np.full(n_samples, self.init_pred)
        
        if self.verbose:
            print(f"Initial prediction (log-odds): {self.init_pred:.4f}")
            print(f"Initial probability: {self._sigmoid(self.init_pred):.4f}")
            print("\nStarting boosting iterations...\n")
        
        # Step 2: Boosting iterations
        for m in range(self.n_estimators):
            # Convert log-odds to probabilities
            probabilities = self._sigmoid(F)
            
            # Compute pseudo-residuals (negative gradient)
            residuals = self._compute_residuals(y, probabilities)
            
            # Subsample if needed (Stochastic Gradient Boosting)
            if self.subsample < 1.0:
                sample_indices = np.random.choice(
                    n_samples, 
                    size=int(n_samples * self.subsample),
                    replace=False
                )
                X_sample = X[sample_indices]
                residuals_sample = residuals[sample_indices]
            else:
                X_sample = X
                residuals_sample = residuals
            
            # Fit a regression tree to residuals
            tree = DecisionTreeRegressor(
                max_depth=self.max_depth,
                min_samples_split=self.min_samples_split,
                random_state=m
            )
            tree.fit(X_sample, residuals_sample)
            
            # Update predictions (all samples)
            update = self.learning_rate * tree.predict(X)
            F += update
            
            # Store tree
            self.trees.append(tree)
            
            # Track training performance
            train_proba = self._sigmoid(F)
            train_loss = self._log_loss(y, train_proba)
            train_acc = np.mean((train_proba >= 0.5).astype(int) == y)
            
            self.train_losses.append(train_loss)
            self.train_scores.append(train_acc)
            
            # Print progress
            if self.verbose and (m % 10 == 0 or m == self.n_estimators - 1):
                print(f"Iteration {m+1:3d} | Loss: {train_loss:.6f} | "
                      f"Accuracy: {train_acc:.4f}")
        
        return self
    
    def _predict_raw(self, X):
        """
        Predict raw values (log-odds / margins).
        F(x) = F_0 + learning_rate * sum(trees)
        """
        # Start with initial prediction
        F = np.full(len(X), self.init_pred)
        
        # Add predictions from all trees
        for tree in self.trees:
            F += self.learning_rate * tree.predict(X)
        
        return F
    
    def predict_proba(self, X):
        """
        Predict class probabilities.
        
        Returns:
        --------
        proba : array, shape (n_samples,)
            Probability of positive class
        """
        raw_predictions = self._predict_raw(X)
        return self._sigmoid(raw_predictions)
    
    def predict(self, X, threshold=0.5):
        """
        Predict class labels.
        
        Parameters:
        -----------
        threshold : float, default=0.5
            Decision threshold
        """
        probabilities = self.predict_proba(X)
        return (probabilities >= threshold).astype(int)
    
    def get_feature_importance(self, X):
        """
        Calculate feature importance as average across all trees.
        """
        n_features = X.shape[1]
        importance = np.zeros(n_features)
        
        for tree in self.trees:
            importance += tree.feature_importances_
        
        # Normalize
        importance /= len(self.trees)
        return importance

print("\n✓ GradientBoostingClassifierScratch class defined")
print("  - Sequential tree building")
print("  - Fits trees to residuals")
print("  - Learning rate for shrinkage")

## Example 1: Simple Synthetic Dataset

In [None]:
# Generate synthetic dataset
X_simple, y_simple = make_classification(
    n_samples=500,
    n_features=10,
    n_informative=8,
    n_redundant=2,
    n_classes=2,
    random_state=42,
    flip_y=0.1  # Add 10% label noise
)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_simple, y_simple, test_size=0.2, random_state=42, stratify=y_simple
)

print("Synthetic Dataset:")
print(f"  Training samples: {len(X_train)}")
print(f"  Test samples: {len(X_test)}")
print(f"  Features: {X_train.shape[1]}")
print(f"  Classes: {len(np.unique(y_simple))}")
print(f"\nClass distribution:")
print(f"  Class 0: {np.sum(y_train == 0)} samples")
print(f"  Class 1: {np.sum(y_train == 1)} samples")

In [None]:
# Train gradient boosting
print("\n" + "=" * 80)
print("TRAINING GRADIENT BOOSTING CLASSIFIER")
print("=" * 80)

gb_model = GradientBoostingClassifierScratch(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    min_samples_split=2,
    subsample=1.0,
    verbose=True
)

gb_model.fit(X_train, y_train)

In [None]:
# Evaluate on test set
y_pred_proba = gb_model.predict_proba(X_test)
y_pred = gb_model.predict(X_test)

test_acc = accuracy_score(y_test, y_pred)
test_auc = roc_auc_score(y_test, y_pred_proba)

print("\n" + "=" * 80)
print("TEST SET PERFORMANCE")
print("=" * 80)
print(f"Accuracy: {test_acc:.4f}")
print(f"ROC-AUC:  {test_auc:.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Class 0', 'Class 1']))

## Visualize Training Progress

In [None]:
# Plot training curves
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Loss curve
axes[0].plot(gb_model.train_losses, linewidth=2)
axes[0].set_xlabel('Boosting Iteration', fontsize=12)
axes[0].set_ylabel('Log Loss', fontsize=12)
axes[0].set_title('Training Loss Over Iterations', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Accuracy curve
axes[1].plot(gb_model.train_scores, linewidth=2, color='green')
axes[1].set_xlabel('Boosting Iteration', fontsize=12)
axes[1].set_ylabel('Accuracy', fontsize=12)
axes[1].set_title('Training Accuracy Over Iterations', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)
axes[1].set_ylim([0, 1])

plt.tight_layout()
plt.show()

print("\n✓ Loss decreases and accuracy increases with more trees")

## ROC Curve and Confusion Matrix

In [None]:
# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)

fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# ROC Curve
axes[0].plot(fpr, tpr, linewidth=2, label=f'ROC curve (AUC = {test_auc:.3f})')
axes[0].plot([0, 1], [0, 1], 'k--', linewidth=2, label='Random classifier')
axes[0].set_xlabel('False Positive Rate', fontsize=12)
axes[0].set_ylabel('True Positive Rate', fontsize=12)
axes[0].set_title('ROC Curve', fontsize=14, fontweight='bold')
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)

# Confusion Matrix
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[1],
            xticklabels=['Class 0', 'Class 1'],
            yticklabels=['Class 0', 'Class 1'])
axes[1].set_xlabel('Predicted Label', fontsize=12)
axes[1].set_ylabel('True Label', fontsize=12)
axes[1].set_title('Confusion Matrix', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

## Feature Importance

In [None]:
# Get feature importance
feature_importance = gb_model.get_feature_importance(X_train)
feature_names = [f'Feature {i}' for i in range(len(feature_importance))]

# Sort by importance
indices = np.argsort(feature_importance)[::-1]

print("\nFeature Importance (sorted):")
print(f"{'Rank':<6} {'Feature':<12} {'Importance':<12}")
print("-" * 35)
for i, idx in enumerate(indices, 1):
    print(f"{i:<6} {feature_names[idx]:<12} {feature_importance[idx]:.6f}")

# Visualize
plt.figure(figsize=(10, 6))
plt.barh(range(len(feature_importance)), feature_importance[indices])
plt.yticks(range(len(feature_importance)), [feature_names[i] for i in indices])
plt.xlabel('Importance', fontsize=12)
plt.title('Feature Importance', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

## Effect of Learning Rate

In [None]:
print("\n" + "=" * 80)
print("EFFECT OF LEARNING RATE")
print("=" * 80)

learning_rates = [0.01, 0.05, 0.1, 0.5]
lr_results = {}

for lr in learning_rates:
    print(f"\nTraining with learning_rate={lr}...")
    
    model = GradientBoostingClassifierScratch(
        n_estimators=100,
        learning_rate=lr,
        max_depth=3,
        verbose=False
    )
    
    model.fit(X_train, y_train)
    
    # Evaluate
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)
    
    test_acc = accuracy_score(y_test, y_pred)
    test_auc = roc_auc_score(y_test, y_pred_proba)
    
    lr_results[lr] = {
        'model': model,
        'accuracy': test_acc,
        'auc': test_auc
    }
    
    print(f"  Test Accuracy: {test_acc:.4f}")
    print(f"  Test AUC: {test_auc:.4f}")

In [None]:
# Plot learning rate comparison
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.ravel()

for idx, (lr, results) in enumerate(lr_results.items()):
    model = results['model']
    
    axes[idx].plot(model.train_losses, linewidth=2)
    axes[idx].set_xlabel('Boosting Iteration', fontsize=11)
    axes[idx].set_ylabel('Log Loss', fontsize=11)
    axes[idx].set_title(f'Learning Rate: {lr} (Test Acc: {results["accuracy"]:.4f})',
                       fontsize=12, fontweight='bold')
    axes[idx].grid(True, alpha=0.3)

plt.suptitle('Effect of Learning Rate on Convergence', 
            fontsize=14, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

print("\n✓ Lower learning rate → smoother convergence, may need more trees")
print("✓ Higher learning rate → faster initial progress, risk of overfitting")

## Effect of Number of Trees

In [None]:
print("\n" + "=" * 80)
print("EFFECT OF NUMBER OF TREES")
print("=" * 80)

# Train model with many trees
model_many_trees = GradientBoostingClassifierScratch(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=3,
    verbose=False
)

model_many_trees.fit(X_train, y_train)

# Evaluate at different stages
n_trees_to_test = [10, 25, 50, 100, 150, 200]
train_accs = []
test_accs = []

for n_trees in n_trees_to_test:
    # Temporarily limit trees
    original_trees = model_many_trees.trees
    model_many_trees.trees = original_trees[:n_trees]
    
    # Evaluate
    train_pred = model_many_trees.predict(X_train)
    test_pred = model_many_trees.predict(X_test)
    
    train_acc = accuracy_score(y_train, train_pred)
    test_acc = accuracy_score(y_test, test_pred)
    
    train_accs.append(train_acc)
    test_accs.append(test_acc)
    
    print(f"Trees: {n_trees:3d} | Train Acc: {train_acc:.4f} | Test Acc: {test_acc:.4f}")
    
    # Restore all trees
    model_many_trees.trees = original_trees

In [None]:
# Plot train vs test performance
plt.figure(figsize=(10, 6))
plt.plot(n_trees_to_test, train_accs, marker='o', linewidth=2, 
         markersize=8, label='Training Accuracy')
plt.plot(n_trees_to_test, test_accs, marker='s', linewidth=2, 
         markersize=8, label='Test Accuracy')
plt.xlabel('Number of Trees', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('Model Performance vs Number of Trees', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\n✓ Monitor train vs test to detect overfitting")
print("✓ If test performance plateaus/degrades, stop adding trees")

## Example 2: Breast Cancer Dataset

In [None]:
# Load breast cancer dataset
cancer = load_breast_cancer()
X_cancer = cancer.data
y_cancer = cancer.target

# Split data
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X_cancer, y_cancer, test_size=0.2, random_state=42, stratify=y_cancer
)

print("\nBreast Cancer Dataset:")
print(f"  Training samples: {len(X_train_c)}")
print(f"  Test samples: {len(X_test_c)}")
print(f"  Features: {X_train_c.shape[1]}")
print(f"  Classes: Malignant (0), Benign (1)")
print(f"\nClass distribution (train):")
print(f"  Malignant: {np.sum(y_train_c == 0)}")
print(f"  Benign: {np.sum(y_train_c == 1)}")

In [None]:
# Train gradient boosting
print("\n" + "=" * 80)
print("TRAINING ON BREAST CANCER DATA")
print("=" * 80)

gb_cancer = GradientBoostingClassifierScratch(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    verbose=True
)

gb_cancer.fit(X_train_c, y_train_c)

In [None]:
# Evaluate
y_pred_c = gb_cancer.predict(X_test_c)
y_pred_proba_c = gb_cancer.predict_proba(X_test_c)

test_acc_c = accuracy_score(y_test_c, y_pred_c)
test_auc_c = roc_auc_score(y_test_c, y_pred_proba_c)

print("\n" + "=" * 80)
print("BREAST CANCER TEST RESULTS")
print("=" * 80)
print(f"Accuracy: {test_acc_c:.4f}")
print(f"ROC-AUC:  {test_auc_c:.4f}")

print("\nClassification Report:")
print(classification_report(y_test_c, y_pred_c, 
                          target_names=['Malignant', 'Benign']))

In [None]:
# Top features
feature_importance_c = gb_cancer.get_feature_importance(X_train_c)
top_k = 10
top_indices = np.argsort(feature_importance_c)[::-1][:top_k]

print(f"\nTop {top_k} Most Important Features:")
for i, idx in enumerate(top_indices, 1):
    print(f"{i:2d}. {cancer.feature_names[idx]:<30s} {feature_importance_c[idx]:.6f}")

# Visualize
plt.figure(figsize=(12, 6))
plt.barh(range(top_k), feature_importance_c[top_indices])
plt.yticks(range(top_k), [cancer.feature_names[i] for i in top_indices])
plt.xlabel('Importance', fontsize=12)
plt.title(f'Top {top_k} Feature Importances - Breast Cancer', 
         fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

## Summary

**Gradient Boosting - Key Concepts:**

1. **Sequential Learning:**
   - Trees added one at a time
   - Each new tree corrects errors of previous ensemble
   - Cannot be parallelized across trees

2. **Gradient Descent in Function Space:**
   - Instead of optimizing parameters, optimize predictions
   - Negative gradient = residuals (for squared loss) or pseudo-residuals (for other losses)
   - New tree approximates negative gradient

3. **Loss Function:**
   - Binary classification: Log loss (cross-entropy)
   - Regression: MSE, MAE, Huber
   - Ranking: Pairwise loss

4. **Learning Rate (Shrinkage):**
   - Scale contribution of each tree
   - Smaller → better generalization, need more trees
   - Larger → faster convergence, risk overfitting
   - Typical values: 0.01 - 0.3

5. **Hyperparameters:**
   - `n_estimators`: Number of trees
   - `learning_rate`: Shrinkage factor
   - `max_depth`: Tree complexity (usually 3-8)
   - `min_samples_split`: Minimum samples to split
   - `subsample`: Fraction for stochastic GB

6. **Advantages:**
   - High predictive accuracy
   - Handles mixed data types
   - Robust to outliers (with appropriate loss)
   - Feature importance built-in
   - Flexible (custom loss functions)

7. **Disadvantages:**
   - Sequential (hard to parallelize)
   - Sensitive to overfitting
   - Requires careful tuning
   - Slower training than Random Forest

8. **Best Practices:**
   - Start with small learning rate (0.01-0.1)
   - Use cross-validation for tuning
   - Monitor train vs validation performance
   - Consider early stopping
   - Use subsample < 1.0 for regularization
