# Logistic Regression from Scratch - Binary Classification

This notebook implements logistic regression for binary classification without using scikit-learn's LogisticRegression. We'll build everything from the ground up to understand the mathematics and implementation details.

**Author:** Educational Implementation  
**Date:** February 2026

---

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
np.random.seed(42)

print("=" * 80)
print("LOGISTIC REGRESSION FROM SCRATCH")
print("=" * 80)

## Part 1: Mathematical Foundation

### Core Concepts

**1. Sigmoid Function (Activation):**
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

Properties:
- Maps any real number to (0, 1)
- Output can be interpreted as probability
- Derivative: $\sigma'(z) = \sigma(z) \cdot (1 - \sigma(z))$

**2. Hypothesis:**
$$h(x) = \sigma(w^T x + b)$$

where $w$ are weights, $b$ is bias, $x$ is input

**3. Binary Cross-Entropy Loss:**
$$L(w, b) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(h(x^{(i)})) + (1-y^{(i)}) \log(1-h(x^{(i)}))]$$

For a single sample:
- If $y=1$: loss $= -\log(h(x))$ (want $h(x)$ close to 1)
- If $y=0$: loss $= -\log(1-h(x))$ (want $h(x)$ close to 0)

**4. Gradient Descent Update:**
$$w := w - \alpha \frac{\partial L}{\partial w}$$
$$b := b - \alpha \frac{\partial L}{\partial b}$$

where:
$$\frac{\partial L}{\partial w} = \frac{1}{m} X^T (h(x) - y)$$
$$\frac{\partial L}{\partial b} = \frac{1}{m} \sum(h(x) - y)$$

In [None]:
class LogisticRegressionScratch:
    """
    Logistic Regression implemented from scratch using gradient descent.
    
    Parameters:
    -----------
    learning_rate : float, default=0.01
        Step size for gradient descent updates
    n_iterations : int, default=1000
        Number of gradient descent iterations
    regularization : str, default=None
        Type of regularization: None, 'l1', or 'l2'
    lambda_reg : float, default=0.01
        Regularization strength
    verbose : bool, default=False
        Whether to print training progress
    """
    
    def __init__(self, learning_rate=0.01, n_iterations=1000, 
                 regularization=None, lambda_reg=0.01, verbose=False):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.regularization = regularization
        self.lambda_reg = lambda_reg
        self.verbose = verbose
        
        # Model parameters (initialized during fit)
        self.weights = None
        self.bias = None
        
        # Training history
        self.loss_history = []
        self.accuracy_history = []
        
    def _sigmoid(self, z):
        """
        Sigmoid activation function.
        
        Numerical stability: clip z to avoid overflow in exp
        """
        z = np.clip(z, -500, 500)  # Prevent overflow
        return 1 / (1 + np.exp(-z))
    
    def _compute_loss(self, y_true, y_pred, weights):
        """
        Compute binary cross-entropy loss with optional regularization.
        
        Parameters:
        -----------
        y_true : array-like, shape (n_samples,)
            True binary labels
        y_pred : array-like, shape (n_samples,)
            Predicted probabilities
        weights : array-like, shape (n_features,)
            Current weight values
            
        Returns:
        --------
        loss : float
            Average loss across samples
        """
        m = len(y_true)
        
        # Clip predictions to avoid log(0)
        epsilon = 1e-15
        y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
        
        # Binary cross-entropy
        bce_loss = -np.mean(
            y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)
        )
        
        # Add regularization term
        reg_term = 0
        if self.regularization == 'l2':
            reg_term = (self.lambda_reg / (2 * m)) * np.sum(weights ** 2)
        elif self.regularization == 'l1':
            reg_term = (self.lambda_reg / m) * np.sum(np.abs(weights))
            
        return bce_loss + reg_term
    
    def _compute_gradients(self, X, y_true, y_pred):
        """
        Compute gradients for weights and bias.
        
        Mathematical derivation:
        For loss L = -1/m Σ[y*log(h) + (1-y)*log(1-h)]
        
        ∂L/∂w_j = 1/m Σ(h(x_i) - y_i) * x_ij
        ∂L/∂b = 1/m Σ(h(x_i) - y_i)
        """
        m = len(y_true)
        error = y_pred - y_true  # Shape: (m,)
        
        # Gradient for weights
        dw = (1/m) * X.T.dot(error)  # Shape: (n_features,)
        
        # Add regularization gradient
        if self.regularization == 'l2':
            dw += (self.lambda_reg / m) * self.weights
        elif self.regularization == 'l1':
            dw += (self.lambda_reg / m) * np.sign(self.weights)
        
        # Gradient for bias (no regularization on bias)
        db = (1/m) * np.sum(error)
        
        return dw, db
    
    def fit(self, X, y):
        """
        Fit logistic regression model using gradient descent.
        
        Parameters:
        -----------
        X : array-like, shape (n_samples, n_features)
            Training data
        y : array-like, shape (n_samples,)
            Binary target values (0 or 1)
        """
        # Initialize parameters
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        # Gradient descent
        for iteration in range(self.n_iterations):
            # Forward pass: compute predictions
            linear_output = X.dot(self.weights) + self.bias
            y_pred = self._sigmoid(linear_output)
            
            # Compute loss
            loss = self._compute_loss(y, y_pred, self.weights)
            self.loss_history.append(loss)
            
            # Compute accuracy
            y_pred_class = (y_pred >= 0.5).astype(int)
            accuracy = np.mean(y_pred_class == y)
            self.accuracy_history.append(accuracy)
            
            # Compute gradients
            dw, db = self._compute_gradients(X, y, y_pred)
            
            # Update parameters
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db
            
            # Print progress
            if self.verbose and (iteration % 100 == 0 or iteration == self.n_iterations - 1):
                print(f"Iteration {iteration:4d} | Loss: {loss:.6f} | Accuracy: {accuracy:.4f}")
        
        return self
    
    def predict_proba(self, X):
        """
        Predict class probabilities.
        
        Returns:
        --------
        proba : array-like, shape (n_samples,)
            Probability of positive class
        """
        linear_output = X.dot(self.weights) + self.bias
        return self._sigmoid(linear_output)
    
    def predict(self, X, threshold=0.5):
        """
        Predict binary class labels.
        
        Parameters:
        -----------
        threshold : float, default=0.5
            Decision threshold for classification
        """
        proba = self.predict_proba(X)
        return (proba >= threshold).astype(int)
    
    def get_params(self):
        """Return model parameters."""
        return {
            'weights': self.weights,
            'bias': self.bias,
            'loss_history': self.loss_history,
            'accuracy_history': self.accuracy_history
        }

print("\n✓ LogisticRegressionScratch class defined")
print("  - Sigmoid activation")
print("  - Binary cross-entropy loss")
print("  - Gradient descent optimization")
print("  - Optional L1/L2 regularization")

## Part 2: Generate Synthetic Dataset

In [None]:
print("\n" + "=" * 80)
print("GENERATING SYNTHETIC DATASET")
print("=" * 80)

# Create binary classification dataset
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    n_classes=2,
    random_state=42,
    flip_y=0.1  # Add 10% label noise
)

print(f"\nDataset shape: {X.shape}")
print(f"Class distribution:")
print(f"  Class 0: {np.sum(y == 0)} samples ({np.mean(y == 0)*100:.1f}%)")
print(f"  Class 1: {np.sum(y == 1)} samples ({np.mean(y == 1)*100:.1f}%)")

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Standardize features (important for gradient descent!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"\nTraining set: {X_train_scaled.shape}")
print(f"Test set: {X_test_scaled.shape}")

## Part 3: Train Basic Model

In [None]:
print("\n" + "=" * 80)
print("TRAINING BASIC LOGISTIC REGRESSION")
print("=" * 80)

# Train model
model_basic = LogisticRegressionScratch(
    learning_rate=0.1,
    n_iterations=1000,
    regularization=None,
    verbose=True
)

model_basic.fit(X_train_scaled, y_train)

# Evaluate on test set
y_pred_proba = model_basic.predict_proba(X_test_scaled)
y_pred = model_basic.predict(X_test_scaled)

test_accuracy = np.mean(y_pred == y_test)
print(f"\n{'='*80}")
print(f"Test Accuracy: {test_accuracy:.4f}")
print(f"{'='*80}")

## Part 4: Visualize Training Progress

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Loss curve
axes[0].plot(model_basic.loss_history, linewidth=2)
axes[0].set_xlabel('Iteration', fontsize=12)
axes[0].set_ylabel('Loss', fontsize=12)
axes[0].set_title('Training Loss Over Time', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Accuracy curve
axes[1].plot(model_basic.accuracy_history, linewidth=2, color='green')
axes[1].set_xlabel('Iteration', fontsize=12)
axes[1].set_ylabel('Accuracy', fontsize=12)
axes[1].set_title('Training Accuracy Over Time', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)
axes[1].set_ylim([0, 1])

plt.tight_layout()
plt.show()

print("✓ Training curves show convergence")

## Part 5: Detailed Evaluation Metrics

In [None]:
print("\n" + "=" * 80)
print("DETAILED EVALUATION")
print("=" * 80)

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)

fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Plot confusion matrix
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0],
            xticklabels=['Class 0', 'Class 1'],
            yticklabels=['Class 0', 'Class 1'])
axes[0].set_xlabel('Predicted Label', fontsize=12)
axes[0].set_ylabel('True Label', fontsize=12)
axes[0].set_title('Confusion Matrix', fontsize=14, fontweight='bold')

# ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)

axes[1].plot(fpr, tpr, linewidth=2, label=f'ROC curve (AUC = {roc_auc:.3f})')
axes[1].plot([0, 1], [0, 1], 'k--', linewidth=2, label='Random classifier')
axes[1].set_xlabel('False Positive Rate', fontsize=12)
axes[1].set_ylabel('True Positive Rate', fontsize=12)
axes[1].set_title('ROC Curve', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Class 0', 'Class 1']))

# Additional metrics
tn, fp, fn, tp = cm.ravel()
print(f"\nDetailed Metrics:")
print(f"  True Positives:  {tp}")
print(f"  True Negatives:  {tn}")
print(f"  False Positives: {fp}")
print(f"  False Negatives: {fn}")
print(f"  Sensitivity (Recall): {tp/(tp+fn):.4f}")
print(f"  Specificity: {tn/(tn+fp):.4f}")
print(f"  Precision: {tp/(tp+fp):.4f}")
print(f"  F1-Score: {2*tp/(2*tp+fp+fn):.4f}")

## Part 6: Compare Regularization Techniques

In [None]:
print("\n" + "=" * 80)
print("COMPARING REGULARIZATION TECHNIQUES")
print("=" * 80)

models = {}
regularizations = [None, 'l1', 'l2']

for reg in regularizations:
    print(f"\nTraining with regularization: {reg if reg else 'None'}")
    
    model = LogisticRegressionScratch(
        learning_rate=0.1,
        n_iterations=1000,
        regularization=reg,
        lambda_reg=0.01,
        verbose=False
    )
    
    model.fit(X_train_scaled, y_train)
    
    # Evaluate
    y_pred = model.predict(X_test_scaled)
    accuracy = np.mean(y_pred == y_test)
    
    models[reg if reg else 'none'] = {
        'model': model,
        'accuracy': accuracy
    }
    
    print(f"  Test Accuracy: {accuracy:.4f}")

# Compare regularization effects
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, (reg_type, model_info) in enumerate(models.items()):
    model = model_info['model']
    
    axes[idx].plot(model.loss_history, linewidth=2)
    axes[idx].set_xlabel('Iteration', fontsize=12)
    axes[idx].set_ylabel('Loss', fontsize=12)
    axes[idx].set_title(f'{reg_type.upper()} Regularization\n(Accuracy: {model_info["accuracy"]:.4f})',
                       fontsize=12, fontweight='bold')
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Part 7: Feature Importance Analysis

In [None]:
print("\n" + "=" * 80)
print("FEATURE IMPORTANCE ANALYSIS")
print("=" * 80)

# Use model without regularization for clearer interpretation
weights = model_basic.weights
feature_importance = np.abs(weights)

# Sort features by importance
sorted_idx = np.argsort(feature_importance)[::-1]

# Display top features
print("\nTop 10 Most Important Features:")
print(f"{'Rank':<6} {'Feature':<12} {'Weight':<12} {'|Weight|':<12}")
print("-" * 48)
for i, idx in enumerate(sorted_idx[:10], 1):
    print(f"{i:<6} Feature {idx:<3}   {weights[idx]:>10.4f}   {feature_importance[idx]:>10.4f}")

# Visualize all feature weights
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Bar plot of feature importance
axes[0].bar(range(len(feature_importance)), feature_importance[sorted_idx])
axes[0].set_xlabel('Feature Rank', fontsize=12)
axes[0].set_ylabel('|Weight|', fontsize=12)
axes[0].set_title('Feature Importance (Absolute Weight Values)', 
                  fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3, axis='y')

# Weight distribution
axes[1].hist(weights, bins=20, edgecolor='black', alpha=0.7)
axes[1].axvline(x=0, color='red', linestyle='--', linewidth=2, label='Zero')
axes[1].set_xlabel('Weight Value', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].set_title('Distribution of Feature Weights', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Part 8: Decision Boundary Visualization (2D Projection)

In [None]:
print("\n" + "=" * 80)
print("DECISION BOUNDARY VISUALIZATION")
print("=" * 80)

# Take the two most important features for visualization
top_2_features = sorted_idx[:2]
X_train_2d = X_train_scaled[:, top_2_features]
X_test_2d = X_test_scaled[:, top_2_features]

# Train model on 2D data
model_2d = LogisticRegressionScratch(
    learning_rate=0.1,
    n_iterations=1000,
    regularization=None,
    verbose=False
)

model_2d.fit(X_train_2d, y_train)
y_pred_2d = model_2d.predict(X_test_2d)
accuracy_2d = np.mean(y_pred_2d == y_test)

print(f"\n2D Model Accuracy: {accuracy_2d:.4f}")
print(f"Features used: {top_2_features[0]} and {top_2_features[1]}")

# Create mesh grid for decision boundary
x_min, x_max = X_train_2d[:, 0].min() - 1, X_train_2d[:, 0].max() + 1
y_min, y_max = X_train_2d[:, 1].min() - 1, X_train_2d[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                     np.linspace(y_min, y_max, 200))

# Predict on mesh
Z = model_2d.predict_proba(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot
fig, ax = plt.subplots(figsize=(12, 8))

# Decision boundary and regions
contourf = ax.contourf(xx, yy, Z, levels=20, cmap='RdYlBu', alpha=0.6)
contour = ax.contour(xx, yy, Z, levels=[0.5], colors='black', linewidths=3)

# Scatter plot of data points
scatter = ax.scatter(X_test_2d[:, 0], X_test_2d[:, 1], 
                    c=y_test, cmap='RdYlBu', s=100, 
                    edgecolors='black', linewidth=1.5)

ax.set_xlabel(f'Feature {top_2_features[0]}', fontsize=12)
ax.set_ylabel(f'Feature {top_2_features[1]}', fontsize=12)
ax.set_title(f'Decision Boundary (2D Projection)\nAccuracy: {accuracy_2d:.4f}', 
            fontsize=14, fontweight='bold')

# Colorbar for probability
cbar = plt.colorbar(contourf, ax=ax)
cbar.set_label('P(Class = 1)', fontsize=12)

# Legend
legend_labels = ['Class 0', 'Class 1']
legend_handles = [plt.Line2D([0], [0], marker='o', color='w', 
                            markerfacecolor=scatter.cmap(scatter.norm(i)), 
                            markersize=10, markeredgecolor='black')
                 for i in [0, 1]]
ax.legend(legend_handles, legend_labels, loc='upper right', fontsize=10)

plt.tight_layout()
plt.show()

## Part 9: Learning Rate Comparison

In [None]:
print("\n" + "=" * 80)
print("LEARNING RATE COMPARISON")
print("=" * 80)

learning_rates = [0.001, 0.01, 0.1, 0.5]
lr_results = {}

for lr in learning_rates:
    model = LogisticRegressionScratch(
        learning_rate=lr,
        n_iterations=1000,
        regularization=None,
        verbose=False
    )
    
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    accuracy = np.mean(y_pred == y_test)
    
    lr_results[lr] = {
        'model': model,
        'accuracy': accuracy
    }
    
    print(f"Learning Rate: {lr:6.3f} | Test Accuracy: {accuracy:.4f}")

# Plot learning curves for different learning rates
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.ravel()

for idx, (lr, results) in enumerate(lr_results.items()):
    model = results['model']
    
    axes[idx].plot(model.loss_history, linewidth=2)
    axes[idx].set_xlabel('Iteration', fontsize=11)
    axes[idx].set_ylabel('Loss', fontsize=11)
    axes[idx].set_title(f'Learning Rate: {lr} (Acc: {results["accuracy"]:.4f})',
                       fontsize=12, fontweight='bold')
    axes[idx].grid(True, alpha=0.3)

plt.suptitle('Effect of Learning Rate on Convergence', 
            fontsize=14, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

## Part 10: Probability Calibration Analysis

In [None]:
print("\n" + "=" * 80)
print("PROBABILITY CALIBRATION ANALYSIS")
print("=" * 80)

# Get predicted probabilities
y_pred_proba = model_basic.predict_proba(X_test_scaled)

# Create probability bins
n_bins = 10
bins = np.linspace(0, 1, n_bins + 1)
bin_centers = (bins[:-1] + bins[1:]) / 2

# Calculate actual positive rate in each bin
bin_true_probs = []
bin_pred_probs = []
bin_counts = []

for i in range(n_bins):
    mask = (y_pred_proba >= bins[i]) & (y_pred_proba < bins[i+1])
    if np.sum(mask) > 0:
        bin_true_probs.append(np.mean(y_test[mask]))
        bin_pred_probs.append(np.mean(y_pred_proba[mask]))
        bin_counts.append(np.sum(mask))
    else:
        bin_true_probs.append(np.nan)
        bin_pred_probs.append(np.nan)
        bin_counts.append(0)

# Plot calibration curve
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Reliability diagram
axes[0].plot([0, 1], [0, 1], 'k--', linewidth=2, label='Perfect calibration')
axes[0].plot(bin_pred_probs, bin_true_probs, 'o-', linewidth=2, 
            markersize=8, label='Model calibration')
axes[0].set_xlabel('Predicted Probability', fontsize=12)
axes[0].set_ylabel('True Positive Rate', fontsize=12)
axes[0].set_title('Calibration Curve (Reliability Diagram)', 
                 fontsize=14, fontweight='bold')
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)
axes[0].set_xlim([0, 1])
axes[0].set_ylim([0, 1])

# Histogram of predicted probabilities
axes[1].hist([y_pred_proba[y_test == 0], y_pred_proba[y_test == 1]], 
            bins=20, label=['Class 0', 'Class 1'], alpha=0.7, edgecolor='black')
axes[1].axvline(x=0.5, color='red', linestyle='--', linewidth=2, 
               label='Decision threshold')
axes[1].set_xlabel('Predicted Probability', fontsize=12)
axes[1].set_ylabel('Count', fontsize=12)
axes[1].set_title('Distribution of Predicted Probabilities', 
                 fontsize=14, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\n✓ Calibration analysis complete")
print(f"  - Well-calibrated: points close to diagonal line")
print(f"  - Overconfident: points below diagonal")
print(f"  - Underconfident: points above diagonal")

## Part 11: Gradient Magnitude Analysis

In [None]:
print("\n" + "=" * 80)
print("GRADIENT MAGNITUDE TRACKING")
print("=" * 80)

# Create a new model with gradient tracking
class LogisticRegressionWithGradients(LogisticRegressionScratch):
    """Extended version that tracks gradient magnitudes."""
    
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.gradient_magnitudes = []
    
    def fit(self, X, y):
        # Initialize parameters
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        # Gradient descent with gradient tracking
        for iteration in range(self.n_iterations):
            # Forward pass
            linear_output = X.dot(self.weights) + self.bias
            y_pred = self._sigmoid(linear_output)
            
            # Compute loss
            loss = self._compute_loss(y, y_pred, self.weights)
            self.loss_history.append(loss)
            
            # Compute accuracy
            y_pred_class = (y_pred >= 0.5).astype(int)
            accuracy = np.mean(y_pred_class == y)
            self.accuracy_history.append(accuracy)
            
            # Compute gradients
            dw, db = self._compute_gradients(X, y, y_pred)
            
            # Track gradient magnitude
            grad_magnitude = np.sqrt(np.sum(dw**2) + db**2)
            self.gradient_magnitudes.append(grad_magnitude)
            
            # Update parameters
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db
            
            if self.verbose and (iteration % 100 == 0):
                print(f"Iter {iteration:4d} | Loss: {loss:.6f} | "
                      f"Grad Magnitude: {grad_magnitude:.6f}")
        
        return self

# Train model with gradient tracking
model_grad = LogisticRegressionWithGradients(
    learning_rate=0.1,
    n_iterations=1000,
    regularization=None,
    verbose=False
)

model_grad.fit(X_train_scaled, y_train)

# Plot gradient magnitudes
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Linear scale
axes[0].plot(model_grad.gradient_magnitudes, linewidth=2, color='purple')
axes[0].set_xlabel('Iteration', fontsize=12)
axes[0].set_ylabel('Gradient Magnitude', fontsize=12)
axes[0].set_title('Gradient Magnitude Over Time (Linear Scale)', 
                 fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Log scale
axes[1].semilogy(model_grad.gradient_magnitudes, linewidth=2, color='purple')
axes[1].set_xlabel('Iteration', fontsize=12)
axes[1].set_ylabel('Gradient Magnitude (log scale)', fontsize=12)
axes[1].set_title('Gradient Magnitude Over Time (Log Scale)', 
                 fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n✓ Gradient magnitude analysis complete")
print(f"  - Initial gradient: {model_grad.gradient_magnitudes[0]:.6f}")
print(f"  - Final gradient: {model_grad.gradient_magnitudes[-1]:.6f}")
print(f"  - Reduction factor: {model_grad.gradient_magnitudes[0]/model_grad.gradient_magnitudes[-1]:.2f}x")

## Part 12: Summary and Key Takeaways

In [None]:
print("\n" + "=" * 80)
print("SUMMARY AND KEY TAKEAWAYS")
print("=" * 80)

summary_data = {
    'Model': ['Basic', 'L1 Regularized', 'L2 Regularized'],
    'Test Accuracy': [
        models['none']['accuracy'],
        models['l1']['accuracy'],
        models['l2']['accuracy']
    ],
    'Final Loss': [
        models['none']['model'].loss_history[-1],
        models['l1']['model'].loss_history[-1],
        models['l2']['model'].loss_history[-1]
    ]
}

summary_df = pd.DataFrame(summary_data)
print("\nModel Comparison:")
print(summary_df.to_string(index=False))

print("\n" + "="*80)
print("KEY CONCEPTS DEMONSTRATED:")
print("="*80)
print("""
1. SIGMOID FUNCTION
   - Maps linear output to (0,1) probability range
   - Smooth, differentiable activation function
   
2. BINARY CROSS-ENTROPY LOSS
   - Measures how well probabilities match true labels
   - Penalizes confident wrong predictions heavily
   
3. GRADIENT DESCENT OPTIMIZATION
   - Iteratively updates weights to minimize loss
   - Learning rate controls step size
   - Gradients computed via chain rule
   
4. REGULARIZATION
   - L1: Encourages sparsity (feature selection)
   - L2: Encourages small weights (prevents overfitting)
   
5. MODEL EVALUATION
   - Accuracy: Overall correctness
   - Precision: Positive predictions that are correct
   - Recall: Actual positives that are found
   - ROC-AUC: Threshold-independent performance
   
6. FEATURE IMPORTANCE
   - Magnitude of weights indicates importance
   - Sign indicates positive/negative correlation
   
7. PROBABILITY CALIBRATION
   - Predicted probabilities should match empirical frequencies
   - Important for decision-making under uncertainty
""")

print("\n" + "="*80)
print("IMPLEMENTATION NOTES:")
print("="*80)
print("""
✓ Numerical Stability:
  - Clip sigmoid inputs to prevent overflow
  - Add epsilon to log computations to avoid log(0)
  
✓ Feature Scaling:
  - StandardScaler applied before training
  - Critical for gradient descent convergence
  
✓ Vectorization:
  - Matrix operations for efficiency
  - Avoids slow Python loops
  
✓ Training Monitoring:
  - Track loss and accuracy
  - Visualize convergence
  - Detect potential issues early
""")

print("\n" + "="*80)
print("NOTEBOOK COMPLETE!")
print("="*80)