# Linear Regression from Scratch - Gradient Descent

Implementation of linear regression using gradient descent with L1 (Lasso) and L2 (Ridge) regularization.

**Key Concepts:**
- Mean Squared Error (MSE) loss
- Gradient descent optimization
- L1 regularization (Lasso)
- L2 regularization (Ridge)
- Feature scaling


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_diabetes, make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
np.random.seed(42)

print("=" * 80)
print("LINEAR REGRESSION FROM SCRATCH")
print("=" * 80)

## Mathematical Foundation

**Linear Regression Hypothesis:**
$$h(x) = w^T x + b$$

where $w$ are weights, $b$ is bias, $x$ is input

**Mean Squared Error Loss:**
$$L(w, b) = \frac{1}{2m} \sum_{i=1}^{m} (h(x^{(i)}) - y^{(i)})^2$$

**L2 Regularization (Ridge):**
$$L_{Ridge} = L_{MSE} + \frac{\lambda}{2m} \sum_{j=1}^{n} w_j^2$$

**L1 Regularization (Lasso):**
$$L_{Lasso} = L_{MSE} + \frac{\lambda}{m} \sum_{j=1}^{n} |w_j|$$

**Gradient Descent Update:**
$$w := w - \alpha \frac{\partial L}{\partial w}$$
$$b := b - \alpha \frac{\partial L}{\partial b}$$

where:
$$\frac{\partial L}{\partial w} = \frac{1}{m} X^T (h(x) - y)$$
$$\frac{\partial L}{\partial b} = \frac{1}{m} \sum(h(x) - y)$$

## Linear Regression Class

In [None]:
class LinearRegressionScratch:
    """
    Linear Regression implemented from scratch using gradient descent.
    
    Parameters:
    -----------
    learning_rate : float, default=0.01
        Step size for gradient descent updates
    n_iterations : int, default=1000
        Number of gradient descent iterations
    regularization : str, default=None
        Type of regularization: None, 'l1' (Lasso), or 'l2' (Ridge)
    lambda_reg : float, default=0.01
        Regularization strength
    verbose : bool, default=False
        Whether to print training progress
    """
    
    def __init__(self, learning_rate=0.01, n_iterations=1000, 
                 regularization=None, lambda_reg=0.01, verbose=False):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.regularization = regularization
        self.lambda_reg = lambda_reg
        self.verbose = verbose
        
        # Model parameters
        self.weights = None
        self.bias = None
        
        # Training history
        self.loss_history = []
        
    def fit(self, X, y):
        """
        Fit linear regression model using gradient descent.
        
        Parameters:
        -----------
        X : array-like, shape (n_samples, n_features)
            Training data
        y : array-like, shape (n_samples,)
            Target values
        """
        # Initialize parameters
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        # Gradient descent
        for iteration in range(self.n_iterations):
            # Forward pass: compute predictions
            y_pred = self._predict(X)
            
            # Compute loss
            loss = self._compute_loss(y, y_pred)
            self.loss_history.append(loss)
            
            # Compute gradients
            dw, db = self._compute_gradients(X, y, y_pred)
            
            # Update parameters
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db
            
            # Print progress
            if self.verbose and (iteration % 100 == 0 or iteration == self.n_iterations - 1):
                print(f"Iteration {iteration:4d} | Loss: {loss:.6f}")
        
        return self
    
    def _predict(self, X):
        """
        Make predictions using current parameters.
        """
        return X.dot(self.weights) + self.bias
    
    def _compute_loss(self, y_true, y_pred):
        """
        Compute Mean Squared Error loss with optional regularization.
        """
        m = len(y_true)
        
        # MSE loss
        mse_loss = (1 / (2 * m)) * np.sum((y_pred - y_true) ** 2)
        
        # Add regularization term
        reg_term = 0
        if self.regularization == 'l2':
            reg_term = (self.lambda_reg / (2 * m)) * np.sum(self.weights ** 2)
        elif self.regularization == 'l1':
            reg_term = (self.lambda_reg / m) * np.sum(np.abs(self.weights))
        
        return mse_loss + reg_term
    
    def _compute_gradients(self, X, y_true, y_pred):
        """
        Compute gradients for weights and bias.
        """
        m = len(y_true)
        error = y_pred - y_true
        
        # Gradient for weights
        dw = (1 / m) * X.T.dot(error)
        
        # Add regularization gradient
        if self.regularization == 'l2':
            dw += (self.lambda_reg / m) * self.weights
        elif self.regularization == 'l1':
            dw += (self.lambda_reg / m) * np.sign(self.weights)
        
        # Gradient for bias (no regularization on bias)
        db = (1 / m) * np.sum(error)
        
        return dw, db
    
    def predict(self, X):
        """
        Predict target values for samples in X.
        """
        return self._predict(X)
    
    def get_params(self):
        """
        Return model parameters.
        """
        return {
            'weights': self.weights,
            'bias': self.bias,
            'loss_history': self.loss_history
        }

print("\n✓ LinearRegressionScratch class defined")
print("  - Gradient descent optimization")
print("  - MSE loss function")
print("  - Optional L1/L2 regularization")

## Example 1: Simple Synthetic Data

In [None]:
# Generate simple 1D data for visualization
np.random.seed(42)
X_simple = 2 * np.random.rand(100, 1)
y_simple = 4 + 3 * X_simple.ravel() + np.random.randn(100)

print("Simple 1D Dataset")
print(f"Samples: {len(X_simple)}")
print(f"Features: {X_simple.shape[1]}")
print(f"True equation: y ≈ 4 + 3x + noise")

In [None]:
# Train model
model_simple = LinearRegressionScratch(
    learning_rate=0.1,
    n_iterations=1000,
    regularization=None,
    verbose=True
)

model_simple.fit(X_simple, y_simple)

# Get predictions
y_pred_simple = model_simple.predict(X_simple)

# Learned parameters
print(f"\nLearned parameters:")
print(f"  Weight: {model_simple.weights[0]:.4f} (true: 3.0)")
print(f"  Bias: {model_simple.bias:.4f} (true: 4.0)")

# Metrics
mse = mean_squared_error(y_simple, y_pred_simple)
rmse = np.sqrt(mse)
r2 = r2_score(y_simple, y_pred_simple)
print(f"\nMetrics:")
print(f"  MSE:  {mse:.4f}")
print(f"  RMSE: {rmse:.4f}")
print(f"  R²:   {r2:.4f}")

In [None]:
# Visualize results
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Scatter plot with regression line
axes[0].scatter(X_simple, y_simple, alpha=0.5, label='Data')
axes[0].plot(X_simple, y_pred_simple, color='red', linewidth=2, label='Fitted line')
axes[0].set_xlabel('X', fontsize=12)
axes[0].set_ylabel('y', fontsize=12)
axes[0].set_title('Linear Regression Fit', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Loss curve
axes[1].plot(model_simple.loss_history, linewidth=2)
axes[1].set_xlabel('Iteration', fontsize=12)
axes[1].set_ylabel('Loss', fontsize=12)
axes[1].set_title('Training Loss', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Example 2: Diabetes Dataset

In [None]:
# Load diabetes dataset
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Standardize features (important for gradient descent!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("\nDiabetes Dataset")
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"Features: {X.shape[1]}")
print(f"Target range: [{y.min():.2f}, {y.max():.2f}]")

## Compare Regularization Techniques

In [None]:
# Train models with different regularization
models = {}
regularizations = [None, 'l1', 'l2']

for reg in regularizations:
    print(f"\nTraining with regularization: {reg if reg else 'None'}")
    
    model = LinearRegressionScratch(
        learning_rate=0.1,
        n_iterations=1000,
        regularization=reg,
        lambda_reg=0.1,
        verbose=False
    )
    
    model.fit(X_train_scaled, y_train)
    
    # Predictions
    y_pred_train = model.predict(X_train_scaled)
    y_pred_test = model.predict(X_test_scaled)
    
    # Metrics
    train_r2 = r2_score(y_train, y_pred_train)
    test_r2 = r2_score(y_test, y_pred_test)
    test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
    
    models[reg if reg else 'none'] = {
        'model': model,
        'train_r2': train_r2,
        'test_r2': test_r2,
        'test_rmse': test_rmse
    }
    
    print(f"  Train R²: {train_r2:.4f}")
    print(f"  Test R²:  {test_r2:.4f}")
    print(f"  Test RMSE: {test_rmse:.2f}")

In [None]:
# Compare loss curves
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, (reg_type, model_info) in enumerate(models.items()):
    model = model_info['model']
    
    axes[idx].plot(model.loss_history, linewidth=2)
    axes[idx].set_xlabel('Iteration', fontsize=12)
    axes[idx].set_ylabel('Loss', fontsize=12)
    axes[idx].set_title(
        f'{reg_type.upper()} Regularization\n(Test R²: {model_info["test_r2"]:.4f})',
        fontsize=12, fontweight='bold'
    )
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Feature Importance Analysis

In [None]:
# Compare weights across regularization methods
feature_names = diabetes.feature_names

# Create dataframe with weights
weights_df = pd.DataFrame({
    'Feature': feature_names,
    'No Reg': models['none']['model'].weights,
    'L1 (Lasso)': models['l1']['model'].weights,
    'L2 (Ridge)': models['l2']['model'].weights
})

print("\nFeature Weights Comparison:")
print(weights_df.to_string(index=False))

In [None]:
# Visualize weights
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, (reg_type, model_info) in enumerate(models.items()):
    weights = model_info['model'].weights
    
    axes[idx].barh(feature_names, weights)
    axes[idx].set_xlabel('Weight Value', fontsize=12)
    axes[idx].set_title(
        f'{reg_type.upper()} Regularization',
        fontsize=12, fontweight='bold'
    )
    axes[idx].axvline(x=0, color='red', linestyle='--', linewidth=1)
    axes[idx].grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

## Effect of Regularization Strength

In [None]:
# Test different lambda values for L2 regularization
lambda_values = [0.001, 0.01, 0.1, 1.0, 10.0]
lambda_results = []

for lambda_val in lambda_values:
    model = LinearRegressionScratch(
        learning_rate=0.1,
        n_iterations=1000,
        regularization='l2',
        lambda_reg=lambda_val,
        verbose=False
    )
    
    model.fit(X_train_scaled, y_train)
    
    y_pred_train = model.predict(X_train_scaled)
    y_pred_test = model.predict(X_test_scaled)
    
    lambda_results.append({
        'lambda': lambda_val,
        'train_r2': r2_score(y_train, y_pred_train),
        'test_r2': r2_score(y_test, y_pred_test),
        'test_rmse': np.sqrt(mean_squared_error(y_test, y_pred_test)),
        'weight_sum': np.sum(np.abs(model.weights))
    })

lambda_df = pd.DataFrame(lambda_results)
print("\nEffect of Lambda (L2 Regularization):")
print(lambda_df.to_string(index=False))

In [None]:
# Visualize effect of lambda
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# R² vs Lambda
axes[0].plot(lambda_df['lambda'], lambda_df['train_r2'], 
            marker='o', linewidth=2, label='Train R²')
axes[0].plot(lambda_df['lambda'], lambda_df['test_r2'], 
            marker='s', linewidth=2, label='Test R²')
axes[0].set_xlabel('Lambda (Regularization Strength)', fontsize=12)
axes[0].set_ylabel('R² Score', fontsize=12)
axes[0].set_title('Model Performance vs Lambda', fontsize=14, fontweight='bold')
axes[0].set_xscale('log')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Weight magnitude vs Lambda
axes[1].plot(lambda_df['lambda'], lambda_df['weight_sum'], 
            marker='o', linewidth=2, color='green')
axes[1].set_xlabel('Lambda (Regularization Strength)', fontsize=12)
axes[1].set_ylabel('Sum of Absolute Weights', fontsize=12)
axes[1].set_title('Weight Magnitude vs Lambda', fontsize=14, fontweight='bold')
axes[1].set_xscale('log')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Learning Rate Comparison

In [None]:
# Compare different learning rates
learning_rates = [0.001, 0.01, 0.1, 0.5]
lr_models = {}

for lr in learning_rates:
    model = LinearRegressionScratch(
        learning_rate=lr,
        n_iterations=1000,
        regularization=None,
        verbose=False
    )
    
    model.fit(X_train_scaled, y_train)
    y_pred_test = model.predict(X_test_scaled)
    test_r2 = r2_score(y_test, y_pred_test)
    
    lr_models[lr] = {
        'model': model,
        'test_r2': test_r2
    }
    
    print(f"Learning Rate: {lr:6.3f} | Test R²: {test_r2:.4f}")

In [None]:
# Plot learning curves
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.ravel()

for idx, (lr, results) in enumerate(lr_models.items()):
    model = results['model']
    
    axes[idx].plot(model.loss_history, linewidth=2)
    axes[idx].set_xlabel('Iteration', fontsize=11)
    axes[idx].set_ylabel('Loss', fontsize=11)
    axes[idx].set_title(
        f'Learning Rate: {lr} (R²: {results["test_r2"]:.4f})',
        fontsize=12, fontweight='bold'
    )
    axes[idx].grid(True, alpha=0.3)

plt.suptitle('Effect of Learning Rate on Convergence', 
            fontsize=14, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

## Example 3: High-Dimensional Synthetic Data

In [None]:
# Generate high-dimensional dataset
X_syn, y_syn = make_regression(
    n_samples=500,
    n_features=50,
    n_informative=20,
    noise=10,
    random_state=42
)

X_train_syn, X_test_syn, y_train_syn, y_test_syn = train_test_split(
    X_syn, y_syn, test_size=0.2, random_state=42
)

# Standardize
scaler_syn = StandardScaler()
X_train_syn_scaled = scaler_syn.fit_transform(X_train_syn)
X_test_syn_scaled = scaler_syn.transform(X_test_syn)

print("\nHigh-Dimensional Synthetic Dataset")
print(f"Training samples: {len(X_train_syn)}")
print(f"Test samples: {len(X_test_syn)}")
print(f"Features: {X_syn.shape[1]}")
print(f"Informative features: 20")

In [None]:
# Train models with different regularization
models_syn = {}

for reg in [None, 'l1', 'l2']:
    model = LinearRegressionScratch(
        learning_rate=0.1,
        n_iterations=1000,
        regularization=reg,
        lambda_reg=0.5,
        verbose=False
    )
    
    model.fit(X_train_syn_scaled, y_train_syn)
    
    y_pred_train = model.predict(X_train_syn_scaled)
    y_pred_test = model.predict(X_test_syn_scaled)
    
    models_syn[reg if reg else 'none'] = {
        'model': model,
        'train_r2': r2_score(y_train_syn, y_pred_train),
        'test_r2': r2_score(y_test_syn, y_pred_test),
        'non_zero_weights': np.sum(np.abs(model.weights) > 0.01)
    }

# Summary
print("\nResults on High-Dimensional Data:")
for reg_type, results in models_syn.items():
    print(f"\n{reg_type.upper()} Regularization:")
    print(f"  Train R²: {results['train_r2']:.4f}")
    print(f"  Test R²:  {results['test_r2']:.4f}")
    print(f"  Non-zero weights: {results['non_zero_weights']}/50")

## Predictions vs Actual Values

In [None]:
# Plot predictions vs actual for diabetes dataset
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, (reg_type, model_info) in enumerate(models.items()):
    model = model_info['model']
    y_pred = model.predict(X_test_scaled)
    
    axes[idx].scatter(y_test, y_pred, alpha=0.5)
    axes[idx].plot([y_test.min(), y_test.max()], 
                   [y_test.min(), y_test.max()], 
                   'r--', linewidth=2, label='Perfect prediction')
    axes[idx].set_xlabel('Actual Values', fontsize=12)
    axes[idx].set_ylabel('Predicted Values', fontsize=12)
    axes[idx].set_title(
        f'{reg_type.upper()} (R²: {model_info["test_r2"]:.4f})',
        fontsize=12, fontweight='bold'
    )
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Summary

**Key Takeaways:**

1. **Gradient Descent**:
   - Iteratively updates weights to minimize loss
   - Learning rate controls convergence speed
   - Feature scaling is crucial for good performance

2. **L2 Regularization (Ridge)**:
   - Penalizes large weights: $\lambda \sum w_j^2$
   - Shrinks all weights proportionally
   - Good for correlated features
   - Never sets weights exactly to zero

3. **L1 Regularization (Lasso)**:
   - Penalizes absolute weights: $\lambda \sum |w_j|$
   - Encourages sparsity (feature selection)
   - Sets some weights to exactly zero
   - Useful for high-dimensional data

4. **Regularization Strength (Lambda)**:
   - Higher λ → stronger regularization → simpler model
   - Lower λ → weaker regularization → more complex model
   - Tune via cross-validation

5. **When to Use Which**:
   - No regularization: Small datasets, low noise
   - Ridge (L2): Correlated features, prevent overfitting
   - Lasso (L1): Feature selection, sparse solutions
