# ML Practice Questions Part 4: Linear Models and Regularization

This notebook covers fundamental linear models, regularization techniques, and their theoretical foundations. Each question includes detailed explanations, mathematical derivations, and practical implementations.

**Topics Covered:**
- Linear regression variants and assumptions
- Ridge, Lasso, and Elastic Net regularization
- Logistic regression and generalized linear models
- Feature selection and regularization paths
- Coordinate descent and optimization techniques

**Format:** Each question includes theory, implementation, and analysis sections.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import make_regression, make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, LogisticRegression
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, log_loss
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8')
np.random.seed(42)

## Question 1: Linear Regression Assumptions and Diagnostics

**Question:** What are the key assumptions of linear regression, and how would you test for violations? Implement diagnostic tests and demonstrate their use.

### Theory

Linear regression assumes:
1. **Linearity**: Relationship between X and y is linear
2. **Independence**: Observations are independent
3. **Homoscedasticity**: Constant variance of errors
4. **Normality**: Errors are normally distributed
5. **No multicollinearity**: Features are not perfectly correlated

**Mathematical Model:**
$$y = X\beta + \epsilon, \quad \epsilon \sim N(0, \sigma^2 I)$$

**Diagnostic Tests:**
- Residual plots for linearity and homoscedasticity
- Q-Q plots for normality
- VIF for multicollinearity
- Durbin-Watson for independence

In [None]:
def linear_regression_diagnostics(X, y, model=None):
    """
    Comprehensive linear regression diagnostic tests.
    
    Args:
        X: Feature matrix
        y: Target vector
        model: Fitted sklearn model (optional)
    
    Returns:
        Dictionary of diagnostic results
    """
    if model is None:
        model = LinearRegression().fit(X, y)
    
    y_pred = model.predict(X)
    residuals = y - y_pred
    
    diagnostics = {}
    
    # 1. Linearity (correlation of residuals with fitted values)
    linearity_corr = np.corrcoef(y_pred, residuals)[0, 1]
    diagnostics['linearity_violation'] = abs(linearity_corr) > 0.1
    
    # 2. Homoscedasticity (Breusch-Pagan test)
    from scipy.stats import chi2
    aux_reg = LinearRegression().fit(X, residuals**2)
    bp_statistic = len(X) * aux_reg.score(X, residuals**2)
    bp_pvalue = 1 - chi2.cdf(bp_statistic, X.shape[1])
    diagnostics['homoscedasticity_violation'] = bp_pvalue < 0.05
    
    # 3. Normality (Shapiro-Wilk test)
    if len(residuals) <= 5000:  # Shapiro-Wilk limit
        _, normality_pvalue = stats.shapiro(residuals)
        diagnostics['normality_violation'] = normality_pvalue < 0.05
    
    # 4. Multicollinearity (VIF)
    vif_scores = []
    for i in range(X.shape[1]):
        X_temp = np.delete(X, i, axis=1)
        r_squared = LinearRegression().fit(X_temp, X[:, i]).score(X_temp, X[:, i])
        vif = 1 / (1 - r_squared) if r_squared < 0.99 else float('inf')
        vif_scores.append(vif)
    
    diagnostics['multicollinearity_violation'] = any(vif > 10 for vif in vif_scores)
    diagnostics['max_vif'] = max(vif_scores)
    
    return diagnostics, residuals, y_pred

# Generate test data with known violations
n_samples, n_features = 500, 5
X, y = make_regression(n_samples=n_samples, n_features=n_features, noise=10, random_state=42)

# Add heteroscedasticity
y = y + np.random.normal(0, abs(y) * 0.1)

# Add multicollinearity
X[:, -1] = X[:, 0] + np.random.normal(0, 0.1, n_samples)

# Run diagnostics
diagnostics, residuals, y_pred = linear_regression_diagnostics(X, y)

print("Linear Regression Diagnostic Results:")
print(f"Linearity violation: {diagnostics['linearity_violation']}")
print(f"Homoscedasticity violation: {diagnostics['homoscedasticity_violation']}")
print(f"Normality violation: {diagnostics['normality_violation']}")
print(f"Multicollinearity violation: {diagnostics['multicollinearity_violation']}")
print(f"Maximum VIF: {diagnostics['max_vif']:.2f}")

In [None]:
# Visualization of diagnostic plots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Residuals vs Fitted
axes[0, 0].scatter(y_pred, residuals, alpha=0.6)
axes[0, 0].axhline(y=0, color='red', linestyle='--')
axes[0, 0].set_xlabel('Fitted Values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted Values')

# Q-Q plot
stats.probplot(residuals, dist="norm", plot=axes[0, 1])
axes[0, 1].set_title('Q-Q Plot (Normality Check)')

# Scale-Location plot
standardized_residuals = residuals / np.std(residuals)
axes[1, 0].scatter(y_pred, np.sqrt(np.abs(standardized_residuals)), alpha=0.6)
axes[1, 0].set_xlabel('Fitted Values')
axes[1, 0].set_ylabel('√|Standardized Residuals|')
axes[1, 0].set_title('Scale-Location Plot')

# Residual histogram
axes[1, 1].hist(residuals, bins=30, density=True, alpha=0.7)
x_norm = np.linspace(residuals.min(), residuals.max(), 100)
axes[1, 1].plot(x_norm, stats.norm.pdf(x_norm, 0, np.std(residuals)), 'r-', label='Normal')
axes[1, 1].set_xlabel('Residuals')
axes[1, 1].set_ylabel('Density')
axes[1, 1].set_title('Residual Distribution')
axes[1, 1].legend()

plt.tight_layout()
plt.show()

## Question 2: Ridge Regression Theory and Implementation

**Question:** Derive the Ridge regression solution and explain how the regularization parameter affects bias-variance tradeoff. Implement from scratch and compare with sklearn.

### Theory

**Ridge Regression Objective:**
$$\min_{\beta} \|y - X\beta\|_2^2 + \lambda\|\beta\|_2^2$$

**Closed-form Solution:**
$$\hat{\beta}_{ridge} = (X^TX + \lambda I)^{-1}X^Ty$$

**Bias-Variance Analysis:**
- **Bias**: Increases with λ (shrinks coefficients toward zero)
- **Variance**: Decreases with λ (stabilizes estimates)
- **Total Error**: U-shaped curve in λ

**Key Properties:**
- Always has unique solution (even when X^TX is singular)
- Shrinks coefficients proportionally
- Equivalent to adding Gaussian prior on coefficients

In [None]:
class RidgeRegressionCustom:
    """
    Ridge Regression implementation from scratch.
    """
    
    def __init__(self, alpha=1.0, fit_intercept=True):
        self.alpha = alpha
        self.fit_intercept = fit_intercept
        self.coef_ = None
        self.intercept_ = None
    
    def fit(self, X, y):
        """Fit Ridge regression model."""
        X = np.array(X)
        y = np.array(y)
        
        if self.fit_intercept:
            # Center the data
            self.X_mean_ = np.mean(X, axis=0)
            self.y_mean_ = np.mean(y)
            X_centered = X - self.X_mean_
            y_centered = y - self.y_mean_
        else:
            X_centered = X
            y_centered = y
            self.y_mean_ = 0
        
        # Ridge solution: (X^T X + λI)^(-1) X^T y
        n_features = X_centered.shape[1]
        A = X_centered.T @ X_centered + self.alpha * np.eye(n_features)
        b = X_centered.T @ y_centered
        
        self.coef_ = np.linalg.solve(A, b)
        
        if self.fit_intercept:
            self.intercept_ = self.y_mean_ - np.dot(self.X_mean_, self.coef_)
        else:
            self.intercept_ = 0
        
        return self
    
    def predict(self, X):
        """Make predictions."""
        return X @ self.coef_ + self.intercept_
    
    def get_regularization_path(self, X, y, alphas):
        """Compute coefficient path for different alpha values."""
        coef_path = []
        
        for alpha in alphas:
            self.alpha = alpha
            self.fit(X, y)
            coef_path.append(self.coef_.copy())
        
        return np.array(coef_path)

# Generate data for demonstration
X, y = make_regression(n_samples=100, n_features=10, noise=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Compare custom implementation with sklearn
alphas = [0.01, 0.1, 1.0, 10.0, 100.0]
results = []

for alpha in alphas:
    # Custom implementation
    ridge_custom = RidgeRegressionCustom(alpha=alpha)
    ridge_custom.fit(X_train_scaled, y_train)
    y_pred_custom = ridge_custom.predict(X_test_scaled)
    mse_custom = mean_squared_error(y_test, y_pred_custom)
    
    # Sklearn implementation
    ridge_sklearn = Ridge(alpha=alpha)
    ridge_sklearn.fit(X_train_scaled, y_train)
    y_pred_sklearn = ridge_sklearn.predict(X_test_scaled)
    mse_sklearn = mean_squared_error(y_test, y_pred_sklearn)
    
    results.append({
        'alpha': alpha,
        'mse_custom': mse_custom,
        'mse_sklearn': mse_sklearn,
        'coef_diff': np.mean(np.abs(ridge_custom.coef_ - ridge_sklearn.coef_))
    })

results_df = pd.DataFrame(results)
print("Comparison between custom and sklearn Ridge implementations:")
print(results_df.round(6))

In [None]:
# Visualize regularization path
alphas_path = np.logspace(-3, 2, 50)
ridge_custom = RidgeRegressionCustom()
coef_path = ridge_custom.get_regularization_path(X_train_scaled, y_train, alphas_path)

plt.figure(figsize=(12, 5))

# Coefficient paths
plt.subplot(1, 2, 1)
for i in range(coef_path.shape[1]):
    plt.plot(alphas_path, coef_path[:, i], label=f'Feature {i+1}')
plt.xscale('log')
plt.xlabel('Alpha (λ)')
plt.ylabel('Coefficient Value')
plt.title('Ridge Regularization Path')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True, alpha=0.3)

# Bias-variance tradeoff simulation
plt.subplot(1, 2, 2)
test_errors = []
train_errors = []

for alpha in alphas_path:
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_train_scaled, y_train)
    
    train_pred = ridge.predict(X_train_scaled)
    test_pred = ridge.predict(X_test_scaled)
    
    train_errors.append(mean_squared_error(y_train, train_pred))
    test_errors.append(mean_squared_error(y_test, test_pred))

plt.plot(alphas_path, train_errors, label='Training Error', alpha=0.7)
plt.plot(alphas_path, test_errors, label='Test Error', alpha=0.7)
plt.xscale('log')
plt.xlabel('Alpha (λ)')
plt.ylabel('Mean Squared Error')
plt.title('Bias-Variance Tradeoff')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Find optimal alpha
optimal_idx = np.argmin(test_errors)
optimal_alpha = alphas_path[optimal_idx]
print(f"Optimal alpha: {optimal_alpha:.4f}")
print(f"Minimum test error: {test_errors[optimal_idx]:.4f}")

## Question 3: Lasso Regression and Feature Selection

**Question:** Explain why Lasso performs automatic feature selection while Ridge doesn't. Implement coordinate descent for Lasso and demonstrate the sparsity-inducing effect.

### Theory

**Lasso Objective:**
$$\min_{\beta} \frac{1}{2n}\|y - X\beta\|_2^2 + \lambda\|\beta\|_1$$

**Why Lasso Induces Sparsity:**
- L1 penalty creates "corners" at coordinate axes
- Contours of quadratic loss function intersect penalty region at sparse points
- Subgradient includes zero, allowing coefficients to be exactly zero

**Coordinate Descent Algorithm:**
For each coordinate j:
$$\beta_j^{(new)} = S\left(\frac{1}{n}X_j^T(y - X_{-j}\beta_{-j}), \lambda\right)$$

Where S(z,λ) is the soft-thresholding operator:
$$S(z, \lambda) = \text{sign}(z)(|z| - \lambda)_+$$

In [None]:
class LassoRegressionCustom:
    """
    Lasso Regression using coordinate descent.
    """
    
    def __init__(self, alpha=1.0, max_iter=1000, tol=1e-4, fit_intercept=True):
        self.alpha = alpha
        self.max_iter = max_iter
        self.tol = tol
        self.fit_intercept = fit_intercept
        self.coef_ = None
        self.intercept_ = None
        self.n_iter_ = None
    
    def _soft_threshold(self, z, lambda_val):
        """Soft thresholding operator."""
        return np.sign(z) * np.maximum(np.abs(z) - lambda_val, 0)
    
    def fit(self, X, y):
        """Fit Lasso regression using coordinate descent."""
        X = np.array(X, dtype=float)
        y = np.array(y, dtype=float)
        n_samples, n_features = X.shape
        
        # Center data if fitting intercept
        if self.fit_intercept:
            self.X_mean_ = np.mean(X, axis=0)
            self.y_mean_ = np.mean(y)
            X = X - self.X_mean_
            y = y - self.y_mean_
        
        # Normalize features for coordinate descent
        X_norms = np.sqrt(np.sum(X**2, axis=0))
        X_normalized = X / X_norms
        
        # Initialize coefficients
        beta = np.zeros(n_features)
        
        # Coordinate descent
        for iteration in range(self.max_iter):
            beta_old = beta.copy()
            
            for j in range(n_features):
                # Compute partial residual
                r_j = y - X_normalized @ beta + beta[j] * X_normalized[:, j]
                
                # Coordinate update with soft thresholding
                z_j = np.dot(X_normalized[:, j], r_j) / n_samples
                beta[j] = self._soft_threshold(z_j, self.alpha)
            
            # Check convergence
            if np.max(np.abs(beta - beta_old)) < self.tol:
                self.n_iter_ = iteration + 1
                break
        else:
            self.n_iter_ = self.max_iter
        
        # Rescale coefficients
        self.coef_ = beta / X_norms
        
        # Compute intercept
        if self.fit_intercept:
            self.intercept_ = self.y_mean_ - np.dot(self.X_mean_, self.coef_)
        else:
            self.intercept_ = 0
        
        return self
    
    def predict(self, X):
        """Make predictions."""
        return X @ self.coef_ + self.intercept_
    
    def get_regularization_path(self, X, y, alphas):
        """Compute Lasso path for different alpha values."""
        coef_path = []
        sparsity_path = []
        
        for alpha in alphas:
            self.alpha = alpha
            self.fit(X, y)
            coef_path.append(self.coef_.copy())
            sparsity_path.append(np.sum(np.abs(self.coef_) > 1e-6))
        
        return np.array(coef_path), np.array(sparsity_path)

# Generate data with some irrelevant features
n_samples, n_features = 100, 20
n_informative = 5

X, y = make_regression(
    n_samples=n_samples,
    n_features=n_features,
    n_informative=n_informative,
    noise=10,
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Compare implementations
alpha = 1.0

# Custom Lasso
lasso_custom = LassoRegressionCustom(alpha=alpha)
lasso_custom.fit(X_train_scaled, y_train)
y_pred_custom = lasso_custom.predict(X_test_scaled)

# Sklearn Lasso
lasso_sklearn = Lasso(alpha=alpha, max_iter=1000)
lasso_sklearn.fit(X_train_scaled, y_train)
y_pred_sklearn = lasso_sklearn.predict(X_test_scaled)

print(f"Custom Lasso - Test MSE: {mean_squared_error(y_test, y_pred_custom):.4f}")
print(f"Sklearn Lasso - Test MSE: {mean_squared_error(y_test, y_pred_sklearn):.4f}")
print(f"Custom Lasso - Non-zero coefficients: {np.sum(np.abs(lasso_custom.coef_) > 1e-6)}")
print(f"Sklearn Lasso - Non-zero coefficients: {np.sum(np.abs(lasso_sklearn.coef_) > 1e-6)}")
print(f"Custom Lasso - Iterations: {lasso_custom.n_iter_}")

In [None]:
# Visualize Lasso path and feature selection
alphas_path = np.logspace(-3, 1, 50)
lasso_custom = LassoRegressionCustom()
coef_path, sparsity_path = lasso_custom.get_regularization_path(X_train_scaled, y_train, alphas_path)

fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Lasso path
axes[0, 0].plot(alphas_path, coef_path)
axes[0, 0].set_xscale('log')
axes[0, 0].set_xlabel('Alpha (λ)')
axes[0, 0].set_ylabel('Coefficient Value')
axes[0, 0].set_title('Lasso Regularization Path')
axes[0, 0].grid(True, alpha=0.3)

# Sparsity vs Alpha
axes[0, 1].plot(alphas_path, sparsity_path, 'o-')
axes[0, 1].set_xscale('log')
axes[0, 1].set_xlabel('Alpha (λ)')
axes[0, 1].set_ylabel('Number of Non-zero Coefficients')
axes[0, 1].set_title('Feature Selection Effect')
axes[0, 1].grid(True, alpha=0.3)

# Coefficient comparison at specific alpha
test_alpha = 0.1
lasso_test = LassoRegressionCustom(alpha=test_alpha)
lasso_test.fit(X_train_scaled, y_train)

feature_indices = np.arange(len(lasso_test.coef_))
axes[1, 0].bar(feature_indices, lasso_test.coef_, alpha=0.7)
axes[1, 0].set_xlabel('Feature Index')
axes[1, 0].set_ylabel('Coefficient Value')
axes[1, 0].set_title(f'Lasso Coefficients (α={test_alpha})')
axes[1, 0].grid(True, alpha=0.3)

# L1 vs L2 penalty visualization (2D case)
theta = np.linspace(0, 2*np.pi, 1000)
l1_x = np.cos(theta)
l1_y = np.sin(theta)
l1_constraint = np.abs(l1_x) + np.abs(l1_y) <= 1

l2_x = np.cos(theta)
l2_y = np.sin(theta)

axes[1, 1].fill(l1_x[l1_constraint], l1_y[l1_constraint], alpha=0.3, label='L1 (Lasso)', color='red')
axes[1, 1].plot(l2_x, l2_y, label='L2 (Ridge)', color='blue', linewidth=2)
axes[1, 1].set_xlim(-1.5, 1.5)
axes[1, 1].set_ylim(-1.5, 1.5)
axes[1, 1].set_xlabel('β₁')
axes[1, 1].set_ylabel('β₂')
axes[1, 1].set_title('L1 vs L2 Penalty Regions')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].set_aspect('equal')

plt.tight_layout()
plt.show()

## Question 4: Elastic Net and Combined Regularization

**Question:** When would you use Elastic Net over Ridge or Lasso? Derive the optimization conditions and implement the algorithm.

### Theory

**Elastic Net Objective:**
$$\min_{\beta} \frac{1}{2n}\|y - X\beta\|_2^2 + \lambda\left[\alpha\|\beta\|_1 + \frac{1-\alpha}{2}\|\beta\|_2^2\right]$$

**When to Use Elastic Net:**
1. **Grouped selection**: When features are correlated, Lasso arbitrarily selects one
2. **n << p**: When samples < features, Lasso selects at most n features
3. **Stability**: Ridge stabilizes Lasso's variable selection

**Coordinate Descent Update:**
$$\beta_j^{(new)} = \frac{S(z_j, \lambda\alpha)}{1 + \lambda(1-\alpha)}$$

Where $z_j = \frac{1}{n}X_j^T(y - X_{-j}\beta_{-j})$

In [None]:
class ElasticNetCustom:
    """
    Elastic Net regression using coordinate descent.
    """
    
    def __init__(self, alpha=1.0, l1_ratio=0.5, max_iter=1000, tol=1e-4, fit_intercept=True):
        self.alpha = alpha  # Overall regularization strength
        self.l1_ratio = l1_ratio  # Mix between L1 and L2 (0=Ridge, 1=Lasso)
        self.max_iter = max_iter
        self.tol = tol
        self.fit_intercept = fit_intercept
        self.coef_ = None
        self.intercept_ = None
        self.n_iter_ = None
    
    def _soft_threshold(self, z, lambda_val):
        """Soft thresholding operator."""
        return np.sign(z) * np.maximum(np.abs(z) - lambda_val, 0)
    
    def fit(self, X, y):
        """Fit Elastic Net using coordinate descent."""
        X = np.array(X, dtype=float)
        y = np.array(y, dtype=float)
        n_samples, n_features = X.shape
        
        # Center data
        if self.fit_intercept:
            self.X_mean_ = np.mean(X, axis=0)
            self.y_mean_ = np.mean(y)
            X = X - self.X_mean_
            y = y - self.y_mean_
        
        # Normalize features
        X_norms = np.sqrt(np.sum(X**2, axis=0))
        X_normalized = X / X_norms
        
        # Regularization parameters
        lambda_l1 = self.alpha * self.l1_ratio
        lambda_l2 = self.alpha * (1 - self.l1_ratio)
        
        # Initialize coefficients
        beta = np.zeros(n_features)
        
        # Coordinate descent
        for iteration in range(self.max_iter):
            beta_old = beta.copy()
            
            for j in range(n_features):
                # Partial residual
                r_j = y - X_normalized @ beta + beta[j] * X_normalized[:, j]
                
                # Coordinate update
                z_j = np.dot(X_normalized[:, j], r_j) / n_samples
                
                # Elastic Net update: soft threshold then scale
                beta[j] = self._soft_threshold(z_j, lambda_l1) / (1 + lambda_l2)
            
            # Check convergence
            if np.max(np.abs(beta - beta_old)) < self.tol:
                self.n_iter_ = iteration + 1
                break
        else:
            self.n_iter_ = self.max_iter
        
        # Rescale coefficients
        self.coef_ = beta / X_norms
        
        # Compute intercept
        if self.fit_intercept:
            self.intercept_ = self.y_mean_ - np.dot(self.X_mean_, self.coef_)
        else:
            self.intercept_ = 0
        
        return self
    
    def predict(self, X):
        """Make predictions."""
        return X @ self.coef_ + self.intercept_

# Create dataset with correlated features to demonstrate Elastic Net benefits
np.random.seed(42)
n_samples, n_features = 100, 50

# Generate correlated features
X_base = np.random.randn(n_samples, 10)
X_corr = []

for i in range(5):
    # Create groups of correlated features
    base_feature = X_base[:, i]
    for j in range(3):
        noise = np.random.randn(n_samples) * 0.1
        corr_feature = base_feature + noise
        X_corr.append(corr_feature)

# Add some random features
for i in range(35):
    X_corr.append(np.random.randn(n_samples))

X = np.column_stack(X_corr)

# Generate target with only first 15 features being relevant
true_coef = np.zeros(n_features)
true_coef[:15] = np.random.randn(15) * 2
y = X @ true_coef + np.random.randn(n_samples) * 0.5

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Compare Ridge, Lasso, and Elastic Net
alpha = 0.1
models = {
    'Ridge': Ridge(alpha=alpha),
    'Lasso': Lasso(alpha=alpha),
    'Elastic Net (0.5)': ElasticNet(alpha=alpha, l1_ratio=0.5),
    'Custom Elastic Net': ElasticNetCustom(alpha=alpha, l1_ratio=0.5)
}

results = {}
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    n_nonzero = np.sum(np.abs(model.coef_) > 1e-6)
    
    results[name] = {
        'MSE': mse,
        'R²': r2,
        'Non-zero coef': n_nonzero
    }

results_df = pd.DataFrame(results).T
print("Model Comparison on Correlated Features:")
print(results_df.round(4))

In [None]:
# Visualize effect of l1_ratio in Elastic Net
l1_ratios = np.linspace(0, 1, 11)
alpha = 0.1

elastic_results = []
for l1_ratio in l1_ratios:
    model = ElasticNet(alpha=alpha, l1_ratio=l1_ratio)
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    
    mse = mean_squared_error(y_test, y_pred)
    n_nonzero = np.sum(np.abs(model.coef_) > 1e-6)
    
    elastic_results.append({
        'l1_ratio': l1_ratio,
        'mse': mse,
        'n_nonzero': n_nonzero,
        'coefficients': model.coef_.copy()
    })

fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# MSE vs l1_ratio
mse_values = [r['mse'] for r in elastic_results]
axes[0, 0].plot(l1_ratios, mse_values, 'o-')
axes[0, 0].set_xlabel('L1 Ratio')
axes[0, 0].set_ylabel('Test MSE')
axes[0, 0].set_title('MSE vs L1 Ratio')
axes[0, 0].grid(True, alpha=0.3)

# Sparsity vs l1_ratio
sparsity_values = [r['n_nonzero'] for r in elastic_results]
axes[0, 1].plot(l1_ratios, sparsity_values, 'o-', color='orange')
axes[0, 1].set_xlabel('L1 Ratio')
axes[0, 1].set_ylabel('Number of Non-zero Coefficients')
axes[0, 1].set_title('Sparsity vs L1 Ratio')
axes[0, 1].grid(True, alpha=0.3)

# Coefficient paths for different l1_ratios
coef_matrix = np.array([r['coefficients'] for r in elastic_results])
for i in range(min(10, coef_matrix.shape[1])):
    axes[1, 0].plot(l1_ratios, coef_matrix[:, i], alpha=0.7, label=f'Feature {i+1}')
axes[1, 0].set_xlabel('L1 Ratio')
axes[1, 0].set_ylabel('Coefficient Value')
axes[1, 0].set_title('Coefficient Paths vs L1 Ratio')
axes[1, 0].grid(True, alpha=0.3)

# Feature correlation heatmap (first 15 features)
corr_matrix = np.corrcoef(X_train_scaled[:, :15].T)
im = axes[1, 1].imshow(corr_matrix, cmap='coolwarm', vmin=-1, vmax=1)
axes[1, 1].set_title('Feature Correlation Matrix')
axes[1, 1].set_xlabel('Feature Index')
axes[1, 1].set_ylabel('Feature Index')
plt.colorbar(im, ax=axes[1, 1])

plt.tight_layout()
plt.show()

# Find optimal l1_ratio
optimal_idx = np.argmin(mse_values)
optimal_l1_ratio = l1_ratios[optimal_idx]
print(f"Optimal L1 ratio: {optimal_l1_ratio:.2f}")
print(f"Minimum test MSE: {mse_values[optimal_idx]:.4f}")
print(f"Non-zero coefficients at optimum: {sparsity_values[optimal_idx]}")

## Question 5: Logistic Regression and Maximum Likelihood

**Question:** Derive the logistic regression cost function from maximum likelihood estimation. Implement gradient descent and compare with Newton's method.

### Theory

**Logistic Model:**
$$P(y=1|x) = \sigma(x^T\beta) = \frac{1}{1 + e^{-x^T\beta}}$$

**Likelihood Function:**
$$L(\beta) = \prod_{i=1}^n \sigma(x_i^T\beta)^{y_i}(1-\sigma(x_i^T\beta))^{1-y_i}$$

**Log-Likelihood:**
$$\ell(\beta) = \sum_{i=1}^n [y_i \log \sigma(x_i^T\beta) + (1-y_i) \log(1-\sigma(x_i^T\beta))]$$

**Cost Function (Negative Log-Likelihood):**
$$J(\beta) = -\frac{1}{n}\sum_{i=1}^n [y_i \log \sigma(x_i^T\beta) + (1-y_i) \log(1-\sigma(x_i^T\beta))]$$

**Gradient:**
$$\nabla J(\beta) = \frac{1}{n}X^T(\sigma(X\beta) - y)$$

**Hessian:**
$$H = \frac{1}{n}X^T W X, \quad W = \text{diag}(\sigma(X\beta)(1-\sigma(X\beta)))$$

In [None]:
class LogisticRegressionCustom:
    """
    Logistic Regression with multiple optimization methods.
    """
    
    def __init__(self, method='gradient_descent', learning_rate=0.01, max_iter=1000, tol=1e-6, fit_intercept=True):
        self.method = method
        self.learning_rate = learning_rate
        self.max_iter = max_iter
        self.tol = tol
        self.fit_intercept = fit_intercept
        self.coef_ = None
        self.intercept_ = None
        self.cost_history_ = []
        self.n_iter_ = None
    
    def _sigmoid(self, z):
        """Stable sigmoid function."""
        # Clip to prevent overflow
        z = np.clip(z, -500, 500)
        return 1 / (1 + np.exp(-z))
    
    def _cost_function(self, y_true, y_pred_proba):
        """Binary cross-entropy cost function."""
        # Add small epsilon to prevent log(0)
        epsilon = 1e-15
        y_pred_proba = np.clip(y_pred_proba, epsilon, 1 - epsilon)
        
        cost = -np.mean(y_true * np.log(y_pred_proba) + (1 - y_true) * np.log(1 - y_pred_proba))
        return cost
    
    def _gradient(self, X, y, beta):
        """Compute gradient of cost function."""
        z = X @ beta
        predictions = self._sigmoid(z)
        gradient = X.T @ (predictions - y) / len(y)
        return gradient
    
    def _hessian(self, X, beta):
        """Compute Hessian matrix."""
        z = X @ beta
        predictions = self._sigmoid(z)
        W = predictions * (1 - predictions)
        # Add small regularization to ensure positive definiteness
        W = W + 1e-8
        hessian = X.T @ (W[:, np.newaxis] * X) / len(X)
        return hessian
    
    def fit(self, X, y):
        """Fit logistic regression model."""
        X = np.array(X)
        y = np.array(y)
        
        # Add intercept term
        if self.fit_intercept:
            X = np.column_stack([np.ones(len(X)), X])
        
        # Initialize parameters
        beta = np.zeros(X.shape[1])
        self.cost_history_ = []
        
        # Optimization
        for iteration in range(self.max_iter):
            # Compute predictions and cost
            z = X @ beta
            predictions = self._sigmoid(z)
            cost = self._cost_function(y, predictions)
            self.cost_history_.append(cost)
            
            # Compute gradient
            gradient = self._gradient(X, y, beta)
            
            if self.method == 'gradient_descent':
                # Gradient descent update
                beta_new = beta - self.learning_rate * gradient
                
            elif self.method == 'newton':
                # Newton's method update
                hessian = self._hessian(X, beta)
                try:
                    beta_new = beta - np.linalg.solve(hessian, gradient)
                except np.linalg.LinAlgError:
                    # Fall back to gradient descent if Hessian is singular
                    beta_new = beta - self.learning_rate * gradient
            
            # Check convergence
            if np.linalg.norm(beta_new - beta) < self.tol:
                self.n_iter_ = iteration + 1
                break
            
            beta = beta_new
        else:
            self.n_iter_ = self.max_iter
        
        # Store coefficients
        if self.fit_intercept:
            self.intercept_ = beta[0]
            self.coef_ = beta[1:]
        else:
            self.intercept_ = 0
            self.coef_ = beta
        
        return self
    
    def predict_proba(self, X):
        """Predict class probabilities."""
        if self.fit_intercept:
            X = np.column_stack([np.ones(len(X)), X])
            beta = np.concatenate([[self.intercept_], self.coef_])
        else:
            beta = self.coef_
        
        z = X @ beta
        return self._sigmoid(z)
    
    def predict(self, X):
        """Make binary predictions."""
        return (self.predict_proba(X) >= 0.5).astype(int)

# Generate binary classification data
X, y = make_classification(n_samples=1000, n_features=10, n_redundant=0, 
                         n_informative=10, n_clusters_per_class=1, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Compare optimization methods
methods = {
    'Gradient Descent': LogisticRegressionCustom(method='gradient_descent', learning_rate=0.1),
    'Newton Method': LogisticRegressionCustom(method='newton'),
    'Sklearn': LogisticRegression(max_iter=1000)
}

results = {}
for name, model in methods.items():
    model.fit(X_train_scaled, y_train)
    
    if hasattr(model, 'predict_proba'):
        y_pred_proba = model.predict_proba(X_test_scaled)
        if len(y_pred_proba.shape) > 1:  # sklearn returns 2D array
            y_pred_proba = y_pred_proba[:, 1]
    else:
        y_pred_proba = model.predict_proba(X_test_scaled)
    
    y_pred = (y_pred_proba >= 0.5).astype(int)
    
    accuracy = accuracy_score(y_test, y_pred)
    logloss = log_loss(y_test, y_pred_proba)
    
    if hasattr(model, 'n_iter_'):
        iterations = model.n_iter_
    else:
        iterations = model.n_iter_[0] if hasattr(model, 'n_iter_') else 'N/A'
    
    results[name] = {
        'Accuracy': accuracy,
        'Log Loss': logloss,
        'Iterations': iterations
    }

results_df = pd.DataFrame(results).T
print("Optimization Method Comparison:")
print(results_df.round(4))

In [None]:
# Visualize convergence and decision boundary
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Convergence comparison
gd_model = LogisticRegressionCustom(method='gradient_descent', learning_rate=0.1, max_iter=1000)
newton_model = LogisticRegressionCustom(method='newton', max_iter=100)

gd_model.fit(X_train_scaled, y_train)
newton_model.fit(X_train_scaled, y_train)

axes[0, 0].plot(gd_model.cost_history_, label='Gradient Descent', alpha=0.8)
axes[0, 0].plot(newton_model.cost_history_, label='Newton Method', alpha=0.8)
axes[0, 0].set_xlabel('Iteration')
axes[0, 0].set_ylabel('Cost (Log Loss)')
axes[0, 0].set_title('Convergence Comparison')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Learning rate sensitivity
learning_rates = [0.001, 0.01, 0.1, 1.0]
final_costs = []
iterations_to_converge = []

for lr in learning_rates:
    model = LogisticRegressionCustom(method='gradient_descent', learning_rate=lr, max_iter=2000)
    model.fit(X_train_scaled, y_train)
    final_costs.append(model.cost_history_[-1])
    iterations_to_converge.append(model.n_iter_)

axes[0, 1].semilogx(learning_rates, final_costs, 'o-', label='Final Cost')
axes[0, 1].set_xlabel('Learning Rate')
axes[0, 1].set_ylabel('Final Cost')
axes[0, 1].set_title('Learning Rate Sensitivity')
axes[0, 1].grid(True, alpha=0.3)

# Decision boundary (2D visualization)
# Use only first 2 features for visualization
X_2d = X_train_scaled[:, :2]
model_2d = LogisticRegressionCustom(method='newton')
model_2d.fit(X_2d, y_train)

# Create mesh
h = 0.02
x_min, x_max = X_2d[:, 0].min() - 1, X_2d[:, 0].max() + 1
y_min, y_max = X_2d[:, 1].min() - 1, X_2d[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

mesh_points = np.c_[xx.ravel(), yy.ravel()]
Z = model_2d.predict_proba(mesh_points)
Z = Z.reshape(xx.shape)

axes[1, 0].contourf(xx, yy, Z, levels=50, alpha=0.8, cmap='RdYlBu')
scatter = axes[1, 0].scatter(X_2d[:, 0], X_2d[:, 1], c=y_train, cmap='RdYlBu', edgecolors='black')
axes[1, 0].set_xlabel('Feature 1')
axes[1, 0].set_ylabel('Feature 2')
axes[1, 0].set_title('Decision Boundary (2D Projection)')

# Sigmoid function visualization
z = np.linspace(-10, 10, 100)
sigmoid_values = 1 / (1 + np.exp(-z))

axes[1, 1].plot(z, sigmoid_values, 'b-', linewidth=2, label='σ(z) = 1/(1+e⁻ᶻ)')
axes[1, 1].axhline(y=0.5, color='r', linestyle='--', alpha=0.7, label='Decision threshold')
axes[1, 1].axvline(x=0, color='r', linestyle='--', alpha=0.7)
axes[1, 1].set_xlabel('z = xᵀβ')
axes[1, 1].set_ylabel('P(y=1|x)')
axes[1, 1].set_title('Sigmoid Function')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nGradient Descent - Iterations to convergence: {gd_model.n_iter_}")
print(f"Newton Method - Iterations to convergence: {newton_model.n_iter_}")
print(f"Final cost difference: {abs(gd_model.cost_history_[-1] - newton_model.cost_history_[-1]):.6f}")

## Summary and Key Takeaways

### Linear Models and Regularization:

1. **Linear Regression Assumptions**: 
   - Critical for valid inference and predictions
   - Diagnostic tests help identify violations
   - VIF > 10 indicates multicollinearity

2. **Ridge Regression**:
   - L2 penalty shrinks coefficients proportionally
   - Always has unique solution: β = (XᵀX + λI)⁻¹Xᵀy
   - Reduces variance at cost of increased bias

3. **Lasso Regression**:
   - L1 penalty induces sparsity (feature selection)
   - Coordinate descent with soft thresholding
   - Can select at most n features when n < p

4. **Elastic Net**:
   - Combines L1 and L2 penalties
   - Handles correlated features better than Lasso
   - l1_ratio = 0 → Ridge, l1_ratio = 1 → Lasso

5. **Logistic Regression**:
   - Maximum likelihood estimation
   - Newton's method converges faster than gradient descent
   - Sigmoid function provides probabilistic interpretation

### Practical Guidelines:
- Use Ridge when you want to keep all features
- Use Lasso for automatic feature selection
- Use Elastic Net when features are correlated
- Always check assumptions and use diagnostics
- Standardize features for regularized methods