# ML Practice Questions Part 5: Optimization and Gradient-Based Methods

This notebook covers optimization algorithms fundamental to machine learning, from basic gradient descent to advanced second-order methods. Each question includes mathematical derivations, algorithmic implementations, and performance analysis.

**Topics Covered:**
- Gradient descent variants and convergence analysis
- Momentum and adaptive learning rate methods
- Second-order optimization methods
- Constrained optimization and Lagrangian methods
- Optimization challenges in high-dimensional spaces

**Format:** Each question includes theory, implementation, and empirical analysis sections.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import make_regression, make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, log_loss
import seaborn as sns
from scipy.optimize import minimize, minimize_scalar
from scipy.linalg import norm
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8')
np.random.seed(42)

## Question 1: Gradient Descent Variants and Convergence Analysis

**Question:** Compare batch, stochastic, and mini-batch gradient descent. Analyze their convergence properties and implement each variant with theoretical guarantees.

### Theory

**Batch Gradient Descent (BGD):**
$$\theta_{t+1} = \theta_t - \alpha \nabla J(\theta_t)$$
- Uses entire dataset for each update
- Guaranteed convergence to global minimum for convex functions
- Convergence rate: O(1/k) for convex, O(log k/k) for strongly convex

**Stochastic Gradient Descent (SGD):**
$$\theta_{t+1} = \theta_t - \alpha \nabla J_i(\theta_t)$$
- Uses single sample for each update
- Faster updates but noisy convergence
- Requires decreasing learning rate for convergence

**Mini-batch Gradient Descent:**
$$\theta_{t+1} = \theta_t - \alpha \frac{1}{|B|} \sum_{i \in B} \nabla J_i(\theta_t)$$
- Compromise between BGD and SGD
- Batch size affects convergence speed and stability

**Learning Rate Requirements:**
For SGD convergence: $\sum_{t=1}^\infty \alpha_t = \infty$ and $\sum_{t=1}^\infty \alpha_t^2 < \infty$

In [None]:
class GradientDescentOptimizer:
    """
    Implementation of gradient descent variants for linear regression.
    """
    
    def __init__(self, method='batch', learning_rate=0.01, batch_size=32, max_iter=1000, tol=1e-6):
        self.method = method
        self.learning_rate = learning_rate
        self.batch_size = batch_size
        self.max_iter = max_iter
        self.tol = tol
        self.cost_history_ = []
        self.theta_history_ = []
        self.grad_norms_ = []
        
    def _cost_function(self, X, y, theta):
        """Mean squared error cost function."""
        m = len(y)
        predictions = X @ theta
        cost = np.sum((predictions - y) ** 2) / (2 * m)
        return cost
    
    def _gradient(self, X, y, theta):
        """Compute gradient of MSE cost function."""
        m = len(y)
        predictions = X @ theta
        gradient = X.T @ (predictions - y) / m
        return gradient
    
    def _get_learning_rate(self, iteration):
        """Get learning rate (can be adaptive for SGD)."""
        if self.method == 'sgd_adaptive':
            # Learning rate schedule: α_t = α_0 / (1 + t)
            return self.learning_rate / (1 + iteration * 0.01)
        return self.learning_rate
    
    def fit(self, X, y):
        """Fit model using specified gradient descent variant."""
        # Add intercept term
        X_with_intercept = np.column_stack([np.ones(len(X)), X])
        m, n = X_with_intercept.shape
        
        # Initialize parameters
        theta = np.random.normal(0, 0.01, n)
        self.cost_history_ = []
        self.theta_history_ = [theta.copy()]
        self.grad_norms_ = []
        
        for iteration in range(self.max_iter):
            if self.method == 'batch':
                # Batch Gradient Descent
                gradient = self._gradient(X_with_intercept, y, theta)
                lr = self._get_learning_rate(iteration)
                theta = theta - lr * gradient
                
            elif self.method in ['sgd', 'sgd_adaptive']:
                # Stochastic Gradient Descent
                for i in range(m):
                    # Random sample
                    idx = np.random.randint(0, m)
                    X_sample = X_with_intercept[idx:idx+1]
                    y_sample = y[idx:idx+1]
                    
                    gradient = self._gradient(X_sample, y_sample, theta)
                    lr = self._get_learning_rate(iteration * m + i)
                    theta = theta - lr * gradient
                    
            elif self.method == 'mini_batch':
                # Mini-batch Gradient Descent
                # Shuffle data
                indices = np.random.permutation(m)
                X_shuffled = X_with_intercept[indices]
                y_shuffled = y[indices]
                
                # Process mini-batches
                for start_idx in range(0, m, self.batch_size):
                    end_idx = min(start_idx + self.batch_size, m)
                    X_batch = X_shuffled[start_idx:end_idx]
                    y_batch = y_shuffled[start_idx:end_idx]
                    
                    gradient = self._gradient(X_batch, y_batch, theta)
                    lr = self._get_learning_rate(iteration)
                    theta = theta - lr * gradient
            
            # Record history
            cost = self._cost_function(X_with_intercept, y, theta)
            gradient_full = self._gradient(X_with_intercept, y, theta)
            
            self.cost_history_.append(cost)
            self.theta_history_.append(theta.copy())
            self.grad_norms_.append(np.linalg.norm(gradient_full))
            
            # Check convergence
            if len(self.cost_history_) > 1:
                cost_diff = abs(self.cost_history_[-2] - self.cost_history_[-1])
                if cost_diff < self.tol:
                    break
        
        self.theta_ = theta
        self.intercept_ = theta[0]
        self.coef_ = theta[1:]
        return self
    
    def predict(self, X):
        """Make predictions."""
        return X @ self.coef_ + self.intercept_

# Generate regression data
X, y = make_regression(n_samples=1000, n_features=5, noise=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Compare gradient descent variants
methods = {
    'Batch GD': GradientDescentOptimizer(method='batch', learning_rate=0.01, max_iter=1000),
    'SGD': GradientDescentOptimizer(method='sgd', learning_rate=0.01, max_iter=100),
    'SGD Adaptive': GradientDescentOptimizer(method='sgd_adaptive', learning_rate=0.1, max_iter=100),
    'Mini-batch GD': GradientDescentOptimizer(method='mini_batch', learning_rate=0.01, 
                                            batch_size=50, max_iter=1000)
}

results = {}
fitted_models = {}

for name, model in methods.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    
    mse = mean_squared_error(y_test, y_pred)
    final_cost = model.cost_history_[-1]
    iterations = len(model.cost_history_)
    final_grad_norm = model.grad_norms_[-1]
    
    results[name] = {
        'Test MSE': mse,
        'Final Cost': final_cost,
        'Iterations': iterations,
        'Final Grad Norm': final_grad_norm
    }
    fitted_models[name] = model

results_df = pd.DataFrame(results).T
print("Gradient Descent Variants Comparison:")
print(results_df.round(6))

In [None]:
# Visualize convergence behavior
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Cost convergence
for name, model in fitted_models.items():
    axes[0, 0].plot(model.cost_history_, label=name, alpha=0.8)
axes[0, 0].set_xlabel('Iteration')
axes[0, 0].set_ylabel('Cost')
axes[0, 0].set_title('Cost Function Convergence')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
axes[0, 0].set_yscale('log')

# Gradient norm convergence
for name, model in fitted_models.items():
    axes[0, 1].plot(model.grad_norms_, label=name, alpha=0.8)
axes[0, 1].set_xlabel('Iteration')
axes[0, 1].set_ylabel('Gradient Norm')
axes[0, 1].set_title('Gradient Norm Convergence')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)
axes[0, 1].set_yscale('log')

# Parameter trajectory (first parameter)
for name, model in fitted_models.items():
    theta_trajectory = [theta[1] for theta in model.theta_history_]  # First feature coefficient
    axes[1, 0].plot(theta_trajectory, label=name, alpha=0.8)
axes[1, 0].set_xlabel('Iteration')
axes[1, 0].set_ylabel('Parameter Value (θ₁)')
axes[1, 0].set_title('Parameter Trajectory')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Learning rate sensitivity for SGD
learning_rates = [0.001, 0.01, 0.1, 1.0]
sgd_results = []

for lr in learning_rates:
    model = GradientDescentOptimizer(method='sgd', learning_rate=lr, max_iter=50)
    model.fit(X_train_scaled, y_train)
    final_cost = model.cost_history_[-1]
    sgd_results.append(final_cost)

axes[1, 1].semilogx(learning_rates, sgd_results, 'o-', linewidth=2, markersize=8)
axes[1, 1].set_xlabel('Learning Rate')
axes[1, 1].set_ylabel('Final Cost')
axes[1, 1].set_title('SGD Learning Rate Sensitivity')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Question 2: Momentum and Adaptive Learning Rate Methods

**Question:** Implement and compare momentum-based methods (momentum, Nesterov) and adaptive methods (AdaGrad, RMSprop, Adam). Analyze their convergence properties and robustness.

### Theory

**Momentum:**
$$v_t = \beta v_{t-1} + (1-\beta)\nabla J(\theta_t)$$
$$\theta_{t+1} = \theta_t - \alpha v_t$$

**Nesterov Accelerated Gradient:**
$$v_t = \beta v_{t-1} + \nabla J(\theta_t - \alpha\beta v_{t-1})$$
$$\theta_{t+1} = \theta_t - \alpha v_t$$

**AdaGrad:**
$$G_t = G_{t-1} + \nabla J(\theta_t) \odot \nabla J(\theta_t)$$
$$\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{G_t + \epsilon}} \odot \nabla J(\theta_t)$$

**RMSprop:**
$$v_t = \beta v_{t-1} + (1-\beta)\nabla J(\theta_t)^2$$
$$\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{v_t + \epsilon}} \nabla J(\theta_t)$$

**Adam:**
$$m_t = \beta_1 m_{t-1} + (1-\beta_1)\nabla J(\theta_t)$$
$$v_t = \beta_2 v_{t-1} + (1-\beta_2)\nabla J(\theta_t)^2$$
$$\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t}$$
$$\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$$

In [None]:
class AdvancedOptimizer:
    """
    Implementation of advanced optimization algorithms.
    """
    
    def __init__(self, method='adam', learning_rate=0.001, beta1=0.9, beta2=0.999, 
                 epsilon=1e-8, max_iter=1000, tol=1e-6):
        self.method = method
        self.learning_rate = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.max_iter = max_iter
        self.tol = tol
        self.cost_history_ = []
        self.grad_norms_ = []
        
    def _cost_function(self, X, y, theta):
        """Mean squared error cost function."""
        m = len(y)
        predictions = X @ theta
        cost = np.sum((predictions - y) ** 2) / (2 * m)
        return cost
    
    def _gradient(self, X, y, theta):
        """Compute gradient of MSE cost function."""
        m = len(y)
        predictions = X @ theta
        gradient = X.T @ (predictions - y) / m
        return gradient
    
    def fit(self, X, y):
        """Fit model using specified optimization method."""
        # Add intercept term
        X_with_intercept = np.column_stack([np.ones(len(X)), X])
        m, n = X_with_intercept.shape
        
        # Initialize parameters
        theta = np.random.normal(0, 0.01, n)
        self.cost_history_ = []
        self.grad_norms_ = []
        
        # Initialize optimizer-specific variables
        if self.method in ['momentum', 'nesterov']:
            v = np.zeros(n)  # Velocity
        elif self.method == 'adagrad':
            G = np.zeros(n)  # Accumulated squared gradients
        elif self.method == 'rmsprop':
            v = np.zeros(n)  # Moving average of squared gradients
        elif self.method == 'adam':
            m = np.zeros(n)  # First moment estimate
            v = np.zeros(n)  # Second moment estimate
        
        for t in range(1, self.max_iter + 1):
            # Compute gradient
            if self.method == 'nesterov':
                # Look ahead gradient
                theta_lookahead = theta - self.learning_rate * self.beta1 * v
                gradient = self._gradient(X_with_intercept, y, theta_lookahead)
            else:
                gradient = self._gradient(X_with_intercept, y, theta)
            
            # Update parameters based on method
            if self.method == 'momentum':
                v = self.beta1 * v + (1 - self.beta1) * gradient
                theta = theta - self.learning_rate * v
                
            elif self.method == 'nesterov':
                v = self.beta1 * v + gradient
                theta = theta - self.learning_rate * v
                
            elif self.method == 'adagrad':
                G = G + gradient ** 2
                adapted_lr = self.learning_rate / (np.sqrt(G) + self.epsilon)
                theta = theta - adapted_lr * gradient
                
            elif self.method == 'rmsprop':
                v = self.beta2 * v + (1 - self.beta2) * gradient ** 2
                adapted_lr = self.learning_rate / (np.sqrt(v) + self.epsilon)
                theta = theta - adapted_lr * gradient
                
            elif self.method == 'adam':
                m = self.beta1 * m + (1 - self.beta1) * gradient
                v = self.beta2 * v + (1 - self.beta2) * gradient ** 2
                
                # Bias correction
                m_hat = m / (1 - self.beta1 ** t)
                v_hat = v / (1 - self.beta2 ** t)
                
                theta = theta - self.learning_rate * m_hat / (np.sqrt(v_hat) + self.epsilon)
            
            # Record history
            cost = self._cost_function(X_with_intercept, y, theta)
            grad_norm = np.linalg.norm(gradient)
            
            self.cost_history_.append(cost)
            self.grad_norms_.append(grad_norm)
            
            # Check convergence
            if grad_norm < self.tol:
                break
        
        self.theta_ = theta
        self.intercept_ = theta[0]
        self.coef_ = theta[1:]
        return self
    
    def predict(self, X):
        """Make predictions."""
        return X @ self.coef_ + self.intercept_

# Generate more complex data (non-convex-like landscape simulation)
np.random.seed(42)
X = np.random.randn(500, 10)
# Add some feature interactions to make optimization more challenging
X_complex = np.column_stack([X, X[:, 0] * X[:, 1], X[:, 2] ** 2])
true_coef = np.random.randn(12) * 2
y = X_complex @ true_coef + np.random.randn(500) * 0.5

X_train, X_test, y_train, y_test = train_test_split(X_complex, y, test_size=0.3, random_state=42)

# Standardize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Compare optimization methods
optimizers = {
    'SGD': AdvancedOptimizer(method='momentum', learning_rate=0.01, beta1=0.0),  # No momentum = SGD
    'Momentum': AdvancedOptimizer(method='momentum', learning_rate=0.01, beta1=0.9),
    'Nesterov': AdvancedOptimizer(method='nesterov', learning_rate=0.01, beta1=0.9),
    'AdaGrad': AdvancedOptimizer(method='adagrad', learning_rate=0.1),
    'RMSprop': AdvancedOptimizer(method='rmsprop', learning_rate=0.01, beta2=0.9),
    'Adam': AdvancedOptimizer(method='adam', learning_rate=0.01)
}

results = {}
fitted_optimizers = {}

for name, optimizer in optimizers.items():
    optimizer.fit(X_train_scaled, y_train)
    y_pred = optimizer.predict(X_test_scaled)
    
    mse = mean_squared_error(y_test, y_pred)
    final_cost = optimizer.cost_history_[-1]
    iterations = len(optimizer.cost_history_)
    final_grad_norm = optimizer.grad_norms_[-1]
    
    results[name] = {
        'Test MSE': mse,
        'Final Cost': final_cost,
        'Iterations': iterations,
        'Final Grad Norm': final_grad_norm
    }
    fitted_optimizers[name] = optimizer

results_df = pd.DataFrame(results).T
print("Advanced Optimization Methods Comparison:")
print(results_df.round(6))

In [None]:
# Visualize optimization behavior
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

# Cost convergence
for name, optimizer in fitted_optimizers.items():
    axes[0, 0].semilogy(optimizer.cost_history_, label=name, alpha=0.8, linewidth=2)
axes[0, 0].set_xlabel('Iteration')
axes[0, 0].set_ylabel('Cost (log scale)')
axes[0, 0].set_title('Cost Function Convergence')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Gradient norm convergence
for name, optimizer in fitted_optimizers.items():
    axes[0, 1].semilogy(optimizer.grad_norms_, label=name, alpha=0.8, linewidth=2)
axes[0, 1].set_xlabel('Iteration')
axes[0, 1].set_ylabel('Gradient Norm (log scale)')
axes[0, 1].set_title('Gradient Norm Convergence')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Final performance comparison
method_names = list(results.keys())
test_mses = [results[name]['Test MSE'] for name in method_names]
final_costs = [results[name]['Final Cost'] for name in method_names]

x_pos = np.arange(len(method_names))
axes[0, 2].bar(x_pos, test_mses, alpha=0.7)
axes[0, 2].set_xlabel('Optimization Method')
axes[0, 2].set_ylabel('Test MSE')
axes[0, 2].set_title('Final Test Performance')
axes[0, 2].set_xticks(x_pos)
axes[0, 2].set_xticklabels(method_names, rotation=45)
axes[0, 2].grid(True, alpha=0.3)

# Learning rate robustness for different methods
learning_rates = np.logspace(-3, 0, 10)
methods_to_test = ['momentum', 'rmsprop', 'adam']
robustness_results = {method: [] for method in methods_to_test}

for method in methods_to_test:
    for lr in learning_rates:
        try:
            opt = AdvancedOptimizer(method=method, learning_rate=lr, max_iter=200)
            opt.fit(X_train_scaled, y_train)
            final_cost = opt.cost_history_[-1]
            robustness_results[method].append(final_cost)
        except:
            robustness_results[method].append(float('inf'))

for method in methods_to_test:
    axes[1, 0].loglog(learning_rates, robustness_results[method], 'o-', label=method, alpha=0.8)
axes[1, 0].set_xlabel('Learning Rate')
axes[1, 0].set_ylabel('Final Cost')
axes[1, 0].set_title('Learning Rate Robustness')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Adam hyperparameter sensitivity
beta1_values = np.linspace(0.5, 0.99, 10)
adam_sensitivity = []

for beta1 in beta1_values:
    opt = AdvancedOptimizer(method='adam', learning_rate=0.01, beta1=beta1, max_iter=200)
    opt.fit(X_train_scaled, y_train)
    final_cost = opt.cost_history_[-1]
    adam_sensitivity.append(final_cost)

axes[1, 1].plot(beta1_values, adam_sensitivity, 'o-', linewidth=2, markersize=6)
axes[1, 1].set_xlabel('β₁ (momentum parameter)')
axes[1, 1].set_ylabel('Final Cost')
axes[1, 1].set_title('Adam β₁ Sensitivity')
axes[1, 1].grid(True, alpha=0.3)

# Convergence speed comparison (iterations to reach threshold)
threshold_cost = min([min(opt.cost_history_) for opt in fitted_optimizers.values()]) * 1.1
convergence_speeds = []

for name, optimizer in fitted_optimizers.items():
    # Find first iteration where cost drops below threshold
    try:
        conv_iter = next(i for i, cost in enumerate(optimizer.cost_history_) if cost <= threshold_cost)
    except StopIteration:
        conv_iter = len(optimizer.cost_history_)
    convergence_speeds.append(conv_iter)

axes[1, 2].bar(method_names, convergence_speeds, alpha=0.7, color='orange')
axes[1, 2].set_xlabel('Optimization Method')
axes[1, 2].set_ylabel('Iterations to Convergence')
axes[1, 2].set_title('Convergence Speed Comparison')
axes[1, 2].set_xticks(range(len(method_names)))
axes[1, 2].set_xticklabels(method_names, rotation=45)
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nConvergence Analysis:")
print(f"Threshold cost for convergence: {threshold_cost:.6f}")
for name, speed in zip(method_names, convergence_speeds):
    print(f"{name}: {speed} iterations")

## Question 3: Second-Order Optimization Methods

**Question:** Implement Newton's method and quasi-Newton methods (BFGS, L-BFGS). Compare computational complexity and convergence rates with first-order methods.

### Theory

**Newton's Method:**
$$\theta_{t+1} = \theta_t - H^{-1}(\theta_t) \nabla J(\theta_t)$$
- Quadratic convergence near optimum
- Requires Hessian computation and inversion: O(n³) per iteration

**BFGS (Broyden-Fletcher-Goldfarb-Shanno):**
- Approximates inverse Hessian using gradient information
- Update rule: $B_{k+1} = B_k + \frac{y_k y_k^T}{y_k^T s_k} - \frac{B_k s_k s_k^T B_k}{s_k^T B_k s_k}$
- Where $s_k = \theta_{k+1} - \theta_k$ and $y_k = \nabla J(\theta_{k+1}) - \nabla J(\theta_k)$

**L-BFGS (Limited-memory BFGS):**
- Stores only m recent vector pairs instead of full matrix
- Memory: O(mn) instead of O(n²)
- Two-loop recursion for computing search direction

**Convergence Rates:**
- Newton: Quadratic (very fast near optimum)
- BFGS: Superlinear
- L-BFGS: Superlinear (slightly slower than BFGS)
- First-order: Linear

In [None]:
class SecondOrderOptimizer:
    """
    Implementation of second-order optimization methods.
    """
    
    def __init__(self, method='newton', memory_size=10, max_iter=100, tol=1e-8, line_search=True):
        self.method = method
        self.memory_size = memory_size  # For L-BFGS
        self.max_iter = max_iter
        self.tol = tol
        self.line_search = line_search
        self.cost_history_ = []
        self.grad_norms_ = []
        self.step_sizes_ = []
        
    def _cost_function(self, X, y, theta):
        """Logistic regression cost function."""
        z = X @ theta
        # Stable sigmoid computation
        z = np.clip(z, -500, 500)
        sigmoid = 1 / (1 + np.exp(-z))
        
        # Binary cross-entropy with regularization
        epsilon = 1e-15
        sigmoid = np.clip(sigmoid, epsilon, 1 - epsilon)
        cost = -np.mean(y * np.log(sigmoid) + (1 - y) * np.log(1 - sigmoid))
        return cost
    
    def _gradient(self, X, y, theta):
        """Gradient of logistic regression cost."""
        z = X @ theta
        z = np.clip(z, -500, 500)
        sigmoid = 1 / (1 + np.exp(-z))
        gradient = X.T @ (sigmoid - y) / len(y)
        return gradient
    
    def _hessian(self, X, y, theta):
        """Hessian of logistic regression cost."""
        z = X @ theta
        z = np.clip(z, -500, 500)
        sigmoid = 1 / (1 + np.exp(-z))
        weights = sigmoid * (1 - sigmoid) + 1e-8  # Add small regularization
        hessian = X.T @ (weights[:, np.newaxis] * X) / len(y)
        return hessian
    
    def _line_search(self, X, y, theta, direction, gradient):
        """Backtracking line search."""
        alpha = 1.0
        c1 = 1e-4  # Armijo condition parameter
        rho = 0.5  # Backtracking parameter
        
        f0 = self._cost_function(X, y, theta)
        grad_dot_dir = np.dot(gradient, direction)
        
        for _ in range(20):  # Max line search iterations
            theta_new = theta + alpha * direction
            f_new = self._cost_function(X, y, theta_new)
            
            # Armijo condition
            if f_new <= f0 + c1 * alpha * grad_dot_dir:
                return alpha
            
            alpha *= rho
        
        return alpha
    
    def _lbfgs_direction(self, gradient, s_list, y_list, rho_list):
        """Compute L-BFGS search direction using two-loop recursion."""
        q = gradient.copy()
        alpha_list = []
        
        # First loop
        for i in reversed(range(len(s_list))):
            alpha_i = rho_list[i] * np.dot(s_list[i], q)
            alpha_list.append(alpha_i)
            q = q - alpha_i * y_list[i]
        
        # Initial Hessian approximation (identity scaled)
        if len(s_list) > 0:
            gamma = np.dot(s_list[-1], y_list[-1]) / np.dot(y_list[-1], y_list[-1])
        else:
            gamma = 1.0
        
        r = gamma * q
        
        # Second loop
        alpha_list.reverse()
        for i in range(len(s_list)):
            beta = rho_list[i] * np.dot(y_list[i], r)
            r = r + s_list[i] * (alpha_list[i] - beta)
        
        return -r
    
    def fit(self, X, y):
        """Fit model using specified second-order method."""
        # Add intercept
        X_with_intercept = np.column_stack([np.ones(len(X)), X])
        n_features = X_with_intercept.shape[1]
        
        # Initialize
        theta = np.random.normal(0, 0.01, n_features)
        self.cost_history_ = []
        self.grad_norms_ = []
        self.step_sizes_ = []
        
        # For BFGS/L-BFGS
        if self.method in ['bfgs', 'lbfgs']:
            if self.method == 'bfgs':
                B_inv = np.eye(n_features)  # Initial inverse Hessian approximation
            else:  # L-BFGS
                s_list, y_list, rho_list = [], [], []
        
        for iteration in range(self.max_iter):
            # Compute cost and gradient
            cost = self._cost_function(X_with_intercept, y, theta)
            gradient = self._gradient(X_with_intercept, y, theta)
            grad_norm = np.linalg.norm(gradient)
            
            self.cost_history_.append(cost)
            self.grad_norms_.append(grad_norm)
            
            # Check convergence
            if grad_norm < self.tol:
                break
            
            # Compute search direction
            if self.method == 'newton':
                hessian = self._hessian(X_with_intercept, y, theta)
                try:
                    direction = -np.linalg.solve(hessian, gradient)
                except np.linalg.LinAlgError:
                    # Fallback to gradient descent if Hessian is singular
                    direction = -gradient
                    
            elif self.method == 'bfgs':
                direction = -B_inv @ gradient
                
            elif self.method == 'lbfgs':
                direction = self._lbfgs_direction(gradient, s_list, y_list, rho_list)
            
            # Line search
            if self.line_search:
                step_size = self._line_search(X_with_intercept, y, theta, direction, gradient)
            else:
                step_size = 1.0
            
            self.step_sizes_.append(step_size)
            
            # Update parameters
            theta_new = theta + step_size * direction
            
            # Update inverse Hessian approximation for BFGS methods
            if self.method in ['bfgs', 'lbfgs'] and iteration > 0:
                s = theta_new - theta
                y = self._gradient(X_with_intercept, y, theta_new) - gradient
                
                if np.dot(s, y) > 1e-10:  # Curvature condition
                    if self.method == 'bfgs':
                        # BFGS update
                        rho = 1.0 / np.dot(y, s)
                        A1 = np.eye(n_features) - rho * np.outer(s, y)
                        A2 = np.eye(n_features) - rho * np.outer(y, s)
                        B_inv = A1 @ B_inv @ A2 + rho * np.outer(s, s)
                        
                    else:  # L-BFGS
                        rho = 1.0 / np.dot(y, s)
                        
                        if len(s_list) >= self.memory_size:
                            s_list.pop(0)
                            y_list.pop(0)
                            rho_list.pop(0)
                        
                        s_list.append(s)
                        y_list.append(y)
                        rho_list.append(rho)
            
            theta = theta_new
        
        self.theta_ = theta
        self.intercept_ = theta[0]
        self.coef_ = theta[1:]
        self.n_iter_ = len(self.cost_history_)
        return self
    
    def predict_proba(self, X):
        """Predict class probabilities."""
        z = X @ self.coef_ + self.intercept_
        z = np.clip(z, -500, 500)
        return 1 / (1 + np.exp(-z))
    
    def predict(self, X):
        """Make binary predictions."""
        return (self.predict_proba(X) >= 0.5).astype(int)

# Generate logistic regression data
X, y = make_classification(n_samples=1000, n_features=20, n_redundant=0, 
                         n_informative=15, n_clusters_per_class=1, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Compare second-order methods
second_order_methods = {
    'Newton': SecondOrderOptimizer(method='newton', max_iter=50),
    'BFGS': SecondOrderOptimizer(method='bfgs', max_iter=100),
    'L-BFGS': SecondOrderOptimizer(method='lbfgs', memory_size=10, max_iter=100),
    'Adam (reference)': AdvancedOptimizer(method='adam', learning_rate=0.01, max_iter=200)
}

results = {}
fitted_second_order = {}

import time

for name, optimizer in second_order_methods.items():
    start_time = time.time()
    optimizer.fit(X_train_scaled, y_train)
    training_time = time.time() - start_time
    
    if hasattr(optimizer, 'predict_proba'):
        y_pred_proba = optimizer.predict_proba(X_test_scaled)
        if len(y_pred_proba.shape) > 1:
            y_pred_proba = y_pred_proba[:, 1]
    else:
        y_pred_proba = optimizer.predict_proba(X_test_scaled)
    
    logloss = log_loss(y_test, y_pred_proba)
    
    if hasattr(optimizer, 'n_iter_'):
        iterations = optimizer.n_iter_
    else:
        iterations = len(optimizer.cost_history_)
    
    final_grad_norm = optimizer.grad_norms_[-1] if optimizer.grad_norms_ else 'N/A'
    
    results[name] = {
        'Log Loss': logloss,
        'Iterations': iterations,
        'Training Time (s)': training_time,
        'Final Grad Norm': final_grad_norm,
        'Time per Iter': training_time / iterations if iterations > 0 else 0
    }
    fitted_second_order[name] = optimizer

results_df = pd.DataFrame(results).T
print("Second-Order Optimization Methods Comparison:")
print(results_df.round(6))

In [None]:
# Visualize second-order methods behavior
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# Cost convergence comparison
for name, optimizer in fitted_second_order.items():
    if optimizer.cost_history_:
        axes[0, 0].semilogy(optimizer.cost_history_, label=name, alpha=0.8, linewidth=2, marker='o', markersize=3)
axes[0, 0].set_xlabel('Iteration')
axes[0, 0].set_ylabel('Cost (log scale)')
axes[0, 0].set_title('Convergence Speed Comparison')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Gradient norm convergence
for name, optimizer in fitted_second_order.items():
    if optimizer.grad_norms_:
        axes[0, 1].semilogy(optimizer.grad_norms_, label=name, alpha=0.8, linewidth=2, marker='s', markersize=3)
axes[0, 1].set_xlabel('Iteration')
axes[0, 1].set_ylabel('Gradient Norm (log scale)')
axes[0, 1].set_title('Gradient Norm Convergence')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Step sizes (for second-order methods with line search)
for name, optimizer in fitted_second_order.items():
    if hasattr(optimizer, 'step_sizes_') and optimizer.step_sizes_:
        axes[0, 2].plot(optimizer.step_sizes_, label=name, alpha=0.8, linewidth=2, marker='^', markersize=3)
axes[0, 2].set_xlabel('Iteration')
axes[0, 2].set_ylabel('Step Size')
axes[0, 2].set_title('Step Size Evolution')
axes[0, 2].legend()
axes[0, 2].grid(True, alpha=0.3)

# Performance vs computational cost
method_names = list(results.keys())
log_losses = [results[name]['Log Loss'] for name in method_names]
training_times = [results[name]['Training Time (s)'] for name in method_names]
iterations = [results[name]['Iterations'] for name in method_names]

scatter = axes[1, 0].scatter(training_times, log_losses, c=iterations, s=100, alpha=0.7, cmap='viridis')
for i, name in enumerate(method_names):
    axes[1, 0].annotate(name, (training_times[i], log_losses[i]), 
                       xytext=(5, 5), textcoords='offset points', fontsize=9)
axes[1, 0].set_xlabel('Training Time (seconds)')
axes[1, 0].set_ylabel('Test Log Loss')
axes[1, 0].set_title('Performance vs Computational Cost')
plt.colorbar(scatter, ax=axes[1, 0], label='Iterations')
axes[1, 0].grid(True, alpha=0.3)

# L-BFGS memory size sensitivity
memory_sizes = [3, 5, 10, 20, 50]
lbfgs_performance = []
lbfgs_times = []

for m in memory_sizes:
    start_time = time.time()
    lbfgs = SecondOrderOptimizer(method='lbfgs', memory_size=m, max_iter=100)
    lbfgs.fit(X_train_scaled, y_train)
    training_time = time.time() - start_time
    
    y_pred_proba = lbfgs.predict_proba(X_test_scaled)
    logloss = log_loss(y_test, y_pred_proba)
    
    lbfgs_performance.append(logloss)
    lbfgs_times.append(training_time)

axes[1, 1].plot(memory_sizes, lbfgs_performance, 'o-', linewidth=2, markersize=8, label='Log Loss')
ax_twin = axes[1, 1].twinx()
ax_twin.plot(memory_sizes, lbfgs_times, 's-', color='orange', linewidth=2, markersize=8, label='Training Time')

axes[1, 1].set_xlabel('L-BFGS Memory Size')
axes[1, 1].set_ylabel('Test Log Loss', color='blue')
ax_twin.set_ylabel('Training Time (s)', color='orange')
axes[1, 1].set_title('L-BFGS Memory Size Effect')
axes[1, 1].grid(True, alpha=0.3)

# Convergence rate analysis (log-log plot)
newton_opt = fitted_second_order['Newton']
if newton_opt.grad_norms_:
    # Show quadratic convergence of Newton's method
    grad_norms = np.array(newton_opt.grad_norms_[1:])  # Skip first point
    iterations_range = np.arange(1, len(grad_norms) + 1)
    
    # Fit convergence rate: ||∇f|| ≈ C * r^k
    if len(grad_norms) > 3:
        log_grad_norms = np.log(grad_norms[grad_norms > 0])
        log_iterations = np.log(iterations_range[:len(log_grad_norms)])
        
        # Linear fit in log space
        coeffs = np.polyfit(log_iterations, log_grad_norms, 1)
        slope = coeffs[0]
        
        axes[1, 2].loglog(iterations_range[:len(grad_norms)], grad_norms, 'o-', 
                         label=f'Newton (slope≈{slope:.2f})', linewidth=2)

# Compare with first-order method
adam_opt = fitted_second_order['Adam (reference)']
if adam_opt.grad_norms_:
    adam_grad_norms = np.array(adam_opt.grad_norms_[1:20])  # First 20 iterations
    adam_iterations = np.arange(1, len(adam_grad_norms) + 1)
    axes[1, 2].loglog(adam_iterations, adam_grad_norms, 's-', 
                     label='Adam (first-order)', linewidth=2, alpha=0.7)

axes[1, 2].set_xlabel('Iteration')
axes[1, 2].set_ylabel('Gradient Norm')
axes[1, 2].set_title('Convergence Rate Analysis')
axes[1, 2].legend()
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Summary statistics
print("\nComputational Complexity Analysis:")
print(f"Newton's method - Average time per iteration: {results['Newton']['Time per Iter']:.4f}s")
print(f"BFGS - Average time per iteration: {results['BFGS']['Time per Iter']:.4f}s")
print(f"L-BFGS - Average time per iteration: {results['L-BFGS']['Time per Iter']:.4f}s")
print(f"Adam - Average time per iteration: {results['Adam (reference)']['Time per Iter']:.4f}s")

print("\nConvergence Efficiency (Log Loss / Time):")
for name in method_names:
    efficiency = results[name]['Log Loss'] / results[name]['Training Time (s)']
    print(f"{name}: {efficiency:.4f}")

## Question 4: Constrained Optimization and Lagrangian Methods

**Question:** Implement constrained optimization using Lagrange multipliers for SVM-like problems. Compare penalty methods with barrier methods for handling constraints.

### Theory

**Constrained Optimization Problem:**
$$\min_{x} f(x) \quad \text{subject to} \quad g_i(x) \leq 0, \quad h_j(x) = 0$$

**Lagrangian Function:**
$$L(x, \lambda, \mu) = f(x) + \sum_i \lambda_i g_i(x) + \sum_j \mu_j h_j(x)$$

**KKT Conditions:**
1. Stationarity: $\nabla f(x^*) + \sum_i \lambda_i^* \nabla g_i(x^*) + \sum_j \mu_j^* \nabla h_j(x^*) = 0$
2. Primal feasibility: $g_i(x^*) \leq 0, \quad h_j(x^*) = 0$
3. Dual feasibility: $\lambda_i^* \geq 0$
4. Complementary slackness: $\lambda_i^* g_i(x^*) = 0$

**Penalty Method:**
$$\min_{x} f(x) + \rho \sum_i \max(0, g_i(x))^2 + \rho \sum_j h_j(x)^2$$

**Barrier Method:**
$$\min_{x} f(x) - \mu \sum_i \log(-g_i(x))$$
(for inequality constraints $g_i(x) < 0$)

In [None]:
class ConstrainedOptimizer:
    """
    Constrained optimization using penalty and barrier methods.
    """
    
    def __init__(self, method='penalty', penalty_param=1.0, barrier_param=1.0, 
                 max_iter=100, tol=1e-6):
        self.method = method
        self.penalty_param = penalty_param
        self.barrier_param = barrier_param
        self.max_iter = max_iter
        self.tol = tol
        self.cost_history_ = []
        self.penalty_history_ = []
        self.constraint_violations_ = []
        
    def _svm_objective(self, w, b, X, y):
        """SVM objective function: ||w||² / 2"""
        return 0.5 * np.dot(w, w)
    
    def _svm_constraints(self, w, b, X, y):
        """SVM constraints: y_i(w^T x_i + b) >= 1"""
        # Convert to g_i(x) <= 0 form: 1 - y_i(w^T x_i + b) <= 0
        margins = y * (X @ w + b)
        return 1 - margins  # <= 0 for feasible points
    
    def _penalty_function(self, w, b, X, y, rho):
        """Penalty method objective function."""
        objective = self._svm_objective(w, b, X, y)
        constraints = self._svm_constraints(w, b, X, y)
        
        # Quadratic penalty for violated constraints
        penalty = rho * np.sum(np.maximum(0, constraints) ** 2)
        
        return objective + penalty, penalty
    
    def _barrier_function(self, w, b, X, y, mu):
        """Barrier method objective function."""
        objective = self._svm_objective(w, b, X, y)
        constraints = self._svm_constraints(w, b, X, y)
        
        # Log barrier for inequality constraints
        # Check if any constraint is violated (constraint > 0)
        if np.any(constraints >= -1e-8):  # Small tolerance for numerical issues
            return float('inf'), float('inf')
        
        barrier = -mu * np.sum(np.log(-constraints))
        
        return objective + barrier, barrier
    
    def _gradient_penalty(self, w, b, X, y, rho):
        """Gradient of penalty function."""
        # Objective gradient
        grad_w_obj = w
        grad_b_obj = 0
        
        # Constraint gradients
        constraints = self._svm_constraints(w, b, X, y)
        violated = constraints > 0
        
        if np.any(violated):
            grad_w_penalty = -2 * rho * np.sum(
                (constraints[violated, np.newaxis] * y[violated, np.newaxis] * X[violated]), axis=0
            )
            grad_b_penalty = -2 * rho * np.sum(constraints[violated] * y[violated])
        else:
            grad_w_penalty = np.zeros_like(w)
            grad_b_penalty = 0
        
        return grad_w_obj + grad_w_penalty, grad_b_obj + grad_b_penalty
    
    def _gradient_barrier(self, w, b, X, y, mu):
        """Gradient of barrier function."""
        # Objective gradient
        grad_w_obj = w
        grad_b_obj = 0
        
        # Barrier gradients
        constraints = self._svm_constraints(w, b, X, y)
        
        # Check feasibility
        if np.any(constraints >= -1e-8):
            return np.full_like(w, float('inf')), float('inf')
        
        barrier_weights = mu / constraints
        grad_w_barrier = np.sum(
            (barrier_weights[:, np.newaxis] * y[:, np.newaxis] * X), axis=0
        )
        grad_b_barrier = np.sum(barrier_weights * y)
        
        return grad_w_obj + grad_w_barrier, grad_b_obj + grad_b_barrier
    
    def fit(self, X, y):
        """Solve SVM using constrained optimization."""
        n_samples, n_features = X.shape
        
        # Initialize parameters
        w = np.random.normal(0, 0.1, n_features)
        b = 0.0
        
        self.cost_history_ = []
        self.penalty_history_ = []
        self.constraint_violations_ = []
        
        # Adaptive penalty/barrier parameter
        rho = self.penalty_param
        mu = self.barrier_param
        
        for iteration in range(self.max_iter):
            if self.method == 'penalty':
                # Penalty method
                cost, penalty = self._penalty_function(w, b, X, y, rho)
                grad_w, grad_b = self._gradient_penalty(w, b, X, y, rho)
                
                # Update with gradient descent
                learning_rate = 0.01 / (1 + iteration * 0.01)
                w = w - learning_rate * grad_w
                b = b - learning_rate * grad_b
                
                # Increase penalty parameter
                if iteration % 20 == 19:
                    rho *= 2
                    
            elif self.method == 'barrier':
                # Barrier method
                cost, barrier = self._barrier_function(w, b, X, y, mu)
                
                if cost == float('inf'):
                    # Project to feasible region
                    constraints = self._svm_constraints(w, b, X, y)
                    violated_idx = np.where(constraints > 0)[0]
                    
                    if len(violated_idx) > 0:
                        # Simple projection: reduce w magnitude
                        w *= 0.9
                        b *= 0.9
                    continue
                
                grad_w, grad_b = self._gradient_barrier(w, b, X, y, mu)
                
                if np.any(np.isinf(grad_w)) or np.isinf(grad_b):
                    # Projection step
                    w *= 0.95
                    b *= 0.95
                    continue
                
                # Update with gradient descent
                learning_rate = 0.001 / (1 + iteration * 0.01)
                w = w - learning_rate * grad_w
                b = b - learning_rate * grad_b
                
                # Decrease barrier parameter
                if iteration % 20 == 19:
                    mu *= 0.5
                    
                penalty = barrier
            
            # Record history
            constraints = self._svm_constraints(w, b, X, y)
            max_violation = np.maximum(0, constraints).max()
            
            self.cost_history_.append(cost)
            self.penalty_history_.append(penalty)
            self.constraint_violations_.append(max_violation)
            
            # Check convergence
            if len(self.cost_history_) > 1:
                cost_change = abs(self.cost_history_[-1] - self.cost_history_[-2])
                if cost_change < self.tol and max_violation < self.tol:
                    break
        
        self.w_ = w
        self.b_ = b
        self.n_iter_ = len(self.cost_history_)
        return self
    
    def predict(self, X):
        """Make predictions."""
        return np.sign(X @ self.w_ + self.b_)
    
    def decision_function(self, X):
        """Compute decision function values."""
        return X @ self.w_ + self.b_

# Generate linearly separable data for SVM
np.random.seed(42)
n_samples = 200
X = np.random.randn(n_samples, 2)
# Create a clear separation
y = np.where(X[:, 0] + X[:, 1] > 0, 1, -1)
# Add some noise to make it more interesting
noise_idx = np.random.choice(n_samples, size=int(0.1 * n_samples), replace=False)
y[noise_idx] *= -1

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Compare penalty and barrier methods
constrained_optimizers = {
    'Penalty Method': ConstrainedOptimizer(method='penalty', penalty_param=1.0, max_iter=200),
    'Barrier Method': ConstrainedOptimizer(method='barrier', barrier_param=1.0, max_iter=200)
}

results = {}
fitted_constrained = {}

for name, optimizer in constrained_optimizers.items():
    try:
        optimizer.fit(X_train, y_train)
        y_pred = optimizer.predict(X_test)
        
        accuracy = np.mean(y_pred == y_test)
        final_violation = optimizer.constraint_violations_[-1]
        objective_value = 0.5 * np.dot(optimizer.w_, optimizer.w_)
        
        results[name] = {
            'Accuracy': accuracy,
            'Final Constraint Violation': final_violation,
            'Objective Value': objective_value,
            'Iterations': optimizer.n_iter_
        }
        fitted_constrained[name] = optimizer
        
    except Exception as e:
        print(f"Error with {name}: {e}")
        results[name] = {'Error': str(e)}

# Add sklearn SVM for comparison
from sklearn.svm import SVC
svm_sklearn = SVC(kernel='linear', C=1000)  # High C for hard margin
svm_sklearn.fit(X_train, y_train)
y_pred_sklearn = svm_sklearn.predict(X_test)

results['sklearn SVM'] = {
    'Accuracy': np.mean(y_pred_sklearn == y_test),
    'Final Constraint Violation': 0,  # Assumes perfect optimization
    'Objective Value': 0.5 * np.sum(svm_sklearn.coef_[0] ** 2),
    'Iterations': 'N/A'
}

results_df = pd.DataFrame(results).T
print("Constrained Optimization Methods Comparison:")
print(results_df.round(6))

In [None]:
# Visualize constrained optimization behavior
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# Cost evolution
for name, optimizer in fitted_constrained.items():
    axes[0, 0].semilogy(optimizer.cost_history_, label=name, alpha=0.8, linewidth=2)
axes[0, 0].set_xlabel('Iteration')
axes[0, 0].set_ylabel('Cost (log scale)')
axes[0, 0].set_title('Cost Function Evolution')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Constraint violations
for name, optimizer in fitted_constrained.items():
    axes[0, 1].semilogy(optimizer.constraint_violations_, label=name, alpha=0.8, linewidth=2)
axes[0, 1].set_xlabel('Iteration')
axes[0, 1].set_ylabel('Max Constraint Violation (log scale)')
axes[0, 1].set_title('Constraint Satisfaction')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Penalty/Barrier term evolution
for name, optimizer in fitted_constrained.items():
    if optimizer.penalty_history_:
        penalty_values = [p for p in optimizer.penalty_history_ if p != float('inf')]
        if penalty_values:
            axes[0, 2].semilogy(penalty_values, label=f'{name} Penalty/Barrier', alpha=0.8, linewidth=2)
axes[0, 2].set_xlabel('Iteration')
axes[0, 2].set_ylabel('Penalty/Barrier Term (log scale)')
axes[0, 2].set_title('Penalty/Barrier Evolution')
axes[0, 2].legend()
axes[0, 2].grid(True, alpha=0.3)

# Decision boundaries visualization
if fitted_constrained:
    # Create mesh for decision boundary
    h = 0.02
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    
    # Plot for penalty method
    if 'Penalty Method' in fitted_constrained:
        penalty_opt = fitted_constrained['Penalty Method']
        mesh_points = np.c_[xx.ravel(), yy.ravel()]
        Z = penalty_opt.decision_function(mesh_points)
        Z = Z.reshape(xx.shape)
        
        axes[1, 0].contour(xx, yy, Z, levels=[0], colors='black', linestyles='--', linewidths=2)
        axes[1, 0].scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='RdYlBu', alpha=0.8)
        axes[1, 0].set_title('Penalty Method Decision Boundary')
        axes[1, 0].set_xlabel('Feature 1')
        axes[1, 0].set_ylabel('Feature 2')
    
    # Plot for barrier method
    if 'Barrier Method' in fitted_constrained:
        barrier_opt = fitted_constrained['Barrier Method']
        Z = barrier_opt.decision_function(mesh_points)
        Z = Z.reshape(xx.shape)
        
        axes[1, 1].contour(xx, yy, Z, levels=[0], colors='black', linestyles='--', linewidths=2)
        axes[1, 1].scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='RdYlBu', alpha=0.8)
        axes[1, 1].set_title('Barrier Method Decision Boundary')
        axes[1, 1].set_xlabel('Feature 1')
        axes[1, 1].set_ylabel('Feature 2')

# Performance comparison
method_names = [name for name in results.keys() if 'Error' not in results[name]]
accuracies = [results[name]['Accuracy'] for name in method_names]
objective_values = [results[name]['Objective Value'] for name in method_names]

x_pos = np.arange(len(method_names))
width = 0.35

ax1 = axes[1, 2]
ax2 = ax1.twinx()

bars1 = ax1.bar(x_pos - width/2, accuracies, width, label='Accuracy', alpha=0.7, color='blue')
bars2 = ax2.bar(x_pos + width/2, objective_values, width, label='Objective Value', alpha=0.7, color='orange')

ax1.set_xlabel('Method')
ax1.set_ylabel('Accuracy', color='blue')
ax2.set_ylabel('Objective Value', color='orange')
ax1.set_title('Performance Comparison')
ax1.set_xticks(x_pos)
ax1.set_xticklabels(method_names, rotation=45)

# Add value labels on bars
for bar, val in zip(bars1, accuracies):
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height + 0.01,
             f'{val:.3f}', ha='center', va='bottom', fontsize=9)

for bar, val in zip(bars2, objective_values):
    height = bar.get_height()
    ax2.text(bar.get_x() + bar.get_width()/2., height + height*0.05,
             f'{val:.3f}', ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()

# Analysis of constraint satisfaction
print("\nConstraint Satisfaction Analysis:")
for name, optimizer in fitted_constrained.items():
    # Check final constraint satisfaction
    constraints = optimizer._svm_constraints(optimizer.w_, optimizer.b_, X_train, y_train)
    n_violated = np.sum(constraints > 1e-6)
    max_violation = np.maximum(0, constraints).max()
    
    print(f"{name}:")
    print(f"  Constraints violated: {n_violated}/{len(constraints)}")
    print(f"  Maximum violation: {max_violation:.6f}")
    print(f"  Final objective: {results[name]['Objective Value']:.6f}")
    print()

## Question 5: Optimization Challenges in High-Dimensional Spaces

**Question:** Analyze the curse of dimensionality in optimization. Implement coordinate descent and demonstrate its effectiveness on high-dimensional problems compared to full gradient methods.

### Theory

**Curse of Dimensionality in Optimization:**
1. **Local minima proliferation**: Number of critical points grows exponentially
2. **Gradient vanishing**: Gradients become sparse in high dimensions
3. **Saddle point dominance**: Most critical points are saddle points, not minima
4. **Poor conditioning**: Hessian eigenvalue spectrum becomes poorly conditioned

**Coordinate Descent:**
$$\theta_j^{(t+1)} = \arg\min_{\theta_j} f(\theta_1^{(t+1)}, \ldots, \theta_{j-1}^{(t+1)}, \theta_j, \theta_{j+1}^{(t)}, \ldots, \theta_d^{(t)})$$

**Advantages in High Dimensions:**
- Avoids computing full gradient (O(d) → O(1) per coordinate)
- Natural for sparsity-inducing problems
- Can handle non-differentiable objectives
- Memory efficient

**Block Coordinate Descent:**
Updates blocks of coordinates simultaneously:
$$\theta_{B_k}^{(t+1)} = \arg\min_{\theta_{B_k}} f(\theta_{B_1}^{(t+1)}, \ldots, \theta_{B_{k-1}}^{(t+1)}, \theta_{B_k}, \theta_{B_{k+1}}^{(t)}, \ldots)$$

In [None]:
class HighDimensionalOptimizer:
    """
    Optimizer for high-dimensional problems comparing different approaches.
    """
    
    def __init__(self, method='coordinate_descent', block_size=1, max_iter=1000, tol=1e-6, 
                 learning_rate=0.01):
        self.method = method
        self.block_size = block_size
        self.max_iter = max_iter
        self.tol = tol
        self.learning_rate = learning_rate
        self.cost_history_ = []
        self.grad_norms_ = []
        self.coord_updates_ = []  # Track which coordinates are updated
        
    def _sparse_quadratic_objective(self, theta, A, b, lambda_reg=0.01):
        """Sparse quadratic objective: 1/2 * theta^T A theta - b^T theta + lambda * ||theta||_1"""
        quadratic_term = 0.5 * theta.T @ A @ theta
        linear_term = -b.T @ theta
        l1_term = lambda_reg * np.sum(np.abs(theta))
        return quadratic_term + linear_term + l1_term
    
    def _gradient_sparse_quadratic(self, theta, A, b):
        """Gradient of quadratic part (without L1 term)."""
        return A @ theta - b
    
    def _coordinate_descent_update(self, theta, A, b, lambda_reg, j):
        """Single coordinate update for sparse quadratic objective."""
        # Compute partial residual
        r_j = A[j, :] @ theta - A[j, j] * theta[j] - b[j]
        
        # Soft thresholding
        z_j = -r_j / A[j, j] if A[j, j] != 0 else 0
        
        # L1 soft thresholding
        if A[j, j] != 0:
            lambda_scaled = lambda_reg / A[j, j]
            theta_new_j = np.sign(z_j) * max(0, abs(z_j) - lambda_scaled)
        else:
            theta_new_j = 0
            
        return theta_new_j
    
    def _create_high_dim_problem(self, n_features, sparsity=0.1, condition_number=100, random_state=42):
        """Create a high-dimensional sparse optimization problem."""
        np.random.seed(random_state)
        
        # Create a well-conditioned positive definite matrix
        eigenvals = np.logspace(0, np.log10(condition_number), n_features)
        Q = np.random.randn(n_features, n_features)
        Q, _ = np.linalg.qr(Q)  # Orthogonal matrix
        A = Q @ np.diag(eigenvals) @ Q.T
        
        # Create sparse true solution
        theta_true = np.zeros(n_features)
        n_nonzero = int(sparsity * n_features)
        nonzero_idx = np.random.choice(n_features, n_nonzero, replace=False)
        theta_true[nonzero_idx] = np.random.randn(n_nonzero) * 2
        
        # Create b such that theta_true is close to optimum
        b = A @ theta_true + np.random.randn(n_features) * 0.1
        
        return A, b, theta_true
    
    def fit(self, A, b, lambda_reg=0.01, theta_init=None):
        """Fit the high-dimensional optimization problem."""
        n_features = len(b)
        
        if theta_init is None:
            theta = np.zeros(n_features)
        else:
            theta = theta_init.copy()
        
        self.cost_history_ = []
        self.grad_norms_ = []
        self.coord_updates_ = []
        
        for iteration in range(self.max_iter):
            theta_old = theta.copy()
            
            if self.method == 'coordinate_descent':
                # Cyclic coordinate descent
                for j in range(n_features):
                    theta[j] = self._coordinate_descent_update(theta, A, b, lambda_reg, j)
                    
            elif self.method == 'random_coordinate_descent':
                # Random coordinate descent
                j = np.random.randint(n_features)
                theta[j] = self._coordinate_descent_update(theta, A, b, lambda_reg, j)
                self.coord_updates_.append(j)
                
            elif self.method == 'block_coordinate_descent':
                # Block coordinate descent
                n_blocks = n_features // self.block_size
                for block in range(n_blocks):
                    start_idx = block * self.block_size
                    end_idx = min((block + 1) * self.block_size, n_features)
                    
                    for j in range(start_idx, end_idx):
                        theta[j] = self._coordinate_descent_update(theta, A, b, lambda_reg, j)
                        
            elif self.method == 'gradient_descent':
                # Full gradient descent (ignoring L1 term for simplicity)
                gradient = self._gradient_sparse_quadratic(theta, A, b)
                theta = theta - self.learning_rate * gradient
                
                # Apply soft thresholding for L1 regularization
                theta = np.sign(theta) * np.maximum(np.abs(theta) - self.learning_rate * lambda_reg, 0)
            
            # Record history
            cost = self._sparse_quadratic_objective(theta, A, b, lambda_reg)
            gradient = self._gradient_sparse_quadratic(theta, A, b)
            grad_norm = np.linalg.norm(gradient)
            
            self.cost_history_.append(cost)
            self.grad_norms_.append(grad_norm)
            
            # Check convergence
            if np.linalg.norm(theta - theta_old) < self.tol:
                break
        
        self.theta_ = theta
        self.n_iter_ = len(self.cost_history_)
        return self

# Test high-dimensional optimization
dimensions = [100, 500, 1000, 2000]
methods = {
    'Coordinate Descent': 'coordinate_descent',
    'Random CD': 'random_coordinate_descent', 
    'Block CD (10)': 'block_coordinate_descent',
    'Gradient Descent': 'gradient_descent'
}

performance_results = []

for dim in dimensions:
    print(f"\nTesting dimension: {dim}")
    
    # Create problem
    optimizer = HighDimensionalOptimizer()
    A, b, theta_true = optimizer._create_high_dim_problem(dim, sparsity=0.05, condition_number=50)
    
    for method_name, method_key in methods.items():
        start_time = time.time()
        
        if method_key == 'block_coordinate_descent':
            opt = HighDimensionalOptimizer(method=method_key, block_size=10, max_iter=1000)
        elif method_key == 'gradient_descent':
            opt = HighDimensionalOptimizer(method=method_key, learning_rate=0.001, max_iter=1000)
        else:
            opt = HighDimensionalOptimizer(method=method_key, max_iter=1000)
        
        try:
            opt.fit(A, b, lambda_reg=0.01)
            training_time = time.time() - start_time
            
            final_cost = opt.cost_history_[-1]
            final_grad_norm = opt.grad_norms_[-1]
            n_nonzero = np.sum(np.abs(opt.theta_) > 1e-6)
            recovery_error = np.linalg.norm(opt.theta_ - theta_true)
            
            performance_results.append({
                'Dimension': dim,
                'Method': method_name,
                'Final Cost': final_cost,
                'Training Time': training_time,
                'Iterations': opt.n_iter_,
                'Final Grad Norm': final_grad_norm,
                'Non-zero Coeffs': n_nonzero,
                'Recovery Error': recovery_error
            })
            
        except Exception as e:
            print(f"Error with {method_name} at dim {dim}: {e}")
            performance_results.append({
                'Dimension': dim,
                'Method': method_name,
                'Error': str(e)
            })

# Convert to DataFrame for analysis
results_df = pd.DataFrame(performance_results)
successful_results = results_df[~results_df.get('Error', pd.Series(dtype='object')).notna()]

print("\nHigh-Dimensional Optimization Results:")
for dim in dimensions:
    print(f"\nDimension {dim}:")
    dim_results = successful_results[successful_results['Dimension'] == dim]
    if not dim_results.empty:
        print(dim_results[['Method', 'Final Cost', 'Training Time', 'Iterations', 'Non-zero Coeffs']].round(4))
    else:
        print("No successful results for this dimension")

In [None]:
# Detailed analysis for a specific dimension
analysis_dim = 1000
print(f"\nDetailed Analysis for {analysis_dim} dimensions:")

# Create problem for detailed analysis
optimizer = HighDimensionalOptimizer()
A, b, theta_true = optimizer._create_high_dim_problem(analysis_dim, sparsity=0.05, condition_number=50)

# Run all methods and store detailed results
detailed_optimizers = {}
detailed_results = {}

methods_detailed = {
    'Coordinate Descent': HighDimensionalOptimizer(method='coordinate_descent', max_iter=200),
    'Random CD': HighDimensionalOptimizer(method='random_coordinate_descent', max_iter=200*analysis_dim),
    'Block CD': HighDimensionalOptimizer(method='block_coordinate_descent', block_size=50, max_iter=200),
    'Gradient Descent': HighDimensionalOptimizer(method='gradient_descent', learning_rate=0.0001, max_iter=200)
}

for name, opt in methods_detailed.items():
    start_time = time.time()
    opt.fit(A, b, lambda_reg=0.01)
    training_time = time.time() - start_time
    
    detailed_optimizers[name] = opt
    detailed_results[name] = {
        'training_time': training_time,
        'final_cost': opt.cost_history_[-1],
        'recovery_error': np.linalg.norm(opt.theta_ - theta_true),
        'sparsity': np.sum(np.abs(opt.theta_) > 1e-6)
    }

# Visualize detailed results
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# Cost convergence
for name, opt in detailed_optimizers.items():
    if opt.cost_history_:
        axes[0, 0].semilogy(opt.cost_history_, label=name, alpha=0.8, linewidth=2)
axes[0, 0].set_xlabel('Iteration')
axes[0, 0].set_ylabel('Cost (log scale)')
axes[0, 0].set_title(f'Convergence Comparison ({analysis_dim}D)')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Gradient norm convergence
for name, opt in detailed_optimizers.items():
    if opt.grad_norms_:
        axes[0, 1].semilogy(opt.grad_norms_, label=name, alpha=0.8, linewidth=2)
axes[0, 1].set_xlabel('Iteration')
axes[0, 1].set_ylabel('Gradient Norm (log scale)')
axes[0, 1].set_title('Gradient Norm Convergence')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Solution sparsity comparison
true_sparsity = np.sum(np.abs(theta_true) > 1e-6)
method_names = list(detailed_results.keys())
sparsities = [detailed_results[name]['sparsity'] for name in method_names]

bars = axes[0, 2].bar(method_names, sparsities, alpha=0.7)
axes[0, 2].axhline(y=true_sparsity, color='red', linestyle='--', linewidth=2, label=f'True sparsity: {true_sparsity}')
axes[0, 2].set_ylabel('Number of Non-zero Coefficients')
axes[0, 2].set_title('Solution Sparsity')
axes[0, 2].legend()
axes[0, 2].set_xticklabels(method_names, rotation=45)
for bar, val in zip(bars, sparsities):
    axes[0, 2].text(bar.get_x() + bar.get_width()/2., bar.get_height() + 1,
                   f'{val}', ha='center', va='bottom')

# Scalability analysis across dimensions
if not successful_results.empty:
    for method in methods.keys():
        method_data = successful_results[successful_results['Method'] == method]
        if not method_data.empty:
            axes[1, 0].loglog(method_data['Dimension'], method_data['Training Time'], 
                             'o-', label=method, alpha=0.8, linewidth=2, markersize=6)
    
    axes[1, 0].set_xlabel('Dimension')
    axes[1, 0].set_ylabel('Training Time (s)')
    axes[1, 0].set_title('Scalability Analysis')
    axes[1, 0].legend()
    axes[1, 0].grid(True, alpha=0.3)

# Recovery error comparison
recovery_errors = [detailed_results[name]['recovery_error'] for name in method_names]
bars = axes[1, 1].bar(method_names, recovery_errors, alpha=0.7, color='orange')
axes[1, 1].set_ylabel('Recovery Error ||θ - θ_true||')
axes[1, 1].set_title('Solution Quality')
axes[1, 1].set_xticklabels(method_names, rotation=45)
for bar, val in zip(bars, recovery_errors):
    axes[1, 1].text(bar.get_x() + bar.get_width()/2., bar.get_height() + bar.get_height()*0.05,
                   f'{val:.3f}', ha='center', va='bottom', fontsize=9)

# Coordinate update pattern for Random CD
if 'Random CD' in detailed_optimizers and detailed_optimizers['Random CD'].coord_updates_:
    coord_updates = detailed_optimizers['Random CD'].coord_updates_[:1000]  # First 1000 updates
    axes[1, 2].hist(coord_updates, bins=50, alpha=0.7, density=True)
    axes[1, 2].set_xlabel('Coordinate Index')
    axes[1, 2].set_ylabel('Update Frequency (normalized)')
    axes[1, 2].set_title('Random CD Coordinate Selection Pattern')
    axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print detailed comparison
print(f"\nDetailed Results for {analysis_dim}D problem:")
print(f"True solution sparsity: {true_sparsity}")
print("\nMethod Comparison:")
for name in method_names:
    result = detailed_results[name]
    print(f"{name}:")
    print(f"  Training time: {result['training_time']:.4f}s")
    print(f"  Final cost: {result['final_cost']:.6f}")
    print(f"  Recovery error: {result['recovery_error']:.6f}")
    print(f"  Solution sparsity: {result['sparsity']}")
    print()

## Summary and Key Takeaways

### Optimization Methods in Machine Learning:

1. **Gradient Descent Variants**:
   - **Batch GD**: Stable convergence but slow for large datasets
   - **SGD**: Fast updates but noisy convergence; requires learning rate scheduling
   - **Mini-batch GD**: Good compromise between stability and speed

2. **Advanced First-Order Methods**:
   - **Momentum**: Accelerates convergence and dampens oscillations
   - **Nesterov**: Look-ahead gradient provides better convergence
   - **AdaGrad**: Adapts learning rate per parameter but can stop too early
   - **RMSprop**: Fixes AdaGrad's aggressive learning rate decay
   - **Adam**: Combines momentum and adaptive learning rates; robust default choice

3. **Second-Order Methods**:
   - **Newton's Method**: Fastest convergence but O(n³) complexity
   - **BFGS**: Quasi-Newton method with superlinear convergence
   - **L-BFGS**: Memory-efficient version for large-scale problems
   - Trade-off: Faster convergence vs. computational cost per iteration

4. **Constrained Optimization**:
   - **Penalty Methods**: Convert constraints to penalty terms; simple but can be ill-conditioned
   - **Barrier Methods**: Use logarithmic barriers; maintain feasibility but require good initialization
   - **Lagrangian Methods**: Theoretical foundation for understanding optimality conditions

5. **High-Dimensional Challenges**:
   - **Curse of Dimensionality**: Gradients become sparse, saddle points dominate
   - **Coordinate Descent**: Effective for sparse problems and separable objectives
   - **Block Coordinate Descent**: Balances efficiency with parallelization
   - **Random Coordinate Selection**: Can improve convergence for some problems

### Practical Guidelines:

- **Default Choice**: Adam optimizer for neural networks
- **Large Datasets**: Mini-batch SGD with momentum
- **Sparse Problems**: Coordinate descent or L-BFGS
- **Small Datasets**: L-BFGS or Newton's method
- **Constrained Problems**: Penalty methods for simple constraints, specialized solvers for complex ones

### Performance Insights:
- Second-order methods converge in fewer iterations but cost more per iteration
- Adaptive methods (Adam, RMSprop) are more robust to hyperparameter choices
- Coordinate descent scales better with dimension for sparse problems
- Proper learning rate scheduling is crucial for SGD variants