# Homework 7: Regression, Part 2

In [None]:
import numpy as np
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from typing import Callable, List, Tuple
from helper import load_diabetes, print_table, print_stats_table, mse, pearson_correlation, standardize

We will load the same diabetes dataset from last week. We provide the file `helper.py` that provides the functions we implemented last week.

In [None]:
data, target, feature_labels = load_diabetes()
data = standardize(data)
p_corrs = pearson_correlation(data, target)

## Regression with Gradient Descent (GD) 

Previously, we used closed-form solutions to approximate our data-generating function, mapping data $ X $ to targets $ y $. However, these methods have limitations:
- They are **not universally applicable** across all $ p $-norms used in regularization and in real-world setting they are usually not available.
- **Computing matrix inverses becomes increasingly expensive** as dataset sizes grow.

To overcome these challenges, we turn to Gradient Descent (GD), a fundamental optimization algorithm in machine learning and deep learning. Unlike closed-form solutions, GD **iteratively minimizes the objective function**. By adjusting model parameters in the direction opposite the gradient of the cost function, GD refines models to converge towards optimal parameter values. This iterative approach ensures models minimize prediction errors while respecting regularization constraints.

<center>
    <a href="https://www.analyticsvidhya.com/blog/2020/10/how-does-the-gradient-descent-algorithm-work-in-machine-learning/">
        <img src="GD_quick_guide.png" alt="GD image" style="width:30%;">
    </a>
</center>



For this exercise, we focus on Ridge Regression with $ L2 $-norm regularization:

Recall, we aim to find parameters $ \theta $ and $ \theta_0 $ that minimize the objective function $ J(X, Y, \theta, \theta_0) $:

$$
\theta^*, \theta_0^* = \underset{\theta, \theta_0}{\min}\ J(X, Y, \theta, \theta_0) = \underset{\theta, \theta_0}{\min} \ \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2  + \lambda \|[\theta,\theta_0]\|^2
$$

**Gradient Descent Algorithm:**

1. Initialize $ \theta^{(1)} $ randomly or with zeros.
2. **for** $ \text{t} = 1 $ **to** $ \text{num\_iterations} $ **do**:
   - Compute predictions: $ \hat{Y} = X \theta^{(t)} $.
   - Compute gradient w.r.t. parameters:
     $ \nabla_{\theta^{(t)}} = -\frac{2}{n} X^T (Y - \hat{Y}) + 2 \lambda \theta^{(t)} $.
   - Update parameters:
     $ \theta^{(t+1)} = \theta^{(t)} - \alpha \nabla_{\theta^{(t)}} $.

**Where:**
- $ X $ and $ Y $ are the training examples and target values, respectively.
- $ X $ is an $ n \times (m+1) $ matrix, and $ \theta $ is an $ (m+1) $-dimensional vector (including bias).
- $ \alpha $ is the learning rate controlling the step size in each iteration.
- $ \lambda $ is the regularization parameter that controls the strength of regularization.

Let us start by implementing the ridge objective function (`ridge_objective` function). 

**Hint:** Make use of the previously implemented `mse` function and the method `np.linalg.norm`.

In [None]:

### Please enter your solution here ###



Next, implement the `compute_ridge_gradient` method. This function should take input features $X$, targets $y$, the weights $\theta$ (including the bias term), and the regularization hyperparameter $\lambda$. 

Its purpose is to compute and return the gradient $\nabla_\theta$ efficiently using numpy optimizations.

In [None]:

### Please enter your solution here ###




Next, implement the `gradient_descent` function to perform Gradient Descent for Ridge Regression. Add type hints, NumPy docstring, and some input checks to ensure the function is used correctly.
- Inputs:
    - features $X$
    - targets $y$
    - the learning rate $\alpha$ 
    - the number of iterations we want to run GD 
    - the regularization hyperparameter $\lambda$
    - a flag to indicate whether to initialize the weights randomly or with zeros and check the input. 
    - seed for reproducibility
- Outputs:
    - the optimized parameters $\theta$ 
    - the history of the objective function. We will use it to observe the learning curves for different $\alpha$. 

**Note**: It is good practice to set a random seed for reproducibility: in the case of random weight initialization, see [`np.random.seed`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.seed.html)

**Hint**: Make use of the previously implemented `ridge_objective` and  `compute_ridge_gradient` functions. 

In [None]:

### Please enter your solution here ###




**Note**: If you implemented the function rigorously, a considerable part of the function `gradient_descent` is function description and input checks. This is a good practice to ensure that the function is used correctly.

Test your code with the following settings.
- $1000$ iterations of GD
- $\alpha = 0.01$
- $\lambda = 0.2$
- `zero` as the weights initialization mode. 

**Note**: If you did everything correctly, you should get similar results, i.e., feature coefficients and MSE error, to the closed-form solutions. 

In [None]:

### Please enter your solution here ###



To assure that your solution is correct for the above problem, here are the results we have when running the notebook from a fresh kernel. Note that if you rerun this, it's possible that you'll get slightly different results due to the stochastic nature of gradient descent. 

`MSE: 3562.265...`

`AGE Weight/P-Cor: 0.368/0.188`

`SEX Weight/P-Cor: -8.702/0.043`

`BMI Weight/P-Cor: 21.746/0.586`

For each weight initialization method (`random` or `zero`), plot the error histories, i.e., learning curves, for several learning rates $\alpha \in \{0.00001, 0.0001, 0.001, 0.01, 0.1\}$.

Make sure that the alpha values have the same coloring in both plots. 

Remember to set $\lambda = 0.2$. 

Play around with the number of iterations of GD. What do you observe?

Hint: You can use `plt.cm.viridis` to build the colormap object.

In [None]:

### Please enter your solution here ###



If you did everything correctly, you should notice several things:
- The initialization modes appear to behave similarly. 
- The smaller the learning rate, the slower the convergence.
- The two smallest learning rates do not converge after 2000 iterations.

## Model Selection: How do we select the right model and hyperparameters?

So far, we fixed the hyperparameter $\lambda$ to $0.2$. Let's observe the behaviour of the objective function for increasing regularization. 

### Plotting $\lambda$

Write a function that plots $\lambda$ on the horizontal axis and Ridge loss on the vertical axis. Call this function `plot_lambda`. This function should have the following parameters and return values:

Parameters:
- `func` -- A function that takes a float and returns a float
- `x` -- The values to be plotted on the horizontal axis and used as input into `func`

Return: `None`

To plot $\lambda$ you will need to create a function that finds the parameters of our model using GD, finds Ridge loss using these parameters, and then outputs the Ridge loss. 

Create a function called `ridge_objective_after_GD` that calculates the Ridge loss after a specified number of iterations of GD.

Parameters:
- `data` -- The data as a numpy array.
- `target` -- The targets as a numpy array
- `lam` -- The lambda value used in the objective function
- `alpha` -- The learning rate
- `num_iterations` -- Number of iterations for which to run GD.
- `init_mode` -- A string that specifies whether we are initializing our parameters to zero or to random values.

Encapsulate the above function in a `lambda` function (no relation to $\lambda$) that specifies all parameters that we want to pass to `ridge_objective_after_GD` aside from `lam`. Specifically, set `alpha` to 0.01, `num_iterations` to 100, and `init_mode` to 'random'.  We will pass this function to `plot_lambda`. 

Plot for the $\lambda$ values `np.logspace(-3, 1, num=25, base=10)`.

**Hint**: Make use of the previously implemented functions. 

In [None]:

### Please enter your solution here ###



The above plot should have very low error for $\lambda$ less than $10^{-1}$, after which the error drastically increases. It should look a bit like the left side of a standard distribution, although I can assure you that it is not. 

It would seem that adding regularization doesn't help decrease the error rate, but we have made a fundamental error in our analysis: We are testing our model with our training data!

If we test our model with our training data, we are essentially seeing to what degree our method is able to remember our training data. In practical applications, this almost never happens! We almost always want our model to **generalize** to other data within that is in the same distribution as our training data. 

To properly test our model, we need to hold out some data to test with. We can simply split our dataset into two parts, or we can do something a bit more sophisticated -- we can split it into several equally sized parts, called **folds**. We'll then train multiple models, each validated on one part of our split and trained on the rest. This principle is called **cross-validation (CV)**. Find more information [here](https://scikit-learn.org/stable/modules/cross_validation.html).

*However, it is important to note that our final model validation needs to be done on a test set that has not been seen during model training.* 

<center>
    <a href="https://scikit-learn.org/stable/modules/cross_validation.html">
        <img src="grid_search_cross_validation.png" alt="5-Fold-CV" style="width:50%;">
    </a>
</center>

**Remark about data standardization and normalization**. 

Previously, we standardized the features. When splitting the dataset, we need "learn" the standardization and normalization parameters from the training set and apply it to the test set. This is because, we cannot use any information of the test set for data preprocessing.

First, create a function `train_test_split` that splits passed data into a train-test split (80% vs. 20%).
- Inputs:
    - `X` -- Data points as a numpy array
    - `y` -- Target values as a numpy array 
    - `test_size` -- Ratio of samples that shall belong to the test dataset. Must be a value between $[0,1]$. Default is 0.2.
    - `shuffle` -- Boolian that determines if  data is shuffled before splitting. Default is True.
    - `rnd_seed` -- Sets the random seed for the shuffling process. Default is 42.
- Outputs:
    - `X_train` -- Data points belonging to the train datasets as numpy array
    - `X_test` -- Data points belonging to the test datasets as numpy array
    - `y_train` -- Train target values as a numpy array
    - `y_test` -- Test target values as a numpy array


**Note:** You should **not** use the function `sklearn.model_selection.train_test_split`. 

In [None]:

### Please enter your solution here ###



Reload the dataset and split into train and test. Use the default values for all parameters. 

In [None]:

### Please enter your solution here ###



Now, we want to implement $10$-fold cross-validation to find the best parameter $\lambda$. 

Write a function `ridge_reg_cross_validation_nfolds` specified as follows:
- Inputs:
    - `X` -- Train data points as a numpy array
    - `y` -- Train target values as a numpy array 
    - `nfolds` -- Number of splits we want to use for cross-validation.
    - `lambdas` -- List of lambda values to test
    - `shuffle` -- Boolian that determines if  data is shuffled before splitting. Default is True.
    - `rnd_seed` -- Sets the random seed for the shuffling process. Default is 42.
    - Other parameters used for gradient descent.
- Outputs:
    - `avg_errors` -- List of average errors for each lambda value.
    - `best_lambda` -- The best lambda value. 
    - `best_error` -- The error for the best lambda value.

**Note**: You should **not** use pre-implemented CV functions, i.e., those listed [here](https://scikit-learn.org/stable/api/sklearn.model_selection.html)

**Hints**: 
- Inspect the picture above. 
- Given a single $\lambda$ value you need to train $k$ different models where $k$ are the number of folds in our cross-validation. To evaluate the performance of models with a given $\lambda$, use the average performance across the $k$ folds. 
- Don't forget to standardize the data. For simplicity's sake we are using `StandardScalar` from sklearn. 

In [None]:
from sklearn.preprocessing import StandardScaler


### Please enter your solution here ###



Now, let's test the cross-validation method with the following settings.
- $10$-fold cross-validation
- $\alpha = 0.01$
- $1000$ iterations of GD
- $\lambda \in [10^{-3}, 10^{1}]$ with $25$ values
- `random` as the weights initialization mode. 
- Random seed $42$

What is the best $\lambda$ and the corresponding error?

In [None]:

### Please enter your solution here ###



If you did everything correctly, you should get the best $\lambda$ value of $0.001$ with an error of around $2977.073$.

Plot the average errors for each lambda value.
- The x-axis should be the lambda values.
- The y-axis should be the average error values.
- Highlight the best lambda value.

In [None]:

### Please enter your solution here ###



Let us now retrain a model with the best $\lambda$ on the entire train dataset. 
Provide the MSE error for train and test set. 
What do you observe? Are the errors similar?

**Hint**: Do not forget to standardize the data.

In [None]:

### Please enter your solution here ###



To sanity check, here are the results of this after starting with a fresh kernel and running the entire notebook:

`Train set MSE: 2768.360...`

`Test set MSE: 3562.265...`

`AGE Weight/P-Cor: 0.284/0.188`

`SEX Weight/P-Cor: -11.163/0.043`

`BMI Weight/P-Cor: 24.588/0.586`

## Ridge Regression with Mini-batch Gradient Descent in PyTorch
In this homework, so far, we have seen how to implement Ridge Regression with Gradient Descent in NumPy. Now we take a small detour to understand why (full) gradient descent is not always the best choice. 
In real-world scenarios, we often have large datasets that do not fit into memory. In such cases, we need to use a variant of gradient descent called **Mini-Batch Gradient Descent**. In Mini-Batch Gradient Descent, we use a subset of the training data to compute the gradient at each step. This approach is computationally more efficient and can be parallelized. However, everything comes at a cost. Mini-batch Gradient Descent is less accurate than full gradient descent. 

The following figure illustrates the difference between full gradient descent and mini-batch gradient descent.

<center>
    <a href="https://www.analyticsvidhya.com/blog/2022/07/gradient-descent-and-its-types/">
        <img src="gd_versions.png" alt="GD vs. MBGD" style="width:50%;">
    </a>
</center>

Let's make things more interesting and use PyTorch (take a hike NumPy!) to implement Ridge Regression with Mini-batch Gradient Descent.

Let's start by implementing the Ridge Regression model in PyTorch.
 
Create a class `RidgeRegression` that inherits from `torch.nn.Module`.
- The class should have the following methods:
    - `__init__`: The constructor that takes the number of features as input and initializes a [linear layer](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) with one output dimension. Note that we want to learn the bias term as well.
    - `forward`: The forward method that takes the input data and returns the predicted values.

In [None]:

### Please enter your solution here ###



Just as we previously did with NumPy, define the Ridge objective function in PyTorch.


- Inputs:
    - `y_pred` -- Predicted values as a tensor
    - `y_true` -- True values as a tensor
    - `theta` -- Weights as a tensor
    - `theta_0` -- Bias term as tensor
    - `lam` -- Regularization parameter as a float
- Output:
    - `loss` -- Ridge loss as a tensor

For simplicity, you can use the PyTorch method  `torch.nn.functional.mse_loss` to compute the MSE loss.

In [None]:

### Please enter your solution here ###



PyTorch provides a convenient way to automatically split the data into mini-batches using `torch.utils.data.DataLoader`. However, to be able to use this class, we need to make use of the PyTorch `Dataset` class, i.e., the `torch.utils.data.TensorDataset`. [This tutorial](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) shows quite in detail how to use these classes.

Your next task is to:
1. Create a train and test dataset using the `TensorDataset` class.
2. Create a train and test data loader using the `DataLoader` class.
   - Use mini-batches of 16 samples. 
   - How do you fix the (mini-)batch size?
   - How do you shuffle the data? Which data should be shuffled?

**Hint**: 
- You can use the `torch.tensor` function to convert NumPy arrays to PyTorch tensors.
- It might be useful to reshape the target arrays to have the shape `(n_samples, 1)`.

In [None]:

### Please enter your solution here ###



Before we start training the model, we need to initialize the model and define the optimizer. PyTorch provides a wide range of optimizers, see [here](https://pytorch.org/docs/stable/optim.html). For this task, we will use the stochastic gradient descent optimizer `torch.optim.SGD`. Note that we will do mini-batch gradient descent by using the batches provided by the data loader and the optimizer.

Initialize the model and the optimizer. Fix, for the moment, the learning rate to $0.01$ and the regularization parameter to $0.001$.

In [None]:

### Please enter your solution here ###



In comparison to the gradient descent in NumPy, one epoch in PyTorch corresponds to one pass through the entire dataset and computing `data_train.shape[0] // batch_size` gradient updates. Thus, the number of iterations in PyTorch is `num_epochs * len(train_loader)`. However, recall that in the NumPy implementation, we computed the gradient for the entire dataset in each iteration.

Finally, create a function `ridge_reg_training_loop` that trains the model for a given number of epochs. The function should: 
- Take the model, optimizer, train loader, learning rate, number of epochs, and regularization parameter as input.
- For each epoch, iterate over the mini-batches, compute the loss, and perform the backward pass.
- The loss should be printed for each epoch.
- Return the trained model.

**Hints**:
- The loss should be computed using the `ridge_objective` function.
- The backward pass updates the weights using the gradients. Before updating the weights, the optimizer should be zeroed.

(If you are completely lost, have a look at this introductory [tutorial](https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html), i.e., the section "Optimizing the Model Parameters".) 


In [None]:

### Please enter your solution here ###



Now, train the model for $10$ epochs.

In [None]:
model = ridge_reg_training_loop(model, optimizer, train_loader, num_epochs=10, lam=lambda_)

Finally, evaluate the model on the train and test set. Do we observe similar errors as in the NumPy implementation?

In [None]:

### Please enter your solution here ###



You should observe that the test error is higher than the train error. This is expected as the model has seen the training data but not the test data. Do you see similar results as with our NumPy implementation?

Specifically, after running the entire notebook with a fresh kernel you should see:

`Train set MSE: 2793.823497791269`

`Test set MSE: 3320.599889138388`