## Part III: Ridgeless and double descent

### Objective


So far in our course, we’ve utilized the U-shaped bias-variance trade-off curve as a pivotal tool for model selection. This has aided us in methodologies such as ridge/lasso regression, tree pruning, and smoothing splines, among others.

A key observation is that when a model interpolates training data to the extent that the Residual Sum of Squares (RSS) equals zero, it’s typically a red flag signaling overfitting. Such models are anticipated to perform inadequately when presented with new, unseen data. 

> However, in modern practice, very rich models such as neural networks are trained to exactly fit (i.e., interpolate) the data. Classically, such models would be considered overfitted, and yet they often obtain high accuracy on test data. This apparent contradiction has raised questions about the mathematical foundations of machine learning and their relevance to practitioners. ([Belkin et al. 2019](https://liangfgithub.github.io/Coding/DoubleDescent_PNAS_2019.pdf))

In this assignment, we will use Ridgeless to illustrate the double descent phenomenon. Our setup is similar to, but not the same as, [Section 8 in Hastie (2020)](https://liangfgithub.github.io/Coding/Ridge_Hastie_2020.pdf).

### Data

Remember the dataset used in Coding 2 Part I? It consisted of 506 rows (i.e., n = 506) and 14 columns: *Y*, *X1* through *X13*.

Based on this dataset, we have formed <u>Coding3_dataH.csv</u>, which is structured as follows:

- It contains 506 rows, corresponding to *n* = 506.
- There are 241 columns in total. The first column represents *Y* . The subsequent 240 columns relate to the NCS basis functions for each of the 13 X variables. The number of knots are individually determined for each feature.

### Task 1: Ridgeless function

Ridgeless least squares can be equated with principal component regression (PCR) when all principal components are employed. For our simulation study, we’ll employ the PCR version with the **scale = FALSE** option, implying that we’ll center each column of the design matrix from the training data without scaling.

Your task is to write a function that accepts training and test datasets and returns the training and test errors of the ridgeless estimator. For both datasets, the initial column represents the response vector *Y*.

- You can use R/Python packages or built-in functions for PCA/SVD, but you are not allowed to use packages or functions tailored for linear regression, PCR, or ridge regression.

- Post PCA/SVD, you’ll notice that the updated design matrix comprises orthogonal columns. This allows for the calculation of least squares coefficients through simple matrix multiplication, eliminating the need for matrix inversion.

- For computation stability, you need to exclude directions with extremely small eigenvalues (in PCA) or singular values (in SVD). As a reference, consider setting **eps = 1e-10** as the threshold for singular values.

- Although training errors aren’t a requisite for our simulation, I recommend including them in the ridgeless output. This serves as a useful debugging tool. Ideally, your training error should align with the RSS derived from a standard linear regression model.

In [1]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

import pandas as pd
import numpy as np
import plotly.graph_objects as go

In [2]:
def ridgeless_function(training_data:np.ndarray, testing_data:np.ndarray) -> float:
    X_train = training_data[:, 1:]
    Y_train = training_data[:, 0]
    X_test = testing_data[:, 1:]
    Y_test = testing_data[:, 0]
    
    scaler = StandardScaler(with_mean=True, with_std=False)
    pca = PCA()

    pipeline = Pipeline([('scaling', scaler), ('pca', pca)])
    pipeline.fit(X_train)
    X_train = pipeline.transform(X_train)  # X_train changes to XtX shape
    X_train = X_train[:, pca.singular_values_>1e-10]   # setting threshold for comoputational stability
    coefs =Y_train.T @ X_train / np.sum(X_train**2, axis=0)
    b0 = np.mean(Y_train)

    X_test = pipeline.transform(X_test)   # X_test changes to XtX covariance shape
    X_test = X_test[:, pca.singular_values_>1e-10]


    preds = X_test @ coefs.T + b0
    log_test_error = np.log(np.mean((Y_test-preds)**2))

    return log_test_error


### Task 2: Simulation Study

Execute the procedure below for *T* = 30 times.

In each iteration,
- Randomly partition the data into training (25%) and test (75%).
- Calculate and log the test error from the ridgeless method using the first *d* columns of **myData**, where *d* ranges from 6 to 241. Keep in mind that the number of regression parameters spans from 5 to 240 because the first column represents *Y*.

This will result in recording 236 test errors per iteration. These errors are the averaged mean squared errors based on the test data. One practical way to manage this data would be to maintain a matrix of dimensions 30-by-236 to house the test errors derived from this simulation study.

**Graphical display**: 
Plot the median of the test errors (collated over the 30 iterations) in **log scale** against the count of regression parameters, which spans from 5 to 240.

In [3]:
from sklearn.model_selection import train_test_split

In [4]:
Shu_UIN = 8298  # Last 4-digits of my UIN 
T = 30
N_PARAM = 236

Data = pd.read_csv("Coding3_dataH.csv", header=None)

log_test_error_array= np.zeros((T, N_PARAM))

for t in range(T):
    train_t, test_t = train_test_split(Data.values, test_size=0.75, random_state=Shu_UIN+t)
    for d in range(6, 242):
        train_t_d, test_t_d = train_t[:, :d], test_t[:, :d]
        log_test_error_array[t, d-6] = ridgeless_function(train_t_d, test_t_d)

In [5]:
log_test_error_median_array = np.median(log_test_error_array, axis=0)
number_of_feature_array = np.linspace(5, 240, 236).astype(int)

In [6]:
fig = go.Figure()
fig.update_layout(width=1000,height=500,
                  xaxis_title="# of features",
                  yaxis_title="Log of Test Error")

fig.add_trace(go.Scatter(x=number_of_feature_array,y=log_test_error_median_array,
                    mode='markers',
                    marker_color="blue"))