#### In the class we mentioned that for linear models, the LOOCV test error estimate can be obtained as a closed form expression, and there is no need to perform n linear fits for a data set of size n. The goal of this question is for you to obtain that closed-form expression, for a class of linear models. We focus on the simple linear model.

#### (f) In the homework folder you have access to the data file **SimpleReg.csv**. The data contains a feature column x and a response column y. Read the data, then center the x data, and fit a linear model in the form of $y = β0 + β1x$. Now use an R or Python program to calculate the LOOCV CVn, as we did in the class (if you use R, pick the first element of delta). Also write a code that calculates the CVn using equation (3). You should see that the two methods produce identical results. You may also be surprised with how faster your customized code is, compared to the R cv.glm function!

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from ISLP.models import sklearn_sm
from ISLP.models import (ModelSpec as MS,summarize)
from sklearn.model_selection import cross_validate
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

In [2]:
Data = pd.read_csv('SimpleReg.csv')
y = Data['y']
df = pd.DataFrame({'intercept': np.ones(Data.shape[0]), 'x':Data['x']})
# standardlize training data
X = pd.DataFrame(StandardScaler(with_mean=True,  with_std=True).fit_transform(df), 
                 columns=df.columns, 
                 index=df.index)
# print(X)

### = = = = = From Sample code = = = = =

In [9]:
M = sklearn_sm(sm.OLS)
M_CV = cross_validate(M, X, y, cv=Data.shape[0])
cv_error = np.mean(M_CV['test_score'])
print('CVn:', cv_error)

CVn: 0.7256502337450886


In [4]:
def split_data_leave_one_out(X, y, i):
    X_train = X.drop(i)
    y_train = y.drop(i)
    X_test = X.iloc[i:i+1]
    y_test = y.iloc[i]
    return X_train, y_train, X_test, y_test

In [10]:
MSE_LR = np.zeros(len(Data))
for i in range(len(Data)):
    X_train, y_train, X_test, y_test = split_data_leave_one_out(X, y, i)
    # fit model
    model = sm.OLS(y_train, X_train).fit()
    # prediction of the left sample
    pred = model.predict(exog=X_test)
    # calculate MSE
    MSE_LR[i] = mean_squared_error([y_test], [pred])
    # if i in range(3):
    #     print(model.params[0], model.params[1])
print(f'CVn: {np.mean(MSE_LR)}')

CVn: 0.7256502337450886


### = = = = = $CV_n = \frac{1}{n} \sum_{j=1}^{n} (\frac{y_j - \hat{y}}{1 - h_j})^2$ = = = = =

In [7]:
n = len(Data)
MSE = np.zeros(n)

x = X['x']
Sxy = np.sum(Data.x * Data.y)
Sxx = np.sum(Data.x * Data.x)
muY = np.mean(y)

for j in range(n):
    
    beta_1_hat = (Sxy - (n/n-1) * x[j] * (y[j] - muY)) / (Sxx - (n/n-1) * x[j] * x[j])
    beta_0_hat = y[j] - (n * (y[j] - muY) / (n-1)) + (beta_1_hat * x[j] / (n-1))
    
    # predict y_hat
    y_hat = beta_0_hat + (beta_1_hat * x[j])
    
    # calculate MSE
    h_j = 1/n + ((x[j] * x[j]) / Sxx)
    MSE[j] = ( (y[j] - y_hat) / (1 - h_j) ) ** 2

    # if j in range(3):
    #     print('b0:', beta_0_hat, 'b1:', beta_1_hat)
    #     print('yj: ', y[j], 'y_hat:', y_hat, '\n= = = = = = =')

print('CVn:', np.sum(MSE)/len(Data))

CVn: 0.37539681052299934
