## Imports

In [1]:
# To embed plots in the notebooks
%matplotlib inline
import matplotlib.pyplot as plt

import numpy as np # numpy library
import scipy.linalg as lng # linear algebra from scipy library
from scipy.spatial import distance # load distance function
from sklearn import preprocessing as preproc # load preprocessing function

# seaborn can be used to "prettify" default matplotlib plots by importing and setting as default
import seaborn as sns
sns.set() # Set searborn as default

## Load dataset

In [2]:
diabetPath = './DiabetesDataNormalized.txt'
T = np.loadtxt(diabetPath, delimiter = ' ', skiprows = 1)
y = T[:, 10]
X = T[:,:10]

# Get number of observations (n) and number of independent variables (p)
[n, p] = np.shape(X)

M = X

## 1 Solve the Ordinary Least Squares (OLS) computationally (for the diabetes data set):

> (a) What is the difference between using a brute force implementation(analytical) for an OLS solver and a numerically ’smarter’ implementation? Compute the ordinary least squares solution to the diabetes data set for both options and look at the relative difference. Use for example lng.lstsq to invert the matrix or to solve the linear system of equation.

In [3]:
def ols_numerical(X, y):
    # Call lstsq from lng to get betas
    # beta, residues, rank, sing_val = lng.lstsq(a=X, b=y)
    return lng.lstsq(a=X, b=y)


def ols_analytical(X, y):
    # Implement the analytical closed form way of calculating the betas 
    beta = np.linalg.inv(X.T @ X) @ X.T @ y
    return beta

In [36]:
# numerical solution
beta_num, _, _, _ = ols_numerical(X, y)
print(f'The list of betas: \n{beta_num}')

The list of betas: 
[-0.00618293 -0.14813008  0.32110005  0.20036692 -0.48931352  0.29447365
  0.06241272  0.10936897  0.46404908  0.04177187]


In [37]:
# analytical solution
beta_ana = ols_analytical(X,y)
print(f'The list of betas: \n{beta_ana}')

The list of betas: 
[-0.00618293 -0.14813008  0.32110005  0.20036692 -0.48931352  0.29447365
  0.06241272  0.10936897  0.46404908  0.04177187]


In [38]:
# difference in solutions
norm = np.linalg.norm(beta_ana-beta_num)
print(f'The norm of the difference between betas: \n{norm}')

The norm of the difference between betas: 
2.1112011768850182e-14


Is the difference significant? 

What can we conclude relating to numerical vs analytical solutions?

> (b) How could you include an intercept term in Python? This means using the model: $y = β_0 +xβ_1 +...+x_pβ_p +ε $ rather than: $ y=x_1β_1 +...+x_pβ_p +ε. $

In [39]:
# Include offset / intercept
M = np.hstack(((np.ones_like(X[:,0]))[:, np.newaxis], X))

# numerical solution
beta_num_bias, _, _, _ = ols_numerical(M, y)
print(f'The list of betas: \n{beta_num}')

# analytical solution
beta_ana_bias = ols_analytical(M,y)
print(f'The list of betas: \n{beta_ana}')

# difference in solutions
norm = np.linalg.norm(beta_ana_bias-beta_num_bias)
print(f'The norm of the difference between betas: \n{norm}')

The list of betas: 
[-0.00618293 -0.14813008  0.32110005  0.20036692 -0.48931352  0.29447365
  0.06241272  0.10936897  0.46404908  0.04177187]
The list of betas: 
[-0.00618293 -0.14813008  0.32110005  0.20036692 -0.48931352  0.29447365
  0.06241272  0.10936897  0.46404908  0.04177187]
The norm of the difference between betas: 
4.330529936130431e-15


What is the value of the intercept coefficient?

Can you explain why?

<span style="color:yellow"> Value of intercept: about 0

<span style="color:yellow"> Reason: OLS is an unbiased estimator. Therefore the intercept (bias) is 0

> (c) Calculate the mean squared error $MSE = 1/n \sum^n_{i=1} (y_i−x_iβ)^2$.

In [40]:
# Calculate the estimated y values and use these to calculate the MSE.
def compute_mse(X,beta,y):
    y_hat = X @ beta
    res = y - y_hat
    mse = np.mean((res)**2)
    return mse, res, y_hat

In [41]:
mse_ana, res_ana, yhat_ana = compute_mse(X,beta_ana,y)

print(f'mse from the analytical solution: {mse_ana}')

mse from the analytical solution: 0.48116051086159695


What happens to the MSE if we change some of the betas?

Is that what you expected?

In [44]:
beta_new = beta_ana
beta_new[5] = 0

mse_new, res_new, yhat_new = compute_mse(X,beta_new,y)

print(f'mse from the changed betas: {mse_new}')

mse from the changed betas: 0.5676790520132494


<span style="color:yellow"> If we change value sof beta, we get a worse MSE. This makes sense as OLS optimizes a squared error.

> (d) Calculate the residual sum of squares $RSS = ∥{\bf y} − Xβ∥_2^2$ and the total sum of squares $T SS = ∥{\bf y} − y∥_2^2$, where $y$ is the estimated mean of ${\bf y}$. Report on the $R^2$ measure, that is, the proportion of variance in the sample set explained by the
  model: $R^2 = 1 − \frac{RSS}{TSS}$

In [47]:
RSS = np.linalg.norm(y - M@beta_num_bias)**2
TSS = np.linalg.norm(y - np.mean(M@beta_num_bias))**2

R2 = 1 - RSS/TSS
R2

0.51774842222035

How much variance in <strong>y</strong> can we explain using this simple model?

<span style="color:yellow"> With the simple model, we can explain about 52% of the varience in y </span>