#Regularisation: Ridge and Lasso Regression

**Introduction**

Imagine a data set which includes two input variables, X<sub>i</sub> and X<sub>j</sub>, X<sub>j</sub>=-X<sub>i</sub>, such that they have no genuine bearing on the output y. Still, a linear model that sets β<sub>j</sub>=-β<sub>i</sub> will have the same accuracy as the one from which X<sub>i</sub> and X<sub>j</sub> are removed, as  the two terms in the sum β<sub>i</sub>X<sub>i</sub>+β<sub>j</sub>X<sub>j</sub> would cancel out. This situation can be mitigated if the loss function of our linear regression, the ordinary least squares (OLS), is extended with an additional *penalty* term, which pushes down the parameters  β<sub>1</sub>... β<sub>p</sub>. Everything else being equal, this would allow us to reduce the parameters β<sub>i</sub> and β<sub>j</sub> of our example to as close to zero as possible without any loss of accuracy.

When the penalty term is proportional to L2 = Σ(β<sub>i</sub><sup>2</sup>), the resulting regression is known as [Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn-linear-model-ridge) regression (with a loss function OLS+λ.L2), while using penalty term L1 = Σ|β<sub>i</sub>| (loss function OLS+λ.L1) corresponds to the so-called [Lasso](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso)  regression.

L1 and L2 are also known as *regularisation* terms. λ is referred to as the complexity parameter. You can also see the letter α (alpha) used instead of λ - this includes the entire scikit-learn documentation.

Using regularisation terms - L1, L2 or their combination (with the result known as [Elastic Net](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html?highlight=elastic%20net#sklearn-linear-model-elasticnet)) has as an effect a reduction of the variance of the model parameters β<sub>1</sub>... β<sub>p</sub> as we vary the training data. The price to pay is a larger model bias, i.e. the average (squared) error we end up with as we vary our training data.


**Exercise**

In this exercise you will use both Ridge and Lasso regression, and also plain Linear Regression to build regression models for the California house prices dataset. The task for this dataset is to learn a model that predicts the median price of a house (in a California district) from 8 variables describing that district.

To choose the correct complexity penalty for Ridge and Lasso regression you should use the built-in *cross-validation* (see online lecture) that is available in scikit-learn via the classes [RidgeCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html?highlight=ridgecv#sklearn.linear_model.RidgeCV) and [LassoCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html). Look up how to use them in the scikit-learn documentation.

To help you out here is the initial part of my Python program for doing this:


In [2]:
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import RidgeCV, LinearRegression, LassoCV
import numpy as np
california = fetch_california_housing()
print(california.data.shape)
print(california.target.shape)
print(california.feature_names)




(20640, 8)
(20640,)
['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']


**In your first experiment:**

*   learn from the first 15,000 datapoints
*   print out the learned parameter values for each predictor
*   compute the R<sup>2</sup> score on the remainder.

Do this for linear regression, ridge regression and lasso regression, then:

*   For ridge and lasso regression print out the complexity penalty value cross-validation found.
*   Inspect the learned parameter values for each predictor, and comment on their significance.


**In your second experiment:**

*  do the same except this time just train on the first 150 datapoints.

The results you get from the first and second experiments should be quite different - consider the reasons for that.



##Solution

(to the above two experiments combined)

In [3]:
X = california.data
y = california.target

X_train_large, X_test_large = X[:15000], X[15000:]
y_train_large, y_test_large = y[:15000], y[15000:]

X_train_small, X_test_small = X[:150], X[150:]
y_train_small, y_test_small = y[:150], y[150:]

In [4]:
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV

linear_model = LinearRegression(fit_intercept=True)
lasso_model = LassoCV(fit_intercept=True)
ridge_model = RidgeCV(fit_intercept=True)

In [5]:
linear_model.fit(X_train_large, y_train_large)
lasso_model.fit(X_train_large, y_train_large)
ridge_model.fit(X_train_large, y_train_large)


In [6]:
models = {
    "Linear Regression": linear_model,
    "Lasso Regression": lasso_model,
    "Ridge Regression": ridge_model,
}

for model_name, model in models.items():
    print(f"\n{model_name}:")
    print("Intercept:", f"{model.intercept_:.3f}")
    print("Coefficients:")
    for feature, coef in zip(california.feature_names, model.coef_):
        print(f"  {feature}: {coef:.3f}")



Linear Regression:
Intercept: -30.192
Coefficients:
  MedInc: 0.443
  HouseAge: 0.007
  AveRooms: -0.106
  AveBedrms: 0.621
  Population: -0.000
  AveOccup: -0.008
  Latitude: -0.385
  Longitude: -0.368

Lasso Regression:
Intercept: -20.973
Coefficients:
  MedInc: 0.385
  HouseAge: 0.009
  AveRooms: 0.004
  AveBedrms: 0.000
  Population: -0.000
  AveOccup: -0.008
  Latitude: -0.307
  Longitude: -0.269

Ridge Regression:
Intercept: -30.134
Coefficients:
  MedInc: 0.442
  HouseAge: 0.007
  AveRooms: -0.104
  AveBedrms: 0.609
  Population: -0.000
  AveOccup: -0.008
  Latitude: -0.385
  Longitude: -0.367


In [7]:
from sklearn.metrics import r2_score

print("\nR2 Scores")
for model_name, model in models.items():
    r2 = model.score(X_test_large, y_test_large)
    print(f"{model_name}: {r2:.3f}")



R2 Scores
Linear Regression: 0.593
Lasso Regression: 0.551
Ridge Regression: 0.593


In [8]:
linear_model_small = LinearRegression(fit_intercept=True)
ridge_model_small = RidgeCV(fit_intercept=True)
lasso_model_small = LassoCV(fit_intercept=True)

In [9]:
linear_model_small.fit(X_train_small, y_train_small)
lasso_model_small.fit(X_train_small, y_train_small)
ridge_model_small.fit(X_train_small, y_train_small)

In [12]:
models = {
    "Linear Regression": linear_model_small,
    "Lasso Regression": lasso_model_small,
    "Ridge Regression": ridge_model_small,
}

for model_name, model in models.items():
    print(f"\n{model_name}:")
    print("Intercept:", f"{model.intercept_:.3f}")
    print("Coefficients:")
    for feature, coef in zip(california.feature_names, model.coef_):
        print(f"  {feature}: {coef:.3f}")

    r2 = model.score(X_test_small, y_test_small)
    print("R²:", f"{r2:.3f}")

    if model_name in ["Ridge Regression", "Lasso Regression"]:
        print("Alpha):", f"{model.alpha_:.3f}")



Linear Regression:
Intercept: 1998.033
Coefficients:
  MedInc: 0.223
  HouseAge: 0.003
  AveRooms: -0.085
  AveBedrms: -0.500
  Population: 0.000
  AveOccup: 0.027
  Latitude: 2.051
  Longitude: 16.961
R²: -1961.050

Lasso Regression:
Intercept: 0.785
Coefficients:
  MedInc: 0.359
  HouseAge: -0.000
  AveRooms: -0.020
  AveBedrms: -0.000
  Population: 0.000
  AveOccup: -0.000
  Latitude: -0.000
  Longitude: 0.000
R²: 0.334
Alpha): 0.101

Ridge Regression:
Intercept: 481.972
Coefficients:
  MedInc: 0.371
  HouseAge: 0.004
  AveRooms: -0.109
  AveBedrms: -0.881
  Population: 0.000
  AveOccup: -0.013
  Latitude: -0.383
  Longitude: 3.807
R²: -148.736
Alpha): 0.100


##L1 and L2 regularisation with PyTorch

So, Lasso regression is simply a regression model that uses L1 regularisation;  a model that uses L2 regularisation is called Ridge regression. This means that we can easily modify our implementation of Linear regression by tweaking the loss function.

In [14]:
import torch

# Function to compute the L1 loss of all weights in a model
def L1regloss(model):
  reg_loss = 0
  for param in model.parameters():
    reg_loss += param.abs().sum()
  return reg_loss


# Function to compute the L2 loss of all weights in a model
def L2regloss(model):
  reg_loss = 0
  for param in model.parameters():
    reg_loss += param.pow(2).sum()
  return reg_loss

model = torch.nn.Linear(8,1)
X_train = torch.from_numpy(np.float32(california.data[:15000,:]))
y_train = torch.from_numpy(np.float32(california.target[:15000])).unsqueeze(1)
X_test = torch.from_numpy(np.float32(california.data[15000:,:]))
y_test = torch.from_numpy(np.float32(california.target[15000:])).unsqueeze(1)
epochs = 50

# Setup optimiser
optim = torch.optim.SGD(model.parameters(), lr=1e-7)
criterion = torch.nn.MSELoss()

# Main training loop
for epoch in range(epochs):
    y_predict = model(X_train)
    fit_loss = criterion(y_train,y_predict)
    # Use L2regloss for Lasso
    reg_loss = L2regloss(model)
    loss = fit_loss+reg_loss
    optim.zero_grad()
    loss.backward()
    optim.step()
    print('epoch {}, fit loss {}, reg loss {}'.format(epoch, fit_loss.item(), reg_loss.item()))

# Generalisation error
print(criterion(model(X_test),y_test))

epoch 0, fit loss 126394.5234375, reg loss 2.0490219593048096
epoch 1, fit loss 14053.783203125, reg loss 1.9282405376434326
epoch 2, fit loss 2179.454833984375, reg loss 1.8885183334350586
epoch 3, fit loss 922.7672119140625, reg loss 1.8751482963562012
epoch 4, fit loss 788.184814453125, reg loss 1.8703457117080688
epoch 5, fit loss 772.1934204101562, reg loss 1.8683290481567383
epoch 6, fit loss 768.7390747070312, reg loss 1.8672184944152832
epoch 7, fit loss 766.6140747070312, reg loss 1.8664028644561768
epoch 8, fit loss 764.6339721679688, reg loss 1.865683913230896
epoch 9, fit loss 762.673828125, reg loss 1.8649970293045044
epoch 10, fit loss 760.7201538085938, reg loss 1.8643211126327515
epoch 11, fit loss 758.771728515625, reg loss 1.8636491298675537
epoch 12, fit loss 756.8282470703125, reg loss 1.8629794120788574
epoch 13, fit loss 754.889892578125, reg loss 1.862310767173767
epoch 14, fit loss 752.9564208984375, reg loss 1.8616429567337036
epoch 15, fit loss 751.02789306640

**To do:**

Modify the loss function in the above solution to use the L2 loss as penalty instead of L1 loss, thus obtaining an implementation of Ridge regression.