# Regularized Linear Models

In [0]:
from sklearn import datasets
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import  PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.metrics import mean_squared_error
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams["figure.figsize"] = (20,15)
sns.set_theme(style="whitegrid")

## Load data

We load the [Boston housing data](http://lib.stat.cmu.edu/datasets/boston) and split it into train and test data. 
As in the last notebook, we generate [polynomial features](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)
of the second degree.
We will work further with `X_train_poly`, `y_train`, `X_test_poly` and `y_test`. 
Run the cell below.

### Ethical considerations

The dataset, which we are using in this exercise, has an ethical problem.
A thorough discussion of the issues can be found in [this article](https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8).  
The key take aways are, that there is a attribute called 'B' in the data.
The
[original authors](https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air)
of the dataset engineered this feature assuming that racial self-segregation has a positive impact on house prices.
Such an attribute furthers systemic racism and must not be used outside of educational purposes.


In [0]:
# The data set is originally downloaded from  "http://lib.stat.cmu.edu/datasets/boston".

raw_df = pd.read_csv('../../../Data/Boston.csv')

y = raw_df['target']
X = pd.DataFrame(raw_df.iloc[:,1:-1])

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

poly = PolynomialFeatures(2, include_bias=False)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
# depending on the version of sklearn, this will cause an error
# in that case, replace "get_feature_names_out" with "get_feature_names"
poly_names = poly.get_feature_names_out()

### Exercise

How many features are there in total?

In [0]:
# Task 1


We will further use the user-defined function `plot_coef` that takes as input coefficients as output of the fitted model. It plots the coefficient values and calculates average.

In [0]:
def plot_coef(lr_coef, names=[], ordered=True, hide_zero=False, figsize=(12,20)):
    """
    The function plots coefficients' values from the linear model.
    --------
    params:
        lr_coef: coefficients as they are returned from the classifier's attributes
        names: names for the coefficients, if left empty x0, x1, ... will be used
        ordered: order the coefficients according to their value
        hide_zero: hide all coefficients which are equal to 0
        figsize: tuple spcifying the size of the plot
    """
    if len(names) < 1:
        names = [f"x{i}" for i in range(len(lr_coef))]

    named_coef = pd.DataFrame({"attr": names, "coef": lr_coef})
    
    if hide_zero:
        named_coef = named_coef[named_coef["coef"] != 0]

    if ordered:
        named_coef.sort_values(by="coef", ascending=True, inplace=True)

    fig, ax = plt.subplots(figsize=figsize)

    ax.axvline(x=0, c="orange", ls="--")
    ax.scatter(x="coef", y="attr", data=named_coef)
    ax.margins(y=0.01)
    ax.set_title("Coefficients' values")

## Fit linear regression without regularization

### Exercise

- Instantiate a [linear regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) under the variable `lr`.
- Fit `lr` to `X_train_poly`, `y_train `.
- Predict with `lr` on `X_train_poly` and store the results to `y_hat_train`.
- Predict with `lr` on `X_test_poly` and store the results to `y_hat_test`.
- Return the RMSE for `y_hat_train` as well as for `y_hat_test`. 

How do you interpret the difference in performance of the model on train and on test dataset? Can you tell if the model overfits/underfits?

In [0]:
# Task 2

lr = ...
...

y_hat_train = ...
y_hat_test = ...

print(f"RMSE train: {mean_squared_error(..., ..., squared=False)}")
print(f"RMSE test: {mean_squared_error(..., ..., squared=False)}")

The RMSE is almost twice as big for the test set than for the train set. This suggests overfitting and a poor generalization power of the model.

We use the function `plot_coef` on the coefficients of the fitted model to see the values of the coefficients.

In [0]:
plot_coef(lr.coef_, poly_names)

The error values on train and test suggest that we deal here with overfitting of the model on the given set of polynomial features. 
We should therefore use **regularization**. 

## Standardization

Before fitting any regularized model, the scaling of the features is crucial.
Otherwise the regularization would not be fair to features of different scales.
Regularized linear models assume that the inputs to the model have a zero mean and a variance in the same magnitude.
[`StandarScaler()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
deducts the mean and divides by the standard deviation. 

### Exercise

- Instantiate
[`StandardScaler()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
under the name `scaler`.
- Apply the `fit_transform` method with the input `X_train_poly` to `scaler` and store the result into `X_train_scaled`.
- Once the scaler is fit to `X_train_poly` you can directly transform `X_test_poly` and store it in the variable `X_test_scaled`. You never want to fit on a test sample, because that way information from the test data might leak. Test data serves only for evaluation.

In [0]:
# Task 3

scaler = ...
X_train_scaled = ...
X_test_scaled = ...

If you applied the standardization correctly you should see on the bottom chart the distributions of all the features concentrated around zero with similar ranges of deviation.

In [0]:
fig, axs = plt.subplots(1, 2, sharey=True, figsize=(12, 20))

axs[0].boxplot(X_train_poly, vert=False, labels=poly_names)
axs[0].set_title('Original polynomial features')

axs[1].boxplot(X_train_scaled, vert=False, labels=poly_names)
axs[1].set_title('Scaled features')

plt.tight_layout()
plt.show()

## Lasso
Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html

### Exercise
- Instantiate a Lasso regression under the name `lr_l`.
- Fit the model to `X_train_scaled` and `y_train`.
- Predict on `X_train_scaled` and `X_test_scaled` and store the predictions in `y_hat_train` and `y_hat_test`, respectively.

Did the overfit change?

In [0]:
# Task 4


from sklearn.linear_model import Lasso

lr_l = ...
...

y_hat_train = ...
y_hat_test = ...

print(f"RMSE train: {mean_squared_error(..., ..., squared=False)}")
print(f"RMSE test: {mean_squared_error..., ..., squared=False)}")

The performance seems to be comparable on train and test dataset. Hence, the model's generalization power is better now.

### Exercise

Use `plot_coef()` on the coefficients of the lasso model.

In [0]:
# Task 5



The average value of the coefficients is much smaller now. Also, many of the coefficients are equal to 0.

In [0]:
print(f'After applying Lasso on polynomial scaled features we remain with {np.sum(lr_l.coef_!=0)} variables.')
print('\nThe selected variables are:\n- ', end="")
print("\n- ".join(poly_names[lr_l.coef_ != 0]))

### Exercise

- Take the subset of `X_train_scaled` with only those variables that have a non-zero coefficient and store it in the variable `X_train_lasso`
- Do the same selection on `X_test_scaled` and save it to `X_test_lasso`.
- How many variables are remaining? Check it with the cell above.

In [0]:
# Task 6

X_train_lasso = ...
X_test_lasso = ...
...

## Ridge
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

We have effectively performed a feature selection with Lasso. Now we will compare it to Ridge regression.

Let's try different values for the strength of the optimization, alpha. By default it is equal to 1 and it must be a positive value. Larger values specify stronger regularization. Alpha can be set also in Lasso and Elastic Net.

### Exercise
- Fit the ridge regression to `X_train_scaled` and `y_train` with the values of alpha being 0.001, 0.01, 0.1, 1, 10 and 100 to see the effect of the regularization strength.
- Return the RMSE for `X_train_scaled` and `X_test_scaled` for each of the alpha options.
- Visulaize both RMSE curves.
Are you able to find the ranges where the model is over- or underfitted?

In [0]:
# Task 7

rmses = pd.DataFrame(columns=["alpha", "train", "test"])
alphas = [10**i for i in range(-3, 3)]

for alpha in alphas:    
    lr_r = ...
    ...

    y_hat_train = ...
    y_hat_test = ...

    rmse_train = mean_squared_error(..., ..., squared=False)
    rmse_test = mean_squared_error(..., ..., squared=False)
    rmses = pd.concat([rmses, pd.DataFrame([{"alpha": alpha, "train": rmse_train, "test": rmse_test}])], axis=0, ignore_index=True)

fig, ax = plt.subplots(figsize=(12, 4))
ax.plot("alpha", "train", data=rmses, label="RMSE for train set", c="b", ls="--")
ax.plot("alpha", "test", data=rmses, label="RMSE for test set", c="r")

ax.set_xscale("log")
ax.legend()
ax.set_xlabel(r"$\alpha$")
ax.set_ylabel("RMSE")

plt.show();

In the above plot, we can observe a clear trend in the training data: as the regularization parameter \\(\alpha\\) increases, the Root Mean Square Error (RMSE) also increases monotonically. 
This is expected, as a higher \\(\alpha\\) imposes more restriction on the coefficients, leading to a simpler model.  
The more intriguing effect is seen when we look at the RMSE on the test data. 
As anticipated, the RMSE is high for large \\(\alpha\\) values, a phenomenon known as underfitting.
However, as α decreases, the RMSE starts to rise again.
This is because the coefficients are not sufficiently constrained, leading to an overly complex model, a situation referred to as overfitting.

**Note:** It’s crucial not to use your test data when optimizing the hyperparameter.
If you aim to optimize the hyperparameter, consider using cross-validation or alternative metrics such as Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC).
These metrics penalize complex models and enable you to make an informed decision based solely on your training data.

All of these observations also hold for Lasso Regression.

### Exercise
- Fit the model with a high value of \\(\alpha=100\\).
- Check how many coefficients equal 0 and plot their valuse using `plot_coef`.

In [0]:
# Task 8

lr_r_high = Ridge(...).fit(..., ...)
print(f"There are {(lr_r_high.coef_ == 0).sum()} coefficients equal to 0 for this model.")

...

Even for a highly penalized Ridge regression model, all the coefficients are non zero.

## Elastic Net
[Elastic Net](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html)
is a combination of Lasso and Ridge which is defined by a parameter `l1_ratio`.
If it is equal to 1 the model is equivalent to Lasso, if it is 0 then it is as if we had a Ridge regression.
The regularization strength alpha can be defined just as in Ridge or Lasso. 

You can enforce the values of the parameters to be positive with the parameter `positive = True`.
Such an option is also available for Lasso. 

For all the variations of the linear regression you can enforce it to fit the model without an intercept.
This can be done by setting the parameter `fit_intercept=False`.
If `False` the data is assumed to be already centered.

There is an option to scale data by the norm of each feature.
If normalization is applied to fitting of the model it is automatically also applied to the `predict()`.
We can use this method instead of standard scaling done at the beginning. 

### Exercise

Experiment with the parameters of `ElasticNet()`.
Fit the model to `X_train_scaled` and `y_train` with different set of options, e.g.
- `positive=False`
- `l1_ratio = 0`, `0.5`, `1`
- `alpha = 0.001`, `0.01`, `0.1`, `1`, `10`, `100` 

Plot the coefficients with `plot_coef` to see the effect on the options.
Return the RMSE on train and test set.

In [0]:
# Task 9



Material adapted for RBI internal purposes with full permissions from original authors.