# TP Feature Selection / Regularisation

### author: Anastasios Giovanidis, 2021-2022

date of TP: 27 October 2021

**Student name:**

This is the TP related to Feature Selection / Regularisation. We will need to import the following libraries.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston, load_diabetes, load_linnerud
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.cross_decomposition import PLSRegression

In this class we will learn how to use PYTHON for regression, using the new methods that either choose a subset of features, or reduce the dimension, or shrink the coefficients. Specifically, we will study:

- Ridge Regression
- The Lasso
- Partial Least Squares (PLS)*

## Preparation: Importing datasets

To do so, we will first import a dataset, with a large number of features. We need the following libraries.

- Dataset 1: Boston House Prices dataset (Linear Regression example)

In [2]:
# load_boston(): from the sklearn.datasets library
boston=load_boston()

One can read the general discription of the "Boston" dataset and find the names of Features by printing the following:

In [3]:
print(boston.keys())
#print(boston.DESCR)
#print(boston.feature_names)

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])


We create a panda DataFrame to keep all Features in a compact format.

In [4]:
boston_df=pd.DataFrame(boston.data,columns=boston.feature_names)
#print(boston_df)
#print(boston_df.info())

The response Y is actually the price of an asset found in the dataset as 'target'. We need to add this as well.

In [5]:
boston_df['Price']=boston.target

## Exercise 1 (Ridge Regression)

(A) Use the Boston dataset to perform Linear Regression, and afterwards, Ridge Regression with the sklearn functions. For Ridge use $\lambda = 5$ (the sklearn Ridge parameter is called alpha). Compare the two methods, based on the Mean Squared Error, $R^2$ and the coefficient values. What do you observe?


(B) As a second step, plot (i) the MSE test error and (ii) the values of feature coefficients (one curve per feature), for $\lambda\in[0,25]$.

note 1: To use Ridge, you need to **rescale** the data (not the target) so that the mean is $0$ and the standard deviation is $1$.

note 2: You should also consider splitting the dataset into a Train and a Test set (30\%).

note 3: The Ridge Regression parameter is 'alpha'. For documentation, see: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

**Answer:**

Preparation: Rescale + Train-Test Split. We will first need to rescale and standardize the data set.

In [6]:
# scale: from the preprocessing library
Xboston = scale(boston_df.drop('Price',axis=1))
Yboston = boston_df['Price']

In [8]:
#check that the scale worked well:
print(Xboston.mean(0).round())

[-0.  0.  0. -0. -0. -0. -0. -0. -0.  0. -0. -0. -0.]


Then we split the dataset in Train-Test (Validation set, with 30% test dataset, randomly chosen - with seed)

In [9]:
# train_test_split: from the sklearn.model_selection library
Xbo_train, Xbo_test, ybo_train, ybo_test = train_test_split(Xboston, Yboston, test_size=0.3, random_state=10)
#print(len(ybo_train))

- Linear Regression

- Ridge Regression

In [1]:
# Ridge: from the sklearn.linear_model library

## Exercise 2 (The Lasso)

- Use the same Boston dataset, with same Train-test split, to perform the Lasso using the sklearn function. Use $\lambda = 1$ (the sklearn Lasso parameter is called alpha). Print the Mean Squared Error, $R^2$ and the coefficient values. What do you observe?


- As a second step, plot (i) the MSE test error and (ii) the values of feature coefficients, for $\lambda\in[0,0.3]$.


- Finally, compare the minimum Test MSE for Lasso, with the one for Ridge and the OLS. Which one is better?

note 1: To use Lasso, you need keep the **rescaled** data from before. 

note 2: Keep the same split of the dataset as above in a Train and a Test set.

note 3: The Lasso Regression parameter is 'alpha'. For documentation, see: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html

**Answer:**

- The Lasso

In [2]:
# Lasso: from the sklearn.linear_model library

## Exercise 3 (bonus) - *to do after classification course*

Repeat the comparison between OLS, Ridge and Lasso, using the two other datasets, i.e. 'load_diabetes', 'load_linnerud'

## Exercise 4 (Partial Least Squares) - *not in our material*

Use the same dataset to perform Partial Least Squares (PLS). 

- Find the Test MSE for all possible number of linear-combinations of features. Use the Train-Test split as in the previous exercises. Which number $M\in 1,\ldots,p$ of elements gives the minimum Test MSE? Compare with OLS, Ridge, Lasso.

- Instead of a single Train-Test split, use the Cross-Validation framework, making use of the entire available trace. What number of linear combinations of the original features, $M\in 1,\ldots,p$ gives the optimal Test MSE result?

note 1: You will need the 'cross_validate' feature, from sklearn.model_selection : https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html for documentation.

note 2: You will need the 'PLSRegression' feature from sklearn.cross_decomposition : https://scikit-learn.org/stable/modules/generated/sklearn.cross_decomposition.PLSRegression.html for documentation.

**Answer:**

In [3]:
# We first use the same Train-Test split as before, with Test equal to 30% of the original dataset.
# apply: PLSRegression

In [4]:
# We use Cross-Validation on the entire dataset.
# apply: PLSRegression, cross_validate