# <font color=darkcyan> Multivariate linear regression - Kernel regression </font>

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
from matplotlib import colors 
from sklearn.metrics import mean_squared_error

A function $k:\mathbb{R}^d\times\mathbb{R}^d:\to \mathbb{R}$ is said to be a positive semi-definite kernel if and only if it is symmetric and if for all $n\geqslant 1$, $(x_1,\ldots,x_n)\in(\mathbb{R}^d)^n$ and all $(a_1,\ldots,a_n)\in\mathbb{R}^n$,
$$
\sum_{1\leqslant i,j\leqslant n}a_ia_jk(x_i,x_j) \geqslant 0\,.
$$

The following functions, defined on $\mathbb{R}^d\times\mathbb{R}^d$, are positive semi-definite kernels:

$$
k:(x,y)\mapsto x^Ty \quad\mathrm{and}\quad k:(x,y)\mapsto \mathrm{exp}\left(-\|x-y\|^2/(2\sigma^2\right)\,,\; \sigma>0\,.
$$


Let $\mathcal{F}$ be a Hilbert space of functions $f:\mathbb{R}^d\to\mathbb{R}$. A symmetric function $k:\mathbb{R}^d\times\mathbb{R}^d:\to \mathbb{R}$ is said to be a reproducing kernel of $\mathcal{F}$ if and only if:

1. for all $x\in\mathbb{R}^d$, $k(x,\cdot)\in\mathcal{F}$ ; 

2. for all $x\in\mathbb{R}^d$ and all $f\in\mathcal{F}$, $\langle f; k(x,\cdot)\rangle_\mathcal{F} = f(x)$ . 

The space $\mathcal{F}$ is said to be a reproducing kernel Hilbert space with kernel $k$.



Let $k:\mathbb{R}^d\times\mathbb{R}^d:\to \mathbb{R}$ be a positive definite kernel and $\mathcal{F}$ the RKHS with kernel $k$. Then, 

$$
\widehat f^n_{\mathcal{F}} \in \underset{f\in\mathcal{F}}{\mathrm{min}}\;\frac{1}{n}\sum_{i=1}^n (Y_i - f(X_i))^2 + \lambda\|f\|_\mathcal{F}^2\,,
$$

where $\|f\|^2_\mathcal{F} = \langle f\,;\, f\rangle_\mathcal{F}$, is given by $\widehat f^n_{\mathcal{F}} : x \mapsto \sum_{i=1}^n \widehat \alpha_i k(X_i,x)$, where

$$
\widehat\alpha \in \underset{\alpha \in (\mathbb{R}^d)^n}{\mathrm{argmin}}\;\left\{\frac{1}{n}\|Y - K\alpha\|^2_2 + \lambda \sum_{1\leqslant i,j \leqslant n}\alpha_i \alpha_j k(X_i,X_j) = \frac{1}{n}\|Y - K\alpha\|^2_2 + \lambda \alpha^TK\alpha\right\}\,,
$$

where for all $1\leqslant i,j\leqslant n$, $K_{i,j} = k(X_i,X_j)$.

In practice, once the matrix $K$ is built, kernel ridge regression boils down to solving this optimization problem to obtain $\widehat \alpha$. Then, the estimated function $\widehat f^n_{\mathcal{F}}$ is a mixture of kernels evaluated at each data points with weights given by $\alpha$.

<font color=darkred> Provide the value of $\widehat \alpha$ as a function of $n$, $\lambda$, $K$ and  $Y$.</font>

#### Import data

In [2]:
import pandas as pd

Data frames can be imported using pandas. This provides two-dimensional and heterogeneous tabular data.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

<font color=darkred>
    
Import data in the file BRinf using ``read_csv``, display the first rows with ``head`` and the shape of the dataframe using ``shape``.
</font>

In [3]:
# In this section, multivariate linear regression is used to predic the Brazilian inflation based on
# many observed variables, see https://github.com/gabrielrvsc/HDeconometrics/
df = pd.read_csv('BRinf.txt')
df.head()

Unnamed: 0.1,Unnamed: 0,Brazil CPI IPCA MoM,FGV Brazil General Prices IGP-,FGV Brazil General Prices IGP-.1,FGV Brazil General Prices IGP-.2,FGV Brazil General Prices IGP-.3,Brazil CPI IPCA Median Market,Brazil Total Electricity Consu,Brazil Industrial Electricity,BofA Merrill Lynch Economic Co,...,X5t.12,X5t.13,fmed2,fmean,fmean2,fdp,lfmed2,lfmean,lfmean2,lfdp
0,2003-01-31,2.25,2.33,2.17,5.816,2.29,11.56,25422.0,-359,-1.47,...,0.8,0.68,4.0,1.96,3.8416,0.18,1.3225,1.15,1.3225,0.24
1,2003-02-28,1.57,2.28,1.59,4.351,2.42,10.95,25163.0,-303,-1.21,...,0.68,0.48,2.56,1.6,2.56,0.26,0.7569,0.91,0.8281,0.27
2,2003-03-31,1.23,1.53,1.66,4.593,1.58,9.58,24960.0,115,-0.35,...,0.55,0.5,0.8836,0.93,0.8649,0.13,0.49,0.73,0.5329,0.17
3,2003-04-30,0.97,0.92,0.41,1.173,1.24,8.88,25284.0,93,-0.15,...,0.45,0.78,1.0,0.97,0.9409,0.14,0.25,0.52,0.2704,0.18
4,2003-05-31,0.61,-0.26,-0.67,-1.86,0.02,8.22,24785.0,285,-0.47,...,0.6,0.43,0.25,0.5,0.25,0.1,0.3025,0.55,0.3025,0.15


In [4]:
# number of observations, number of variables
df.shape

(156, 93)

In [5]:
X = df.iloc[:,2:]
Y = df.iloc[:,1] 

<font color=darkred>
    
Use the ``StandardScaler`` of sklearn to preprocess the input variables.
</font>

``StandardScaler`` standardizes the input variables by removing the mean and scaling to unit variance.
We will not analyze closely standardization in this course. However, it is often very useful (even mandatory in some cases) for the stability of learning procedures.
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

In [6]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(X)

In [7]:
# first coordinate is the number of samples
# second coordinate is the number of input features (+ 1 for the observations)
np.shape(X)

(156, 91)

<font color=darkred>
    
Divide the input data ``X`` and observations ``Y`` into training and test sets using the ``train_test_split`` method. This method randomly splits arrays or matrices into training and test subsets. It allows to train several times a model with different training set and analyze the variability of the performance on the test set.
</font>

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1)
np.shape(X_train)

(140, 91)

#### Kernel Regression from scractch

<font color=darkred>

Write a ``linear_kernel`` function with the arguments two vectors ``x`` and ``y``, which returns the result of the linear kernel function defined as $k:(x,y)\mapsto x^Ty$.
</font>

In [10]:
def linear_kernel(x,y):
    return None

x = np.ones(10)
y = np.ones(10)
linear_kernel(x,y)

<font color=darkred>

Write a ``rbf_kernel`` function with the arguments two vectors ``x`` and ``y`` and a scalar ``sigma``, which returns the result of the radial basis function (RBF) kernel defined as $k:(x,y)\mapsto \mathrm{exp}\left(-\|x-y\|^2 /\left(2\sigma^2\right)\right)$.
</font>

<font color=darkred>

Write a ``kernel_ridge_regression`` function with arguments the data ``X`` and ``y``, the kernel matrix ``K`` and a penalty parameter ``lambda``. The function returns the parameter estimate $\widehat \alpha$.
</font>

<font color=darkred>
    Run the algorithm with several values of $\lambda$ using X_train and Y_train and compute MSE obtained on the test set.
    </font>

#### Kernel Regression with Sklearn

In [11]:
from sklearn.kernel_ridge import KernelRidge

<font color=darkred>
    
Fit a ``KernelRidge`` from sklearn to train a kernel regression model, display the parameter estimate $\widehat \alpha$ and compute the MSE obtained on the test set.
</font>

#### Cross Validation

<font color=darkred>
    
Implement a grid search from scratch for the given ``KernelRidge`` model with radial basis function (RBF) kernel and hyperparameter ``alpha``. Create a new training dataset (X_train, Y_train) and a validation dataset (X_val, Y_val) using the actual (X_train, Y_train) datasets. Write a function that performs grid search to find the best value of ``alpha`` and displays the best hyperparameter along with the corresponding validation MSE.
</font>

In [12]:
model = KernelRidge(kernel='rbf', gamma=0.1)

# Validation set
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size=0.1)


# Hyperparameter grid for alpha
alphas = [1e0, 0.1, 1e-2, 1e-3]

best_alpha = None
best_score = float('inf')



<font color=darkred>
    
Implement a grid search using either ``GridSearchCV`` or ``RandomizedSearchCV`` for the given ``KernelRidge`` model with a radial basis function (RBF) kernel and hyperparameters ``alpha`` and ``gamma``. Display the optimal hyperparameters along with the corresponding validation MSE.
    
</font>

In [13]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

In [14]:
model = KernelRidge(kernel='rbf', gamma=0.1)
param_grid={"alpha": [1e0, 0.1, 1e-2, 1e-3],
            "gamma": np.logspace(-2, 2, 5)}



#### Ridge vs Lasso

In [15]:
from sklearn.linear_model import Lasso, Ridge

In [16]:
import warnings
from sklearn.exceptions import ConvergenceWarning

# Ignore ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)

<font color=darkred>
Create a np array with several values of the penalty parameter (called $\alpha$ in Python)
</font>

In [17]:
n_alphas = 100
alphas = np.logspace(-10, 3, n_alphas)

<font color=darkred>
Use the ``fit`` function of sklearn to fit a Lasso and Ridge model with for each value of $\alpha$. 
    
Store the estimated parameter, the number of zeros in the estimated parameter and the MSE on the test set after each training.
</font>

<font color=darkred>
Display the estimated parameters as a function of the penalty parameter.
</font>

<font color=darkred>
Display the number of zero coefficients of the estimated parameter as a function of the penalty parameter.
</font>

<font color=darkred>
Display the MSE on the test set as a function of the penalty parameter.
</font>