# Gaussian Process Regression


________
###  Table of content

[1. Regression model](#model)<br>
[2. Prediction](#Prediction)<br>
[3. Finding the best kernel parameters](#kernel)<br>
_________

**Dataset :**

Let $(X_i)_{i\in[\![1,n]\!]}$ be i.i.d. random variables in $\mathbb{R}^d$ and consider the matrix $X \in \mathbb{R}^{n . d}$ such that the i-th row of $X$ is the observation $X_i^T$. For all $1 \leq i \leq n$, $X_i$ is an individual which has been associated with the label $Y_i \in \mathbb{R}$, and consider the matrix $Y = [Y_1, Y_2, ..., Y_n]^T \in \mathbb{R}^n$.

<a id='model'></a>
## 1. Regression model

In the Gaussian process regression model, $Y = f(X) + \epsilon$. 
<br><br>
The $(\epsilon_i)_{i\in[\![1,n]\!]} \in \mathbb{R}^n$ are the i.i.d. noise variables with independant normal distributions, so that $\epsilon = [\epsilon_1, \epsilon_2, ... \epsilon_n]^T \sim N(0, \sigma ^2 I_n)$.
$ f$ is a Gaussian Process, i.e. an n-dimensional vector, defined by :


$$
f(.) \sim \mathcal{GP}(0, k_{\gamma}(. , .))
$$

where $k_{\gamma}(. , .)$ is a valid covariance function and $\gamma$ is the parameter to optimize.

A commonly used kernel function is the squared exponential or radial basis function (RBF) kernel, defined as follows:

$$k_\gamma (z, z') = \exp(-\frac{\parallel z - z' \parallel ^2}{2 \gamma^2})$$


<a id='prediction'></a>
## 2. Prediction

Gaussian Process Regression is a nonparametric model. Therein, prediction will be directly performed using the conditionnal gaussian distribution.

Given a dataset of observed outputs $ \lbrace (x_i, y_i) \rbrace_{1 \leq i \leq n}$, we want to predict the output $Y_{test}$ of a test set $X_{test}$ drawn from the same distribution.


With $Y_0 = [Y_{test}, Y]^T$, $X_0 = [X_{test}, X]^T$ and $\epsilon_0 = [0, \epsilon]^T$, the model is thus :


$$Y_0 = f(X_0) + \epsilon_0 \sim N(0, K_0 + \sigma^2I_n)$$


where 
$$K_0 = \begin{pmatrix}
          K_{aa} & K_{ab} \\
          K_{ba} & K_{bb} \\
         \end{pmatrix}$$ and 
$$\begin{equation}
    \begin{cases}
      K_{aa} = (k_\gamma(X_i, X_j))\\
      K_{ab} = (k_\gamma(X_i, X_{test, j}))\\
      K_{ba} = (k_\gamma(X_{test, i}, X_j))\\
      K_{bb} = (k_\gamma(X_{test, i}, X_{test, j}))
    \end{cases}       
\end{equation}$$

We can then compute the conditional distribution $(Y_{test}|X_{test}, X, Y) \sim \mathcal{N}(m, D)$ by using the conditional Gaussian distribution formulas :

$$m = K_{ab}K_{bb}^{-1}$$
$$D = (K_{aa} + \sigma^2I_n) - K_{ab}K_{bb}^{-1}K_{ba}$$

<a id='kernel'></a>
## 3. Finding the best kernel parameters

With $X$, $Y$ and $\epsilon$ as previously defined, let's apply the model to the dataset : $Y = f(X) + \epsilon$.

As f(.) is a Gaussian Process, $f(X) \sim N(0, K_\gamma)$ where $K_\gamma = (k_\gamma (X_i, X_j))_{i,j \in [\![1,n]\!]}$, and since $\epsilon \sim N(0, \sigma ^2 I_n)$, it implies that $ (Y | X; \gamma) \sim N(0, K_\gamma + \sigma ^2 I_n)$.

Consequently, the probability distribution is:  
$$P(Y=y | X; \gamma) = \frac{1}{(2\pi)^{n/2} det(K_\gamma + \sigma^2 I_n)^{1/2}}exp(-\frac{1}{2} y^T (K_\gamma + \sigma^2 I_n)^{-1} y)$$


The likelihood of the model, given the observed data, is defined as:  
$$L(\gamma) = P(Y=y | X; \gamma)$$


The aim of the training is to find the parameters $\gamma$ which maximizes the likelihood function (Maximum Likelihood Estimation), which is the same as minimizing the negative log-likelihood. To unify with the neural network architecture used later, we chose to minimize the negative log-likelihood :  
$$l(\gamma) = - \log (P(Y=y | X; \gamma))$$


The resulting equation is thus:  
$$l(\gamma) = \frac{1}{2} ( n\log(2\pi) + \log det(K_\gamma + \sigma^2 I_n) + y^T (K_\gamma + \sigma^2 I_n)^{-1} y )$$

To simplify this, we can rewrite $K_\gamma + \sigma ^2 I_n$ as $K_\gamma$.

Again, to unify this with the neural network approach (based on the chain rule), we need to compute the derivative of the negative log-likelihood with respect to $K_\gamma$ :  
$$\frac {\partial l}{\partial K_\gamma} = \frac{1}{2} ( \frac {\partial n\log(2\pi)}{\partial K_\gamma} + \frac {\partial \log det K_\gamma }{\partial K_\gamma} + \frac {\partial y^T K_\gamma^{-1} y )}{\partial K_\gamma})$$


Pre-requisites :
$$\begin{equation}
    \begin{cases}
      \frac {\partial \log \det A}{\partial A} = A^{-1}\\
      \frac {\partial A^{-1}}{\partial A} = -(A^{-1}) (A^{-1})\\
    \end{cases}       
\end{equation}$$

Consequently :
$$\begin{equation}
    \begin{cases}
      \frac {\partial n\log(2\pi)}{\partial K_\gamma} = 0\\
      \frac {\partial \log det K_\gamma}{\partial K_\gamma} = K_\gamma ^{-1}\\
      \frac {\partial y^T K_\gamma^{-1} y}{\partial K_\gamma} = - K_\gamma^{-1} y y^T K_\gamma^{-1}
    \end{cases}       
\end{equation}$$


> PROOF for $\frac {\partial y^T K_\gamma^{-1} y}{\partial K_\gamma} = - K_\gamma^{-1} y y^T K_\gamma^{-1}$ :
>
> As $y^T K_\gamma^{-1} y = \sum_{i,j} y_i y_j (K_\gamma^{-1})_{i,j}$, 
> $$\frac {\partial y^T K^{-1} y}{\partial K_{k,l}} = \sum_{i,j} y_i y_j (\frac {\partial K^{-1}} {\partial K_{k,l}})_{i,j}$$ 
>
> Given that $\frac {\partial K^{-1}} {\partial K_{k,l}} =  - K^{-1} \frac {\partial K}{\partial K_{k,l}} K^{-1}$  and that the derivative with respect to the coordinate (k,l) is equal to a (n,n) null matrix with a 1 at position (k,l),  it ends up with the following equation:
> $$\frac {\partial K^{-1}} {\partial K_{k,l}} =  - K^{-1} \frac {\partial K}{\partial K_{k,l}} K^{-1} = - [(K^{-1}_{ik} K^{-1}_{lj})_{i,j}]$$
> Consequently :
$$(\frac {\partial K^{-1}} {\partial K_{k,l}})_{i,j} = - K^{-1}_{ik} K^{-1}_{lj}$$
Finally, the derivative equals : 
$$\frac {\partial y^T K^{-1} y}{\partial K_{k,l}} = \sum_{i,j} - y_i y_j K^{-1}_{ik} K^{-1}_{lj}.$$
>
> This may be written as :
> $$\frac {\partial y^T K^{-1} y}{\partial K_\gamma}  = - K_\gamma^{-1} y y^T K_\gamma^{-1}$$

Finally :  
$$\frac {\partial l}{\partial K_\gamma} = \frac{1}{2} (K_\gamma ^{-1} - K_\gamma^{-1} y y^T K_\gamma^{-1})$$


In Deep Kernel Learning, we have $K_{\gamma, w} = (k_\gamma (h_w(X_i), h_w(X_j))_{i,j \in [|1,n|]})$ where $h_w(.)$ represents the Neural Network.


During the backpropagation, we use the chain rule to compute $\frac {\partial K_{\gamma, w}}{\partial \gamma}$ and $\frac {\partial K_{\gamma, w}}{\partial w}$ in the following way :  

$$\begin{equation}
    \begin{cases}
      \frac {\partial l}{\partial \gamma} = \frac {\partial l}{\partial K_{\gamma, w}} \frac {\partial K_{\gamma, w}}{\partial \gamma}\\
      \frac {\partial l}{\partial w} = \frac {\partial l}{\partial K_{\gamma, w}} \frac {\partial K_{\gamma, w}}{\partial w}
    \end{cases}       
\end{equation}$$
