# Gaussian Process Regression

Dataset : $(X_i, Y_i)_{1⩽i⩽n}$ where $X_i  \in \mathbb{R}^d$ and $Y_i \in \mathbb{R}$

Regression problem : $\forall i \in \{1, 2, ... n\}, Y_i = f^*(X_i) + \epsilon_i$  


Let's write the matrix form of this : $Y = f^*(X) + \epsilon$  
where $Y = [Y_1, Y_2, ... Y_n]^T \in \mathbb{R}^n$  
and 
$X = \begin{pmatrix}
          - & X_1 & - \\
          - & X_2 & - \\
            & ... \\
          - & X_n & -
         \end{pmatrix} \in \mathbb{R}^{n . d}$  
and $\epsilon = [\epsilon_1, \epsilon_2, ... \epsilon_n]^T \in \mathbb{R}^n$

## Model

Model : $y = f(x) + \epsilon$ where $f(.) \sim GP(0, k_{\gamma})$ and $\epsilon \sim N(0, \sigma ^2 I_n)$

In general, the kernel used is the RBF kernel, defined as follow :  
$k_\gamma (z, z') = \exp(-\frac{1}{2} (z - z') / l^2)$

## Prediction

Gaussian Process Regression is a nonparametric model. We can do prediction directly by using the conditional gaussian distribution.

Given a dataset of observed outputs (X, Y), we want to predict the output of a test set $X_{test}$.  
Let's define $Y_0 = [Y_{test}, Y]^T$, $X_0 = [X_{test}, X]^T$ and $\epsilon_0 = [0, \epsilon]^T$.  
The model is then : $[Y_{test}, Y]^T = f([X_{test}, X]^T) + [0, \epsilon]^T \sim N(0, K_0 + \sigma^2I_n)$  
where $K_0 = \begin{pmatrix}
          K_{aa} & K_{ab} \\
          K_{ba} & K_{bb} \\
         \end{pmatrix}$  
We can then compute the conditional distribution $(Y_{test}|X_{test}, X, Y) \sim N(m, D)$  
By using the conditional Gaussian distribution formulas, we find the following results :  
$m = K_{ab}K_{bb}^{-1}$  
$D = (K_{aa} + \sigma^2I_n) - K_{ab}K_{bb}^{-1}K_{ba}$  
where $K_{aa} = (k_\gamma(X_i, X_j)_{i,j})$  
$K_{ab} = (k_\gamma(X_i, X_{test, j})_{i,j})$  
$K_{ba} = (k_\gamma(X_{test, i}, X_j)_{i,j})$  
$K_{bb} = (k_\gamma(X_{test, i}, X_{test, j})_{i,j})$

## Find the best kernel parameters

Let's apply the model to the dataset : $Y = f(X) + \epsilon$  
where $Y \in \mathbb{R}^n$, $X  \in \mathbb{R}^{n . d}$ and $\epsilon \in \mathbb{R}^n$

Because f(.) is a Gaussian Process, it means that $f(X) \sim N(0, K_\gamma)$ where $K_\gamma = (k_\gamma (X_i, X_j)_{1≤i, j≤n})$

Since $\epsilon \sim N(0, \sigma ^2 I_n)$, it implies that $ (Y | X; \gamma) \sim N(0, K_\gamma + \sigma ^2 I_n)$

Let's write the probability distribution :  
$P(Y=y | X; \gamma) = \frac{1}{(2\pi)^{n/2} det(K_\gamma + \sigma^2 I_n)^{1/2}}exp(-\frac{1}{2} y^T (K_\gamma + \sigma^2 I_n)^{-1} y)$  

We define the likelihood of the model, given the observed data :  
$L(\gamma) = P(Y=y | X; \gamma)$  
We want to find the parameters $\gamma$ which maximizes the likelihood function (Maximum Likelihood Estimation). Ordinarily, we prefer to maximize the log-likelihood : $\log(L(\gamma))$  
To unify this to the neural network architecture that we will use later, we will instead minimizes the negative log-likelihood :  
$l(\gamma) = - \log (P(Y=y | X; \gamma))$

Let's write the resulting equation :  
$l(\gamma) = \frac{1}{2} ( n\log(2\pi) + \log det(K_\gamma + \sigma^2 I_n) + y^T (K_\gamma + \sigma^2 I_n)^{-1} y )$  
To simplify the notation, let's say that $K_\gamma = K_\gamma + \sigma ^2 I_n$

Again, to unify this with the neural network approach (based on the chain rule), we need to compute le derivative of the negative log-likelihood with respect to $K_\gamma$ :  
$\frac {\partial l}{\partial K_\gamma} = \frac{1}{2} ( \frac {\partial n\log(2\pi)}{\partial K_\gamma} + \frac {\partial \log det K_\gamma }{\partial K_\gamma} + \frac {\partial y^T K_\gamma^{-1} y )}{\partial K_\gamma})$

Pre-requisites :  
$\frac {\partial \log \det A}{\partial A} = A^{-1}$  
$\frac {\partial A^{-1}}{\partial A} = -(A^{-1}) (A^{-1})$

$\frac {\partial n\log(2\pi)}{\partial K_\gamma} = 0$  
$\frac {\partial \log det K_\gamma}{\partial K_\gamma} = K_\gamma ^{-1}$  
$\frac {\partial y^T K_\gamma^{-1} y}{\partial K_\gamma} = -y^T K_\gamma^{-1} K_\gamma^{-1} y$

Finally :  
$\frac {\partial l}{\partial K_\gamma} = \frac{1}{2} (K_\gamma ^{-1} - y^T K_\gamma^{-1} K_\gamma^{-1} y)$  
In the "Deep Kernel Learning" paper, the final result is : $\frac {\partial l}{\partial K_\gamma} = \frac{1}{2} (K_\gamma ^{-1} - K_\gamma^{-1} y y^T K_\gamma^{-1})$

In the case of Deep Kernel Learning, we have $K_{\gamma, w} = (k_\gamma (h_w(X_i), h_w(X_j))_{1≤i, j≤n})$ where $h_w(.)$ represents the Neural Network.  
During the backpropagation, we use the chain rule to compute $\frac {\partial K_{\gamma, w}}{\partial \gamma}$ and $\frac {\partial K_{\gamma, w}}{\partial w}$ in the following way :  
$\frac {\partial l}{\partial \gamma} = \frac {\partial l}{\partial K_{\gamma, w}} \frac {\partial K_{\gamma, w}}{\partial \gamma}$  
$\frac {\partial l}{\partial w} = \frac {\partial l}{\partial K_{\gamma, w}} \frac {\partial K_{\gamma, w}}{\partial w}$  