# Gaussian Process

We have been inferring parameters of function $p(\theta|D)$ rather than the function itself $p(f|D)$. For given input $x_i$ and output $y_i$, we assume there is $y_i = f(x_i )$. The optimal approach is to infer the $p(f|X,y)$ and use this to make predicitons:

$$
\begin{equation}
p(y_*| x_*,X,y) = \int p(y_*|f,x_*)p(f|X,y) df
\end{equation}
$$

Gaussian Process provides an approach to do this. GP defines a prior over functions, which can be converted into a posterior over functions once we have seen some data. Although it's hard to represent a distribution over a function, we can define a distribution over the function's values at a finite, but arbitrary set of points.

GP gives well-calibrated probabilistic outputs, while some kernel methods like LIVM, RVM and SVM, do not.

- ## Regression
Let the prior on the regression function be a GP:

$$
\begin{equation}
f(x) \sim GP(m(x),\kappa(x,x'))
\end{equation}
$$

where,

$$
\begin{align}
m(x) &= E[f(x)] \\
\kappa(x,x') &= E[(f(x)-m(x))(f(x)-m(x'))^T] 
\end{align}
$$

For any finite set of points:

$$
\begin{equation}
p(\mathbf{f|X}) = N(\mathbf{f|\mu,K})
\end{equation}
$$

The mean function is usually set to be 0, as the GP is flexible enough to model the mean arbitrarily well.


### Predictions using noise-free observation

When we assume the observations are noiseless, we expect, for train data $\mathbf{x}$, $f(\mathbf{x})$ has no uncertainty. By definition of the GP, the joint distribution has following form:

$$
\begin{equation}
(\begin{bmatrix}f \\ f_*
\end{bmatrix}
)
\end{equation}
$$