**Gaussian Process**

## Linear Regression (radial basis function regression) and Curse of Demension

For example, when think one demensional vector $\mathbf{x}$ and feature vector $\phi(\mathbf{x})=(1,x,x^2,x^3)^T$,  cubic function $y$ of $\mathbf{x}$ is formulated using weigh $\mathbf{w}=(w_0,w_1,w_2,w_3)^T$ as
$$
\begin{eqnarray}
y &= &w_0+w_1x_1+w_2x_2+w_3x_3\\
   &=&\mathbf{w}^T\phi(\mathbf{x})\\
\end{eqnarray}
$$
If use basis function according to gaussian distribution, i.e.
$$
\phi_h(x)=exp\left(-\frac{(x-\mu_h)^2}{\sigma^2}\right)\\
$$
and set above $\mu_h\in(-H,...,-2,-1,0,1,2,...,H)$ and weight by $w_h\in\mathbb{R}$, $y$ is formulated as
$$
y = \sum_{n=-H}^{H}w_hexp\left(-\frac{(x-\mu_h)^2}{\sigma^2}\right)\\
$$
This way is named as **radial basis function regression**. At a glance, this way seems well, but it has critical issue. In one dimension, size of $\mu_h$ is $2H+1$, and in two demension, size of $\mu_h$ is $(2H+1)^2$. As number of dimension increasing, dimension of $\mathbf{w}$ is larger, and finally become to can not calcurate accurate. This problem is named as **Curse of Dimension**.

## Gaussian Process

### Gaussian Process

How should solve this problem? The solution is to take expected value of w and delete w from model  by integrating. Consider the following conditions,
$$
y =\Phi\mathbf{w}\\
w\sim\mathcal{N}(\mathbf{0},\lambda^2\mathbf{I})
$$
They meen **${y}$ is that vector $w$ according to gauusian distribution transformed by constant matrix $\Phi$**. So, $y$ is according to gaussian distribution too. For this, expected value $\mathbb{E}[\mathbf{y}]$ and $\Sigma$ is formulated as
$$
\mathbb{E}[\mathbf{y}]=\mathbb{E}[\Phi\mathbf{w}]=\Phi\mathbb{E}[\mathbf{w}]=0\\
\begin{eqnarray}
\Sigma&=&\mathbb{E}[\mathbf{yy}^T]-\mathbb{E}[\mathbf{y}]\mathbb{E}[\mathbf{y}^T]\\
&=&\mathbb{E}[(\Phi\mathbf{w})(\Phi\mathbf{w})^T]\\
&=&\Phi\mathbb{E}[\mathbf{ww}^T]\Phi^T\\
&=&\lambda^2\Phi\Phi^T
\end{eqnarray}
$$
As a result, distribution of $\mathbf{y}$ is according to multiple gaussian distribution
$$
\mathbf{y}\sim\mathcal{N}(\mathbf{0},\lambda^2\Phi\Phi^T)=\mathcal{N}(\mathbf{0},\mathbf{K})
$$
This relation is **named Gaussian** Process, and it means **gaussian distribution of infinite dimension** i.e.  **if $\mathbf{w}$ is according to gaussian distribution, $\mathbf{y}$ is according to Gaussian Process.** 

### Kernel Trick (Kernel function) 

Distribution of $y$ is determined only by covariance matrix $\mathbf{K}$'s component, i.e.
$$
K_{nn'}=\phi(\mathbf{x}_n)^{\mathrm{T}}\phi(\mathbf{x}_{n'})\\
$$
If can know $K_{nn'}$' s value in advance, do not have to calculate $\phi(\mathbf{x})$. Therefore, the function which gives $K_{nn'}$' s value is named **kernel function of $\mathbf{x}_n$ and $\mathbf{x}_{n'}$** and is represented as
$$
k(\mathbf{x}_n,\mathbf{x}_{n'})=\phi(\mathbf{x}_n)^{\mathrm{T}}\phi(\mathbf{x}_{n'})\\
$$
and covariance matrix $\mathbf{K}$ is named as **kernel matrix** or **Gram matrix of $\Phi$**. For kernel function example, **polynomial kernel** which is formulated as
$$
k(\mathbf{x},\mathbf{x}')=(\mathbf{x}^{\mathrm{T}}\mathbf{x}'+1)^2
$$
If $\mathbf{x}=(x_1,x_2)^{\mathrm{T}}$ and $\mathbf{x}'=(x_1',x_2')^{\mathrm{T}}$,
$$
\begin{eqnarray}
k(\mathbf{x},\mathbf{x}')&=&(x_1x_1'+x_2x_2'+1)^2\\
&=&x_1^2x_1'^2+x_2^2x_2'^2+2x_1x_2x_1'x_2'+2x_1x_1'+2x_2x_2'+1\\
&=&(x_1^2,x_2^2,\sqrt{2}x_1x_2,\sqrt{2}x_1,\sqrt{2}x_2,1)\cdot(x_1'^2,x_2'^2,\sqrt{2}x_1'x_2',\sqrt{2}x_1',\sqrt{2}x_2',1)
\end{eqnarray}
$$
So, it meens feature vector $\phi(\mathbf{x})=(x_1^2,x_2^2,\sqrt{2}x_1x_2,\sqrt{2}x_1,\sqrt{2}x_2,1)$. Avoiding represent feature vector $\phi(\mathbf{x})$ directly and calculating inner product only by kernel function is named as **kernel trick**.

Note: In some kernel function, feature vector $\phi(\mathbf{x})$ is infinite demension. But using kernel function, can calculate kernel matrix easily without representing feature vector $\phi(\mathbf{x})$ directly.

### Accurate definition of Gaussian Process

When In any nature number $N$, input vector $\mathbf{x}_1,\mathbf{x}_2,...,\mathbf{x}_N\in\chi$ correspond output vector
$$
\mathbf{f}=(f(\mathbf{x}_1),f(\mathbf{x}_2),...,f(\mathbf{x}_N))
$$
and $\mathbf{f}$ is according to $\mathcal{N}(\boldsymbol{\mu},\mathbf{K})$, $f$ is called as f **is according to gaussian process** and represetnted as
$$
f\sim GP(\mu(\mathbf{x}),k(\mathbf{x},\mathbf{x}'))
$$