## Hamiltonian Monte Carlo for linear regression

In linear regression, we are given a dataset $\mathcal{D}=\{(\mathbf{x}_i, y_i)\}_{i=1}^N$ and the objective is to fit a model of form $p(y|\mathbf{w}, \mathbf{x}) = \mathcal{N}(y | \mathbf{w}^T\mathbf{x} + b, \sigma^2)$ to this data.  

To simplify the notation a bit, let us define the design matrix
$$
X = \begin{pmatrix}
x_1^1 & x_1^2 & ... & x_1^d & 1\\
x_2^1 & x_2^2 & ... & x_2^d & 1\\
. & .\\
. & .\\
. & .\\
x_N^1 & x_N^2 & ... & x_N^d & 1\\
\end{pmatrix}$$  

Let us also collect all parameters for the mean and the outputs into vectors $\boldsymbol{\theta} = [w_1, w_2, ..., w_d, b]^T$ and $\mathbf{y} = [y_1, y_2, ..., y_N]^T$  

Then
  
$$\mathbf{X}\boldsymbol{\theta} = \begin{pmatrix}
\mathbf{w}^T\mathbf{x}_1 + b\\
\mathbf{w}^T\mathbf{x}_2 + b\\
.\\
.\\
.\\
\mathbf{w}^T\mathbf{x}_N + b
\end{pmatrix}$$

We next define noninformative priors for all model parameters. For the weights $\mathbf{w}$, the support is $\mathbb{R}^d$. A natural choice is therefore $p(\mathbf{w}) = \mathcal{N(\mathbf{0}, \sigma_0^2\mathbf{I})}$, where the variance $\sigma_0^2$ is large.

We place the prior of the noise over the precision $\tau = \sigma^{-2}$. The support is $[0, \infty)$, so a natural choice is $p(\tau) = \Gamma(\alpha_0, \beta_0)$

The posterior distribution is $$p(\boldsymbol{\theta}, \tau | \mathcal{D}) = \frac{p(\boldsymbol{\theta}, \tau)p(\mathbf{y} | \mathbf{X}, \boldsymbol{\theta}, \tau)}{Z} \propto \exp(-\frac{1}{2\sigma_0^2}{\mathbf{w}^T\mathbf{w}}) \tau^{\alpha_0 - 1} \exp(-\beta_0\tau) \prod_{i=1}^N \sqrt{\frac{\tau}{2\pi}}\exp(-\frac{\tau}{2}(y_i - \mathbf{w}^T\mathbf{x}_i + b)^2)$$

In the likelihood, we can take the exponentiation outside the product and then combine terms which gives

$$p(\boldsymbol{\theta} | \mathcal{D}) \propto \tau^{\alpha_0 + N/2 - 1} \exp(\sum_{i=1}^N (-\frac{\tau}{2}(y_i - \mathbf{w}^T\mathbf{x}_i + b)^2) - \frac{1}{2\sigma_0^2}{\mathbf{w}^T\mathbf{w}} - \beta_0\tau) = \tau^{\alpha_0 + N/2 - 1} \exp(-\frac{\tau}{2}(y - \mathbf{X}\boldsymbol{\theta})^T(y - \mathbf{X}\boldsymbol{\theta}) - \frac{1}{2\sigma_0^2}{\mathbf{w}^T\mathbf{w}} - \beta_0\tau)$$

This does not resemble any known distribution, so we will use MCMC to sample from the posterior. We will use a Gibbs sampling scheme that alternates between sampling from the conditional distributions $p(\boldsymbol{\theta}|\tau, \mathcal{D})$ and $p(\tau|\boldsymbol{\theta}, \mathcal{D})$. For sampling the weights and the bias from $p(\boldsymbol{\theta}|\tau, \mathcal{D})$, we will use HMC. However, HMC is not optimal for sampling the precision from $p(\tau|\boldsymbol{\theta}, \mathcal{D})$, since the support is constrained to be positive. Since the prior we have chosen for $\tau$ is conjugate to the gaussian likelihood, we conveniently get

$$p(\tau|\boldsymbol{\theta}, \mathcal{D}) \propto $$

In [2]:
import numpy as np
import matplotlib.pyplot as plt