In [1]:
import torch
from torch.distributions import MultivariateNormal
import math

## Multivariate Bayesian Inferencing of Mean of Gaussian likelihood, known Precision

Previously, we studied the univariate case. Now let us consider the scenario where the data instances are multi-dimensional i.e vectors. This leads us to the multivariate case.

In particular the training dataset is 
$$
X \equiv \left\lbrace 
\vec{x}^{ \left( 1 \right) }, \vec{x}^{ \left( 2 \right) }, \cdots, \vec{x}^{ \left( i \right) }, \cdots, \vec{x}^{ \left( n \right) }
\right\rbrace
$$

Here, we assume the variance is known (a constant) but the mean of the data is unknown, modeled as a Gaussian random variable. 

We will express the Gaussian in terms of the precision matrix ${\Lambda}$, instead of the covariance matrix ${\Sigma}$ where ${\Lambda} = {\Sigma}^{-1}$.

Since we assume that the data is Normally distributed:
$$
p\left( X \middle\vert \vec{\mu} \right) \propto e^{ -\frac{1}{2} \sum_{i=1}^{n} \left( \vec{x}^{ \left( i \right) } - \vec{ \mu } \right)^{T} {\Lambda} \left( \vec{x}^{ \left( i \right) } - \vec{ \mu } \right) }$$

The variance is known - hence it is treated as a constant as opposed to a random variable.

The mean  $\vec{\mu}$  is unknown and is treated as a random variable. This too is assumed to be a Gaussian, with mean  $\vec{\mu_{0}}$  and precision matrix $\Lambda_{0}$ (not to be confused with  $\vec{\mu}$  and  $\Lambda$  - the mean and precision matrix of the data itself ). Hence, the prior is

$$p\left( \vec{\mu }\right) \propto e^{ -\frac{1}{2} \left( \vec{\mu }- \vec{ \mu_{0} }\right)^{T} {\Lambda}_{0} \left( \vec{\mu }- \vec{ \mu_{0} }\right) }
$$


Using Bayes theorem, the posterior probability is 

$$\overbrace{
p\left(\vec{\mu} \middle\vert X \right)
}^{posterior}
=
\overbrace{
p\left( X \middle\vert \vec{\mu} \right)
}^{likelihood}
\overbrace{
p\left(\vec{\mu} \right)
}^{prior}$$

The right hand side is the product of two Gaussians, which is a Gaussian itself. Let us denote its mean and precision matrix as $\vec{\mu_{n}}$ and $\Lambda_{n}$.

where
$$
\begin{align*}
&{\Lambda}_{n} = n {\Lambda} + {\Lambda}_{0} \\
& \vec{\mu_{n}} = {\Lambda}_{n}^{-1} \left( n{\Lambda} \bar{\vec{x}} + {\Lambda}_{0} \vec{\mu}_{0} \right)    
\end{align*}$$


In [2]:
def inference_known_precision(X, prior_dist, precision_known):
    mu_mle = X.mean(dim=0)
    n = X.shape[0]
    
    # Parameters of the prior
    mu_0 = prior_dist.mean
    precision_0 = prior_dist.precision_matrix
    
    # Parameters of posterior
    precision_n = n * precision_known + precision_0
    mu_n = torch.matmul(n * torch.matmul(mu_mle.unsqueeze(0), precision_known) + torch.matmul(mu_0.unsqueeze(0), precision_0), torch.inverse(precision_n))
    posterior_dist = MultivariateNormal(mu_n, precision_matrix=precision_n)
    return posterior_dist

In [3]:
# Let us assume that the true distribution is a normal distribution. The true distribution corresponds 
# to a single class.
precision_known = torch.tensor([[0.1, 0], [0, 0.1]], dtype=torch.float)
true_dist = MultivariateNormal(torch.tensor([20, 10], dtype=torch.float), precision_matrix=precision_known)

In [4]:
# Case 1
# Let us assume our prior is a Normal distribution with a good estimate of the mean

prior_mu = torch.tensor([19, 9], dtype=torch.float)
prior_precision = torch.tensor([[0.33, 0], [0, 0.33]], dtype=torch.float)
prior_dist = MultivariateNormal(prior_mu, precision_matrix=prior_precision)

torch.manual_seed(0)
                           
#Number of samples is low. 
n = 3
X = true_dist.sample((n,))
posterior_dist_low_n = inference_known_precision(X, prior_dist, precision_known)

mu_mle = X.mean(dim=0)
mu_map = posterior_dist_low_n.mean


# When n is low, the posterior is dominated by the prior. Thus, a good prior can help offset the lack of data.
# We can see this in the following case. 

# With a small sample (n=3), the MLE estimate of mean is worse compared to the MAP estimate of mean

print(f"True mean: {true_dist.mean}")
print(f"MAP mean: {mu_map}")
print(f"MLE mean: {mu_mle}")

True mean: tensor([20., 10.])
MAP mean: tensor([[18.6117,  8.9122]])
MLE mean: tensor([18.1845,  8.8156])


In [5]:
# Case 2
prior_mu = torch.tensor([19, 9], dtype=torch.float)
prior_precision = torch.tensor([[0.33, 0], [0, 0.33]], dtype=torch.float)
prior_dist = MultivariateNormal(prior_mu, precision_matrix=prior_precision)

torch.manual_seed(0)
                           
#Number of samples is high. 
n = 1000
X = true_dist.sample((n,))
posterior_dist_high_n = inference_known_precision(X, prior_dist, precision_known)

mu_mle = X.mean(dim=0)
mu_map = posterior_dist_high_n.mean


# When n is high, the MLE tends to converge to the true distribution. The MAP also tends to converge to the MLE, 
# and in turn converges to the true distribution

print(f"True mean: {true_dist.mean}")
print(f"MAP mean: {mu_map}")
print(f"MLE mean: {mu_mle}")

True mean: tensor([20., 10.])
MAP mean: tensor([[19.9701, 10.1186]])
MLE mean: tensor([19.9733, 10.1223])


### How to use the estimated mean parameter?

We typically find $\vec \mu_{∗}$, the value of $\vec\mu$ that maximizes this posterior probability. In this particular case, the maxima of a Gaussian probability density occurs at the mean, hence, $\vec\mu_{∗}$ = $\vec\mu_{n}$.

Given an arbitrary new data instance $x$, its probability of belonging to the class from which the training data has been sampled is $\mathcal{N}\left( \vec x; \vec\mu_{n},  \Lambda_{n} \right)$.

In [6]:
map_dist = MultivariateNormal(posterior_dist_high_n.mean, precision_matrix=precision_known)
print(f"MAP distribution mu: {map_dist.mean} precision:{map_dist.precision_matrix}")

MAP distribution mu: tensor([[19.9701, 10.1186]]) precision:tensor([[[0.1000, 0.0000],
         [0.0000, 0.1000]]])
