## Introduction to Bayesian Statistics
LSSTC Data Science Fellowship Program Session 16

**Jiayin Dong**, Flatiron Research Fellow

CCA, Flatiron Institute

September 2022

---

In the lecture, we learned Bayesian inference from three examples: (1) Draw balls out of a bag, (2) Observations with Gaussian noise, and (3) Fit a straight line to data. You may find the lecture material in this folder (https://github.com/jiayindong/LSSTC-DSFP-Sessions/tree/main/Sessions/Session16/Day1).

The principle of the Bayesian inference is to update our inference on some parameters of interest $\theta$ out of data $D$ from some prior knowledge on $\theta$.

Bayes' Theorem 

$p(\theta|D) = \cfrac{p(D|\theta)p(\theta)}{p(D)}$,

where $p(\theta|D)$ is the posterior, $p(D|\theta)$ is the likelihood, and $p(\theta)$ is the prior. $p(D)$ is the probablity of observing $D$, i.e., $p(D) = \int{p(D|\theta)p(\theta)d\theta}$.

The approach we used to compute the posterior $p(\theta|D)$ in the lecture is called *grid approximation*.

Grid approximation in five steps,
1. Build a grid for parameters of interest $\theta$. The dimension of the grid depends on the number of parameters.
2. At each parameter value on the grid, calculate the prior $p(\theta_{\rm grid})$.
3. At each parameter value on the grid, calculate the likelihood $p(D|\theta_{\rm grid})$.
4. At each parameter value on the grid, multiply the likelihood by the prior, $p(D|\theta_{\rm grid})p(\theta_{\rm grid})$. Note, this is the unnormalized posterior.
5. Lastly, normalize the posterior by the sum of all values on the grid $\Sigma_{\theta_{\rm grid}} p(D|\theta_{\rm grid})p(\theta_{\rm grid})$.

This problem set is to apply the grid approximation to the three problems we discussed in the lecture.

---

In [1]:
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

In [2]:
from matplotlib import rc
rc('font', **{'family':'sans-serif'})
rc('text', usetex=True)

### Problem 1: Draw balls out of a bag

We have a bag containing 4 balls. Each ball has two possible colors: black and white. We begin with no information on the number of black and white balls in the bag and want to update our guess on the number of black balls from observations (i.e., drawing balls out of the bag).

####  Problem 1(a) Make one draw

We drew one ball out of the bag. And it is black.

Step 1: Build a grid for $\theta$. Let $\theta$ here be the configurations of black and white balls.

$\theta$ has five possible states: 0, 1, 2, 3, 4 black balls out of 4 balls.

In [3]:
θ = np.arange(5)
θ

array([0, 1, 2, 3, 4])

Step 2: At each state of $\theta$, calculate the prior.

Since we have no previous information about the configuration, we use an uninformation prior.

In [4]:
prior = np.ones(5)/5

In [5]:
## Plot the prior distribution
plt.figure(figsize=(3.5,2.7),dpi=110)
plt.scatter(θ, prior, c='k')
plt.xlabel('Configurations (num of blacks balls)')
plt.ylabel(r'$p(\theta)$')
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

Step 3: At each state of $\theta$, calculate the likelihood.

At $\theta_0$, which has no black balls, $p( \textrm{"draw a black ball"}|\theta_0) = 0$

At $\theta_1$, which has one black ball and three white balls, $p( \textrm{"draw a black ball"}|\theta_1) = 1/4$

...

At $\theta_4$, which has no white balls, $p( \textrm{"draw a black ball"}|\theta_4) = 1$

In [6]:
likelihood = np.array([0, 1/4, 2/4, 3/4, 1])

In [7]:
## Plot the likelihood distribution -- YOUR CODE HERE
plt.figure(figsize=(3.5,2.7),dpi=110)
plt.scatter(θ, likelihood, c='k')
plt.xlabel('Configurations (num of blacks balls)')
plt.ylabel(r'$p(D|\theta)$')
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

Step 4: At each state of $\theta$, calculate the unnormalized posterior.

In [8]:
unnormalized_posterior = prior*likelihood

In [9]:
## Plot the unnormalized posterior distribution -- YOUR CODE HERE
plt.figure(figsize=(3.5,2.7),dpi=110)
plt.scatter(θ, unnormalized_posterior, c='k')
plt.xlabel('Configurations (num of blacks balls)')
plt.ylabel(r'unnormalized $p(\theta|D)$')
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

Step 5: At each state of $\theta$, normalize the posterior.

In [10]:
posterior = unnormalized_posterior/np.sum(unnormalized_posterior)

In [11]:
## Plot the posterior distribution -- YOUR CODE HERE
plt.figure(figsize=(3.5,2.7),dpi=110)
plt.scatter(θ, posterior, c='k')
plt.xlabel('Configurations (num of blacks balls)')
plt.ylabel(r'$p(\theta|D)$')
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

That's it. We solved the posterior of $\theta$, i.e., an updated guess on the number of black balls after drawing a black ball out of the bag.

####  Problem 1(b) Make another draw

After drawing the black ball, we put it back into the bag. We make another draw to improve our guess on the number of black balls in the bag. And we drew a white ball this time.

Step 1: Build the grid. The grid of $\theta$ remains the same.

Step 2: Calculate the prior. Now we **do** have prior knowledge. Our prior knowledge is from the previous draw. I.e., the posterior we calculated from the previous draw becomes our prior.

In [12]:
prior = np.array([0,0.1,0.2,0.3,0.4]) # Note: This distribution is the same as the posterior in last cell. 

In [13]:
## Plot the prior distribution
plt.figure(figsize=(3.5,2.7),dpi=110)
plt.scatter(θ, prior, c='k')
plt.xlabel('Configurations (num of blacks balls)')
plt.ylabel(r'$p(\theta)$')
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

Step 3: Calculate the likelihood. 

At $\theta_0$, which has no black balls, $p( \textrm{"draw a white ball"}|\theta_0) = 1$

At $\theta_1$, which has one black ball and three white balls, $p( \textrm{"draw a white ball"}|\theta_1) = 3/4$

...

At $\theta_4$, which has no white balls, $p( \textrm{"draw a white ball"}|\theta_4) = 0$

In [14]:
likelihood = np.array([1, 3/4, 2/4, 1/4, 0]) # YOUR_CODE_HERE

In [15]:
## Plot the likelihood distribution -- YOUR CODE HERE
plt.figure(figsize=(3.5,2.7),dpi=110)
plt.scatter(θ, likelihood, c='k')
plt.xlabel('Configurations (num of blacks balls)')
plt.ylabel(r'$p(D|\theta)$')
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

Step 4: At each state of $\theta$, calculate the unnormalized posterior.

In [16]:
unnormalized_posterior = prior*likelihood # YOUR_CODE_HERE

In [17]:
## Plot the unnormalized posterior distribution -- YOUR CODE HERE
plt.figure(figsize=(3.5,2.7),dpi=110)
plt.scatter(θ, unnormalized_posterior, c='k')
plt.xlabel('Configurations (num of blacks balls)')
plt.ylabel(r'unnormalized $p(\theta|D)$')
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

Step 5: At each state of $\theta$, normalize the posterior.

In [18]:
posterior = unnormalized_posterior/np.sum(unnormalized_posterior)

In [19]:
## Plot the posterior distribution -- YOUR CODE HERE
plt.figure(figsize=(3.5,2.7),dpi=110)
plt.scatter(θ, posterior, c='k')
plt.xlabel('Configurations (num of blacks balls)')
plt.ylabel(r'$p(\theta|D)$')
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

Again, we get our posterior. Since one draw is black and one draw is white, we get zero probability for $\theta_0$ (no black balls) and $\theta_4$ (all black balls). Our observation prefers an equal number of black and white balls, and that's why the posterior is picked at $\theta_2$ and symmetric around $\theta_2$.

#### Problem 1(c) Add more balls.

Let's say, instead of a bag of four balls, we now have a bag of 100 balls.

We made 10 draws and got 6 black balls and 4 white balls.

We begin with an uninformative prior. What the posterior?

In [20]:
# Step 1: define the theta
θ = np.arange(101)

# Step 2: write the prior
prior = np.ones(101)/101

# Step 3: write the likelihood
likelihood = np.linspace(0,1,101)**6*np.linspace(1,0,101)**4

# Step 4 & 5: calcualte and normalize the posterior 
posterior = prior*likelihood
posterior /= np.sum(posterior)

**Trick** We may instead write our likelihood as a binomial distribution.

Call the SciPy binomial function [stats.binom.pmf(k, n, p)](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binom.html#scipy.stats.binom) where k = 6, n = 10, and p = $\theta$/100.

In [21]:
likelihood_binomial = stats.binom.pmf(6, 10, θ/100)

Do two likelihood distributions look the same?

In [22]:
## Plot and compare likelihoood and likelihood_binomial
plt.figure(figsize=(3.5,2.7),dpi=110)
plt.plot(θ, likelihood/np.sum(likelihood),label='Our likelihood')
plt.plot(θ, likelihood_binomial/np.sum(likelihood_binomial), c='k', linestyle='--',label='Binomial')
plt.xlabel('Configurations (num of blacks balls)')
plt.ylabel(r'$p(D|\theta)$')
plt.legend(framealpha=0)
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

In [23]:
## Plot the posterior distribution
plt.figure(figsize=(3.5,2.7),dpi=110)
plt.plot(θ, posterior, c='k')
plt.xlabel('Configurations (num of blacks balls)')
plt.ylabel(r'$p(\theta|D)$')
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

### Problem 2: Observations with Gaussian noise

Astronomical observations come with noise. Instead of the *true* value (or mean value) of a parameter of interest is observed, we observe the true value plus some Gaussian noise $\sigma$.

We make observations of a parameter of interest $x$ with some known Gaussian noise $\sigma$. We want to infer $x$ from observations on x.

#### Problem 2(a) Make one observation.

Let's first generate some data.

In [24]:
np.random.seed(42)

xtrue = 1.  # the "true" value of x
σ = 1.  # the uncertainty sigma

xobs = xtrue + np.random.normal(loc=0, scale=σ) # the observed x

print(xobs)

1.4967141530112327


Now we apply the grid approximation method to approximate the posterior of x given the observation, $p(x_{\rm true}|x_{\rm obs})$.

Just a reminder of the grid approximation.
1. Build a grid for parameters of interest $\theta$. The dimension of the grid depends on the number of parameters.
2. At each parameter value on the grid, calculate the prior $p(\theta_{\rm grid})$.
3. At each parameter value on the grid, calculate the likelihood $p(D|\theta_{\rm grid})$.
4. At each parameter value on the grid, multiply the likelihood by the prior, $p(D|\theta_{\rm grid})p(\theta_{\rm grid})$. Note, this is the unnormalized posterior.
5. Lastly, normalize the posterior by the sum of all values on the grid $\Sigma_{\theta_{\rm grid}} p(D|\theta_{\rm grid})p(\theta_{\rm grid})$.

Step 1: Build a grid for xtrue.

In [25]:
## Let xgrid to be evenly spaced from -3 to 3 with 100 samples.
xgrid = np.linspace(-3,3,100)

Step 2: At each value on $x_{\rm grid}$, calculate the prior $p(x_{\rm grid})$.

Let's assume on uniform prior on x.

In [26]:
prior = np.ones_like(xgrid)/100

In [27]:
## Plot the prior distribution
plt.figure(figsize=(3.5,2.7),dpi=110)
plt.plot(xgrid, prior, c='k')
plt.xlabel(r'x')
plt.ylabel(r'$p(x)$')
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

Step 3: At each value on xgrid, calculate the likelihood $p(x_{\rm obs}|x_{\rm grid})$.

For a certain value of $x_{\rm grid,i}$, the likelihood $p(x_{\rm obs}|x_{\rm grid,i})$ is the probablity of observing $x_{\rm obs}$ from a Normal distribution with $\mu = x_{\rm grid,i}$ and $\sigma$.

We use the SciPy function [stats.norm.pdf(x, loc, scale)](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html?highlight=norm#scipy.stats.norm) to calculate the likelihood.

In [28]:
likelihood = stats.norm.pdf(xobs, xgrid, σ)

In [29]:
## Plot the likelihood distribution -- YOUR CODE HERE
plt.figure(figsize=(3.5,2.7),dpi=110)
plt.plot(xgrid, likelihood, c='k')
plt.xlabel('x')
plt.ylabel(r'$p(x_{\textrm{obs}}|x_{\textrm{true}})$')
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

Steps 4 & 5: At each value on xgrid, calculate the posterior. And lastly, normalize the posterior.

In [30]:
posterior = prior*likelihood
posterior /= np.sum(posterior)

In [31]:
## Plot the posterior distribution
plt.figure(figsize=(3.5,2.7),dpi=110)
plt.plot(xgrid, posterior, c='k')
plt.xlabel('x')
plt.ylabel(r'$p(x_{\textrm{true}}|x_{\textrm{obs}})$')
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

The posterior has the shape like the likelihood because we used a uniform prior. 

Note that you may find stats.norm.pdf(xgrid, xobs, $\sigma$) and stats.norm.pdf(xobs, xgrid, $\sigma$) have the same mathematical expression. They have different statistical meanings. 
- stats.norm.pdf(xobs, xgrid, $\sigma$) calculates the probability at xobs from Normal distributions N(xgrid, $\sigma$).

- stats.norm.pdf(xgrid, xobs, $\sigma$) calculates the probabilities at xgrid from a Normal distribution N(xobs, $\sigma$). 

#### Problem 2(b) Informative prior.

Assume that we have some prior knowledge of x. Assume our prior is distribution as Normal(0.8, 0.5), i.e., $p(x) \sim \mathcal{N}(0.8, 0.5)$. 

Redo the analysis.

In [32]:
prior = stats.norm.pdf(xgrid, 0.8, 0.5)
posterior = prior*likelihood
posterior /= np.sum(posterior)

plt.figure(figsize=(3.5,2.7),dpi=110)
plt.plot(xgrid, posterior, c='k')
plt.xlabel('x')
plt.ylabel(r'$p(x_{\textrm{true}}|x_{\textrm{obs}})$')
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

Let's ponder that for a moment. How did the posterior change and why it changed?

#### Problem 2(c):  Many observations.

Let's keep the assumption the same but now make 10 observations on x.

Generate some data.

In [33]:
xobs = xtrue + np.random.normal(0,σ,size=10)
print(xobs)

[0.8617357  1.64768854 2.52302986 0.76584663 0.76586304 2.57921282
 1.76743473 0.53052561 1.54256004 0.53658231]


Step 1: The xgrid remains the same.

Step 2: Prior. Let's keep using the normal prior from the previous question.

Step 3: Likelihood. Since now we have 10 observations, we need to calculate the likelihood for each observation and multiply them all together.

In [34]:
likelihood = np.ones_like(xgrid)
for i, xobs_i in enumerate(xobs):
    likelihood *= stats.norm.pdf(xobs_i, xgrid, σ)
    
print(likelihood[:10])

[5.00307601e-47 6.86686152e-46 9.08505227e-45 1.15862935e-43
 1.42432656e-42 1.68780618e-41 1.92789566e-40 2.12271863e-39
 2.25293826e-38 2.30491069e-37]


You may find the likelihood values are getting much smaller when we have more observations. It will be troublesome as they reach the smallest representable number in double (or single) precision.

Instead, we could use log_likelihood, which is the natural log (ln) of the likelihood.

Redo the likelihood calculation above with log_likelihood.

In [35]:
log_likelihood = np.zeros_like(xgrid)
for i, xobs_i in enumerate(xobs):
    log_likelihood += stats.norm.logpdf(xobs_i, xgrid, σ)
    
print(log_likelihood[:10])

[-106.61144645 -103.99220711 -101.40969873  -98.86392129  -96.3548748
  -93.88255925  -91.44697464  -89.04812099  -86.68599828  -84.36060651]


Steps 4 & 5: Calculate the posterior and normalize it.

In [36]:
log_posterior = np.log(prior) + log_likelihood
posterior = np.exp(log_posterior)
posterior /= np.sum(posterior)

In [37]:
plt.figure(figsize=(3.5,2.7),dpi=110)
plt.plot(xgrid, posterior, c='k')
plt.xlabel('x')
plt.ylabel(r'$p(x_{\textrm{true}}|x_{\textrm{obs}})$')
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

### Problem 3: Fit a straight line to data

Lastly, let's apply the grid approximation method to model a straight line.

We have two parameters of interest, the slope of the line $m$ and the intercept of the line $b$. The line model is described as $y = mx + b$.

We made 10 observations of y but with some Gaussian noise $\sigma$. Assume that we know the exact x values.

Let's generate some data.

In [38]:
x = np.linspace(-1,1,10)

m_true = 1.
b_true = 0.5
y_true = m_true*x + b_true

σ = 0.2

y_obs = y_true + np.random.normal(size=len(y_true))*σ

In [39]:
plt.figure(figsize=(4,2.7),dpi=110)
plt.errorbar(x, y_obs, yerr=σ, linestyle='', fmt='o', c='k')
plt.plot(x, y_true, c='grey', lw=1, zorder=0)
plt.ylabel('y')
plt.xlabel('x')
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

Let's do the grid approximation to infer the slope and the intercept ${m, b}$.

1. Build a grid for parameters of interest $\theta$. The dimension of the grid depends on the number of parameters.
2. At each parameter value on the grid, calculate the prior $p(\theta_{\rm grid})$.
3. At each parameter value on the grid, calculate the likelihood $p(D|\theta_{\rm grid})$.
4. At each parameter value on the grid, multiply the likelihood by the prior, $p(D|\theta_{\rm grid})p(\theta_{\rm grid})$. Note, this is the unnormalized posterior.
5. Lastly, normalize the posterior by the sum of all values on the grid $\Sigma_{\theta_{\rm grid}} p(D|\theta_{\rm grid})p(\theta_{\rm grid})$.

Step 1: Build the grid. We use [np.meshgrid](https://numpy.org/doc/stable/reference/generated/numpy.meshgrid.html) to broadcast the vectors.

In [40]:
mgrid = np.linspace(0,2,200)
bgrid = np.linspace(0,1,100)
mv, bv = np.meshgrid(mgrid, bgrid)

In [41]:
plt.figure(figsize=(5,3.5))
plt.pcolormesh(mgrid,bgrid,mv,cmap='plasma')
plt.xlabel('m grid')
plt.ylabel('b grid')
plt.colorbar()
plt.title('Broadcast m grid along b direction')
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

In [42]:
plt.figure(figsize=(5,3.5))
plt.pcolormesh(mgrid,bgrid,bv,cmap='plasma')
plt.xlabel('m grid')
plt.ylabel('b grid')
plt.colorbar()
plt.title('Broadcast b grid along m direction')
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

Step 2: At each grid value $(m_i, b_j)$, calcuate the prior.

Let's assume uniform priors on $m$ and $b$. Note that because of the geometry of the problem, a uniform prior on $m$ is biased towards high slopes.

$ m \sim {\rm Uniform}(0, 2)$

$ b \sim {\rm Uniform}(0, 1)$

In [43]:
prior_m = np.ones_like(mgrid)/200
prior_b = np.ones_like(bgrid)/100

prior_mv, prior_bv = np.meshgrid(prior_m, prior_b)

In [44]:
plt.figure(figsize=(5,3.5))
plt.pcolormesh(mgrid,bgrid,prior_mv,cmap='plasma')
plt.xlabel('m grid')
plt.ylabel('b grid')
plt.colorbar()
plt.title('m prior (constant)')
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

In [45]:
plt.figure(figsize=(5,3.5))
plt.pcolormesh(mgrid,bgrid,prior_bv,cmap='plasma')
plt.xlabel('m grid')
plt.ylabel('b grid')
plt.colorbar()
plt.title('b prior (constant)')
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

Our prior here should now be the product $p(m)p(b)$ or the sum $\log(m)+\log(b)$ if we take the natural log.

In [46]:
log_prior = np.log(prior_mv)+np.log(prior_bv)

In [47]:
plt.figure(figsize=(5,3.5))
plt.pcolormesh(mgrid,bgrid,prior_bv,cmap='plasma')
plt.xlabel('m grid')
plt.ylabel('b grid')
plt.colorbar()
plt.title('log-prior (constant)')
plt.tight_layout()
plt.show()

  plt.figure(figsize=(5,3.5))


<IPython.core.display.Javascript object>

Step 3: At each grid value $(m_i, b_j)$, calcuate the log-likelihood.

In [48]:
log_likelihood = 0.
for i, this_y_obs in enumerate(y_obs):
    
    this_y_true = mv*x[i] + bv
    
    log_likelihood += stats.norm.logpdf(this_y_obs, this_y_true, σ)

In [49]:
plt.figure(figsize=(5,3.5))
plt.pcolormesh(mgrid,bgrid,log_likelihood,cmap='plasma')
plt.xlabel('m grid')
plt.ylabel('b grid')
plt.colorbar()
plt.title('log-likelihood')
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

In [50]:
plt.figure(figsize=(5,3.5))
plt.pcolormesh(mgrid,bgrid,np.exp(log_likelihood),cmap='plasma')
plt.xlabel('m grid')
plt.ylabel('b grid')
plt.colorbar()
plt.title('likelihood')
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

Steps 4 & 5: At each grid value $(m_i, b_j)$, calcuate the posterior and normalize it.

In [51]:
log_posterior = log_prior + log_likelihood

In [52]:
plt.figure(figsize=(5,3.5))
plt.pcolormesh(mgrid,bgrid,log_posterior,cmap='plasma')
plt.xlabel('m grid')
plt.ylabel('b grid')
plt.colorbar()
plt.title('posterior')
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

In [53]:
posterior = np.exp(log_posterior)
posterior /= np.sum(posterior)

In [54]:
plt.figure(figsize=(5,3.5))
plt.pcolormesh(mgrid,bgrid,posterior,cmap='plasma')
plt.xlabel('m grid')
plt.ylabel('b grid')
plt.colorbar()
plt.title('posterior')
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

**Bonus Problem** Try different priors on $m$ and $b$. How does the posterior change?