# Chapter 2

Bayes' theorem applied to data given some prior:

\begin{equation}
P(p | data ) \propto P( data | p) \cdot P(p)  
\tag{1}
\end{equation}


Normalised by $ P(data) = E(P( data | p) = \int P( data | p ) P(p) dp$ so that the probability sums to 1.

Terms:
- $ P (p | data) $ - the posterior, the thing we want, the probability of our model given the data
- $ P (data | p) $ - the likelihood
- $ P (p) $ - the prior, the thing we choose in advance, our best guess before the data of the distribution of our parameters  of interest


So we have a machine that conditions the priors on the data. In simple cases this can be done analytically, but in this chapter we look at three numerical techniques instead.

1. Grid approximation
2. Quadratic approximation
3. Markov chain Monte Carlo (MCMC)

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats
import seaborn as sns

sns.set(style='whitegrid')

## Grid approximation

The parameters (i.e the unobserved variables in our model, as opposed to data, the observed variables) are normally continuous, but we can discretise, and just evaluate equation 1 above at a bunch of different parameter values ($p$ above). This scales with the number of parameters, so becomes unfeasible quite quickly.

In what follows, we are trying to estimate the proportion of water on a globe, $p$, by randomly sampling its surface (e.g. by chucking it and recording what our thumb is touching when we catch it). Equivalent to a weighted coin with p(Heads) = p. If W is the number of times water is recorded, L the number of land, and N the total samples, then we model:

$ W \sim \text{Binomial} (N, p) $

And take a uniform prior $p \sim \text{Uniform}(0,1)$



In [None]:
# observed data:
num_waters = 6
num_land = 3
num_total = num_waters + num_land

def grid_approximate_binomial(n: int, k: int, grid_size: int, prior: np.ndarray=None, plot=True) -> np.ndarray:
    p_grid = np.linspace(0,1, grid_size)
    # if prior is None, assume a uniform distribution over the grid.
    if prior is None:
        prior = np.ones(grid_size)
    # evaluate the probability of our observed data given our model.
    # binomial(n, p, k): (n choose k) * p^k * (1-p)^(n-k)
    likelihood = scipy.stats.binom.pmf(n=n, k=k, p=p_grid)   
    posterior_unscaled  = likelihood * prior 
    posterior = posterior_unscaled / posterior_unscaled.sum()
    
    if plot:
        fig, ax = plt.subplots()
        sns.lineplot(x=p_grid, y=posterior, ax=ax, marker='o')
        ax.set_xlim(p_grid.min(), p_grid.max())
        plt.ylabel('posterior')
        plt.xlabel('p')
        plt.show()
        
    return posterior

In [None]:
_ = grid_approximate_binomial(n=num_total, k=num_waters, grid_size=5)

In [None]:
_ = grid_approximate_binomial(n=num_total, k=num_waters, grid_size=50)

## Quadratic approximation

By making stronger assumptions we can handle more complicated models. Specifically, we assume that the region near the peak of the posterior is well-approximated by a Gaussian. The Gaussian is well-behaved and described by only the mean and the variance. This approximation is quadratic because the logarithm of the Gaussian is a quadratic function:
\begin{equation}
 P(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}
\end{equation}

So to approximate the posterior of our model we find the posterior mode (with one of the many many optimizers available, e.g gradient descent), then fit a Gaussian to the mode by estimating the curvature of the region. 


The pymc code would look like:
```python
import pymc as pm
import arviz as az

# Data
W = 6
L = 3
N = W + L

# Model
with pm.Model() as globe_model:
    # Uniform prior
    p = pm.Uniform('p', 0, 1)
    
    # Binomial likelihood
    W_obs = pm.Binomial('W_obs', n=N, p=p, observed=W)
    
    # Perform the quadratic approximation
    approx = pm.fit()

    # Display the summary of the quadratic approximation
    trace = approx.sample(1000)
    print(az.summary(trace, kind='stats'))

```

It seems like this is doing variational inference (ADVI) - it says its fitting to a Gaussian but it might have more assumptions in there.

In [None]:
# TODO use py_quap

In [None]:
import pymc as pm
import arviz as az

# Data
W = 6
L = 3
N = W + L

# Model
with pm.Model() as globe_model:
    # Uniform prior
    p = pm.Uniform('p', 0, 1)
    
    # Binomial likelihood
    W_obs = pm.Binomial('W_obs', n=N, p=p, observed=W)
    
    # Perform the quadratic approximation
    approx = pm.fit()

    # Display the summary of the quadratic approximation
    trace = approx.sample(1000)
    print(az.summary(trace, kind='stats'))


In [None]:
# plot and compare
x = np.linspace(0, 1, 1000)

mu = np.mean(trace['p'])
sigma = np.std(trace['p'])
quap_posterior = scipy.stats.norm.pdf(x, mu, sigma)
grid_size = 50
# need to scale this by the bin width due to the discretization
grid_posterior= grid_approximate_binomial(n=num_total, k=num_waters, grid_size=grid_size, plot=False) / (1/grid_size)  
p_grid = np.linspace(0,1,grid_size)

fig, ax = plt.subplots()
sns.lineplot(x=p_grid, y=grid_posterior, ax=ax, marker='o', label='Grid')
sns.lineplot(x=x, y=quap_posterior, ax=ax, label='QUAP')
ax.set_xlim(p_grid.min(), p_grid.max())
plt.ylabel('posterior')
plt.xlabel('p')
plt.show()

NB in our case the analytical form of the posterior is known - it's a beta distribution

## Markov chain monte carlo

MCMC draws samples from the posterior distribution, then you can draw the posterior from the histogram of these samples, effectively. We then work directly with the samples rather than an approximation of the posterior. You can use e.g. the Metropolis algorithm to get the samples. Details are saved for Chapter 9

## Solutions to exercies (spoilers)

Don't view these if you're trying to learn anything for yourself, they can't be unseen!

- 2E1: 2 and 4 (these are equivalent formulations
- 2E2: 3
- 2E3: 1 and 4 (again, equivalent)
- 2E4: The frequentist would say the chance of landing on water given infinite throws. The Bayesian would say that 0.7 expresses our uncertainty in the outcome. While the result is always 1 or 0, 1 is more likely than 0.


In [None]:
# 2M1:
# Assuming uniform p, calculate the grid approximate posterior for the globe example given the following data.
# 1. W,W,W
_ = grid_approximate_binomial(n=3, k=3, grid_size=50, plot=True)
# 2. WWWL
_ = grid_approximate_binomial(n=4, k=3, grid_size=50, plot=True)
# 3. LWWLWWW
_ = grid_approximate_binomial(n=7, k=5, grid_size=50, plot=True)

In [None]:
# 2M2:
# assume a prior for p that is zero when p < 0.5, positive constant when p >= 0.5. Get the posteriors as above.
grid_size = 50
p_grid = np.linspace(0,1, grid_size)
prior = np.where(p_grid < 0.5, 0, 2)
# 1. W,W,W
_ = grid_approximate_binomial(n=3, k=3, grid_size=grid_size, prior=prior, plot=True)
# 2. WWWL
_ = grid_approximate_binomial(n=4, k=3, grid_size=grid_size, prior=prior, plot=True)
# 3. LWWLWWW
_ = grid_approximate_binomial(n=7, k=5, grid_size=grid_size, prior=prior, plot=True)

2M3: Given two globes, p(water | Earth) = 0.7, p(water | Mars) = 0. Given a random toss that produces a land observation, find
    P(Earth | land).
 
\begin{equation}
\begin{aligned}
    P(\text{Earth} | \text{land}) &= \frac{P(\text{land} |\text{Earth}) P(\text{Earth})} {P(\text{land}|\text{Earth}) P(\text{Earth}) + P(\text{land}|\text{Mars}) P(\text{Mars})} \\
                    &= \frac{0.3 \times 0.5}{0.3 \times 0.5 + 1 \times 0.5} \\
                    &= \frac{0.15}{0.65} \\
                    &= 0.23
\end{aligned}
\end{equation}             

- 2M4: Three cards -one with two black sides, one with one black and one white, and one with two white. Shuffle the cards, draw one, place it flat on the table - the side facing up is black. What is the probability the other side is also black, using the counting method?

There are three ways of drawing a card with a black facing up, two of those are from the double-black card, so the probability is 2/3.

- 2M5: Four cards, B/B, B/W, W/W, B/B. What is the probability of the underside being black, given the topside is?

Five total ways of generating B topside, four of which have B underside.
therefore P = 4/5

- 2M6: B/B, B/W, W/W. Now the draw probability is not uniform - for every way to pull the B/B card, there are 2 ways to pull the B/W and 3 ways to pull the W/W. What is the new probability of the underside being black, given that the topside is?

total ways = 2 + 2, ways with black underside = 2, so P = 1/2


- 2M7: B/B, B/W, W/W as before. Now two cards are drawn. Face up black card, face up white card. What is the chance of the first card being B/B now?

Total ways of generating card 1 then 2 - $2*3 + 1*2$ (card one then two or three, or card two then three)
Total ways in which card 1 is B/B = $2*3$.
Therefore P = 6/8 = 0.75


- 2H1: Two species of panda bear. $P(\text{twins} | A ) = 0.1$. $P(\text{twins} | B) = 0.2$

Given a panda has just given birth to twins, what is the probability the next birth will be twins?

naively, P(twins) = P(twins|A)P(A) + P(twins|B)P(B) = $0.1*0.5 + 0.2*0.5 = 0.15$

This doesn't use the information that we have one twin birth.

P(A | prev twins) = P(prev twins | A) P(A) / P(prev twins) $= 0.1*0.5 / 0.15 = 1/3$
ditto P(B | prev twins) $= 0.2*0.5/0.15 = 2/3$

Now using these probabilities, P(twins) $= 0.1 * 1/3 + 0.2 * 2/3 = 1/6$

                  

- 2H2: As in 2H1. Find P(A | twins)

Did this as part of 2H1 - probability is 1/3

- 2H3: As in 2H2, but now the second birth is a single infant.  What is P(A | singleton, twins)?

Priors are P(A) = 1/3, P(B) = 2/3

P(A | singleton) = P(singleton | A) * P(A) / P(singleton)

                 $ = 0.9 * 1/3 / (0.9 * 1/3 + 0.8 * 2/3) = 0.36$

- 2H4: As in 2H3, but now we have a panda test.
P(A, test says A) = 0.8
P (B, test says B) = 0.65

To start with, find P(A | test says A) with naive P(A) and P(B)


P (A | test says A) = P(A, test says A) / P(test says A) $= 0.8 *0.5 / (0.8*0.5+0.35*0.5) = 0.7 $

But with our birth information,

P(A | test says A) $= 0.8*0.36 / (0.8*0.36 + 0.35 * (1-0.36)) =0.56$