# The Metropolis-Hastings algorithm
Cleverly sampling probablity distributions.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import seaborn as sns
import pandas as pd
# warnings.simplefilter(action='ignore', category=FutureWarning)

The Metropolis-Hastings algorithm is a way to sample from a probability distribution in is widely used in probabilistic programming. `pymc` uses a variation of this when you run `pm.sample()`. The Metropolis-Hastings algorithm is an example of Markov Chain Monte Carlo (MCMC) methods.
 
The Markov Chain part means that the next sample only depends on last one; you dont need to know anything about the history, just what your last step was. This might seem extreme, but let's compare it to using GPS to navigate. You can imagine that in order for you to find the way towards a goal, you really dont care about your history, you just need to know where you are right now. 

The Monte Carlo part means that we use random techniques to approximate the answer, instead of analytical methods.

In this notebook, you will learn a few concepts:

- trace plot
- random walk
- compare probabilities between distributions
- Monte Carlo Markov Chains

The goal is to get a bit more background on the MCMC method. The building blocks are random walks and calculating probabilities to make the random walk a bit less random. 

## 1. Trace plot
Let's draw a thousand samples from a normal distribution with $\mu=0$ and $\sigma=1$. We do this over time, and track what we have drawn. This is called a *trace plot*.

In [None]:

t = np.linspace(0, 10, 1000)
obs = []
for step in t:
    draw = stats.norm(loc=0, scale=1).rvs(size=1)
    obs.append(draw)
    plt.plot(
        step,
        draw,
        "b.",
    )

As you can see, this is a normal distribution, but sideways. The x-axis is time, and the y-axis is the value we have drawn. The trace plot is a way to visualize the random walk.

We could also plot all the observations without the time dimension in a histogram, and we will recognize the familiar normal distribution from which we have been sampling.

In [None]:
plt.hist(np.array(obs), bins=50);

## 2. Random walk

Now, we are going to do something else.
We will draw from a distibution, but the distribution we are going to draw from is connected to the previous one.

This is called a *random walk*. So: we start with $\mu=0$, $\sigma=1$. We draw a sample, let's say we draw 0.74. This will be the mean of the next draw, so: $\mu=0.74$, $\sigma=1$. Now we might draw 0.41. So, our next draw will be from a distribution with $\mu=0.41$, $\sigma=1$.

You could compare this to the following: instead of doing the same thing over and over again, you will vary just a little bit, but usually you stay close to your previous behaviour. So you explore, but usually you don't take big jumps (to be precise: the standard deviation of your jump is 1).

If we make a traceplot of this, we will see a line that drifts, one way or another.

In [None]:
# random walk
t = np.linspace(0, 10, 1000)
draw = 0
obs = []
for step in t:
    draw = stats.norm(loc=draw, scale=1).rvs(size=1)
    obs.append(draw)
    plt.plot(
        step,
        draw,
        "b.",
    )

This result can't be mapped to a normal distribution easily! This is because the distribution we are sampling from is drifting over time, and this data is actually generated by many different distributions. It is like someone that is randomly changing his mind about where to go.

## 3. compare probabilities
### 3.1 Generate data
Before we can start comparing, we need some data.
Let's start with generating some big population of 30000.

In [None]:
np.random.seed(seed=42)
population = stats.norm(loc=20, scale=10).rvs(size=30000)

And from this population, we will sample a small observation of 100. We know what the underlying distribution is, but the data has become a bit more random.

In [None]:
np.random.seed(seed=42)
observation = np.random.choice(population, size=100, replace=False)

In [None]:
plt.figure(figsize=(6, 6))
sns.kdeplot(population, bw_adjust=0.2, color="black", label="population")
sns.histplot(observation, stat="density", bins=30, alpha=0.5, label="observation")
plt.legend()

Now, let's imagine we have access to the mean:

In [None]:
mu_obs = observation.mean()
mu_obs

But we want to estimate the standard deviation. Obviously, in this case, we could simply calculate the std as well (and we know that the std was actually 10 because we generated the data ourselves).

But this is just an example created to be as simple as possible to show how this process works, where the advantage is that if at some point we do get more complex examples where it is not straight forward to calculate a value, we can still use inference. 

In addition to that, our inference will give an estimate about how close we are
to the "real" std. For comparison, let's just directly calculate it as well

In [None]:
observation.std()

Now, let's try how close we can get to this value with MCMC sampling.

### 3.2 Calculate the probability of the direction
The metropolis-hastings algoritm makes a random walk, but it needs to determine how likely a new configuration is. This can serve as a type of gps or compas: it will tell us if we are going in the right direction.

We can calculate that with the pdf function. How likely is it to draw a 0 from a distribution with parameters $\mu=0$ and $\sigma=1$?

In [None]:
stats.norm.pdf(0, loc=0, scale=1)

The density of the probability for a normal distribution with $\mu=0$ and
$\sigma=1$ to draw a 0 is about 0.4. 

Note: this is NOT a percentage!  for a continuous distribution, you can not give the probability
of a point, only of a range. E.g. the probability of the outside temperature
being 17.4 degrees is zero. You could only talk about the probability of the
temperature to be between an interval, e.g. between 17.0 and 18.0, or maybe
between 17.3 and 17.5. However, we can take the limit of an interval, which
gives us the density.

In [None]:
stats.norm.pdf(2, loc=0, scale=1)

A value of 2 is much less likely. This means we can use the pdf to compare two draws, and to figure out if a draw is more, or less, likey to come from a specific distribution. Let's draw the pdf for a range of points:

In [None]:
x = np.linspace(-3, 3, 50)
pdf = stats.norm.pdf(x, loc=0, scale=1)
plt.scatter(x, pdf)
plt.title("Samples from the pdf of the normal distribution");

This show us we can calculate the probabilities for a bunch of data, simultaneously. If we use a nicely spread range of datapoints, we get the familiar bell-shaped curve.

However, we can also calculate the probability for every item from our observed
data, under the assumption that the items are drawn from a normal distribution
with a given mean and scale.

In [None]:
probs = stats.norm(loc=mu_obs, scale=1).pdf(observation)
probs[:10]

This are all the probabilities for all the observations we generated, assuming that the distribution we were sampling from was a normal distribution with $\mu=\texttt{mu\_obs}$ and $\sigma=1$.

The nice thing is we can compare this to other assumed means and standard deviations!


We could multiply all the probabilities together, but because multiplying a lot of small number will give us rounding errors we are often better off by taking the log and summing the values.

In [None]:
np.sum(np.log(probs))

## 3.3 Picking the most likely distribution
Lets compare our observations with two different distributions, each with their own mean and standard deviation. We will use $\mu=0$ and $\sigma=2$.

In [None]:
probs1 = stats.norm(loc=mu_obs, scale=1).pdf(observation)
probs2 = stats.norm(loc=mu_obs, scale=2).pdf(observation)
np.sum(np.log(probs1)) < np.sum(np.log(probs2))

As you can see, std 2 is more likely! (which we know to be true, because we generated the data ourselves with std 10).

Because we will be comparing distributions, we don't need to translate the logs back with
`np.exp` (we could do it, but it does not change the order, so for comparison it doesnt matter and we don't want to waste
compute on something that doesnt matter). 

This is a metric that allows to compare different distributions. Let's take two normal distributions with the same mean but one has a $\sigma=3$, the other $\sigma=1$.

In [None]:
from inference import Metropolis, Dist

metropolis = Metropolis()
dist_a = Dist(mu_obs, scale=3)
dist_b = Dist(mu_obs, scale=1)
a = metropolis.get_log_probs(observation, dist=dist_a)
b = metropolis.get_log_probs(observation, dist=dist_b)
a, b

So, we can see, there is a much higher probability that our observations are
coming from a distribution a with $\sigma=3$ than from a distribution b with
$\sigma=1$.

To keep things neat, we used a dataclass for our distributions.

## 3.4 Accepting or rejecting proposals
And now, let's make two different distributions, the first one starting with $\mu=0$ and $\sigma=1$.

In [None]:
d1 = Dist(0, 1)

d1

We will make a random walk, but we take just 1 step. We start with `d1` and we will use that standard deviation for a random walk.

Because we want our new standard deviation to be positive, we need to pick a distribution that is always positive. We could pick many things for this (a half-cauchy, inverse-gamma, half-normal or exponential distribution). My implementation in `Metropolis.random_walk` assume that $\sigma$ comes from an exponential distribution, but you could change that to something else.

In [None]:
d2 = metropolis.random_walk(d1)
d1, d2

So, we did a first random walk with the two distributions, and after that we have the old distribution `d1`, and the new proposal distribution `d2` that has a different `scale` (i.e. standar deviation), but the same mean, produced by the random walk. Let's make a traceplot of this process

In [None]:
d1 = Dist(0, 1)
draw = 0
for step in range(20):
    print(d1.scale)
    s = stats.expon.rvs(loc=d1.scale)
    d1 = Dist(d1.loc, scale=s)
    plt.plot(
        step,
        d1.scale,
        "b.",
    )

This is just a random walk that explores the space of probabilities by wandering around. But some wandering brings us to places that are more likely to succeed than others.
Let's say we have a current distribution with $\sigma=8$, and we would wander into two different directions. One brings us a $\sigma=7.5$, the other $\sigma=10$:

In [None]:
current = metropolis.get_log_probs(observation, dist=Dist(mu_obs, 8))
proposed1 = metropolis.get_log_probs(observation, dist=Dist(mu_obs, 7.5))
proposed2 = metropolis.get_log_probs(observation, dist=Dist(mu_obs, 10))

Now, what would be the more likely direction?

In [None]:
current, proposed1, proposed2

`proposed1` is a bit less likely as our current distribution, `proposed2` is more likely.

Metropolis-hastings handles this situation like this:

- if the proposed distribution is more likely, always pick it.
- if the proposed distribtuion is less likely, pick it with a chance proportional to how much more unlikely it is.

To do this, we need to recalculate the actual probability (remember we have been working with *logprobs* so far).

In [None]:
np.exp(proposed1 - current)

If the difference is 0.5, we will pick the new distribution 50% of the time, even though it is worse. In this case, the chance is about 1 in 100. Lets see how that works:

In [None]:
np.random.seed(seed=42)
np.mean([metropolis.accept(proposed1, current) for _ in range(1000)])

So, yes, indeed the new distribtion is accepted only 1% of the time...

In [None]:
np.mean([metropolis.accept(proposed2, current) for _ in range(1000)])

...while the more likely proposed2 is always accepted.

# 4 Putting it all together

We can now wrap this all together:

1. We start with a distribution, and a new one.
2. We make a random walk, based on what we have.
3. We calculate how likely our new distribution is, given the data
4. If the new distribution is more likely, we accept it. If not, we accept
   proportional.

Look it up in the source code src.models.inference!

Let's test this!

In [None]:
trace = metropolis(n=1000, observation=observation, mu_obs=mu_obs)

In [None]:
data = pd.DataFrame(trace, columns=["sigma", "accept"]).reset_index()
plt.figure(figsize=(12, 6))
sns.scatterplot(data=data, x="index", y="sigma", hue="accept")
plt.ylim(0, 20)

What do we see? The process takes a few steps to random walk the intial distribution with $\sigma=1$ to the more likely area around $sigma=10$.

This is called the *burn-in* period. After we are around that value, we will accept other values occasionaly, but the wandering converges around a value of 10 because everything that is too far away from that will not be accepted.

# NUTS
After exploring the Metropolis-Hastings algorithm in detail, it's worth contrasting it with the No-U-Turn Sampler (NUTS) that's commonly used in PyMC. While Metropolis-Hastings proposes new states using a fixed proposal distribution and accepts or rejects them based on the target distribution, NUTS adaptively tunes the trajectory length by detecting when the sampler starts to double back on itself (making a "U-turn"). This adaptive behavior allows NUTS to efficiently explore both local and global features of the posterior distribution without requiring manual tuning of parameters. NUTS typically converges faster than Metropolis-Hastings for complex models with many parameters, which explains its popularity as the default sampler in modern probabilistic programming frameworks like PyMC.