# Bayesian hierarchical modeling using HMC/NUTS
The [Bayesian Regression](bayesian_regression.ipynb) tutorial shows how to fit a Bayesian model using variational inference. Markov chain Monte Carlo (MCMC) is another family of methods for fitting Bayesian models. An MCMC algorithm produces a chain of samples which converges to the posterior distribution of the model latent variables. Hamiltonian Monte Carlo (HMC) [1] is an algorithm for efficiently evolving the chain using gradient information. The No-U-Turn Sampler (NUTS) [2] is an extension of HMC that eliminates the need to choose certain tuning parameters, making it suitable for black-box inference in Pyro. You can view a nice visualization of these methods [here](https://chi-feng.github.io/mcmc-demo/app.html#EfficientNUTS,banana) [5].

Bayesian hierarchical models estimate posteriors of latent variables in nested hierarchies of sub-models. The posterior of a sub-model latent variable is determined by both the sub-model data and the population variables in the hierarchy above it, which are in turn determined by priors and data from other sub-models. In this sense, information is said to be partially pooled across sub-models.

In this tutorial, we use NUTS to compare partial pooling with two models which represent either extreme of the degree of information sharing. On one end of the spectrum, we can throw away knowledge of sub-models and treat every data point as coming from a single population distribution. On the other, we can treat each sub-model as completely independent. Partial pooling lies in between, estimating lantent variables for both the population and individual sub-models. This avoids overfitting while retaining the ability to make inferences about the sub-models.

This tutorial is based on [3] and uses the baseball example data from [4]. Baseball players have at-bats which are opportunities to get hits. The hits $h_i$ for player $i$ are modeled as having a binomial distribution with success probability $\phi_i$ and $K_i$ trials. The various models show different ways of jointly modeling the $\phi_i$ variables for all players. 

## Setup
Import modules and set configuration variables.

In [None]:
from __future__ import absolute_import, division, print_function

import argparse
import logging
import math
import os
from IPython.display import display, Markdown

import numpy as np
import pandas as pd
import torch

import pyro
import pyro.poutine as poutine
from pyro.distributions import Beta, Binomial, HalfCauchy, Normal, Pareto, Uniform
from pyro.distributions.util import logsumexp
from pyro.infer import EmpiricalMarginal
from pyro.infer.abstract_infer import TracePredictive
from pyro.infer.mcmc import MCMC, NUTS

pyro.enable_validation(True)
DATA_URL = "https://d2fefpcigoriu7.cloudfront.net/datasets/EfronMorrisBB.txt"
pyro.set_rng_seed(1)

smoke_test = 'CI' in os.environ 
num_samples = 1 if smoke_test else 200
num_chains = 4
warmup_steps = 1 if smoke_test else 100

## Data
Ingest and take a look at the data. The data from the first 45 at-bats are already in separate columns. We will use these as the training set and test on the rest of the season.

In [None]:
def train_test_split(pd_dataframe):
    pd_dataframe["RemainingHits"] = pd_dataframe["SeasonHits"] - pd_dataframe["Hits"]
    train_data = torch.tensor(pd_dataframe[["At-Bats", "Hits"]].values, dtype=torch.float)
    test_data = torch.tensor(pd_dataframe[["RemainingAt-Bats", "RemainingHits"]].values, dtype=torch.float)
    first_name = pd_dataframe["FirstName"].values
    last_name = pd_dataframe["LastName"].values
    player_names = [" ".join([first, last]) for first, last in zip(first_name, last_name)]
    return train_data, test_data, player_names

baseball_dataset = pd.read_csv(DATA_URL, "\t")
train, _, player_names = train_test_split(baseball_dataset)
at_bats, hits = train[:, 0], train[:, 1]
baseball_dataset.head()

Sample output:

|&nbsp;|FirstName|LastName|At-Bats|Hits|BattingAverage|RemainingAt-Bats|RemainingAverage|SeasonAt-Bats|SeasonHits|SeasonAverage|RemainingHits|
|---|---|---|---|---|---|---|---|---|---|---|---|
|__0__|Roberto|Clemente|45|18|0.4|367|0.346|412|145|0.352|127|


## Models
### Fully pooled
One possible approach is to assume a common probability $\phi$ of a hit for every at-bat, regardless of player.

\begin{align} \phi \sim&\ \textrm{Unif}(0, 1) \\
h_i \sim&\ \textrm{Binom}(\phi, K_i)\end{align}

In [None]:
def fully_pooled(at_bats):
    phi_prior = Uniform(at_bats.new_tensor(0), at_bats.new_tensor(1))
    phi = pyro.sample("phi", phi_prior)
    return pyro.sample("obs", Binomial(at_bats, phi))

### Not pooled
We could also assume independent uniform priors for each $\phi_i$, so each player sub-model is completely separate.

\begin{align}\phi_i \sim&\ \textrm{Unif}(0, 1) \\
h_i \sim&\ \textrm{Binom}(\phi_i, K_i) \end{align}

In [None]:
def not_pooled(at_bats):
    num_players = at_bats.shape[0]
    phi_prior = Uniform(at_bats.new_tensor(0), at_bats.new_tensor(1)).expand_by([num_players]).independent(1)
    phi = pyro.sample("phi", phi_prior)
    return pyro.sample("obs", Binomial(at_bats, phi))

### Partial pooling: beta-binomial
Using a hierarchical model, we can specify the player hit probabilities as random draws from a common population distribution. In this case, the $\phi_i$ have a beta distribution. The two variables in the beta distribution are transformations of a variable $m$, the mean, and $\kappa$, which roughly controls how concentrated the distribution is around its mode. We give $m$ a uniform prior and $\kappa$ a Pareto prior with support on $[1, \infty)$ and fixed parameter 1.5. The complete model is

\begin{align} m \sim&\ \textrm{Unif}(0,1) \\
\kappa \sim&\ \textrm{Pareto}(1, 1.5) \\
\phi_i \sim&\ \textrm{Beta}(m\kappa, (1-m)\kappa) \\
h_i \sim&\ \textrm{Binom}(\phi_i, K_i)\end{align}

In [None]:
def partially_pooled(at_bats):
    num_players = at_bats.shape[0]
    m = pyro.sample("m", Uniform(at_bats.new_tensor(0), at_bats.new_tensor(1)))
    kappa = pyro.sample("kappa", Pareto(at_bats.new_tensor(1), at_bats.new_tensor(1.5)))
    phi_prior = Beta(m * kappa, (1 - m) * kappa).expand_by([num_players]).independent(1)
    phi = pyro.sample("phi", phi_prior)
    return pyro.sample("obs", Binomial(at_bats, phi))

### Partial pooling: log-odds
Another hierarchical model places a prior of the log odds of getting a hit. An advantage to doing this is that we can easily add covariates as linear terms on the log odds scale, producing a Bayesian logistic regression. Here the log odds of a hit, $\alpha_i$, have a normal distribution. The hyperprior on the mean of the normal has mean -1, reflecting prior knowledge that long-term batting averages are less than 0.5. The hyperprior on the standard deviation of the normal has a half-Cauchy distribution.

\begin{align}
\mu\sim&\ \textrm{Normal}(-1, 1)\\
\sigma \sim&\ \textrm{HalfCauchy}(1)\\
\alpha_i \sim&\ \textrm{Normal}(\mu, \sigma)\\
\phi_i \sim&\ (1 + e^{-\alpha_i})^{-1} \\
h_i \sim& \ \textrm{Binom}(\phi_i, K_i)
\end{align}

In [None]:
def partially_pooled_with_logit(at_bats):
    num_players = at_bats.shape[0]
    loc = pyro.sample("loc", Normal(at_bats.new_tensor(-1), at_bats.new_tensor(1)))
    scale = pyro.sample("scale", HalfCauchy(scale=at_bats.new_tensor(1)))
    alpha = pyro.sample("alpha", Normal(loc, scale).expand_by([num_players]).independent(1))
    return pyro.sample("obs", Binomial(at_bats, logits=alpha))

## Other functions
Return a conditioned model object given a model function, hits data, and fixed at-bat data.

In [None]:
def conditioned_model(model, at_bats, hits):
    return poutine.condition(model, data={"obs": hits})(at_bats)

Summarize posterior samples from MCMC chains.

In [None]:
def get_site_stats(array, player_names):
    if len(array.shape) == 1:
        df = pd.DataFrame(array).transpose()
    else:
        df = pd.DataFrame(array, columns=player_names).transpose()
    return df.apply(pd.Series.describe, axis=1)[["mean", "std", "25%", "50%", "75%"]]

def summary(traces, sites, player_names, transforms={}):
    marginal = EmpiricalMarginal(traces, sites).get_samples_and_weights()[0].numpy()
    site_stats = {}
    for i in range(marginal.shape[1]):
        site_name = sites[i]
        marginal_site = marginal[:, i]
        if site_name in transforms:
            marginal_site = transforms[site_name](marginal_site)
        site_stats[site_name] = get_site_stats(marginal_site, player_names)
    return site_stats

Run each sample from the posterior on both training and test sets and compare summarized predictions to actual outcomes.

In [None]:
def sample_posterior_predictive(posterior_predictive, baseball_dataset):
    train, test, player_names = train_test_split(baseball_dataset)
    at_bats = train[:, 0]
    at_bats_remaining = test[:, 0]
    display(Markdown("### Posterior predictions \n #### Hit Rate - Initial 45 At Bats"))
    train_predict = posterior_predictive.run(at_bats)
    train_summary = summary(train_predict, sites=["obs"], player_names=player_names)["obs"]
    train_summary = train_summary.assign(ActualHits=baseball_dataset[["Hits"]].values)
    display(train_summary)
    display(Markdown("#### Hit Rate - Season Predictions"))
    test_predict = posterior_predictive.run(at_bats_remaining)
    test_summary = summary(test_predict, sites=["obs"], player_names=player_names)["obs"]
    test_summary = test_summary.assign(ActualHits=baseball_dataset[["RemainingHits"]].values)
    display(test_summary)

Evaluate the log probability density of the observed test data under the posterior samples. A higher density indicates a better fit between the data and the estimated posterior.

In [None]:
def evaluate_log_predictive_density(model, model_trace_posterior, baseball_dataset):
    _, test, player_names = train_test_split(baseball_dataset)
    at_bats_remaining, hits_remaining = test[:, 0], test[:, 1]
    test_eval = TracePredictive(conditioned_model,
                                model_trace_posterior,
                                num_samples=num_samples)
    test_eval.run(model, at_bats_remaining, hits_remaining)
    trace_log_pdf = []
    for tr in test_eval.exec_traces:
        trace_log_pdf.append(tr.log_prob_sum())
    posterior_pred_density = logsumexp(torch.stack(trace_log_pdf), dim=-1) - math.log(len(trace_log_pdf))
    print("Log posterior predictive density: {:.4f}\n".format(posterior_pred_density))

## Model fitting
We are now ready to run our MCMC samplers and examine the performance of the four models. We will walk through the fully pooled model step by step.

First, construct a NUTS kernel using our `conditioned_model` function. This object can be reused for each model, since we will pass the model function in when we run the sampler. Note that we could use `HMC` in place of `NUTS` here if we additionally specify a step size and number of steps.

In [None]:
nuts_kernel = NUTS(conditioned_model)

Next, construct and run the `MCMC` object. Among other details, this object will contain an `.exec_traces` variable which holds the posterior samples generated at each post-warmup step of the Markov chains. This step will take the most time, proportional to the number of samples in the chains.

In [None]:
posterior_fully_pooled = MCMC(nuts_kernel,
                            num_samples=num_samples,
                            warmup_steps=warmup_steps,
                            num_chains=num_chains) \
    .run(fully_pooled, at_bats, hits)

We can examine summary statistics of the empirical marginal distribution of our population hit probability $\phi$.

In [None]:
summary(posterior_fully_pooled, sites=["phi"], player_names=player_names)["phi"]

The `TracePredictive` object holds a model function and the list of posterior samples. When it is called inside `sample_posterior_predictive`, the model is run for each posterior sample with latent variables set to their values in the sample. This produces a sample of the model's return value under the approximate posterior.

In [None]:
posterior_predictive = TracePredictive(fully_pooled,
                                       posterior_fully_pooled,
                                       num_samples=num_samples)
sample_posterior_predictive(posterior_predictive, baseball_dataset)

Finally, we evaluate the predictive density of the actual hits by each player in the test dataset. This gives us a basis for comparison with other models. 

In [None]:
evaluate_log_predictive_density(fully_pooled, posterior_fully_pooled, baseball_dataset)

Now we run the other three models.

In [None]:
posterior_not_pooled = MCMC(nuts_kernel,
                            num_samples=num_samples,
                            warmup_steps=warmup_steps,
                            num_chains=num_chains) \
    .run(not_pooled, at_bats, hits)
evaluate_log_predictive_density(not_pooled, 
                                posterior_not_pooled, 
                                baseball_dataset)


posterior_partially_pooled = MCMC(nuts_kernel,
                            num_samples=num_samples,
                            warmup_steps=warmup_steps,
                            num_chains=num_chains) \
    .run(partially_pooled, at_bats, hits)
evaluate_log_predictive_density(partially_pooled, 
                                posterior_partially_pooled, 
                                baseball_dataset)

posterior_partially_pooled_with_logit = MCMC(nuts_kernel,
                            num_samples=num_samples,
                            warmup_steps=warmup_steps,
                            num_chains=num_chains) \
    .run(partially_pooled_with_logit, at_bats, hits)
evaluate_log_predictive_density(partially_pooled_with_logit, 
                                posterior_partially_pooled_with_logit, 
                                baseball_dataset)

The partial pooling models have higher predictive density, with the winner depending on the choice of random seed. Looking at the test predictions from the logit model, we can see that they are much closer to the actual outcomes than in the fully pooled model above. However, in contrast to the unpooled model, the player $\phi_i$ values have been shrunk toward the population mean from the raw batting averages.

In [None]:
posterior_predictive = TracePredictive(partially_pooled_with_logit,
                                       posterior_partially_pooled_with_logit,
                                       num_samples=num_samples)
sample_posterior_predictive(posterior_predictive, baseball_dataset)
display(Markdown("$\phi$=Sigmoid($\\alpha$)"), 
        summary(posterior_partially_pooled_with_logit, sites=["alpha"], player_names=player_names,
                transforms={"alpha": lambda x: 1. / (1 + np.exp(-x))})["alpha"])

This code is available on [Github](https://github.com/uber/pyro/blob/dev/examples/baseball.py).

## References
[1] Radford M. Neal, [MCMC using Hamiltonian dynamics](https://arxiv.org/pdf/1206.1901.pdf)

[2] Matthew D. Hoffman and Andrew Gelman [The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian 
Monte Carlo](http://jmlr.org/papers/volume15/hoffman14a/hoffman14a.pdf)

[3] Bob Carpenter, [Hierarchical Partial Pooling for Repeated Binary Trials](http://mc-stan.org/users/documentation/case-studies/pool-binary-trials.html)

[4] Bradley Efron and Carl Morris (1975), "Data analysis using Stein's estimator and its
    generalizations", J. Amer. Statist. Assoc., 70, 311-319.

[5] https://chi-feng.github.io/mcmc-demo/app.html