# Benchmark NumPyro in large dataset

This notebook uses `numpyro` and replicates experiments in references [1] which evaluates the performance of NUTS on various frameworks. The benchmark is run with CUDA 10.0 on a NVIDIA RTX 2070.

In [1]:
import time

import numpy as onp

import jax.numpy as np
from jax import random

import numpyro
import numpyro.distributions as dist
from numpyro.examples.datasets import COVTYPE, load_dataset
from numpyro.infer import HMC, MCMC, NUTS
assert numpyro.__version__.startswith('0.2.1')

# NB: replace gpu by cpu to run this notebook in cpu
numpyro.set_platform("gpu")

We do preprocessing steps as in [source code](https://github.com/google-research/google-research/blob/master/simple_probabilistic_programming/no_u_turn_sampler/logistic_regression.py) of reference [1]:

In [2]:
_, fetch = load_dataset(COVTYPE, shuffle=False)
features, labels = fetch()

# normalize features and add intercept
features = (features - features.mean(0)) / features.std(0)
features = np.hstack([features, np.ones((features.shape[0], 1))])

# make binary feature
_, counts = onp.unique(labels, return_counts=True)
specific_category = np.argmax(counts)
labels = (labels == specific_category)

N, dim = features.shape
print("Data shape:", features.shape)
print("Label distribution: {} has label 1, {} has label 0"
      .format(labels.sum(), N - labels.sum()))

Data shape: (581012, 55)
Label distribution: 211840 has label 1, 369172 has label 0


Now, we construct the model:

In [3]:
def model(data, labels):
    coefs = numpyro.sample('coefs', dist.Normal(np.zeros(dim), np.ones(dim)))
    logits = np.dot(data, coefs)
    return numpyro.sample('obs', dist.Bernoulli(logits=logits), obs=labels)

## Benchmark HMC

In [4]:
step_size = np.sqrt(0.5 / N)
kernel = HMC(model, step_size=step_size, trajectory_length=(10 * step_size), adapt_step_size=False)
mcmc = MCMC(kernel, num_warmup=500, num_samples=500, progress_bar=False)
tic = time.time()
mcmc.run(random.PRNGKey(2019), features, labels, extra_fields=['num_steps'], collect_warmup=True)
num_leapfrogs = mcmc.get_extra_fields()['num_steps'].sum().copy()
toc = time.time()
print("number of leapfrog steps:", num_leapfrogs)
print("avg. time for each step :", (toc - tic) / num_leapfrogs)
mcmc.print_summary()

number of leapfrog steps: 10000
avg. time for each step : 0.0025879729270935057


                mean       std    median      5.0%     95.0%     n_eff     r_hat
  coefs[0]      2.16      0.08      2.15      2.14      2.15     98.97      1.00
  coefs[1]      0.07      0.07      0.07      0.06      0.09    237.32      1.00
  coefs[2]      0.02      0.07      0.02      0.01      0.05    133.57      1.00
  coefs[3]     -0.41      0.06     -0.41     -0.42     -0.40    817.35      1.00
  coefs[4]     -0.13      0.03     -0.13     -0.13     -0.12   1369.19      1.00
  coefs[5]     -0.23      0.02     -0.23     -0.24     -0.23    139.04      1.01
  coefs[6]     -0.22      0.09     -0.22     -0.25     -0.21    199.85      1.01
  coefs[7]     -0.66      0.04     -0.66     -0.68     -0.65     79.01      1.00
  coefs[8]      0.36      0.06      0.36      0.36      0.37    108.22      1.00
  coefs[9]     -0.04      0.02     -0.04     -0.05     -0.04    328.03      1.00
 coefs[10]      0.54      0

In CPU, we get `avg. time for each step : 0.029236876797676087`.

## Benchmark NUTS

In [5]:
mcmc = MCMC(NUTS(model), num_warmup=50, num_samples=50, progress_bar=False)
tic = time.time()
mcmc.run(random.PRNGKey(2019), features, labels, extra_fields=['num_steps'], collect_warmup=True)
num_leapfrogs = mcmc.get_extra_fields()['num_steps'].sum().copy()
toc = time.time()
print("number of leapfrog steps:", num_leapfrogs)
print("avg. time for each step :", (toc - tic) / num_leapfrogs)
mcmc.print_summary()

number of leapfrog steps: 60657
avg. time for each step : 0.002400124275651143


                mean       std    median      5.0%     95.0%     n_eff     r_hat
  coefs[0]      2.12      0.42      1.97      1.93      2.55      9.27      1.12
  coefs[1]     -0.10      0.22     -0.04     -0.16     -0.01     13.85      1.06
  coefs[2]     -0.10      0.14     -0.06     -0.20      0.14     23.91      1.06
  coefs[3]     -0.24      0.24     -0.30     -0.32     -0.13     13.89      1.07
  coefs[4]     -0.15      0.19     -0.09     -0.58     -0.08     10.13      1.11
  coefs[5]     -0.08      0.24     -0.14     -0.20      0.05     13.84      1.06
  coefs[6]      0.23      0.24      0.25     -0.18      0.32     17.18      1.00
  coefs[7]     -0.57      0.22     -0.65     -0.71     -0.26      6.89      1.18
  coefs[8]      0.44      0.31      0.58      0.04      0.67      5.10      1.25
  coefs[9]      0.02      0.12     -0.01     -0.02      0.03     16.53      1.05
 coefs[10]      0.63      0.

In CPU, we get `avg. time for each step : 0.029149702710295957`.

## Compare to other frameworks

|               |    HMC    |    NUTS   |
| ------------- |----------:|----------:|
| Edward2 (CPU) |           |  56.1 ms  |
| Edward2 (GPU) |           |   9.4 ms  |
| Pyro (CPU)    |  35.4 ms  |  35.3 ms  |
| Pyro (GPU)    |   3.5 ms  |   4.2 ms  |
| NumPyro (CPU) |  29.2 ms  |  29.1 ms  |
| NumPyro (GPU) |   2.6 ms  |   2.4 ms  |

Note that in some situtation, HMC is slower than NUTS. The reason is the number of leapfrog steps in each HMC trajectory is fixed to $10$, while it is not fixed in NUTS.

**Some takeaways:**
+ The overhead of iterative NUTS is pretty small. So most of computation time is indeed spent for evaluating potential function and its gradient.
+ GPU outperforms CPU by a large margin. The data is large, so evaluating potential function in GPU is clearly faster than doing so in CPU.

## References

1. `Simple, Distributed, and Accelerated Probabilistic Programming,` [arxiv](https://arxiv.org/abs/1811.02091)<br>
Dustin Tran, Matthew D. Hoffman, Dave Moore, Christopher Suter, Srinivas Vasudevan, Alexey Radul, Matthew Johnson, Rif A. Saurous