JIT tscan and HMC warmup steps #115

neerajprad · 2019-04-18T03:44:30Z

Fixes #114.

This fixes the issue with tscan not being jittable. With this change and jitting the warmup_update step, we get significantly faster run times, specially for NUTS. e.g. for test_beta_bernoulli, the run time goes down from 21s to 15s.

I have also added a minimal version of @fehiepsi's #92 as an example to test how this change fares on the benchmark, and also so that it is easy to make changes and see how the benchmark moves (which takes a few more steps in a jupyter notebook). It seems that this is competitive for HMC, but for NUTS, this is still quite slow. @fehiepsi - is it important to start with the step size and initial param values as in the notebook?

TODO:

Figure out why NUTS is still slow wrt the original benchmark.

numpyro/examples/covtype.py

fehiepsi

Many thanks for making the benchmark script! We can keep the notebook for a while to compare different strategy, then we should remove it. If we make script, it would be nice to store the benchmark result somewhere (e.g. in wiki) so we can keep track of performance.

About the slowness of NUTS, if you let num_samples=100, then it will take sometime to finish because NUTS have larger trajectory length than HMC. In my run (with the version in the notebook, it took 65322 leapfrog steps to get 100 samples), so basically will take 65x longer than HMC. It took me 5m in GPU to finish getting 100 NUTS samples (with HMC init param/step_size) so in CPU, I expect it will take more than half an hour. ^^!

fehiepsi · 2019-04-18T04:01:58Z

numpyro/patch.py

+# TODO: Remove with jax v0.1.26
+@patch_dependency('jax.interpreters.partial_eval.trace_unwrapped_to_jaxpr', jax)
+def _trace_unwrapped_to_jaxpr(fun, pvals, **kwargs):
+    return pe.trace_to_jaxpr(lu.wrap_init(fun, kwargs), pvals)


test/conftest.py

numpyro/mcmc.py

numpyro/examples/covtype.py

neerajprad · 2019-04-18T06:20:23Z

We can keep the notebook for a while to compare different strategy, then we should remove it. If we make script, it would be nice to store the benchmark result somewhere (e.g. in wiki) so we can keep track of performance.

We can retain both. I think its nice to have a notebook that gives more details, but a script is more handy for comparison on a day to day basis.

About the slowness of NUTS, if you let num_samples=100, then it will take sometime to finish because NUTS have larger trajectory length than HMC. In my run (with the version in the notebook, it took 65322 leapfrog steps to get 100 samples), so basically will take 65x longer than HMC. It took me 5m in GPU to finish getting 100 NUTS samples (with HMC init param/step_size) so in CPU, I expect it will take more than half an hour. ^^!

I see. From one of the runs, the time per leapfrog step was also quite a bit higher, let me let it run for half an hour then and see what I get.

neerajprad · 2019-04-18T06:57:24Z

it took 65322 leapfrog steps to get 100 samples

This seems a bit surprising. With the benchmark scripts NUTS terminates within 1138 leapfrog steps but it takes around 0.1 s for each step, which is much higher than 0.06 for HMC. Could you disable fast math mode, and try the benchmark on your system?

fehiepsi · 2019-04-18T17:58:46Z

@neerajprad That's the run in GPU, where precision is much better than CPU. For CPU with fast math disable, in my system NUTS gives:

100%|██████████| 100/100 [01:08<00:00,  1.92it/s]

number of leapfrog steps: 1138
avg. time for each step : 0.06028635409678642

while in HMC,

100%|██████████| 100/100 [00:59<00:00,  1.68it/s]

number of leapfrog steps: 1000
avg. time for each step : 0.05941483545303345

If you disable fast math and plot the sample of first coef, you will see that HMC samples are zeros constant and NUTS samples are highly correlated.

neerajprad · 2019-04-18T18:05:24Z

It is interesting that you get the same number of leapfrog steps for NUTS but your timings look better (0.06 vs 0.1 sec. per step). Just to confirm, this is with covtype.py using the same initialization for both?

fehiepsi · 2019-04-18T18:08:04Z

Yes, both with step_size = np.sqrt(0.5 / N) and init_params = {"coefs": np.zeros(dim)}.

fehiepsi · 2019-04-18T19:45:26Z

@neerajprad All the above info from my side is from modifications (step size, init params) of the notebook (which I didn’t use tscan). I have not used your script.

neerajprad · 2019-04-18T20:01:52Z

@neerajprad All the above info from my side is from modifications (step size, init params) of the notebook (which I didn’t use tscan). I have not used your script.

Then the script seems to be slower than your notebook for NUTS. I'll need to take a look at why that's the case. Hopefully if this (or a similar approach works), we can have a fast implementation by default instead of relying on users to bypass warmup and compile sample_kernel.

fehiepsi · 2019-04-18T22:14:24Z

I hope so too. Let me play around with your idea too to see if we miss something.

neerajprad · 2019-04-18T22:19:38Z

Feel free to push to this PR itself; I might not be able to get to this until tomorrow.

neerajprad · 2019-04-19T06:50:14Z

@fehiepsi - With fast-math disabled, I'm getting 0.1 sec per time step for HMC (0.06 with fast-math), and 0.09 for NUTS. I have also changed the initialization so that HMC is actually not just giving 0s, which may have been the reason for the faster 0.06 secs run time that I was getting earlier.

I'm happy with the benchmark in that both HMC and NUTS have similar times. I haven't compared this with your original benchmark on my system, which I think will be similar to this. All the benchmark numbers are uniformly better on your system though (probably because you have a more powerful CPU). In any case, this is ready to merge, unless you have further comments.

fehiepsi

Look great overall! I guess we still have not solve the problem of compiling 2 times #88 yet? I just have a few small comments:

numpyro/examples/covtype.py

fehiepsi · 2019-04-19T12:54:34Z

numpyro/examples/covtype.py

+from numpyro.mcmc import hmc
+from numpyro.util import tscan
+
+


nit: lint for 5 blank lines

numpyro/examples/covtype.py

fehiepsi · 2019-04-19T17:04:47Z

I'm happy with the benchmark in that both HMC and NUTS have similar times. I haven't compared this with your original benchmark on my system, which I think will be similar to this. All the benchmark numbers are uniformly better on your system though (probably because you have a more powerful CPU).

I just want to verify something to make sure we don't have different interpretations of the problem:

With the current init_params, first num_steps of NUTS is 1. So the hack step_size=1. is not important. In addition, we run NUTS with num_samples=100, so the effect of first sampling step would be small. When first num_steps is large, compiling time will contribute to the benchmark a lot. I think that is the reason you didn't use the NUTS init_params and step_size as in the notebook's benchmark (NUTS will take a bunch of time to finish its first step). If that is the case, then I would prefer to keep init_pamams and step_size as in the notebook for further enhancement (in later PRs) and change it back to the default init_params when problems are resolved.
The following timing in my system (cpu, fastmath disabled) shows that first step is important.

HMC script (n=10): 122ms
HMC notebook (n=10): 61ms
HMC script (n=100): 64ms
HMC notebook (n=100): 61ms

neerajprad · 2019-04-19T18:51:53Z

When first num_steps is large, compiling time will contribute to the benchmark a lot. I think that is the reason you didn't use the NUTS init_params and step_size as in the notebook's benchmark (NUTS will take a bunch of time to finish its first step). If that is the case, then I would prefer to keep init_pamams and step_size as in the notebook for further enhancement (in later PRs) and change it back to the default init_params when problems are resolved.

Good point. I didn't use init_params from the notebook because its hard to reliably compare our benchmark against the paper due to a variety of system level differences. I think we can use any reasonable initialization and run the benchmarks for other implementations, so I tried to keep it simple. But if that is a particularly problematic initialization, let us just use that for the time being and later move to a simpler one when that problem is fixed.

neerajprad · 2019-04-19T18:57:07Z

@fehiepsi - let me address all your comments. I think it would be best to just use the notebook's initial params for both HMC and NUTS.

neerajprad · 2019-04-19T19:23:48Z

Okay, I think this should address all your comments. I just had a couple of questions based on your last comment:

The following timing in my system (cpu, fastmath disabled) shows that first step is important.

Are these numbers with the earlier initialization of 0s or the one that is in the notebook? I changed to the one in the notebook and the performance per leapfrog step remains the same. It is possible that the initial compilation time is large and is getting amortized over the larger number of leapfrog steps. But if that's the case, I think we probably should not overly focus our efforts on fixed costs (like compiling the sample_kernel twice) that are likely to get amortized away on larger models / larger number of samples. We can probably rely on JAX fixing that over the longer term. What would be nice is if we are very competitive on large models (I'm not sure if we are there yet), even if we are 10 seconds slower on smaller models (this is important too, but it is kind of optimizing for the tail so early in the project). What do you think?

fehiepsi

LGTM. Could you address the following two nits?

numpyro/examples/covtype.py

fehiepsi · 2019-04-19T20:35:40Z

Are these numbers with the earlier initialization of 0s or the one that is in the notebook?

Those are numbers from init_params = {'coefs': random.normal(key=random.PRNGKey(0), shape=(dim,))}.

But if that's the case, I think we probably should not overly focus our efforts on fixed costs (like compiling the sample_kernel twice) that are likely to get amortized away on larger models / larger number of samples. We can probably rely on JAX fixing that over the longer term.

Yes, it would be great if this issue is fixed, so we just need to jit sample_kernel in hmc implementation and use it across init and sample stages. But right now, compiling time is a big problem:

when first step of NUTS takes hundreds of steps
when there are a bunch of sample statements in models

To make a fair benchmark (while waiting for the issue is fixed upstream), I think we can use non-prim version for a while with a trade-off that we have to concatenate samples without compiling. The other option is to go aggressive and make mcmc work as follows

mcmc = MCMC(sample_kernel, state, num_samples=1000)  # make a jit version of tscan here with num_samples is known, in other works, `bs` length is known
mcmc.compile()  # run tscan with 1 step; we can also trigger the hack for fast compiling here
mcmc.run()  # run tscan with 1000 steps

But I don't like this idea because it is just suitable for benchmarking. We have to recompile to draw another 1000 samples... So for the time being, I would stick with non-prim version and wait for the issue is fixed upstream. ^^

What would be nice is if we are very competitive on large models (I'm not sure if we are there yet), even if we are 10 seconds slower on smaller models

Yes, I also think so.

neerajprad added 5 commits April 17, 2019 17:17

JIT tscan and HMC warmup steps

aa37338

add covtype benchmark

580eefb

modify conftest

73dc873

fix lint

5f6914d

remove conftest change

e76ab75

neerajprad added the discussion label Apr 18, 2019

neerajprad requested a review from fehiepsi April 18, 2019 03:44

neerajprad added the awaiting review label Apr 18, 2019

neerajprad commented Apr 18, 2019

View reviewed changes

numpyro/examples/covtype.py Outdated Show resolved Hide resolved

neerajprad commented Apr 18, 2019

View reviewed changes

numpyro/examples/covtype.py Outdated Show resolved Hide resolved

fehiepsi reviewed Apr 18, 2019

View reviewed changes

numpyro/examples/covtype.py Show resolved Hide resolved

change to step_size=1 for compilation

48492bb

fehiepsi mentioned this pull request Apr 19, 2019

Investigate why adaptation does not work well with default HMC parameters #109

Closed

neerajprad added 2 commits April 18, 2019 23:43

remove kernel compiling step

8361cbb

address comments

f7612a8

add command line arg

5e1039a

fehiepsi approved these changes Apr 19, 2019

View reviewed changes

fehiepsi reviewed Apr 19, 2019

View reviewed changes

numpyro/examples/covtype.py Outdated Show resolved Hide resolved

address comments

206980a

fehiepsi reviewed Apr 19, 2019

View reviewed changes

numpyro/examples/covtype.py Outdated Show resolved Hide resolved

numpyro/examples/covtype.py Outdated Show resolved Hide resolved

address comments

924bec0

fehiepsi approved these changes Apr 19, 2019

View reviewed changes

fehiepsi merged commit 45eac0a into master Apr 19, 2019

neerajprad deleted the scan-fix branch November 19, 2019 19:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT tscan and HMC warmup steps #115

JIT tscan and HMC warmup steps #115

neerajprad commented Apr 18, 2019 •

edited

Loading

fehiepsi left a comment •

edited

Loading

fehiepsi Apr 18, 2019

neerajprad commented Apr 18, 2019

neerajprad commented Apr 18, 2019

fehiepsi commented Apr 18, 2019 •

edited

Loading

neerajprad commented Apr 18, 2019 •

edited

Loading

fehiepsi commented Apr 18, 2019

fehiepsi commented Apr 18, 2019

neerajprad commented Apr 18, 2019

fehiepsi commented Apr 18, 2019

neerajprad commented Apr 18, 2019

neerajprad commented Apr 19, 2019 •

edited

Loading

fehiepsi left a comment

fehiepsi Apr 19, 2019

fehiepsi commented Apr 19, 2019

neerajprad commented Apr 19, 2019

neerajprad commented Apr 19, 2019 •

edited

Loading

neerajprad commented Apr 19, 2019

fehiepsi left a comment

fehiepsi commented Apr 19, 2019

JIT tscan and HMC warmup steps #115

JIT tscan and HMC warmup steps #115

Conversation

neerajprad commented Apr 18, 2019 • edited Loading

fehiepsi left a comment • edited Loading

Choose a reason for hiding this comment

fehiepsi Apr 18, 2019

Choose a reason for hiding this comment

neerajprad commented Apr 18, 2019

neerajprad commented Apr 18, 2019

fehiepsi commented Apr 18, 2019 • edited Loading

neerajprad commented Apr 18, 2019 • edited Loading

fehiepsi commented Apr 18, 2019

fehiepsi commented Apr 18, 2019

neerajprad commented Apr 18, 2019

fehiepsi commented Apr 18, 2019

neerajprad commented Apr 18, 2019

neerajprad commented Apr 19, 2019 • edited Loading

fehiepsi left a comment

Choose a reason for hiding this comment

fehiepsi Apr 19, 2019

Choose a reason for hiding this comment

fehiepsi commented Apr 19, 2019

neerajprad commented Apr 19, 2019

neerajprad commented Apr 19, 2019 • edited Loading

neerajprad commented Apr 19, 2019

fehiepsi left a comment

Choose a reason for hiding this comment

fehiepsi commented Apr 19, 2019

neerajprad commented Apr 18, 2019 •

edited

Loading

fehiepsi left a comment •

edited

Loading

fehiepsi commented Apr 18, 2019 •

edited

Loading

neerajprad commented Apr 18, 2019 •

edited

Loading

neerajprad commented Apr 19, 2019 •

edited

Loading

neerajprad commented Apr 19, 2019 •

edited

Loading