sampler testing framework #318

bob-carpenter · 2013-10-23T18:18:39Z

We can't get new patches into samplers because there aren't any reliable tests.

We need tests for the samplers for

accuracy on means
accuracy on variances
speed regression tests

We also want to test things that Michael has suggested for HMC like

step size * 2 ^ tree_depth is in a range --- how often and what range?

We have to make all these sensitive to the fact that we have MCMC.

component testing for mcmc

betanalpha · 2013-10-23T20:45:37Z

We can't get new patches into samplers because there aren't any reliable tests.

We need tests for the samplers for

• accuracy on means
• accuracy on variances
• speed regression tests
We also want to test things that Michael has suggested for HMC like

• step size * 2 ^ tree_depth is in a range --- how often and what range?
We have to make all these sensitive to the fact that we have MCMC.

We have to be careful because, by construction, MCMC is stochastic and not exactly amenable to unit tests
as they are usually defined.

Mean/variance estimation:

Assuming a Monte Carlo CLT we'll still have to worry about the expected randomness. Running an ensemble
of tests and only requiring the expected number pass would help, but also make the tests much more demanding.

That said, iid gaussian and a correlated gaussian are natural first tests.

Adaptation:

Some distributions undercut the usual optimization criteria that we use for adaptation. Hierarchical models like
the funnel are a big example that we might want to test.

The interaction between the distributions and adaptation would require sampler-specific tests, not happy generic tests.
There are some exceptions -- the gaussians mentioned above are "linear" and about as easy to adapt to as possible.

Speed regression tests:

Depends on the machine running the tests, so we can't just define definite thresholds. Is it possible to build up the
testing framework to run examples using two difference tags for comparison?

bob-carpenter · 2013-10-23T21:22:57Z

On 10/23/13 4:45 PM, Michael Betancourt wrote:

We can't get new patches into samplers because there aren't any reliable tests.

We need tests for the samplers for

• accuracy on means
• accuracy on variances
• speed regression tests
We also want to test things that Michael has suggested for HMC like

• step size * 2 ^ tree_depth is in a range --- how often and what range?
We have to make all these sensitive to the fact that we have MCMC.

We have to be careful because, by construction, MCMC is stochastic and not exactly amenable to unit tests
as they are usually defined.

Right. That's why, for example, the RNG tests that Peter
wrote do a very large number of samples and then use a very
liberal threshold for a chi-square test. We have a classical
multiple testing problem where we want to control the false positive
rate.

This is similar to what Andrew calls the "Cook-Gelman-Rubin" approach.

Mean/variance estimation:

Assuming a Monte Carlo CLT we'll still have to worry about the expected randomness. Running an ensemble
of tests and only requiring the expected number pass would help, but also make the tests much more demanding.

Right. That's what we're doing for the RNGs, but those
are much simpler to run multiple times.

That said, iid gaussian and a correlated gaussian are natural first tests.

We mostly want to have tests in place to make sure we didn't mess
anything up badly. Finer-grained performance testing can't be part
of our "unit testing" framework. (Though I do believe Jenkins currently
reports total time for all the tests in a browsable way, not that I've
ever browsed it.)

Adaptation:

Some distributions undercut the usual optimization criteria that we use for adaptation. Hierarchical models like
the funnel are a big example that we might want to test.

The interaction between the distributions and adaptation would require sampler-specific tests, not happy generic tests.
There are some exceptions -- the gaussians mentioned above are "linear" and about as easy to adapt to as possible.

We already have tests that vary configuration (e.g, for number
of iterations) for different models.

Speed regression tests:

Depends on the machine running the tests, so we can't just define definite thresholds. Is it possible to build up the
testing framework to run examples using two difference tags for comparison?

I don't see why not. Daniel's a wizard with Jenkins.

For the foreseeable future, the machine running the tests will
be the Jenkins Windows box. Our latest grant proposal applied
for some more hardware for ongoing testing.

And we can test on our own machines.

Bob

betanalpha · 2013-10-23T21:42:59Z

It's not a matter of varying the parameters but figuring out how they need to be varied. Just warning that because of these
interactions it will be hard to have generic "sampler" tests instead of individually-tuned tests for each sampler.

On Oct 23, 2013, at 10:22 PM, Bob Carpenter notifications@github.com wrote:

We already have tests that vary configuration (e.g, for number
of iterations) for different models.

syclik · 2013-10-24T20:00:14Z

I'm ok with individually-tuned tests for each sampler.

betanalpha · 2015-03-17T15:48:45Z

Testing framework proposed in https://github.com/stan-dev/stan/tree/feature/stat_valid_test -- currently needs to be updated so that the tests can be run without depending on CmdStan.

syclik · 2016-11-30T14:34:58Z

@bob-carpenter, this is what we were talking about doing. This will depend on #1751, so I'll branch from there as I start working.

ghost assigned syclik Oct 23, 2013

syclik modified the milestones: Future, v2.3.0 May 15, 2014

bob-carpenter added the project label May 15, 2014

syclik removed the C++ API label Sep 19, 2014

betanalpha mentioned this issue Apr 29, 2015

Feature/issue 1426 ibeta derivs large args #1433

Merged

This was referenced Mar 12, 2019

sampler testing framework alashworth/test-issue-import#5

Open

sampler testing framework alashworth/test-issue-import#13

Open

sampler testing framework alashworth/test-issue-import#21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sampler testing framework #318

sampler testing framework #318

bob-carpenter commented Oct 23, 2013

betanalpha commented Oct 23, 2013

bob-carpenter commented Oct 23, 2013

betanalpha commented Oct 23, 2013

syclik commented Oct 24, 2013

betanalpha commented Mar 17, 2015

syclik commented Nov 30, 2016

sampler testing framework #318

sampler testing framework #318

Comments

bob-carpenter commented Oct 23, 2013

betanalpha commented Oct 23, 2013

bob-carpenter commented Oct 23, 2013

betanalpha commented Oct 23, 2013

syclik commented Oct 24, 2013

betanalpha commented Mar 17, 2015

syclik commented Nov 30, 2016