Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Feature/issue 2224 allow abs state tols #2244

Closed
wants to merge 10 commits into from

Conversation

wds15
Copy link
Contributor

@wds15 wds15 commented Dec 7, 2020

Summary

This is to address performance issues with ODEs which have state variables with differing absolute tolerance needs. The absolute tolerance is critically influencing the integration performance and giving users more control over it should allow for faster ODE integration. This feature is supported by CVODES integrators (bdf and Adams).

The implementation is based on overloading the ode_bdf_tol and ode_adams_tol calls. So far the argument for the absolute tolerance was a scalar only and now there is an additional signature available where the absolute tolerance is an Eigen vector. This vector must have the same length as the state vector and specified for each state the absolute tolerance to use during integration.

To make this feature available to Stan users, the stanc3 parsers needs an update as well (@rok-cesnovar could you look at this possibly?).

Tests

Tests were added for wrong input and one case was added where a problem was integrated with different absolute tolerances for a harmonic oscillator.

Side Effects

No

Release notes

Allow absolute tolerances to be specified for each ODE state for the ode_bdf_tol and ode_adams_tol integrators.

Checklist

  • Math issue BDF ODE integrator slow #2224

  • Copyright holder: Sebastian Weber

    The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
    - Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
    - Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

  • the basic tests are passing

    • unit tests pass (to run, use: ./runTests.py test/unit)
    • header checks pass, (make test-headers)
    • dependencies checks pass, (make test-math-dependencies)
    • docs build, (make doxygen)
    • code passes the built in C++ standards checks (make cpplint)
  • the code is written in idiomatic C++ and changes are documented in the doxygen

  • the new changes are tested

@rok-cesnovar
Copy link
Member

rok-cesnovar commented Dec 7, 2020

To make this feature available to Stan users, the stanc3 parsers needs an update as well (@rok-cesnovar could you look at this possibly?).

Sure, no problem. This is a 3 minute job. Once this PR is in just make a stanc3 issue and please also assign me.

EDIT: or if you need a version to test this in a model let me know and I can do it now.

@wds15
Copy link
Contributor Author

wds15 commented Dec 7, 2020

If you could prep a branch for testing in stanc3 then that would be great. It's good for any reviewer here to know that things will work out upstream.

@wds15
Copy link
Contributor Author

wds15 commented Dec 7, 2020

@bbbales2 maybe you are anyway going to review, but I would like to hear your opinion about passing absolute tolerances by state which is done via Eigen vectors. I was thinking about doing this via std vectors, but I thought that we handle state vectors of the ODE system as Eigen things and as such an Eigen vector makes sense. Would you agree with an Eigen vector or prefer a std vector for this?

@bbbales2
Copy link
Member

bbbales2 commented Dec 7, 2020

The new interfaces favor Eigen Vectors, and so that'd be my preference here.

As far as the feature goes, I'm just so disconnected from ODE modeling in my day to day that it's hard for me to think about how to evaluate these things. This looks like the type of thing you'd want if you had some variables get near zero but other stay away.

I saw the conversation on Discourse over the sensitivity tolerance option that we turned on a couple releases ago which makes me instantly nervous about changing anything. This isn't necessarily changing anything though -- it's adding a new argument, which the way Stan works means adding a new function.

Would default arguments or named arguments at a stanc3 level get us anywhere with this? In that case we'd be more free to add new options and such without messing up the default experience or exploding the number of different ODE functions.

I don't know how difficult that would be, but the combinatorics of new ODE functions isn't gonna be fun at all, nor is changing peoples' defaults.

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.36 3.55 0.94 -5.85% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.96 -4.49% slower
eight_schools/eight_schools.stan 0.12 0.11 1.03 2.69% faster
gp_regr/gp_regr.stan 0.17 0.16 1.03 3.13% faster
irt_2pl/irt_2pl.stan 5.74 5.76 1.0 -0.39% slower
performance.compilation 88.12 85.86 1.03 2.57% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.43 8.49 0.99 -0.79% slower
pkpd/one_comp_mm_elim_abs.stan 28.88 28.66 1.01 0.76% faster
sir/sir.stan 130.65 128.69 1.02 1.51% faster
gp_regr/gen_gp_data.stan 0.04 0.05 0.99 -0.9% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.92 2.92 1.0 0.11% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.4 0.42 0.94 -6.36% slower
arK/arK.stan 1.79 1.8 1.0 -0.44% slower
arma/arma.stan 0.75 0.73 1.03 2.52% faster
garch/garch.stan 0.61 0.61 1.0 0.06% faster
Mean result: 0.996918545656

Jenkins Console Log
Blue Ocean
Commit hash: f63b006


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@wds15
Copy link
Contributor Author

wds15 commented Dec 8, 2020

This looks like the type of thing you'd want if you had some variables get near zero but other stay away.

This change allows to set the absolute error for each system state. Since absolute errors are only relevant for things which go down to zero this change matters if you have an ODE system with multiple things tending towards zero which have different noise levels. An example is a pharmacokinetic / pharmacodynamic model which looks at drug concentrations and effects of the drug in some clinical output. The concentrations and the clinical measurements have different scales they live on and their noise levels are usually different.

I saw the conversation on Discourse over the sensitivity tolerance option that we turned on a couple releases ago which makes me instantly nervous about changing anything. This isn't necessarily changing anything though -- it's adding a new argument, which the way Stan works means adding a new function.

Well, I tried to do a more through evaluation of the change. I think the study I did justifies our change.

This PR would not need to introduce a new function name - my suggestion is here to go with a simple overload of the ode_bdf_tol function where so far we only had allowed for a scalar for the absolute tolerance and now we provide another overload where the absolute tolerance is an Eigen vector of length of the number of states.

I don't know how difficult that would be, but the combinatorics of new ODE functions isn't gonna be fun at all, nor is changing peoples' defaults.

Changing defaults is never nice, sure. To me the default call ode_bdf was a "please take care of things for me as good as possible". In that sense our change to the error evaluation strategy was ok - we improved it as we are now closer to what we promise to the user... and most importantly the change made things work for Adams which went from does not converge to works fine.

In any case, this change here extends only the advanced use case of the ode functions. Named arguments at some point sound like a good thing, but for now everything is still fine. Maybe I am going to play with scaling factors at a later point which would give the solver order of magnitude information of the parameter sensitivities to tune it even further, but let's go one step at a time.

If you could review the change that would be great, of course. I am not so sure who else could review it with limited effort, since you do know the structure of the code very well... and it's really not a massive change as absolute tolerance becomes a vector, not more.

Once @rok-cesnovar has a prototype of the stanc3 parser for this I am happy to code up an example using this facility with a PK/PD example.

@bbbales2
Copy link
Member

bbbales2 commented Dec 8, 2020

my suggestion is here to go with a simple overload

Hmm, you're right that we do have overloads at the Stan language level. And that's only three additional functions to doc if we do each _tol interface like this.

I guess also can we do this for rk45?

Couldn't a user also work around this issue manually by doing:

vector f(vector y_, real t, vector scales) {
  vector[size(y)] y = scales .* y_;

  ...

  return dydt ./ scales;
}

And the difficulty from a user perspective is all going to be in documentation anyway -- how we communicate what this is for and what the interface is, etc. etc.. We could just document how to do a scale transformation like this somewhere, and then that saves us from all the coding, testing, interface doc, etc. It can just be a user's guide/case study thing.

I think the study I did justifies our change.

At this point I'm basically willing to accept that there is a use case for any option CVODES includes in their library.

I thought the experiments you did were interesting and the plots were really good. However, it seems like with these ODE things it's fairly easy to find a counterexample to any attempt to establish Good Defaults, and so our practical options are less about justifying defaults and more about how do we expose options in a way that isn't a ton of work.

@wds15
Copy link
Contributor Author

wds15 commented Dec 8, 2020

So, yes, the approach you outline of rescaling stuff should be equivalent to setting the per state tolerances. It's just that you force the user to change the problem formulation in a possibly awkward way which is prone to errors, I think. Just giving users the option to set tolerances differently by state is a lot easier than having to rescale the ODE integration process and then scale back when you compare things to the data.

I think that this change is easy enough for us to implement such that I would think we should do it. ODE problems are really a difficult thing and it's good to give users knobs which improve performance... and absolute tolerance can be such a knob depending on the problem.

The plots I did were very helpful for me. The counter-example given by Yi did not even qualify his own ambitions in terms of what he considers worthwhile experiment for a bdf integrator, but it's definitely true that there are no ubiquitous defaults.

So, yes, I think exposing options from CVODES is a sensible thing to do. These options are in that software for some reason. I do not quite see that we have to make firm recommendations in our doc about every option of ODE solving. The topic is way too complex... but we should provide at least some control for those users who have these problems where it's worth to explore these options.

It's possible to code this up for RK45 as well, but it's far more work since odeint does not know the concept of a by-state absolute tolerance. One has to code the error checking in a custom way completely from scratch, which could be easy or a lot of work (don't know yet).

@bbbales2
Copy link
Member

bbbales2 commented Dec 8, 2020

scale back when you compare things to the data.

That's true that you have to scale the output and initial conditions too, and that is inconvenient.

Also as much as I don't want to break symmetry in the solver interfaces because that seems confusing, we're certainly not going to go about duplicating all the other CVODES behaviors that are not implemented in rk45 as we enable them.

Vaguely this seems okay cause we can do it without modifying defaults via overloading. I'm gonna have to think on this some more.

But I guess a simple way of possibly avoiding another disagreement here is asking -- @yizhang-yiz -- vector absolute tolerance arguments, do you think exposing this is good or no (assuming we expose this in a way that does not affect current defaults)?

@bbbales2
Copy link
Member

bbbales2 commented Dec 8, 2020

I'm gonna have to think on this some more.

And by that I mean if I go silent here for a couple days, ping me. Just not sure I'm thinking all the way through the problem at this second.

@rok-cesnovar
Copy link
Member

Once @rok-cesnovar has a prototype of the stanc3 parser for this I am happy to code up an example using this facility with a PK/PD example.

https://github.com/stan-dev/stanc3/actions/runs/409006719

The signature compiles but the error messages were not updated. But for testing it will suffice.

@wds15
Copy link
Contributor Author

wds15 commented Dec 13, 2020

@bbbales2 okdoki... I now used @rok-cesnovar's stanc3 binary together with the really cool and new cmdstanr expose_stan_function thing to benchmark a very simple system. The ode system is a 2-cmt oral dosing PK system coupled to a turnover PD. Below is the stan code to give the equations.

I did run this ode with fixed parameter values at a low (1E-4) and a high (1E-10) absolute tolerance and a relative tolerance of 1E-8. This evaluation is for the basic case without any sensitivities:

Unit: microseconds
      expr     min       lq     mean   median       uq      max neval
      high 430.055 480.4465 515.0107 506.0725 524.1485  852.905   500
      low1 418.729 467.1560 499.1071 493.1225 504.7150  835.634   500
      low2 395.812 444.2555 472.7079 467.3970 481.3845  692.243   500
      low3 381.697 427.0020 460.9625 450.7065 465.1390  853.172   500
      low4 180.049 198.4460 212.8404 209.5215 213.8555  380.342   500
    low_pd 402.840 450.7460 483.4028 474.3370 488.8185 1351.513   500
 high_pkpd 400.880 446.2325 478.1826 468.0600 481.8160  811.185   500

high has all 4 states at high precision. the low1..4 has state 1, state 1+2... state 1-4 on the lower precision.
low_pd keeps the PD system at low, but the pk system high and high_pkpd sets for the observed states (2 and 4) the high and for the rest the low precision.

Obviously, the biggest gain is setting the absolute tolerance for all states low, but the gains in speed when setting things partially to a lower precision are quite a bit.... in particular the gains should be much larger once sensitivities are being computed as the lower precision for one state is propagated to the respective sensitivities.

To me this all looks quite good and an easy knob for users to get more speed with these nasty ODE problems.

functions {
  vector pk_2cmt(real time, vector a, real ka, real ke, real k12, real k21) {
    vector[3] dydt;
    dydt[1] = -ka * a[1];
    dydt[2] =     +ka * a[1] - ke * a[2] - k12 * a[2] + k21 * a[3];
    dydt[3] =     +k12 * a[2] - k21 * a[3];
    return dydt;
  }

  vector turnover(real time, vector pd, real kin, real kout) {
    vector[1] dydt;
    dydt[1] = kin - kout * pd[1];
    return dydt;
  }

  vector pkpd_rhs(real time, vector pkpd,
                  real ka, real ke, real k12, real k21,
                  real kin, real kout, real ea50) {
    vector[rows(pkpd)] dydt;
    real a_main = pkpd[2];
    dydt[1:3] = pk_2cmt(time, pkpd[1:3], ka, ke, k12, k21);
    dydt[4:4] = turnover(time, pkpd[4:4], kin, kout * (1.0 - a_main / (ea50 + a_main)));
    return dydt;
  }

  vector transform_2cmt_cl_rate(real Cl, real Q, real V1, real V2) {
    real k12 = Q / V1;
    return [
        Cl / V1,      // ke
        k12,          // k12
        k12 * V1 / V2 // k21
            ]';
  }

  vector[] pk_simulate(real t0, vector a0, real[] ts, real rel_tol, vector abs_tol, real ka, real Cl, real Q, real V1, real V2) {
    int num_sim = num_elements(ts);
    vector[3] rates = transform_2cmt_cl_rate(Cl, Q, V1, V2);
    real ke = rates[1];
    real k12 = rates[2];
    real k21 = rates[3];
    vector[3] y[num_sim] = ode_bdf_tol(pk_2cmt, a0, t0, ts, rel_tol, abs_tol, 10000, ka, ke, k12, k21);
    for(i in 1:num_sim) {
      y[i,2] = y[i,2] / V1;
      y[i,3] = y[i,3] / V2;
    }
    return y;
  }

  vector[] pkpd_simulate(real t0, vector a0, real[] ts, real rel_tol, vector abs_tol,
                         real ka, real Cl, real Q, real V1, real V2,
                         real kin, real kout, real ec50) {
    int num_sim = num_elements(ts);
    vector[3] rates = transform_2cmt_cl_rate(Cl, Q, V1, V2);
    real ke = rates[1];
    real k12 = rates[2];
    real k21 = rates[3];
    vector[4] y[num_sim] = ode_bdf_tol(pkpd_rhs, a0, t0, ts, rel_tol, abs_tol, 10000,
                                       ka, ke, k12, k21,
                                       kin, kout, ec50 * V1);
    for(i in 1:num_sim) {
      y[i,2] = y[i,2] / V1;
      y[i,3] = y[i,3] / V2;
    }
    return y;
  }

}

@wds15
Copy link
Contributor Author

wds15 commented Dec 13, 2020

So I wanted to see this in action for an actual model. I did turn the above example into a full Stan model which simulates fake data from the above parameters and then fits the problem. This triggers all sensitivities to be calculated. The relative tolerance is always 1E-6 and I setup 3 scenarios for the absolute tolerance for each state

  • high: all at 1E-10
  • mid: 1E-6, 1E-10, 1E-6, 1E-6 / in this scenario only the concentration from the central compartment is at a higher 1E-10 absolute precision, since the other PK compartments are latent and the PD compartment is a lot noisier than the PK compartment.
  • low: all at 1E-6

Timings for 500 warmup + 500 iterations are this:

high               81.374 seconds (Total)
mid                54.022 seconds (Total)
low                36.836 seconds (Total)

So I think the approach is useful and gives users more control over the precision which has obviously a huge impact on speed. The Stan results do agree closely in this case.

@yizhang-yiz
Copy link
Contributor

yizhang-yiz commented Dec 13, 2020

But I guess a simple way of possibly avoiding another disagreement here is asking -- @yizhang-yiz -- vector absolute tolerance arguments, do you think exposing this is good or no (assuming we expose this in a way that does not affect current defaults)?

I'm all for vector abstol. My questions:

  • How do we convey to user when to use what? Does the above pkpd model indicate abstol affect speed in general, if so in which way? Is speed the message we'd like to pass to users regarding this feature? How do we suggest users use it in their own problems? These questions have been asked already for uniform abstol and we've yet addressed it. With vector abstol the answers will be more involved. With no informative documentation the claimed benefit becomes myth.
  • Perhaps more importantly, when will vector abstol not benefit users, if any? Is there a problem it doesn't help or even make things worse if one particular state has a lower abstol? Or in general, when (vector or not) abstol stops affecting solution performance?
  • Why leave rk45 behind? That's the most popular integrator.
  • What benchmark problem do we have so far that could separate bdf from rk45 in performance so we can further improve speed by using vector abstol? That is, if we have a problem rk45 is faster even in a single low tol than bdf with vector tol, what message do we pass to users?

@wds15
Copy link
Contributor Author

wds15 commented Dec 14, 2020

Ok, great, then we have agreement that we want this PR in.

To your, @yizhang-yiz , questions:

  1. How to set absolute tolerances is very problem specific. The CVODES manual gives some general recommendations, but refrains from being too specific with the argument that this is all super problem specific. I do not think we need nor can go beyond that. Lowering the absolute tolerance targets leads to speedups but comes at a risk, of course. The general advice I would give is to apply this feature such that different precision targets can be taken advantage of as in the example: There is a PK and a PD system. Both can have separate targets due to the fact that different things are measured with different noise levels. Furthermore, there are latent states from the PK system which do not enter the log-likelihood. These are reasons to run with different tolerances. How to set them? That's problem specific.
  2. I would not advice vector abs tol as a default. That is really for fine tuning things to get as much speed as you can with a compromise in precision targets. This makes this even more problem specific and as such even less advice can be given to users in general. My suggestion is actually to take advantage of the CVODES manuals and refer Stan users to their documentation. That's efficient for us and I do not think that we can do a better job at documenting this anyways.
  3. I would like to have the feature for RK45, of course. The problem is that odeint from boost does not support it and I am not willing to invest the work in doing that. I looked at the sources and it wasn't apparent to me how to do it quickly. My suggestion is to file an issue with odeint and wait what the author does. He has been very collaborative in the past and maybe the above benchmark does convince him of such a feature. In addition we file an issue for Stan-math which tracks the status. As long as the base integrator does not provide this functionality I do not think we should invest our resources into refitting that. Also... it's actually going to be anyway a bit more work with odeint which does not have a notion of what is a sensitivity equation.
  4. preferring a bdf with vector absolute tolerance over a rk45 without - I don't think we need to go there. One library supports the feature, the other does not. This is an advanced feature and users need to settle this on their own - again due to things being problem specific.

Still, the above example gives one case where this feature is useful. The example is chosen so that it should generalise to a number of other problems as I said (different systems PK and PD, latent states may need less precision). Given we have one case here where the feature shows it's utility we can move ahead and merge.

@bbbales2 bump?!

@bbbales2
Copy link
Member

then we have agreement that we want this PR in.

We have agreement that vector absolute tolerances are something we want

These questions have been asked already for uniform abstol and we've yet addressed it.

I think we should try to doc this, especially since we're expanding it. The user's guide is an appropriate place, and the example you gave is probably good for the docs.

The CVODES manual gives some general recommendations, but refrains from being too specific with the argument that this is all super problem specific.

Well we should be able to say things without overstating our confidence or generalizing too far. Like we have to make all of these decisions ourselves, so what are we thinking?

  1. The reason abstol is important is for problems where the solution goes near zero and reltol fails.

  2. There are certain problems where we can do a vector abstol
    The reason vector abstol is useful is it is sometimes possible to take advantage of it to get higher performance when only some states need a very low abstol.

I would like to have the feature for RK45, of course

Can we rescale all the states internally and get this behavior? Or will this mess up the relative tolerances? We have the eq. for the RK45 tolerances: https://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.RK45.html

file an issue with odeint and wait what the author does

Go ahead and ask. If they're willing to implement such a thing, that makes our lives easier.

--

Yeah I can do a code review of what is here such but I'd like to have an idea of how RK45 is going to work out.

@yizhang-yiz
Copy link
Contributor

yizhang-yiz commented Dec 14, 2020

Ok, great, then we have agreement that we want this PR in.

No. What I said is I agree we want vector-abstol feature . I can say little about this PR without looking into the code.

Lowering the absolute tolerance targets leads to speedups but comes at a risk, of course.

I'm assuming by "lowering" you meant "increasing" but still don't quite understand what you mean. Could you please explain?

The general advice I would give is to apply this feature such that different precision targets can be taken advantage of as in the example: There is a PK and a PD system.

In that case I suggest we add this model in this PR for future reference, so when people ask about instead of simply saying "go dig up the PR" we can point to the model and tests. In particular, I'd suggest add the model and document the fact that it

can have separate targets due to the fact that different things are measured with different noise levels.

because what's emphasized earlier was speed, while "noise level" indicates accuracy.

Furthermore, there are latent states from the PK system which do not enter the log-likelihood.

Are latent states of ODE system always allowed to be less accurate than the states added to likelihood?

I would not advice vector abs tol as a default. That is really for fine tuning things to get as much speed as you can with a compromise in precision targets. This makes this even more problem specific and as such even less advice can be given to users in general. My suggestion is actually to take advantage of the CVODES manuals and refer Stan users to their documentation. That's efficient for us and I do not think that we can do a better job at documenting this anyways.

You're addressing it in general as this doesn't answer my second question. If there's a situation that vector-abstol doesn't help, or in general decreasing abstol doesn't help, we should add it to this PR too. Such a problem renders the above PKPD model argument less powerful, and can serve as an antidote. I believe so far all our ODE development have been replying a handful of models and we're eager to demo "it works!". To show/document "when it shouldn't work" is as well important. Bug can exist in both ways.

As long as the base integrator does not provide this functionality I do not think we should invest our resources into refitting that. Also... it's actually going to be anyway a bit more work with odeint which does not have a notion of what is a sensitivity equation.

This statement is self-contradictory. odeint doesn't provide sensitivity analysis but we still did it. Had we stayed with "as long as the base integrator does not provide this functionality" there wound't be rk45, not mention keeping improving it.

preferring a bdf with vector absolute tolerance over a rk45 without - I don't think we need to go there.

Why not? Is that PKPD model believed to be stiff? If not, why do we want to suggest user to use bdf integrator, as we clearly indicate in the user guide it's "for stiff systems"? If rk45 is more efficient in a single lower abstol, the above PKPD model becomes more or less artificial and loses some of its speedup claim power.

@wds15
Copy link
Contributor Author

wds15 commented Dec 14, 2020

@bbbales2

Sure, we can add the intuition behind my example here to the docs. That makes sense. Still, I want to link to CVODES documentation as the main source for this. The CVODES guys are the experts on ODEs, I am not in comparison.

I raised an issue with odeint, but that is going to take a while if it happens at all. Are you fine with filing an issue for RK45 absolute tolerance vectors in stan-math for now and postpone the feature for RK45 for the moment?

I looked into how errors are calculated and the docs from CVODES say that they use in their error norms the term

|y_i| * rel_error + abs_error_i

And hence scaling is not a good idea, but shifting of y may do it... but I do find it very messy to shift/unshift states back and forth. I suggest we wait for the odeint author to answer and take it from there. In any case my suggestion is to move forward here now and enable what CVODES offers.

Does that sound like a plan?

To answer @yizhang-yiz Qs:

No. What I said is I agree we want vector-abstol feature . I can say little about this PR without looking into the code.

Your input was asked about if you agree to the feature, which you do, but not about the code.

Are latent states of ODE system always allowed to be less accurate than the states added to likelihood?

Latent states only need to be as accurate as they need to be in order to give sufficient accuracy for what is being put into the log-likelihood. I have seen that the ESS improves whenever ODE integrators are asked to spit out greater accuracy... but I will not make any claim here about this nor plan to benchmark this. If in doubt this is again very problem specific and it‘s extremely difficult to say what gives the better ESS per time. A lot of work is need to tease this out and this is not needed for a PR.

I believe so far all our ODE development have been replying a handful of models and we're eager to demo "it works!". To show/document "when it shouldn't work" is as well important as part of tests/benchmarks. Bug can exist in both ways.

I don‘t think we need to raise the bar for including things. If you have an example to share which you want to contribute, please do so. Other than that reasoning to include these changes is that „CVODES implements it for a reason“ and „here is an example which should generalize to some extent where its useful“.

I am happy to include this example model in some form...but someone would need to tell me how. We have never done this.

odeint doesn't provide sensitivity analysis but we still did it.

CVODES knows that we are calculating sensitivities of the ODE solution. It knows what we are structurally doing. Odeint on the contrary is just given a huge coupled ODE system when we ask for sensitivities. That‘s the difference. Since CVODES knows the structure it will do the right thing for the sensitivities depending on the state absolute tolerances accordingly.

Why not? Is that PKPD model believed to be stiff? If not, why do we want to suggest user to use bdf integrator, as we clearly indicate in the user guide it's "for stiff systems"? If rk45 is more efficient in a single lower abstol, the above PKPD model becomes more or less artificial and loses some of its speedup claim power.

Things are too problem specific. So in principle RK45 should be able to handle an oral 2cmt system coupled to a turn-over model just fine... but for Stan this is not always the case as I have seen. When you use weak priors, then Stan will kill a RK45 integrator during warmup and things will crawl until you reach sampling (provided the data you fit is any informative sampling will run okish). So doing warmup with bdf and then during sampling switch to RK45 is the best thing to do as @charlesm93 has shown a while ago. However, this is way too intricate to detail to users nor is this settled enough such that I would document it in our manuals. Still, bdf is useful for this system and RK45 is also useful (with rather informative priors). It‘s too complicated for my taste such that I do not think that I would be able to write down anything useful in this regard. These choices are hard and users need to find their own way trough their problems.

@bbbales2
Copy link
Member

Still, I want to link to CVODES documentation as the main source for this.

Yeah I am happy providing this upfront and recommending it for details.

I am happy to include this example model in some form...but someone would need to tell me how.

Want to try writing a User's Guide thing for this? That way @yizhang-yiz can suggest changes directly to that and we can iterate on exactly what needs said to make it clear how/when to use this feature.

That will give us some time to hear back from odeint people on rk45 too. @yizhang-yiz, do you have any ideas on doing this for rk45?

@yizhang-yiz
Copy link
Contributor

yizhang-yiz commented Dec 16, 2020

Higher precision solutions near zero
Does not effect solutions away from zero (reltol is dominant

If that's what the physical phenomenon behind ODE requires, then yes.

Higher computation requirement

Not necessarily, as show in my example. I can also show example that computation cost remains same across different abstol. But my point is there's not many definitive thing we can say about computation.

Can create instabilities due to zero crossings

Yes. Even if not, the solution may be polluted. In that above 3-states 1e-10 vs 1e-12 example, if we keep solving state 2 when it reaches 1e-10(its presumed physical scale), even before it crosses zero its solution is useless because it's no longer in its right scale.

Solutions:
If your problem is computational, you can try vector abstol
If your problem is instabilities, you can try vector abstol, or you can try log transform

"computation" & "instability" are intertwined. vector abstol is the solution only when user believe states have different noise level. This is what cvodes document suggests, and it says nothing about computation efficiency(I don't want to give you bad news but if you go back to textbook you'll find out even retol + vector atol cannot guarantee global accuracy).

@bbbales2
Copy link
Member

even retol + vector atol cannot guarantee global accuracy

Well yeah, but there is something we can say. The same things happen with statistics in general, all models are wrong, etc. etc.

Alright so I'll update my list:

Pros of lower abstol:

  1. Higher precision solutions near zero
  2. Does not effect solutions away from zero (reltol is dominant)
  3. Can lead to lower computation cost if X [I don't know what to say here]

And cons:

  1. Can lead to higher computation cost if extra precision is unnecessary
  2. Can create instabilities due to zero crossings

Solutions for if you tried abstol and hit one of the cons:

  1. If your problem is computational, you can try vector abstol
  2. If your problem is instabilities, you can try vector abstol, or you can try log transform

vector abstol is the solution only when user believe states have different noise level

What does noise level mean in an ODE solution? I'm assuming there's no data/inverse problem/statistics happening

@bbbales2
Copy link
Member

Also I tried to make a list without computational cost, but both you and Sebastian have provided examples now where computational cost is a big component of what happened.

Different things are happening in those two problems and we don't get to say something simple, but there should be something there that gives people a heads up.

@yizhang-yiz
Copy link
Contributor

What does noise level mean in an ODE solution?

It roughly means a threshold below which the state is not observable in reality and its effect can be safely ignored without compromising computing states that depend on/coupled with it.

Different things are happening in those two problems and we don't get to say something simple, but there should be something there that gives people a heads up.

Which is why I said efficiency is byproduct not purpose of tolerance, and user should focus on the accuracy of their solution and do whatever necessary that incurs. What we can say is

  • accuracy is important to Stan to get correct and efficient warmup
  • one controls accuracy through tolerance
  • tolreance should be set according to mathematical & physical understanding and numerical testing
  • adjusting tolerance may affect of model running time
  • setting abstol too low may causes solver spends too much time near zero, below noise level that's meaningful to the problem.

@bbbales2
Copy link
Member

user should focus on the accuracy of their solution and do whatever necessary that incurs

Just taking a pause from abstol -- will get back to this tomorrow, but is reltol this finicky?

Or is it as simple as, more precision -> more computation

@yizhang-yiz
Copy link
Contributor

yizhang-yiz commented Dec 17, 2020

@bbbales2

Or is it as simple as, more precision -> more computation

The story here is similar: no definitive map between rtol & speed. If you expect lower rtol leads to more computing, here's a counter example.

functions {
  vector ode(real t, vector y) {
    vector[2] dydt;
    dydt[1] = 1.0 + sin(y[2] - y[1]);
    dydt[2] = 1.5 + sin(y[1] - y[2]);
    return dydt;
  }
}
transformed parameters {
  vector[2] y0 = [3., 0.]';
  vector[2] y[n] = ode_rk45_tol(ode, y0, 0.0, t, rtol, atol, 100000);
}

The model doesn't do fitting so I ran 1000 fixed param iters to make time more obvious. Since the solution at end is at scale 1.e-1 to 1.e+4, I fix atol=1.e-20 so it doesn't have effect. Now solve the ODE with t={0.1, 100, 20000}:

rtol wall time(s)
1e-4 2.865
1e-6 8.373
1e-8 10.585
1e-10 10.596
1e-12 10.087
1e-15 8.955

@bbbales2
Copy link
Member

no definitive map between rtol & speed

Huh. I always thought the basic ODE tradeoff was speed vs. accuracy where tolerance was a proxy for accuracy.

So let me extend the list to include reltol:

Pros of lower reltol:

  1. High precision solutions away from zero (abstol more important near zero)

And cons:

  1. Sometimes ur doing too much work. If you don't need a solution to a certain precision, then no need to lower the tolerance. Caveat: Sometimes lowering reltol will actually improve performance and it is difficult to predict this

Pros of lower abstol:

  1. Higher precision solutions near zero
  2. Does not effect solutions away from zero (reltol is dominant)

And cons:

  1. Same con as reltol, but with reltol replaced with abstol
  2. Can create instabilities due to zero crossings

Solutions for if you tried abstol and hit one of the cons:

  1. If your problem is computational, you can try vector abstol
  2. If your problem is instabilities, you can try vector abstol, or you can try log transform

--

I don't want to get rid of the performance component yet, because it is kinda dramatic. Like, by this tolerance is all that matters argument, we could flip a lot of switches and be conservative about things and end up with real slow solvers.

But is it fair to say that the method is the most important thing in terms of performance? And then the tolerance is a smaller knob that you can jiggle a bit if you want?

@yizhang-yiz
Copy link
Contributor

yizhang-yiz commented Dec 17, 2020

I like the saying "tolerance was a proxy for accuracy" but believe the "speed-accuracy tradeoff" is dangerous. In addition to the counter examples showing there's no definitive relation between, many users will not hesitate sacrificing accuracy in pursuing speed(I've committed the same sin because it's so attractive) with documentation's blessing, and are extremely reluctant to put thoughts on tolerance in the context of the problem. So let me repeat that tolerance is a tool for sake of accuracy not speed tuning, and we need test that and doc that. The rest just background noise.

I want to like your list but it's still missing something(if it's intended for documentation, and if so maybe we should move discussion there). There are mainly two factors that determine solvers' stepsize & orders: accuracy & stability. If stability dominates, the cost is nearly unaffected by tolerance: one can use rk45 with a crude tolerance on a stiff problem and it still gonna cost more than bdf with a fine tolerance, and rk45's cost can remain basically the same across different orders of tolerance. This is another circumstance that tuning tolerance for speed's sake is not "totally obvious and trivial". For that I gave one last example, based on the ode I showed earlier. A user unknowingly using rk45 on this stiff problem will get wall time akin to the following(here I ran only one subject among 8 subjects from the original post, with fixed param and y0=[1, 0, 0,]', integrating from 0 to 4e+7):

rtol=atol=1.e-6 rtol=atol=1.e-8 rtol=atol=1.e-10 rtol=atol=1.e-12
7.325(s) 7.919(s) 7.078(s) 7.118(s)

See? Tolerance has no power here. And the user can be rightly assured that each run has reached the accuracy he requested, because that's the purpose of tolerance. Had he believed he can get better speed by tuning atol, he'd be grossly confused.

You may think our bdf is free from this kind of thing but it's not. It's not a silver bullet for stiff problems. But I think that's a topic for another day.

@bbbales2
Copy link
Member

if it's intended for documentation, and if so maybe we should move discussion there

Yes, probably. I agree at this point about the no-simple-relationship thing.

Obviously it's still valuable to change your tolerances if you find a set that gives a suitably accurate answer and the solve runs faster, but the predictive power of raise-tolerance-get-speed isn't gonna be great (cause there's so much else going on).

Gonna take a pause on this for a bit though. I think we should try to write something from the perspective of speed-vs-accuracy tradeoff is misleading and why for the User's guide.

For this I guess the question that remains is odeint. You think you can get to that in the not-to-distant future (before 2.26 which feature freezes on January 13th)?

@jtimonen
Copy link

jtimonen commented Dec 18, 2020

I like that there is discussion about this. I have also talked about a speed-accuracy tradeoff, originating from the fact that when doing a single ODE solve with a fixed-step explicit solver, then there is theory which shows that a smaller step size means more accuracy, and more steps need to be taken so more computation. But it is a good point that maybe it is bad to use the term tradeoff with adaptive solvers because the total number of steps that need to be taken depends on the curvature of the solution and how that changes over the time course.

Then the MCMC aspect to it. Even if some relation between control parameters and speed exists for a single ODE solution, it might not be appropriate to say that when applied with HMC, because the accuracy will affect the accuracy of gradients, log posterior evaluations, and thus things that can cause a change in the overall speed of fitting a model.

Edit: For example, I imagine that when changing the tolerance (either way), the sampler can then be allowed to go to a new parameter region where the step size needs to be adapted very small, and therefore cause a dramatic change in speed.

@yizhang-yiz
Copy link
Contributor

yizhang-yiz commented Dec 19, 2020

@bbbales2

Gonna take a pause on this for a bit though. I think we should try to write something from the perspective of speed-vs-accuracy tradeoff is misleading and why for the User's guide.

For this I guess the question that remains is odeint. You think you can get to that in the not-to-distant future (before 2.26 which feature freezes on January 13th)?

I couldn't agree more that we should fix the user guide. In fact, I'd like to have that first before this feature. If this one doesn't make to 2.26, the documentation certainly can.

@jtimonen
You make great point. On ODE level adaptive stepsize & order plays an important role in decoupling speed from tolerance, and at sampler level that relationship gets even more unclear. This is why I think accuracy is the message we should pass on: correct sampling relies on accuracy, and for that there is a dependable knob user can turn to, which is tolerance.

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.42 3.46 0.99 -1.16% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 1.04 4.23% faster
eight_schools/eight_schools.stan 0.12 0.11 1.02 1.63% faster
gp_regr/gp_regr.stan 0.15 0.15 1.01 0.74% faster
irt_2pl/irt_2pl.stan 5.18 5.18 1.0 -0.1% slower
performance.compilation 92.64 89.84 1.03 3.03% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.67 8.75 0.99 -0.97% slower
pkpd/one_comp_mm_elim_abs.stan 28.41 29.69 0.96 -4.51% slower
sir/sir.stan 133.32 131.22 1.02 1.57% faster
gp_regr/gen_gp_data.stan 0.04 0.04 0.99 -0.67% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 3.05 3.04 1.0 0.26% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.38 0.4 0.95 -5.0% slower
arK/arK.stan 2.53 2.51 1.01 0.54% faster
arma/arma.stan 0.61 0.61 1.0 0.45% faster
garch/garch.stan 0.67 0.67 1.0 0.17% faster
Mean result: 1.00066015254

Jenkins Console Log
Blue Ocean
Commit hash: f63b006


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@wds15
Copy link
Contributor Author

wds15 commented Jan 10, 2021

I am not aware of any progress on the ODE doc nor are there new commits to this PR for now more than 3 weeks. As such I would propose to move forward with this PR in its current form. The rationale is that

  • we are exposing a useful feature of CVODES
  • the test included in this PR do test that a vector absolute tolerance can be passed to our functions
  • we do not have any tests so far which actually test any tolerance guarantees by our integrators. We only test that we get sufficiently accurate results compared to an analytic solution.
  • we can rely on CVODES having done their homework and we just expose the functionality as provided by CVODES which is CVODES business to test

If that makes sense to others then we should proceed and merge the PR - then we can look forward to have this available in 2.26 which is around the corner and that would be great to have.

@bbbales2
Copy link
Member

Well three weeks ago was the time to plan this for going into 2.26. I don't mind rebooting this, but the reason this stopped moving forward was because it dropped as a priority (#2244 (comment)).

I get the argument that we can just add this feature cause it's CVODES and presumably it's there for a reason, but I don't like the asymmetry of the interfaces, and I don't like the problems we have even describing what this feature is and how it's useful.

I honestly don't have time to try to resolve this stuff in three days, so I want to wait on this.

@wds15
Copy link
Contributor Author

wds15 commented Jan 10, 2021

There is anyway going to be asymmetry with solvers once we introduce adjoint. In the discussion with adjoint we also agreed on saying in the doc that things are difficult in general which also applies here given the discussion.

Anyway, whoever wants to take this up can do so of course.

@wds15 wds15 closed this Jan 10, 2021
@bbbales2
Copy link
Member

Hmm, I don't think this it's a good resolution to close this. We all agreed it was probably useful and we disagreed on the way forward.

I'm down to do these changes, but I wanna work out the RK45/documentation problem. If you disagree with this, call a vote. We're supposed to use those when we get deadlocked like this. It's not a big deal. Maybe everyone else that looks at this thinks we should just merge.

Anyway, apologies for slowing this down. It wasn't my intention to stop it though (take as evidence all the conversations we've had so far about this feature -- if I didn't think it was worth doing, I'd like to think I would have told you earlier). This isn't the highest thing on my priority list though, so when you decided not to spend much time on this a few weeks ago, I happily mirrored that. We can certainly knock the dust off, but not in three days.

@wds15
Copy link
Contributor Author

wds15 commented Jan 11, 2021

My apologies, but it was not clear to me that you wanted to work on this. However, the code in here is by now not anymore even feature complete given that RK45 support is deemed a must have.

This PR is at best a draft PR by now then in order to not waste too many resources. I do not mean to stop this activity by closing the PR. It just reflects that no one is working on it and things anyway changed too much. So please feel free to just reopen this PR.

I am personally concentrating on the adjoint thing (really cool that you got that started!) which is promising to unlock much larger problems for Stan to become feasible due to the much better scaling properties of adjoint as compared to forward.

@bbbales2
Copy link
Member

It just reflects that no one is working on it

Yeah, that's true. I'm gonna reopen this and leave it cause it has some unfinished code/commentary and there's really no better place to put it (even if this isn't the cleanest way to record this).

Anyhow we can always close it in the future.

@bbbales2 bbbales2 reopened this Jan 11, 2021
@bbbales2 bbbales2 marked this pull request as draft January 11, 2021 13:20
@bbbales2 bbbales2 changed the title Feature/issue 2224 allow abs state tols [WIP] Feature/issue 2224 allow abs state tols Jan 11, 2021
@yizhang-yiz
Copy link
Contributor

Sorry guys I want to apologize for not following up in a coupled of weeks. I tool a 2-wk time-off around holidays and last week got caught up in SGB crash course. To echo @bbbales2 's points I believe we need to address rk45 & doc gap, in addition it's also agreed among some of us that more tests are needed to test effect of tol on the accuracy. For rk45 I'm moving Torsten toward using ARKode instead of boost::odeint, and can follow up later when it's ready. Vector tol is not my main motivation of using arkode there but do come out as a trivial by-product. For documentation there's been some discussion tho I haven't got time to respond. So I wouldn't say everything is in limbo but it just may take a while for being not a priority.

@yizhang-yiz yizhang-yiz mentioned this pull request Feb 3, 2021
5 tasks
@bbbales2
Copy link
Member

Alright I haven't gotten around to doing anything here for another month and a half. Anyone have objections with me closing this down? Can re-open a new pull or whatever in the future.

@bbbales2 bbbales2 closed this Feb 25, 2021
@wds15
Copy link
Contributor Author

wds15 commented Apr 26, 2021

@betanalpha did you overlook this PR? I was lobbying hard for per-state absolute tolerances to be added to our CVODES solvers, but I was not successful with it due to too much resistance from others. I am not sure if you are aware of this episode.

@betanalpha
Copy link
Contributor

@wds15 Yes, I agree that the vector tolerances in the BDF integrator should be exposed for the same reason all of the tolerances should have been exposed in the adjoint solver -- to allow advanced users to experiment with the integrator configurations on realistic problems to develop better understanding of what configurations are robust enough for use with Markov chain Monte Carlo. We should not expect to know how to optimally configure the integrators before we expose those configurations; that is an incredibly challenging problem that is not going to be solved in a GitHub thread. We have no idea how good the default configurations actually are, and we cannot even explore that question until the configurations are exposed to as many advanced users as possible for experimentation. Maintaining arbitrary defaults is not a "safe" or "conservative" option.

I fear that that precedence set by the adjoint integrator design doc discussion will provide an excuse for not exposing integrator configurations here as well, but my opinion is the same in both: until we have an explicit argument for how to set default configurations robust to Markov chain Monte Carlo use all of the configurations should be exposed.

@wds15
Copy link
Contributor Author

wds15 commented May 5, 2021

@betanalpha this PR has all the code ready to expose vector absolute tolerances for the states. No other extra options are exposed and this is only done for the CVODES solvers (adams and bdf). Should we re-open this PR with the content as is?

If we re-open I would be happy to write some tests which integrate the new signatures with the testing framework in a basic way.

(This is not to excuse anything here, but take it as a matter of fact that exposing what is already implemented is straightforward).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants