Fix performance regression due to #809 #836

seantalts · 2018-04-12T17:50:12Z

I'm hoping to get comments on this, because it seems too simple but has the performance characteristics we want, and I don't think we have to worry about any of the thread contention issues that cause people to use most of the singleton patterns we've seen flying around. Those are all trying to avoid one thread accessing a static global before it has been initialized, but we are either running in single-threaded mode or we are running with a thread_local static global variable instead. So there's never an opportunity for contention.

Fixes #824.

Develop vs this PR:

('stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan', 1.06)
('stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan', 0.97)
('stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan', 1.08)
('stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan', 1.0)
('stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan', 1.18)
('stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan', 1.19)
('compilation', 1.02)
('stat_comp_benchmarks/benchmarks/arK/arK.stan', 1.21)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan', 1.1)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan', 1.07)
('stat_comp_benchmarks/benchmarks/sir/sir.stan', 1.19)
('stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan', 1.09)
('stat_comp_benchmarks/benchmarks/garch/garch.stan', 1.17)
('stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan', 0.84)
('stat_comp_benchmarks/benchmarks/arma/arma.stan', 1.15)
1.08304631135

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company):

Columbia University

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

seantalts · 2018-04-12T17:53:04Z

Particularly wondering what @bob-carpenter and @wds15 think of this one. I wonder if I'm overlooking something.

…xp1 (branches/release_50)

wds15 · 2018-04-12T19:16:08Z

This will work and will be very fast, I think. However: You have declared the global stack to be static at the namespace level. This specifies that the stack will have internal linkage in c++ within the translational unit. This should work fine given the way we build and compile our models (as a single translational unit). It should actually also work for rstanarm if rstanarm works as I think. That is, rstanarm compiles each model as a translational unit and links all of these together. This will give each compiled model its own autodiff stack. That should work for rstanarm arm. Where this approach does not work is if stan-math is used in a bigger project where many translational units make up the whole thing. In this last setting I am not sure if this will work, but this last case may not be relevant (and I could be wrong).

bob-carpenter · 2018-04-12T21:03:14Z

Isn't every global in some namespace, even if only ::? I'm not sure what internal linkage means here. Does it mean the global stack is invisible from other translation units? If so, I'd think that'd be a problem already.

seantalts · 2018-04-13T01:23:18Z

Yeah, I think "namespace-level" and "global" are basically the same.

It means each translation unit that includes this chainablestack.hpp will have its own copy of the AD stack. I actually think this is most likely fine for the reasons Sebastian outlined above (and we had some previous discussion).

For this not to work (and as far as I can tell), we would need someone to compile some Stan functions into one translation unit, and then link it with some other Stan functions in another translation unit, and for this user to expect that both would access the same AD graph. This seems like a weird use-case, no? Plus, I imagine the way that we inline every function/method (thus generating a separate copy of each method for each translation unit) would have the same effect as it is?

wds15 · 2018-04-13T06:45:12Z

No, i don't think that the inlining leads to the same effect. The ad stack is now part of the translational unit whereas in the standard c++ singleton design which is now in, it is guaranteed that in a given program (no matter how many translational units) there will only ever be a single stack instance.

So our ad stack is not any more a true singleton now, but it is a translational unit/namespace instance. Since each model is its own translational unit everything is fine given how stan works...though people on the net advise against this pattern as it causes confusion. Unless someone else has good reasons not to merge this, we should probably go ahead with this for now until we find a fast singleton pattern.

bob-carpenter · 2018-04-13T15:21:21Z

Not weird at all. It's how we'd be trying to save compile time for instance---we'd compile all the matrix operations in their own translation unit and then link rather than recompile every time. In fact, I think doing this should be a priority for us as they take a long time to compile and won't be too large to precompile---but that's a different issue.

seantalts · 2018-04-13T16:06:28Z

Damn. Guess this won't work either.

…

On Fri, Apr 13, 2018 at 11:21 Bob Carpenter ***@***.***> wrote: Not weird at all. It's how we'd be trying to save compile time for instance---we'd compile all the matrix operations in their own translation unit and then link rather than recompile every time. In fact, I think doing this should be a priority for us as they take a long time to compile and won't be too large to precompile---but that's a different issue. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#836 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAxJ7K-5anpW7f9G9T46sv3PhGTERryUks5toMJygaJpZM4TSOcK> .

seantalts · 2018-04-13T17:53:56Z

Wait, I'm confused - I don't think we could do that as is! As far as I can see, a symbol either has external or internal linkage, meaning either the symbol is exported during link time and visible from other translation units, or it is not. If it is, then you would get conflicts for defining the same symbol in different translation units. Classes have something special in C++ if they are exactly the same class, but static class members suffer the same problem.

I propose that this PR maintains the current state of affairs (before the threaded AD PR), solves the performance issues, and lets us use threading. I started a discourse thread on how to add new libraries easily, which I believe is the proper solution (compiling a separate library with just the static ChainableStack, and linking that in at the end just once).

bob-carpenter · 2018-04-13T19:33:16Z

All the matrix functions would be defined exactly once in their own translation unit. They will be externally linked (default behavior). Other translation units would have to use pure header files (without definitions) for matrix functions to ensure the matrix functions don't get recompiled (like they would if we inlined definitions). If this PR prevents compiled autodiff functions in standalone translation units, it's going to incur a lot of technical debt that we'll probably have to pay off in the future. And it'll probably break things our users are doing.

…

On Apr 13, 2018, at 1:53 PM, seantalts ***@***.***> wrote: Wait, I'm confused - I don't think we could do that as is! As far as I can see, a symbol either has external or internal linkage, meaning either the symbol is exported during link time and visible from other translation units, or it is not. If it is, then you would get conflicts for defining the same symbol in different translation units. Classes have something special in C++ if they are exactly the same class, but static class members suffer the same problem. I propose that this PR maintains the current state of affairs (before the threaded AD PR), solves the performance issues, and lets us use threading. I started a discourse thread on how to add new libraries easily, which I believe is the proper solution (compiling a separate library with just the static ChainableStack, and linking that in at the end just once). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

seantalts · 2018-04-13T19:42:06Z

Yep, sounds good. But where does the ChainableStack live? There must be two copies, or else we wouldn't be able to link two translation units together that both include chainablestack.hpp.

My hypothesis is that this preserves the status quo here, as static globals have the same linkage as static class members (what we had before these PRs).

I think it's best if I make a proof-of-concept...

bob-carpenter · 2018-04-13T20:14:14Z

I'm happy to defer to your wisdom on this. As long as this doesn't preclude us from writing matrix functions in a standalone translation unit any more than whatever we had before, I'm totally OK with the change. I'm less keen if it introduces technical debt that precludes us from ever compiling in multiple translation units. Maybe we need to talk about this in person. I'm not very good with all the build terminology, so may be describing things the wrong way and just confusing everyone.

…

On Apr 13, 2018, at 3:42 PM, seantalts ***@***.***> wrote: Yep, sounds good. But where does the ChainableStack live? There must be two copies, or else we wouldn't be able to link two translation units together that both include chainablestack.hpp. My hypothesis is that this preserves the status quo here, as static globals have the same linkage as static class members (what we had before these PRs). I think it's best if I make a proof-of-concept... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

seantalts · 2018-04-13T20:30:10Z

I made a POC and things do not operate like I have been reading / thinking they should. The version we used to have has a single AD stack across multiple translation units and no link-time errors about multiple definitions (maybe due to the weird thing C++ does for classes in these cases). The version in this PR behaves as expected and has a different AD stack for each TU. So let's not do this version... Any ideas for doing this in a performant way? @wds15 I didn't really understand what that guy was proposing, exactly... Basically if you wrapped chainable_stack in this PR in a class as a static variable instead of as a namespace level global static?

wds15 · 2018-04-13T21:22:51Z

A possible quick solution for now could be to declare the global ad stack as external in the header files and always require that an additional cpp is linked with the program using stan-math. This should give us a single AD stack in any program, but requires to pass into the linker this additional file.

I did not get the solution I linked immediately up and running. I need a moment to go through that and will do that early next week the latest. But basically, I think you are right in that we are looking for a static class member (with fast access).

seantalts · 2018-04-13T21:37:39Z

I got a version along the lines of the 2nd thing working, but need to test performance

…

On Fri, Apr 13, 2018 at 17:22 wds15 ***@***.***> wrote: A possible quick solution for now could be to declare the global ad stack as external in the header files and always require that an additional cpp is linked with the program using stan-math. This should give us a single AD stack in any program, but requires to pass into the linker this additional file. I did not get the solution I linked immediately up and running. I need a moment to go through that and will do that early next week the latest. But basically, I think you are right in that we are looking for a static class member (with fast access). — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#836 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAxJ7PM4cV6i_IdtcozhW2XZJwYhHQ_aks5toRcsgaJpZM4TSOcK> .

bob-carpenter · 2018-04-14T00:58:06Z

Do we need anything more than the following to implement a mutable thread-local singleton?

template <typename T>
struct mutable_thread_local_singleton {
  static thread_local T instance_;  // declaration
  static T& get() { return instance_; }
};

bar mutable_thread_local_singleton<bar>::instance_ = bar();  // out of class definition

I can't imagine that using foo::instance() would be slower than `foo::instance_

seantalts · 2018-04-14T03:06:55Z

I think that's it - we actually don't even want the get() method, right? This is what I was talking about having verified above as having the right linkage characteristics, but need to finish performance testing.

seantalts · 2018-04-14T03:29:03Z

Check out #840

seantalts added 2 commits April 12, 2018 13:48

Try using a simple global static chainable stack

3b49166

Shorter ifdef

46fb849

[Jenkins] auto-formatting by clang-format version 5.0.1-svn319952-1~e…

4139d94

…xp1 (branches/release_50)

seantalts closed this Apr 13, 2018

seantalts reopened this Apr 13, 2018

seantalts closed this Apr 14, 2018

seantalts mentioned this pull request Apr 14, 2018

Fix performance regression due to #809 #840

Merged

Uh oh!

Fix performance regression due to #809 #836

Fix performance regression due to #809 #836

Uh oh!

Conversation

seantalts commented Apr 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Copyright and Licensing

Uh oh!

seantalts commented Apr 12, 2018

Uh oh!

wds15 commented Apr 12, 2018

Uh oh!

bob-carpenter commented Apr 12, 2018 via email

Uh oh!

seantalts commented Apr 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wds15 commented Apr 13, 2018

Uh oh!

bob-carpenter commented Apr 13, 2018 via email

Uh oh!

seantalts commented Apr 13, 2018 via email

Uh oh!

seantalts commented Apr 13, 2018

Uh oh!

bob-carpenter commented Apr 13, 2018 via email

Uh oh!

seantalts commented Apr 13, 2018

Uh oh!

bob-carpenter commented Apr 13, 2018 via email

Uh oh!

seantalts commented Apr 13, 2018

Uh oh!

wds15 commented Apr 13, 2018

Uh oh!

seantalts commented Apr 13, 2018 via email

Uh oh!

bob-carpenter commented Apr 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seantalts commented Apr 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seantalts commented Apr 14, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

seantalts commented Apr 12, 2018 •

edited

Loading

seantalts commented Apr 13, 2018 •

edited

Loading

bob-carpenter commented Apr 14, 2018 •

edited

Loading

seantalts commented Apr 14, 2018 •

edited

Loading