Skip to content

Conversation

@seantalts
Copy link
Member

@seantalts seantalts commented Apr 12, 2018

I'm hoping to get comments on this, because it seems too simple but has the performance characteristics we want, and I don't think we have to worry about any of the thread contention issues that cause people to use most of the singleton patterns we've seen flying around. Those are all trying to avoid one thread accessing a static global before it has been initialized, but we are either running in single-threaded mode or we are running with a thread_local static global variable instead. So there's never an opportunity for contention.

Fixes #824.

Develop vs this PR:

('stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan', 1.06)
('stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan', 0.97)
('stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan', 1.08)
('stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan', 1.0)
('stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan', 1.18)
('stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan', 1.19)
('compilation', 1.02)
('stat_comp_benchmarks/benchmarks/arK/arK.stan', 1.21)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan', 1.1)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan', 1.07)
('stat_comp_benchmarks/benchmarks/sir/sir.stan', 1.19)
('stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan', 1.09)
('stat_comp_benchmarks/benchmarks/garch/garch.stan', 1.17)
('stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan', 0.84)
('stat_comp_benchmarks/benchmarks/arma/arma.stan', 1.15)
1.08304631135

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company):

Columbia University

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

@seantalts
Copy link
Member Author

Particularly wondering what @bob-carpenter and @wds15 think of this one. I wonder if I'm overlooking something.

@wds15
Copy link
Contributor

wds15 commented Apr 12, 2018

This will work and will be very fast, I think. However: You have declared the global stack to be static at the namespace level. This specifies that the stack will have internal linkage in c++ within the translational unit. This should work fine given the way we build and compile our models (as a single translational unit). It should actually also work for rstanarm if rstanarm works as I think. That is, rstanarm compiles each model as a translational unit and links all of these together. This will give each compiled model its own autodiff stack. That should work for rstanarm arm. Where this approach does not work is if stan-math is used in a bigger project where many translational units make up the whole thing. In this last setting I am not sure if this will work, but this last case may not be relevant (and I could be wrong).

@bob-carpenter
Copy link
Member

bob-carpenter commented Apr 12, 2018 via email

@seantalts
Copy link
Member Author

seantalts commented Apr 13, 2018

Yeah, I think "namespace-level" and "global" are basically the same.

It means each translation unit that includes this chainablestack.hpp will have its own copy of the AD stack. I actually think this is most likely fine for the reasons Sebastian outlined above (and we had some previous discussion).

For this not to work (and as far as I can tell), we would need someone to compile some Stan functions into one translation unit, and then link it with some other Stan functions in another translation unit, and for this user to expect that both would access the same AD graph. This seems like a weird use-case, no? Plus, I imagine the way that we inline every function/method (thus generating a separate copy of each method for each translation unit) would have the same effect as it is?

@wds15
Copy link
Contributor

wds15 commented Apr 13, 2018

No, i don't think that the inlining leads to the same effect. The ad stack is now part of the translational unit whereas in the standard c++ singleton design which is now in, it is guaranteed that in a given program (no matter how many translational units) there will only ever be a single stack instance.

So our ad stack is not any more a true singleton now, but it is a translational unit/namespace instance. Since each model is its own translational unit everything is fine given how stan works...though people on the net advise against this pattern as it causes confusion. Unless someone else has good reasons not to merge this, we should probably go ahead with this for now until we find a fast singleton pattern.

@bob-carpenter
Copy link
Member

bob-carpenter commented Apr 13, 2018 via email

@seantalts
Copy link
Member Author

seantalts commented Apr 13, 2018 via email

@seantalts seantalts closed this Apr 13, 2018
@seantalts seantalts reopened this Apr 13, 2018
@seantalts
Copy link
Member Author

Wait, I'm confused - I don't think we could do that as is! As far as I can see, a symbol either has external or internal linkage, meaning either the symbol is exported during link time and visible from other translation units, or it is not. If it is, then you would get conflicts for defining the same symbol in different translation units. Classes have something special in C++ if they are exactly the same class, but static class members suffer the same problem.

I propose that this PR maintains the current state of affairs (before the threaded AD PR), solves the performance issues, and lets us use threading. I started a discourse thread on how to add new libraries easily, which I believe is the proper solution (compiling a separate library with just the static ChainableStack, and linking that in at the end just once).

@bob-carpenter
Copy link
Member

bob-carpenter commented Apr 13, 2018 via email

@seantalts
Copy link
Member Author

Yep, sounds good. But where does the ChainableStack live? There must be two copies, or else we wouldn't be able to link two translation units together that both include chainablestack.hpp.

My hypothesis is that this preserves the status quo here, as static globals have the same linkage as static class members (what we had before these PRs).

I think it's best if I make a proof-of-concept...

@bob-carpenter
Copy link
Member

bob-carpenter commented Apr 13, 2018 via email

@seantalts
Copy link
Member Author

I made a POC and things do not operate like I have been reading / thinking they should. The version we used to have has a single AD stack across multiple translation units and no link-time errors about multiple definitions (maybe due to the weird thing C++ does for classes in these cases). The version in this PR behaves as expected and has a different AD stack for each TU. So let's not do this version... Any ideas for doing this in a performant way? @wds15 I didn't really understand what that guy was proposing, exactly... Basically if you wrapped chainable_stack in this PR in a class as a static variable instead of as a namespace level global static?

@wds15
Copy link
Contributor

wds15 commented Apr 13, 2018

A possible quick solution for now could be to declare the global ad stack as external in the header files and always require that an additional cpp is linked with the program using stan-math. This should give us a single AD stack in any program, but requires to pass into the linker this additional file.

I did not get the solution I linked immediately up and running. I need a moment to go through that and will do that early next week the latest. But basically, I think you are right in that we are looking for a static class member (with fast access).

@seantalts
Copy link
Member Author

seantalts commented Apr 13, 2018 via email

@bob-carpenter
Copy link
Member

bob-carpenter commented Apr 14, 2018

Do we need anything more than the following to implement a mutable thread-local singleton?

template <typename T>
struct mutable_thread_local_singleton {
  static thread_local T instance_;  // declaration
  static T& get() { return instance_; }
};

bar mutable_thread_local_singleton<bar>::instance_ = bar();  // out of class definition

I can't imagine that using foo::instance() would be slower than `foo::instance_

@seantalts
Copy link
Member Author

seantalts commented Apr 14, 2018

I think that's it - we actually don't even want the get() method, right? This is what I was talking about having verified above as having the right linkage characteristics, but need to finish performance testing.

@seantalts
Copy link
Member Author

Check out #840

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants