Skip to content

Conversation

@wds15
Copy link
Contributor

@wds15 wds15 commented Apr 8, 2018

Submission Checklist

  • Run unit tests: ./runTests.py test/unit
  • Run cpplint: make cpplint
  • Declare copyright holder and open-source license: see below

Summary:

Adresses performance regression bug introduced with threaded AD stack change.

It turns out that using a static variables declared as member of a static function is causing problems for the compiler to optimize. This PR changes this such that directly a global instance of ChainableStack is declared. Whenever threads are to be used, the thread_local keyword is used.

This change did solve the performance regression problems:

  • performance prior to the threaded AD pull (cmdstan hash 8f218b6c0584af995ed7e48faa8408d03cb040ee:
stat_comp_benchmarks/benchmarks/arK/arK.stan,2.10290490389
  • performance after this change:
stat_comp_benchmarks/benchmarks/arK/arK.stan,1.90999811888
  • for reference, performance with the threaded AD changes which caused the slow down:
stat_comp_benchmarks/benchmarks/arK/arK.stan,2.34019263983

all of the above are without threading enabled.

Intended Effect:

recover speed.

How to Verify:

Run the performance regression framework. Please use as reference for the cmdstan hash 8f218b6c0584af995ed7e48faa8408d03cb040ee.

Side Effects:

Looks like that things get faster.

Documentation:

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Sebastian Weber

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

@seantalts seantalts self-requested a review April 8, 2018 16:55
@seantalts
Copy link
Member

I'm looking into this usage of static for a global variable, and I don't think it will work for us in the multiple translation unit case (like rstarnarm needs). SO link.

@bgoodri
Copy link
Contributor

bgoodri commented Apr 8, 2018 via email

@wds15
Copy link
Contributor Author

wds15 commented Apr 8, 2018

Hmm...the multiple translation thing could be an issue, let’s see. I need to run these benchmarks with thread local as this could solve this translation unit issue should we have a problem here.

@bgoodri
Copy link
Contributor

bgoodri commented Apr 8, 2018

Can we have a thing like this inside the model's namepsace?

@bgoodri
Copy link
Contributor

bgoodri commented Apr 9, 2018

Still working with Debian stable

@wds15
Copy link
Contributor Author

wds15 commented Apr 9, 2018

@bgoodri one such thing per namespace may work.. but that is probably not nice.

@seantalts Is there a test which should break for this multiple translation unit? I need to read a bit more about this, but I think you are right. If I got on a quick read your stackoverflow article right, then the pattern which we have right now follows their recommendations.

Now, turning on the thread_local thing causes now quite a slow down to 3 with this code (a bigger hit than what we had before). If we could make thread_local's work fast this should solve the multiple translation unit. A possibility for that could be to hold thread_local pointers in the functions which need to access the stack. This is a bit tricky given all the constraints we have.

@seantalts
Copy link
Member

@bgoodri I think you have more experience with the multiple translation unit stuff from rstanarm. Let me outline what I think happens, and hopefully you can correct me as needed. Each model gets built into a shared library and then they are all linked together into a single binary. That binary can execute only one model on any given run. If this is true, maybe it's okay that each model compiled into each translation unit has its own autodiff stack? Since each translation unit / model has basically the entire math library coded into it. I'm just not sure what happens during linking - if there's a link-time optimization phase and it normally notices that all of the math library is the same between models and it can eliminate that redundancy, now it either 1) might not be able to because of this new global static variable or 2) thinks it still can, and potentially something weird happens where some functions are using one autodiff stack and others are using another? or 3) it can perfectly eliminate the redundancy and there is only one autodiff stack left at the end of link time optimization. 3 would be nice :)

Our multiple translation unit tests are not really prepared to answer questions like these. I'm hoping @bgoodri knows, or we just need to try building and testing rstanarm with these changes. Also open to other ideas if people have them.

@wds15
Copy link
Contributor Author

wds15 commented Apr 9, 2018

compilers are really weird. I think I found a solution which I put into another pull.

@wds15 wds15 closed this Apr 9, 2018
@wds15 wds15 deleted the feature/issue-824-fix-speed branch April 17, 2018 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants