-
-
Notifications
You must be signed in to change notification settings - Fork 198
fix performance regression bug, fixes #824 #825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I'm looking into this usage of static for a global variable, and I don't think it will work for us in the multiple translation unit case (like rstarnarm needs). SO link. |
|
For rstanarm, each model is a single translation unit and then they are
linked together. We haven't had any linker issues since we inlined
everything, but I haven't built rstanarm yet with the new threading stuff.
…On Sun, Apr 8, 2018 at 1:17 PM, seantalts ***@***.***> wrote:
I'm looking into this usage of static for a global variable, and I don't
think it will work for us in the multiple translation unit case (like
rstarnarm needs). SO link
<https://stackoverflow.com/questions/15235526/the-static-keyword-and-its-various-uses-in-c>
.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#825 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADOrqpsuP4ujOqUyS98N28j6nxe4re6Eks5tmkYhgaJpZM4TLm1->
.
|
|
Hmm...the multiple translation thing could be an issue, let’s see. I need to run these benchmarks with thread local as this could solve this translation unit issue should we have a problem here. |
|
Can we have a thing like this inside the model's namepsace? |
|
Still working with Debian stable |
|
@bgoodri one such thing per namespace may work.. but that is probably not nice. @seantalts Is there a test which should break for this multiple translation unit? I need to read a bit more about this, but I think you are right. If I got on a quick read your stackoverflow article right, then the pattern which we have right now follows their recommendations. Now, turning on the thread_local thing causes now quite a slow down to 3 with this code (a bigger hit than what we had before). If we could make thread_local's work fast this should solve the multiple translation unit. A possibility for that could be to hold thread_local pointers in the functions which need to access the stack. This is a bit tricky given all the constraints we have. |
|
@bgoodri I think you have more experience with the multiple translation unit stuff from rstanarm. Let me outline what I think happens, and hopefully you can correct me as needed. Each model gets built into a shared library and then they are all linked together into a single binary. That binary can execute only one model on any given run. If this is true, maybe it's okay that each model compiled into each translation unit has its own autodiff stack? Since each translation unit / model has basically the entire math library coded into it. I'm just not sure what happens during linking - if there's a link-time optimization phase and it normally notices that all of the math library is the same between models and it can eliminate that redundancy, now it either 1) might not be able to because of this new global static variable or 2) thinks it still can, and potentially something weird happens where some functions are using one autodiff stack and others are using another? or 3) it can perfectly eliminate the redundancy and there is only one autodiff stack left at the end of link time optimization. 3 would be nice :) Our multiple translation unit tests are not really prepared to answer questions like these. I'm hoping @bgoodri knows, or we just need to try building and testing rstanarm with these changes. Also open to other ideas if people have them. |
|
compilers are really weird. I think I found a solution which I put into another pull. |
Submission Checklist
./runTests.py test/unitmake cpplintSummary:
Adresses performance regression bug introduced with threaded AD stack change.
It turns out that using a
staticvariables declared as member of astaticfunction is causing problems for the compiler to optimize. This PR changes this such that directly a global instance ofChainableStackis declared. Whenever threads are to be used, thethread_localkeyword is used.This change did solve the performance regression problems:
8f218b6c0584af995ed7e48faa8408d03cb040ee:all of the above are without threading enabled.
Intended Effect:
recover speed.
How to Verify:
Run the performance regression framework. Please use as reference for the cmdstan hash
8f218b6c0584af995ed7e48faa8408d03cb040ee.Side Effects:
Looks like that things get faster.
Documentation:
Copyright and Licensing
Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Sebastian Weber
By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses: