Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Merge pull request #1212 from stan-dev/feature/faster-ad-tls-v5" #1244

Merged
merged 1 commit into from May 18, 2019

Conversation

Projects
None yet
4 participants
@seantalts
Copy link
Member

commented May 16, 2019

This reverts commit 5e76d67, reversing
changes made to ec52b8b.

Seems like this caused a 14% performance regression in example-models/bugs_examples/vol2/schools/schools-{2-4}.stan.
Ref: https://discourse.mc-stan.org/t/possible-performance-regression/8835

Revert "Merge pull request #1212 from stan-dev/feature/faster-ad-tls-v5"
This reverts commit 5e76d67, reversing
changes made to ec52b8b.
@wds15

This comment has been minimized.

Copy link
Contributor

commented May 16, 2019

This introduces BACK again a bug such that threading won't work anymore on windows.

In this regard, I am not sure of the slight performance regression is enough to trump that regression which we found in a single model so far.

Could we first have an overview which models are affected?

Moreover, there are two very hot candidates in terms of getting rid of the performance regression for the non-threaded case.

If it's our standard procedure to revert things like, then sure, go ahead.

@seantalts

This comment has been minimized.

Copy link
Member Author

commented May 16, 2019

Yeah, just talked about it at the Math meeting. SOP is to revert since we didn't know about this before, and then study it in more detail / try to fix it / analyze tradeoffs while develop is "clean."

I hear what you're saying about the tradeoffs of the revert and breaking things again for Windows, but given that our performance testing coverage is woefully inadequate it seems like we should study this more to understand it better before we can actually decide to make that trade off.

@wds15

This comment has been minimized.

Copy link
Contributor

commented May 16, 2019

Ok... I am getting a bit tired working on this, but I will look for some motivation on the street...

Most likely it's the all pointer design. Changing that will make things a bit inconsistent, but it will be manageable. With a bit of luck we can turn the instance() method into a raw pointer and that solves it (hopefully).

EDIT: BTW, if this is our SOP - then we need by all means an easy way to generate the overview over the models which are affected by how much.

EDIT2: And does this mean that the gold-standard is non-threaded, right? We never spelled that out, but reverting introduces a major performance regression for any threaded application. So what if I would submit a model which uses threading... you would not be able to make the same argument again (other than saying non-threading is the gold ref).

@seantalts

This comment has been minimized.

Copy link
Member Author

commented May 16, 2019

I feel ya. @rok-cesnovar and @SteveBronder (and maybe others) had previously volunteered to help out with parallelism efforts, maybe you guys could work together on analyzing this and potentially fixing it or making the case that it's worth it?

@rok-cesnovar

This comment has been minimized.

Copy link
Collaborator

commented May 16, 2019

I am working on pinpointing where exactly on the branch this happens. I will see if I can find the fix or seek some help here or on discourse.

@rok-cesnovar

This comment has been minimized.

Copy link
Collaborator

commented May 16, 2019

Reverting 49d3583 and ace8424 has no effect on the schools-4 model.

As @wds15 pointed out on the Discourse thread it could be due to the pointer access to the AD instance. And he is correct.

I tried my best (or worst) with an improvised/handwave-y/justmakethiscompile way of ifdefing the autodiffstackstorage.hpp ( the result is here). I never dealt with these autodiffstackstorage stuff and just wanted to see if this was it. With this we are back to

 Elapsed Time: 287.646 seconds (Warm-up)
               182.023 seconds (Sampling)
               469.669 seconds (Total)
@rok-cesnovar

This comment has been minimized.

Copy link
Collaborator

commented May 16, 2019

Is making a PR with this ifdefs an option? Instead of reverting and then reapplying? Either way, I can make that happen tomorrow to avoid the regression for threading on Windows.

@wds15

This comment has been minimized.

Copy link
Contributor

commented May 16, 2019

@rok-cesnovar Thanks a lot for pinpointing this. We should be able to wait for this, if you ask me, but this is probably a matter of SOP and easier git magic; so let's have @seantalts and/or @syclik have a say here.

In any case... handling things with pointers in the threading case and without in the non-threading case is not too nice. Since this only affects a single file it is probably fine - given the constraints we are under. I hope that this would be common understanding here. The art will be to write it as cleanly as possible.

@rok-cesnovar

This comment has been minimized.

Copy link
Collaborator

commented May 16, 2019

Yeah, with reverting and than applying we will also be able to double check that the fix is fine for all the performance tests in the batch so we are clear for good.

@wds15

This comment has been minimized.

Copy link
Contributor

commented May 16, 2019

reverting 49d3583 and ace8424 does have an effect on my system! See the forum.

(annoying to have to deal with this again... but looks like it is useful...yack)

@rok-cesnovar

This comment has been minimized.

Copy link
Collaborator

commented May 16, 2019

I did only try the revert with g++. That is my bad. Will double check with g++ and clang + Windows. But not before tomorrow.

@wds15

This comment has been minimized.

Copy link
Contributor

commented May 16, 2019

No worries. This stuff can drive you crazy and is almost not manageable without huge automation. It sounds as if we want to avoid the by-method access and we may event need to get rid of the pointer for the non-threaded case. Getting rid of the pointer (and use two different things for threading / non-threading) in the non-threaded case is not too nice.

Windows should work with the pointer for threading. I have never tested speed on Windows though (I just don't like to spend too much time under Windows).

@wds15

This comment has been minimized.

Copy link
Contributor

commented May 17, 2019

Shouldn’t we now expect to see speedups in ther performance tests which are kicked off automatically? Can someone point me to these please?

@rok-cesnovar

This comment has been minimized.

Copy link
Collaborator

commented May 17, 2019

Finally ran the tests with clang++

clang++ on develop
 Elapsed Time: 324.398 seconds (Warm-up)
               207.511 seconds (Sampling)
               531.909 seconds (Total)

clang++ with `49d3583` and `ace8424` reverted (made 2 runs)
 Elapsed Time: 296.09 seconds (Warm-up)
               187.73 seconds (Sampling)
               483.82 seconds (Total)
 Elapsed Time: 297.669 seconds (Warm-up)
               188.23 seconds (Sampling)
               485.9 seconds (Total)

So it seems that Clang handles this better and doesnt need the ifdef solution. Running on Windows now.

@wds15

This comment has been minimized.

Copy link
Contributor

commented May 17, 2019

Thanks for confirming that. The crux is that reverting those two commits is a solution which is cannot be combined easily with the optimisation to have for the non-threaded case a non-pointer solution.

And from the evidence we have so far it seems that gcc runs faster with a non-pointer solution while clang does not care about that, but for clang the method access should be avoided. At least this is how I recall this. Let's see if we can get a good solution under most situations...

@seantalts

This comment has been minimized.

Copy link
Member Author

commented May 17, 2019

So the github commenting mechanism isn't working yet, but Jenkins did run the stat_comp_benchmarks for performance and correctness in this PR against develop in this build: http://d1m1s1b1.stat.columbia.edu:8080/blue/organizations/jenkins/CmdStan%20Performance%20Tests/detail/downstream_tests/62/pipeline/18

Unfortunately the model with the problem is not in the stat_comp_benchmarks suite, but those results are here:

+ ./comparePerformance.py develop_performance.csv performance.csv
('stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan', 1.0)
('stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan', 0.94)
('stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan', 1.0)
('stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan', 0.97)
('stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan', 1.01)
('stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan', 0.99)
('stat_comp_benchmarks/benchmarks/arK/arK.stan', 1.04)
('performance.compilation', 1.0)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan', 1.0)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan', 1.01)
('stat_comp_benchmarks/benchmarks/sir/sir.stan', 1.06)
('stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan', 1.02)
('stat_comp_benchmarks/benchmarks/garch/garch.stan', 1.04)
('stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan', 0.98)
('stat_comp_benchmarks/benchmarks/arma/arma.stan', 1.21)
1.01635012423

I'd be prone to ignore the stuff that looks like plus or minus 4 percent or so. It looks like it sped up the arma model significantly, but that model takes under a second to run anyway. I'll run a test locally where I run it many times and see what it says about that model.

@seantalts

This comment has been minimized.

Copy link
Member Author

commented May 17, 2019

I ran the arma model locally,

Seans-MBP ~/scm/performance-tests-cmdstan (master) $ ./compare-git-hashes.sh "stat_comp_benchmarks/benchmarks/arma --runs 30" develop develop develop PR-1244

and it looks like performance with the PR actually makes it slightly worse on my machine,

+ ./comparePerformance.py develop_performance.csv performance.csv
('stat_comp_benchmarks/benchmarks/arma/arma.stan', 0.92)
('performance.compilation', 1.03)
0.972083995854

So I guess my conclusion here is not to worry about the arma model and focus on the eight schools models. I'll test those against this PR locally to make sure this PR addresses the issue we spotted.

@wds15

This comment has been minimized.

Copy link
Contributor

commented May 17, 2019

It seems as if the compiler used matters for this PR... what is being used in the Jenkins tests and would it be possible to run the performance tests on Jenkins with some specific compiler as an option maybe?

@seantalts

This comment has been minimized.

Copy link
Member Author

commented May 17, 2019

I didn't see the compiler mattering at all in Rok's testing, did that pop up at some point? Jenkins is using clang++-6.0 on Mac and g++ 5 I believe on Linux; the relative performance tests for the PR can execute on either.

@wds15

This comment has been minimized.

Copy link
Contributor

commented May 17, 2019

I am afraid that different compilers react differently to specific code patterns. Here is what I think right now, but I could be wrong:

And from the evidence we have so far it seems that gcc runs faster with a non-pointer solution while clang does not care about that, but for clang the method access should be avoided. At least this is how I recall this. Let's see if we can get a good solution under most situations...

@seantalts

This comment has been minimized.

Copy link
Member Author

commented May 17, 2019

:(

@wds15

This comment has been minimized.

Copy link
Contributor

commented May 17, 2019

Currently we have a performance regression under Linux with gcc 5 with the schools model. If we revert 49d3583 and ace8424 then we even get a speedup with clang under Linux and under macOS. Thus, this brings us almost what we want. I would therefore suggest that we test if newer gcc versions manage to properly optimize things on the schools model with the revert mentioned. If newer gcc versions handle this (and do not show a performance regression), then we should revert those two commits and we end up with a version which only has a performance regression with older gcc versions.

Does that sound like an option?

@rok-cesnovar

This comment has been minimized.

Copy link
Collaborator

commented May 17, 2019

On Windows the regression is a bit less obvious (<3%):

g++ Develop:
 Elapsed Time 320.069 seconds (Warm-up)
               165.872 seconds (Sampling)
               485.941 seconds (Total)
g++ Reverted `49d3583` and `ace8424`
 Elapsed Time: 322.308 seconds (Warm-up)
               162.52 seconds (Sampling)
               484.828 seconds (Total)
g++ reverted tlsv5
 Elapsed Time: 316.418 seconds (Warm-up)
               157.7 seconds (Sampling)
               474.118 seconds (Total)

clang++ develop
 Elapsed Time: 319.805 seconds (Warm-up)
               161.913 seconds (Sampling)
               481.718 seconds (Total)
clang++ reverted `49d3583` and `ace8424`
 Elapsed Time: 324.069 seconds (Warm-up)
               163.576 seconds (Sampling)
               487.645 seconds (Total)
clang++ reverted tlsv5
 Elapsed Time: 317.134 seconds (Warm-up)
               157.861 seconds (Sampling)
               474.995 seconds (Total)

So yeah, 8schools on the combination of Linux+gcc is the one affected.

@rok-cesnovar

This comment has been minimized.

Copy link
Collaborator

commented May 17, 2019

Currently we have a performance regression under Linux with gcc 5 with the schools model.

I have gcc 7.3 on my Ubuntu system and it shows the same regression :/

@wds15

This comment has been minimized.

Copy link
Contributor

commented May 17, 2019

I just finished a run on a branch where then two hashes quoted are revert under gcc 8. And then I get a fast run:

 Elapsed Time: 228.18 seconds (Warm-up)
               143.031 seconds (Sampling)
               371.211 seconds (Total)

which is just as fast as the respective run with clang which took 376s with this.

So I think this would be sensible to do.

In fact, the original TLSv4 which was also approved did have these two hashes reverted.

EDIT: gcc8 on macOS.

@rok-cesnovar

This comment has been minimized.

Copy link
Collaborator

commented May 17, 2019

g++-8 on Linux suffers from the same regression. Just tried it.

I guess we are left with either ifdefing the pointer for the non-threading case (which, as you said, is on the not-nice side) or reverting the 2 commits with knowing that we have regression for g++ on Ubuntu for a few models.

@seantalts

This comment has been minimized.

Copy link
Member Author

commented May 17, 2019

@syclik already asked me to just push this revert through, I've just been waiting for some really long running tests to show that it does indeed fix the issue.

I think if you have a code proposal you want to put forward the first step is to create a PR with it so we can reference it. Does it supersede this revert PR or build on top of it?

@syclik

This comment has been minimized.

Copy link
Member

commented May 17, 2019

@seantalts

This comment has been minimized.

Copy link
Member Author

commented May 17, 2019

Sure, the question is now "what is releasable" - this revert PR breaks threading on Windows but restores performance on (at least) a few models but no perf regression on many more.

I don't want to interpret the policy incorrectly, so maybe @syclik do you want to either hit the merge button or suggest a path for rolling forward? My thought for rolling forward would still be to submit a PR with the proposed changes and let people test it, it's just whether that PR is built on top of the revert in the meantime or not. Shouldn't matter too much, though the best route doesn't seem immediately obvious so the revert PR first is attractive to maintain a sort of monotonically improving math library.

@syclik

This comment has been minimized.

Copy link
Member

commented May 17, 2019

@seantalts

This comment has been minimized.

Copy link
Member Author

commented May 18, 2019

Yes. And in general now that we're measuring performance automatically across a much wider range of tests (than we had in Daily Stan Performance) we're going to have to start making these difficult decisions.

If you made a decision in your last response I didn't catch it - what do you want us to do?

@syclik

This comment has been minimized.

Copy link
Member

commented May 18, 2019

@seantalts

This comment has been minimized.

Copy link
Member Author

commented May 18, 2019

Okay. This revert PR is ready to merge whenever.

@stan-dev stan-dev deleted a comment from stan-buildbot May 18, 2019

@rok-cesnovar rok-cesnovar merged commit f49f224 into develop May 18, 2019

1 check passed

continuous-integration/jenkins/pr-merge This commit looks good
Details
@wds15

This comment has been minimized.

Copy link
Contributor

commented May 18, 2019

The statement "does not slow down end-to-end" performance always related to our performance benchmark set of models. On this set there is no slow-down. This was my understanding and as such it is to me another instance of changing requirements which are very hard to (and frustrating) work with. If others understood it differently, fine, then have another PR.

Moreover, we revert this PR and it reintroduces a major bug under Windows with threading. It's unclear how that plays with our guidelines. We should probably have a discussion on discourse about the place of threading. Either threading is deemed to be a side product (which makes Stan to me a useless toy for many of my applications) or Stan is a serious option to consider in real problems - but let's have that discussion on discourse.

@seantalts

This comment has been minimized.

Copy link
Member Author

commented May 18, 2019

Oh man, my long-running tests finally finished (10 runs each of all the schools models) and on my macbook pro it looks like this revert PR didn't do anything vs develop...

@rok-cesnovar

This comment has been minimized.

Copy link
Collaborator

commented May 18, 2019

Macbook with clang?

@wds15

This comment has been minimized.

Copy link
Contributor

commented May 18, 2019

It seems like this is a linux + gcc problem only. And what is really frighting to me in how we read our own SOPs is that single thread performance trumps bug with windows threading.

BTW, clang speeds up with the changes as in the v6 PR from @rok-cesnovar (thanks for making that one!).

@seantalts

This comment has been minimized.

Copy link
Member Author

commented May 18, 2019

Yes, macbook pro with clang.

It seems like this is a linux + gcc problem only.

No, the performance drop was first measured on a Mac with clang. Something else is going on - either this is the wrong PR, or there were multiple OS-dependent performance drops, or the Mac we run the performance tests on is weird in some way (this is probably true but not sure if it's the culprit here)...

And what is really frighting to me in how we read our own SOPs is that single thread performance trumps bug with windows threading.

If you think of it as maintaining 100% support and only offering improvements then it kind of makes sense. But yeah, you might want to make that trade off a different way sometime if you absolutely have to.
[edit] Also Daniel's paradigm that you revert something that you found had unintended consequences until you can better analyze them I think seems right. It's not that we don't want to make that trade-off, it's that we want to know we're making that trade-off.

@seantalts seantalts deleted the perf-regression branch May 18, 2019

@wds15

This comment has been minimized.

Copy link
Contributor

commented May 18, 2019

Oh, I forgot: The clang does not like the indirect access via a method. That is reverted in the v6 branch from @rok-cesnovar which he filed.

The SOPs are interpreted in ways I don't understand and this is certainly telling me I should work a lot less on Stan!

In this case:

  1. Why can we let Windows threading slide? I don't get it.
  2. If performance is put into stone with our reference set (for which everything seems ok) of the stat_comp_performance - then we can this schools model trigger a revert?

I am just observing here. If this schools model can trigger a revert it should by all means be part of what we test regularly.

@seantalts

This comment has been minimized.

Copy link
Member Author

commented May 18, 2019

Why can we let Windows threading slide? I don't get it.

It was never released.

If performance is put into stone with our reference set (for which everything seems ok) of the stat_comp_performance - then we can this schools model trigger a revert?

It's expensive to compile and run every model we have twice for the relative benchmarks. We don't do it for every PR commit, just on merge to develop.

@wds15

This comment has been minimized.

Copy link
Contributor

commented May 18, 2019

That helps a little.

@seantalts

This comment has been minimized.

Copy link
Member Author

commented May 18, 2019

And we're still working these things out - maybe it would be worth doing the full set on an EC2 node for every PR push. But they're also pretty noisy, so it's not like the automated test triggered the revert, but it did alert us to look at it more carefully and then it was verified by a couple of people independently for this model.

The SOP you're referring to is just that develop be releasable, which I think @syclik is reasonably interpreting to mean, among other things, "in a state where we understand the performance tradeoffs." So if we discover something new on merge to develop on a specific PR, we revert and test that PR more thoroughly to try to outline the performance tradeoffs we're making. Only a very small percentage of PRs will need this - you happen to be doing lots of transformative work to the very core of the autodiff system, which is both exciting and needs lots of analysis.

@seantalts

This comment has been minimized.

Copy link
Member Author

commented May 18, 2019

And in this case it's really annoying because we were making large changes to the performance suite at the same time that a few PRs of this transformative sort were merged in, haha. So we'd normally have much more clear records of which PR caused the performance regression.

@wds15

This comment has been minimized.

Copy link
Contributor

commented May 18, 2019

Yeah... I should ideally just code up new functions where there is no reference at all and it is completely compartmentalised... but this stuff is needed very badly... I think. I already have developed a very long breath with things, but this one is as of now almost funny if you look at it how difficult it is to get this TLS stuff in.

@seantalts

This comment has been minimized.

Copy link
Member Author

commented May 18, 2019

Yeah, sorry about that. The work we've been doing on the performance regression suite is supposed to make this an easier, more transparent process, but it's also the first time on the project where we've systemically measured performance across a breadth of models. So we're for the first time having to make hard choices where we now realize something has a performance impact where a year ago we wouldn't have noticed until a user reported a bug (or possibly never).

I'm definitely open to suggestions on how to make this better, too!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.