New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/faster ad tls v6 #1245
Feature/faster ad tls v6 #1245
Conversation
@seantalts will Jenkins run the entire performance tests suite on this PR? |
Not until it’s merged - we have PRs just running on the stat comp
benchmarks suite since it’s shorter and more accepted as meaningful.
We’ll have to run it ourselves, but the scripts should make that relatively
easy other than finding the idle machines.
…On Sat, May 18, 2019 at 08:24 Rok Češnovar ***@***.***> wrote:
@seantalts <https://github.com/seantalts> will Jenkins run the entire
performance tests suite on this PR?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1245?email_source=notifications&email_token=AAGET3HCJZ7MJM2DIPWOTATPV7YPRA5CNFSM4HN2F44KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVWNWSA#issuecomment-493673288>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAGET3FBBKU6ZYA24MBWRCDPV7YPRANCNFSM4HN2F44A>
.
|
Just to clarify. This means:
Or something else? |
I did run things and got on a Linux RHEL 7.5 box with gcc 6.3.1 these timings:
So it looks good to me! EDIT: The CPU there is |
Rok did that work? The math hash should go in the 5th slot there:
it's |
Havent tried yet, just copied pasted from your post on discourse (and mixed up the arguments). Thanks for the info on the arguments. Going to run this on Windows & Ubuntu system. |
Honestly I'm still worried we reverted the wrong PR, or at least that there's another PR out there that is also causing a dip in performance. I'll try to run some git bisect on that soon... Were there any other candidates we were worried about? I know it was suspicious the ODE one seemed to be the first one to cause the failure... |
You mentioned the ODE, I am not aware of any other. I have only done the git bisect with the schools-4 model as that was the model mentioned. Doing git bisect with more models at a time seemed like a hassle. It seems that the amra model is also returing weird performance results right? At least if I am judging from #1233 EDIT: If I am even reading that Stan-bot post right :) |
(stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan, 0.96) |
You are, though I'm not sure why that PR is showing that issue... it doesn't have anything to do with the TLS stuff right? |
Nope. At least as far as I know. |
None of our Macs are on Mojave 😅 |
My work mac is Mojave, if you send me a link to what you want tested I can do it tmrw |
I am a bit confused as to how this PR can move forward.
BTW, if I recall right then clang compilers speed up with this changes set a little bit for the single-core case if I am not mistaken... and one more BTW... the performance reports are great to have, but would be better to also have them say what compiler on what OS with what CPU it ran on. I am not saying to expand the tests, just adding that doc helps, I think. |
@wds15, I can review it. I agree with all your points below. They were the
same points I made when you originally submitted the PR.
Between the two of us, we can be the "honest broker" and collect the
results accurately and request the right things to be measured. We'll need
to look at: hardware, compiler version, os version, git hash, and relative
computational results.
One other thing we can do is try to simplify the test case so we can
observe these results without needing to go all the way through CmdStan. It
still wouldn't surprise me if there was an interaction with linking that we
didn't know about until now.
Does that sound like it'd work? Wed first define what needs to be measured.
…On Fri, May 24, 2019 at 7:27 AM wds15 ***@***.***> wrote:
I am a bit confused as to how this PR can move forward.
- Who is going to review it? If I were to do that then all is fine (I
explain below)
- Why do we still need Mojave? I have Mojave, but why is it needed for
what?
- The PR as it is right now addresses the performance regression which
was found for the Linux+gcc+schools model constellation. For this setup I
have shown above that the performance regression goes away with the code in
this PR.
BTW, if I recall right then clang compilers speed up with this changes set
a little bit for the single-core case if I am not mistaken... and one more
BTW... the performance reports are great to have, but would be better to
also have them say what compiler on what OS with what CPU it ran on. I am
not saying to expand the tests, just adding that doc helps, I think.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1245?email_source=notifications&email_token=AADH6F2UBBXJLI2N3EMCS4LPW7GI7A5CNFSM4HN2F44KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWFAA6Y#issuecomment-495583355>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AADH6F3XVGXXOVXBZAAHU3TPW7GI7ANCNFSM4HN2F44A>
.
|
@syclik Thanks for reviewing. So what do you think about my suggestion about what to measure? I am pasting the results I posted from above here again: I did run things and got on a Linux RHEL 7.5 box with gcc 6.3.1 these timings: running things a bit shorter (this started from the performance cmdstan repo):
The reported time is from the Total as reported by cmdstan So it looks good to me! EDIT: The CPU there is Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz So what I did is to run the schools model on this Linux+gcc system (this is exactly the problematic case)
Do you think this is sufficient? Do we need more? I mean relative to what we already know, we are good, I think. EDIT: Mark configuration and what processes to run and how time was measured. |
@wds15: can you be explicit about what you think we need to measure? I'm trying to follow along and what I haven't seen is:
I think you're trying to tell me, but it's hidden in comments and I want to make sure I understand what you're proposing. At the very least, it should cover all 3 OSes and have some reasonable test that flexes the autodiff stack. |
Btw, I'm expecting the key points from our discussions to be summarized and included at the PR description. |
I have marked in bold the info you are looking for and added information on how time was measured in the comment above. I have only reported this exercise on Linux+gcc, since this is the only setup where the performance regression was seen. On macOS we did not have a problem as I recall, but I can verify if you think we need that. If you think it is valuable to also do this on macOS - I can do that with a gcc version which I can grab from MacPorts (I would try to get a g++ 6ish which is similar to the Linux one). For macOS+clang I have posted on discourse that there never was a problem. For Windows I am not sure if I manage to do these test (if at all I would have to use a virtual box which adds another complication). So Linux+gcc caused all of this trouble and this is why I restricted my benchmarks above to that combination. Which other combinations are needed from your perspective?
So I will add the Linux+gcc bit to the PR description and what other combination should we document? I can add macOS bits, but for Windows someone else would need to provide them (this time we can actually do the Windows stuff since threading is not used in the example). (I know that the compiler versions I quote are not the "vanilla" ones, but that is a restriction due to the systems I use; not much I can do about that) |
Thanks for the info.
I absolutely think we need to reverify every time we change anything. We're not guaranteed that new changes maintain old behavior. It looks like you're only reporting one single run. Is that the one benchmark we're going to use to evaluate this? Is that sufficient? I think we only need to compare to current develop now (we should pick a hash) vs the latest version of this branch with that develop merged in (the exact git hash). Here's what I think our table should look like:
If you think 1, 2, 4 threads is overkill or not enough, tell me what we should test. Are there any other configurations we should be testing? g++ on mac? g++-8? g++-7? I know you believe that these other things should just be ignored, but they shouldn't. I think we should add a few more columns meaning a few more benchmarks, but I don't know what flexes the autodiff stack in the right way. What we're looking for is a better performance for all configurations. We'll accept things that are within noise. Do you agree? If any of these configurations shows that develop is better, we should not merge. |
This PR got reverted because of a performance regression under the non-threading case. We have already seen performance improvements under threading and nothing has changed for the threading case. Why do we need to retest this? I mean, this change in this PR takes away abstraction and I have seen speed improvements for clang when doing so. Thus, the threading performance should not degrade. We are a project with limited resources and should act accordingly. Looking at the table I want to run away!!! Really... this is huge amount of work and requires resources which are out of reason from my perspective. I certainly do not have the time for this - so if you suggest such a huge thing where we don't have automatism then please also suggest how we achieve this. It is your call on this, no challenge, but I would do much more focussed testing and leave some residual risk of a small performance regression... which we found a little bit later in this case using the big performance suite run on the entire example model suite. This somewhat leaner approach will also find the performance regressions, but with using our automatisms and no need for our sweat. As I say - you are the reviewer here - so more one comments: Why would we need to test more than 1 thread? The program in question does not use threading. And yes, I think a single run should do given that the end-to-end test is large enough and runs for 200s walltime. This thing anyway needs scripting so if we see unreasonable results and suspect run-to-run variation then replications can be done. |
Sorry, I was off for a few days, catching up now. I agree with @wds15 that only the non-threaded tests are needed, the model that shows the regression does not use threading anyways. And also the Windows benchmark - develop for threading cases is N/A. I can run the non-threaded tests on Win+gcc/clang and Linux+gcc/clang. If someone provides the mac ones, that means we can cover the non-threading case which should be enough if @syclik agrees. |
Thanks. The table was what I was expecting. It sounds like both of you
think otherwise.
Here’s why I think it’s important: when we change this code (autodiff stack
internals), any small change may have an unintended performance effect
that’s unintuitive and can be verified, but not reasoned about. I think
that we could miss a performance regression by only testing a subset of
that.
Question for both of you: why do you think only testing non-threading will
provide us with enough evidence that these changes improve threading? If
there is a good reason, then I’m happy to limit what we test.
(My belief here is that results may change drastically with a little bit of
change to the code, so.... any change in the autodiff stack code should be
evaluated. Let me know why this is incorrect.)
…On Fri, May 31, 2019 at 5:14 AM Rok Češnovar ***@***.***> wrote:
Sorry, I was off for a few days, catching up now.
I agree with @wds15 <https://github.com/wds15> that only the non-threaded
tests are needed, the model that shows the regression does not use
threading anyways. And also the Windows benchmark - develop for threading
cases is N/A.
I can run the non-threaded tests on Win+gcc/clang and Linux+gcc/clang. If
someone provides the mac ones, that means we can cover the non-threading
case which should be enough if @syclik <https://github.com/syclik> agrees.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1245?email_source=notifications&email_token=AADH6F27NV3ARGQUD2ZRIA3PYDT5VA5CNFSM4HN2F44KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWUVY5Q#issuecomment-497638518>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AADH6F5YSVSJJSNG3U3FRILPYDT5VANCNFSM4HN2F44A>
.
|
And so we’re on the same page: the purpose of this PR is to improve
performance under threading. To me that means: without performance
degradation for non threading and we’re seeing performance benefits for
threading across the supported compilers and wherever we don’t see it, we
can point users to a workaround or we have an understanding.
Since develop doesn’t have this PR in, I think we do have to test both
threaded and non threaded cases. Am I missing the point of the PR?
…On Fri, May 31, 2019 at 6:31 AM Daniel Lee ***@***.***> wrote:
Thanks. The table was what I was expecting. It sounds like both of you
think otherwise.
Here’s why I think it’s important: when we change this code (autodiff
stack internals), any small change may have an unintended performance
effect that’s unintuitive and can be verified, but not reasoned about. I
think that we could miss a performance regression by only testing a subset
of that.
Question for both of you: why do you think only testing non-threading will
provide us with enough evidence that these changes improve threading? If
there is a good reason, then I’m happy to limit what we test.
(My belief here is that results may change drastically with a little bit
of change to the code, so.... any change in the autodiff stack code should
be evaluated. Let me know why this is incorrect.)
On Fri, May 31, 2019 at 5:14 AM Rok Češnovar ***@***.***>
wrote:
> Sorry, I was off for a few days, catching up now.
>
> I agree with @wds15 <https://github.com/wds15> that only the
> non-threaded tests are needed, the model that shows the regression does not
> use threading anyways. And also the Windows benchmark - develop for
> threading cases is N/A.
>
> I can run the non-threaded tests on Win+gcc/clang and Linux+gcc/clang. If
> someone provides the mac ones, that means we can cover the non-threading
> case which should be enough if @syclik <https://github.com/syclik>
> agrees.
>
> —
> You are receiving this because you were mentioned.
>
>
> Reply to this email directly, view it on GitHub
> <#1245?email_source=notifications&email_token=AADH6F27NV3ARGQUD2ZRIA3PYDT5VA5CNFSM4HN2F44KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWUVY5Q#issuecomment-497638518>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AADH6F5YSVSJJSNG3U3FRILPYDT5VANCNFSM4HN2F44A>
> .
>
|
As far as I understand the threading cases were already covered before in v1-v5. I wasnt involved in those PRs much, but I believe those tests were done. Maybe the largest point is that the schools8-4 model that was the only one with the performance regression does not use threading. Or do we want to check if there is a regression if someone enables threading but runs a non-threaded model? If that is the reasoning than yeah, we should check for the entire table. Apart from Windows threading, as there is no baseline on develop to test it with. |
The key thing why threading got faster is the use of a Thus, the relative benefits for threading in this PR have not changed and don't need re-evaluation from my understanding. |
Ok, I understand your reasoning, but I don’t think results hold. In this
PR, the autodiff implementation changed, so those performance measurements
may be similar, but aren’t guaranteed to hold. That's why we missed the
performance regression for non-threaded use; I prefer not to make the same
mistake for the threaded case. (I have no belief that this PR would not
have similar performance, but I wouldn't just merge it assuming that it
holds.)
I think you make a very good point. We should be running a model with
map_rect in addition to this one. For Windows, we don't need to run
multiple times: just once without threading and once without threading.
Does what I'm thinking make sense? I might be overthinking this.
…On Fri, May 31, 2019 at 6:53 AM Rok Češnovar ***@***.***> wrote:
As far as I understand the threading cases were already covered before in
v1-v5. I wasnt involved in those PRs much, but I believe those tests were
done.
Maybe the largest point is that the schools8-4 model that was the only one
with the performance regression does not use threading. Or do we want to
check if there is a regression if someone enables threading but runs a
non-threaded model? If that is the reasoning than yeah, we should check for
the entire table. Apart from Windows threading, as there is no baseline on
develop to test it with.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1245?email_source=notifications&email_token=AADH6F5KO536XRILHQ74RD3PYD7TNA5CNFSM4HN2F44KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWU43LA#issuecomment-497667500>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AADH6F3LLFJ2LKM3VNVWGFTPYD7TNANCNFSM4HN2F44A>
.
|
@rok-cesnovar, thanks! |
@wds15, thank you for that information. I think this has uncovered a different issue. If I run the warfarin model a few times in a row, I see something different:
I don't think it should terminate the first time it hits max iterations, right? (I don't even know where that's coming from off the top of my head) |
Looks to me as if the initial leads to a log-lik value which is NaN. Therefore, Stan does not reach the part of the program where autodiff is needed, since the log-lik is evaluated in double only mode before sampling starts. It's a bit weird that the problem happens under Windows now. The behaviour is the same under Here is one more seed which I generated a while ago for this model (maybe it just runs with that):
|
Here is the time.txt file from analyze.sh for windows. this include no-threads for both develop and TLSv6 and threads=1,2,4 for TLSv6. Bottom line is that this is good to go. |
Awesome! I took the mean.
…On Fri, Jun 21, 2019 at 2:51 PM Rok Češnovar ***@***.***> wrote:
Here is the time.txt file from analyze.sh for windows.
time.txt <https://github.com/stan-dev/math/files/3315703/time.txt>
this include no-threads for both develop and TLSv6 and threads=1,2,4 for
TLSv6.
Bottom line is that this is good to go.
@syclik <https://github.com/syclik> do you have the time to populate the
table with the results? I am not sure if you took the mean or median of the
10 runs to put in the table. I can do it if you let me know. I will update
the benchmark repo with the windows script once I clean things up.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1245?email_source=notifications&email_token=AADH6F4SD4JEC73ZZZH34D3P3UPKTA5CNFSM4HN2F44KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYJJJCI#issuecomment-504534153>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AADH6F7GAQRRGTAHUQUNYG3P3UPKTANCNFSM4HN2F44A>
.
|
And I can populate it. In a little bit. Unless you’re doing it already.
…On Fri, Jun 21, 2019 at 3:06 PM Daniel Lee ***@***.***> wrote:
Awesome! I took the mean.
On Fri, Jun 21, 2019 at 2:51 PM Rok Češnovar ***@***.***>
wrote:
> Here is the time.txt file from analyze.sh for windows.
> time.txt <https://github.com/stan-dev/math/files/3315703/time.txt>
>
> this include no-threads for both develop and TLSv6 and threads=1,2,4 for
> TLSv6.
>
> Bottom line is that this is good to go.
> @syclik <https://github.com/syclik> do you have the time to populate the
> table with the results? I am not sure if you took the mean or median of the
> 10 runs to put in the table. I can do it if you let me know. I will update
> the benchmark repo with the windows script once I clean things up.
>
> —
> You are receiving this because you were mentioned.
>
>
> Reply to this email directly, view it on GitHub
> <#1245?email_source=notifications&email_token=AADH6F4SD4JEC73ZZZH34D3P3UPKTA5CNFSM4HN2F44KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYJJJCI#issuecomment-504534153>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AADH6F7GAQRRGTAHUQUNYG3P3UPKTANCNFSM4HN2F44A>
> .
>
|
I can do it, no problem. |
It is interesting to see the execution times go considerably up when going from no threading on Windows to threading. I remember reading in the docs of the TBB that windows threads are terribly expensive (which is why a threadpool should speed this up as implemented in the TBB). Other than that it looks all fine to my eyes. |
Great! Thank you. Yeah, that's good to know. We should really suggest people not use threading with 1 thread. I'm really glad there is an improvement at 2 threads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
What's the deal with Jenkins? Anyone know why it started a new job? |
No idea...had the same thought... |
I think mean's better for testing unless you expect high outliers due to something like other system demand.
|
@serban-nicusor-toptal hey, when you have the time please take a look at what is going on here. The last push to this PR's branch was 11 days ago and it seems that the tests keep restarting. |
I did manually restart the last one since Jenkins did hung up on Windows. I think Windows testing is right now somewhat fragile. |
Now I see what is possibly the problem: On the windows the command run is
So the |
To me this looks like a config error of the Jenkins on the Windows instance we are getting. On that machine |
Yeah, that looks right - seems like a config problem with the new EC2 Windows instances Nic has been setting up. @serban-nicusor-toptal for now I turned off the 'windows' label on those nodes so they shouldn't start up and I'll re-run this this job. |
(stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan, 0.99) |
Investigated and it's an issue from the Windows AMI. |
Summary
This PR reapplies the changes of the PR faster TLS v5 which was reverted from develop after performance tests issues. See #1244
In short, this PR:
First, lets see what the perf tests show now. If this wont be fine, we should try ifdefing the autodiffstackstorage.hpp with a more clean version of something like this
Tests
/
Side Effects
The side effects to TLSv5 are kind of why we are doing this.
Performance Results