Feature/faster ad tls v6 #1245

rok-cesnovar · 2019-05-18T12:23:17Z

Summary

This PR reapplies the changes of the PR faster TLS v5 which was reverted from develop after performance tests issues. See #1244

In short, this PR:

reverts the revert from Revert "Merge pull request #1212 from stan-dev/feature/faster-ad-tls-v5" #1244, reapplying the fix for threading on Windows
reverts 2 commits in the TLSv5 branch, effectively goes back to using pointer access to the instance

First, lets see what the perf tests show now. If this wont be fine, we should try ifdefing the autodiffstackstorage.hpp with a more clean version of something like this

Tests

/

Side Effects

The side effects to TLSv5 are kind of why we are doing this.

Performance Results

OS	compiler	threading	number of threads	schools-4 - develop	schools-4 - tls	schools-4 % improvement	warfarin - develop	warfarin - tls	warfarin % improvement
linux	g++ 5.x.x stdlibc++	no	n/a	10.1534s	10.1198s	0.3%	256.7028s	238.0342s	7.8%
linux	g++ 5.x.x stdlibc++	yes	1	11.7442s	10.5511s	11.3%	321.1362s	246.0260s	30.5%
linux	g++ 5.x.x stdlibc++	yes	2	11.4333s	10.2273s	11.8%	291.3092s	257.3849s	13.2%
linux	g++ 5.x.x stdlibc++	yes	4	11.2953s	10.0267s	12.7%	204.3419s	175.2958s	16.6%
mac	clang++ 6.x.x libc++	no	n/a	10.0808s	10.3644s	-2.7%	281.9994s	276.4570s	2.0%
mac	clang++ 6.x.x libc++	yes	1	15.6746s	10.3569s	51.3%	414.8224s	291.6971s	42.2%
mac	clang++ 6.x.x libc++	yes	2	14.9762s	10.0897s	48.4%.	210.5288s	156.2712s	34.7%
mac	clang++ 6.x.x libc++	yes	4	14.8488s	9.9837s	48.7%	128.4796s	94.3915s	36.1%
windows	g++ 4.9.3	no	n/a	9.1471s	9,1897s	-0.4%	318.881s	318,86s	-0.006%
windows	g++ 4.9.3	yes	1	/	24.8642s	/	/	452.852s	/
windows	g++ 4.9.3	yes	2	/	23.0454s	/	/	224.1456s	/
windows	g++ 4.9.3	yes	4	/	23,7629s	/	/	151.7429s	/

…-ad-tls-v5"" This reverts commit db1a002.

This reverts commit 49d3583.

…cess without relying on compiler optimizations"" This reverts commit ace8424.

rok-cesnovar · 2019-05-18T12:24:21Z

@seantalts will Jenkins run the entire performance tests suite on this PR?

seantalts · 2019-05-18T12:54:38Z

Not until it’s merged - we have PRs just running on the stat comp benchmarks suite since it’s shorter and more accepted as meaningful. We’ll have to run it ourselves, but the scripts should make that relatively easy other than finding the idle machines.

…

On Sat, May 18, 2019 at 08:24 Rok Češnovar ***@***.***> wrote: @seantalts <https://github.com/seantalts> will Jenkins run the entire performance tests suite on this PR? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1245?email_source=notifications&email_token=AAGET3HCJZ7MJM2DIPWOTATPV7YPRA5CNFSM4HN2F44KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVWNWSA#issuecomment-493673288>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAGET3FBBKU6ZYA24MBWRCDPV7YPRANCNFSM4HN2F44A> .

rok-cesnovar · 2019-05-18T13:27:30Z

We’ll have to run it ourselves

Just to clarify. This means:

git clone --recursive https://github.com/stan-dev/performance-tests-cmdstan
cd performance-tests-cmdstan
./compare-git-hashes.sh example-models/bugs_examples/vol2/schools/ develop d013e55 false false

Or something else?

wds15 · 2019-05-18T13:32:27Z

I did run things and got on a Linux RHEL 7.5 box with gcc 6.3.1 these timings:

running things a bit shorter:
./example-models/bugs_examples/vol2/schools/schools-4 method=sample num_warmup=250 num_samples=250 data file=example-models/bugs_examples/vol2/schools/schools.data.R random seed=1234

develop-good.log:               267.42 seconds (Total) ## this is before TLSv5 merge
develop-bad.log:               277.86 seconds (Total) ## this is just after TLSv5 merge
develop-v6.log:               262.33 seconds (Total) ## this is for this PR

So it looks good to me!

EDIT: The CPU there is Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz

seantalts · 2019-05-18T14:27:27Z

Rok did that work? The math hash should go in the 5th slot there:

./compare-git-hashes.sh example-models/bugs_examples/vol2/schools/ develop develop false d013e55

it's compare-git-hashes.sh <test_dir> <cmdstan hash/branch of first run> <cmdstan hash of second run> <stan hash of 2nd run> <math hash of 2nd run> where false just means use whatever the cmdstan hash has listed for that submodule.

rok-cesnovar · 2019-05-18T14:30:34Z

Havent tried yet, just copied pasted from your post on discourse (and mixed up the arguments). Thanks for the info on the arguments. Going to run this on Windows & Ubuntu system.

seantalts · 2019-05-18T17:14:15Z

Honestly I'm still worried we reverted the wrong PR, or at least that there's another PR out there that is also causing a dip in performance. I'll try to run some git bisect on that soon... Were there any other candidates we were worried about? I know it was suspicious the ODE one seemed to be the first one to cause the failure...

rok-cesnovar · 2019-05-18T17:21:10Z

You mentioned the ODE, I am not aware of any other. I have only done the git bisect with the schools-4 model as that was the model mentioned. Doing git bisect with more models at a time seemed like a hassle. It seems that the amra model is also returing weird performance results right? At least if I am judging from #1233

EDIT: If I am even reading that Stan-bot post right :)

stan-buildbot · 2019-05-18T17:23:18Z

(stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan, 0.96)
(stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan, 1.0)
(stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan, 1.0)
(stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan, 1.1)
(stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan, 1.03)
(stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan, 1.01)
(stat_comp_benchmarks/benchmarks/arK/arK.stan, 0.97)
(performance.compilation, 1.01)
(stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan, 1.0)
(stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan, 1.0)
(stat_comp_benchmarks/benchmarks/sir/sir.stan, 0.99)
(stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan, 1.0)
(stat_comp_benchmarks/benchmarks/garch/garch.stan, 0.97)
(stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan, 1.0)
(stat_comp_benchmarks/benchmarks/arma/arma.stan, 0.82)
Result: 0.98880260256
Commit hash: d013e55

seantalts · 2019-05-18T17:29:00Z

You are, though I'm not sure why that PR is showing that issue... it doesn't have anything to do with the TLS stuff right?

rok-cesnovar · 2019-05-18T17:31:09Z

it doesn't have anything to do with the TLS stuff right?

Nope. At least as far as I know.

seantalts · 2019-05-18T22:58:36Z

None of our Macs are on Mojave 😅

SteveBronder · 2019-05-19T16:23:27Z

My work mac is Mojave, if you send me a link to what you want tested I can do it tmrw

wds15 · 2019-05-24T11:27:11Z

I am a bit confused as to how this PR can move forward.

Who is going to review it? If I were to do that then all is fine (I explain below)
Why do we still need Mojave? I have Mojave, but why is it needed for what?
The PR as it is right now addresses the performance regression which was found for the Linux+gcc+schools model constellation. For this setup I have shown above that the performance regression goes away with the code in this PR.

BTW, if I recall right then clang compilers speed up with this changes set a little bit for the single-core case if I am not mistaken... and one more BTW... the performance reports are great to have, but would be better to also have them say what compiler on what OS with what CPU it ran on. I am not saying to expand the tests, just adding that doc helps, I think.

syclik · 2019-05-24T12:04:37Z

@wds15, I can review it. I agree with all your points below. They were the same points I made when you originally submitted the PR. Between the two of us, we can be the "honest broker" and collect the results accurately and request the right things to be measured. We'll need to look at: hardware, compiler version, os version, git hash, and relative computational results. One other thing we can do is try to simplify the test case so we can observe these results without needing to go all the way through CmdStan. It still wouldn't surprise me if there was an interaction with linking that we didn't know about until now. Does that sound like it'd work? Wed first define what needs to be measured.

…

On Fri, May 24, 2019 at 7:27 AM wds15 ***@***.***> wrote: I am a bit confused as to how this PR can move forward. - Who is going to review it? If I were to do that then all is fine (I explain below) - Why do we still need Mojave? I have Mojave, but why is it needed for what? - The PR as it is right now addresses the performance regression which was found for the Linux+gcc+schools model constellation. For this setup I have shown above that the performance regression goes away with the code in this PR. BTW, if I recall right then clang compilers speed up with this changes set a little bit for the single-core case if I am not mistaken... and one more BTW... the performance reports are great to have, but would be better to also have them say what compiler on what OS with what CPU it ran on. I am not saying to expand the tests, just adding that doc helps, I think. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1245?email_source=notifications&email_token=AADH6F2UBBXJLI2N3EMCS4LPW7GI7A5CNFSM4HN2F44KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWFAA6Y#issuecomment-495583355>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADH6F3XVGXXOVXBZAAHU3TPW7GI7ANCNFSM4HN2F44A> .

wds15 · 2019-05-24T13:04:16Z

@syclik Thanks for reviewing. So what do you think about my suggestion about what to measure? I am pasting the results I posted from above here again:

I did run things and got on a Linux RHEL 7.5 box with gcc 6.3.1 these timings:

running things a bit shorter (this started from the performance cmdstan repo):
./example-models/bugs_examples/vol2/schools/schools-4 method=sample num_warmup=250 num_samples=250 data file=example-models/bugs_examples/vol2/schools/schools.data.R random seed=1234

develop-good.log:               267.42 seconds (Total) ## this is before TLSv5 merge
develop-bad.log:               277.86 seconds (Total) ## this is just after TLSv5 merge
develop-v6.log:               262.33 seconds (Total) ## this is for this PR

The reported time is from the Total as reported by cmdstan

So it looks good to me!

EDIT: The CPU there is Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz

So what I did is to run the schools model on this Linux+gcc system (this is exactly the problematic case)

before the TLSv5 merge... things are good
after the TLSv5 merge... things are bad
on the TLSv6 PR ... things are good

Do you think this is sufficient? Do we need more? I mean relative to what we already know, we are good, I think.

EDIT: Mark configuration and what processes to run and how time was measured.

syclik · 2019-05-27T02:49:01Z

@wds15: can you be explicit about what you think we need to measure? I'm trying to follow along and what I haven't seen is:

what processes to run to measure time
what configurations

I think you're trying to tell me, but it's hidden in comments and I want to make sure I understand what you're proposing. At the very least, it should cover all 3 OSes and have some reasonable test that flexes the autodiff stack.

syclik · 2019-05-27T02:51:57Z

Btw, I'm expecting the key points from our discussions to be summarized and included at the PR description.

wds15 · 2019-05-27T07:50:34Z

I have marked in bold the info you are looking for and added information on how time was measured in the comment above.

I have only reported this exercise on Linux+gcc, since this is the only setup where the performance regression was seen. On macOS we did not have a problem as I recall, but I can verify if you think we need that.

If you think it is valuable to also do this on macOS - I can do that with a gcc version which I can grab from MacPorts (I would try to get a g++ 6ish which is similar to the Linux one). For macOS+clang I have posted on discourse that there never was a problem. For Windows I am not sure if I manage to do these test (if at all I would have to use a virtual box which adds another complication).

So Linux+gcc caused all of this trouble and this is why I restricted my benchmarks above to that combination. Which other combinations are needed from your perspective?

macOS+clang => I wasn't able to see the issue, that's documented on discourse for the moment
macOS+gcc8 => I did run it there and did not see an issue
Windows is tough for me to test and I could only use a virtual box

So I will add the Linux+gcc bit to the PR description and what other combination should we document? I can add macOS bits, but for Windows someone else would need to provide them (this time we can actually do the Windows stuff since threading is not used in the example).

(I know that the compiler versions I quote are not the "vanilla" ones, but that is a restriction due to the systems I use; not much I can do about that)

syclik · 2019-05-30T04:36:12Z

Thanks for the info.

I have only reported this exercise on Linux+gcc, since this is the only setup where the performance regression was seen. On macOS we did not have a problem as I recall, but I can verify if you think we need that.

I absolutely think we need to reverify every time we change anything. We're not guaranteed that new changes maintain old behavior.

It looks like you're only reporting one single run. Is that the one benchmark we're going to use to evaluate this? Is that sufficient?

I think we only need to compare to current develop now (we should pick a hash) vs the latest version of this branch with that develop merged in (the exact git hash).

Here's what I think our table should look like:

OS	compiler	threading	number of threads	benchmark - develop	benchmark - tls	% improvement
linux	g++ 6.3.1	no	n/a	267.42s	262.33s	1.9%
linux	g++ 6.3.1	yes	1	...	...	...
linux	g++ 6.3.1	yes	2	...	...	...
linux	g++ 6.3.1	yes	4	...	...	...
linux	g++ 5.x.x stdlibc++	no	n/a	...	...	...
linux	g++ 5.x.x stdlibc++	yes	1	...	...	...
linux	g++ 5.x.x stdlibc++	yes	2	...	...	...
linux	g++ 5.x.x stdlibc++	yes	4	...	...	...
mac	clang++ 6.x.x libc++	no	n/a	...	...	...
mac	clang++ 6.x.x libc++	yes	1	...	...	...
mac	clang++ 6.x.x libc++	yes	2	...	...	...
mac	clang++ 6.x.x libc++	yes	4	...	...	...
windows	g++ 4.9.3	no	n/a	...	...	...
windows	g++ 4.9.3	yes	1	...	...	...
windows	g++ 4.9.3	yes	2	...	...	...
windows	g++ 4.9.3	yes	3	...	...	...
windows	g++ 4.9.3	yes	4	...	...	...

If you think 1, 2, 4 threads is overkill or not enough, tell me what we should test. Are there any other configurations we should be testing? g++ on mac? g++-8? g++-7? I know you believe that these other things should just be ignored, but they shouldn't.

I think we should add a few more columns meaning a few more benchmarks, but I don't know what flexes the autodiff stack in the right way.

What we're looking for is a better performance for all configurations. We'll accept things that are within noise. Do you agree? If any of these configurations shows that develop is better, we should not merge.

wds15 · 2019-05-31T06:55:16Z

This PR got reverted because of a performance regression under the non-threading case. We have already seen performance improvements under threading and nothing has changed for the threading case. Why do we need to retest this? I mean, this change in this PR takes away abstraction and I have seen speed improvements for clang when doing so. Thus, the threading performance should not degrade. We are a project with limited resources and should act accordingly.

Looking at the table I want to run away!!! Really... this is huge amount of work and requires resources which are out of reason from my perspective. I certainly do not have the time for this - so if you suggest such a huge thing where we don't have automatism then please also suggest how we achieve this. It is your call on this, no challenge, but I would do much more focussed testing and leave some residual risk of a small performance regression... which we found a little bit later in this case using the big performance suite run on the entire example model suite. This somewhat leaner approach will also find the performance regressions, but with using our automatisms and no need for our sweat.

As I say - you are the reviewer here - so more one comments: Why would we need to test more than 1 thread? The program in question does not use threading.

And yes, I think a single run should do given that the end-to-end test is large enough and runs for 200s walltime. This thing anyway needs scripting so if we see unreasonable results and suspect run-to-run variation then replications can be done.

rok-cesnovar · 2019-05-31T09:14:01Z

Sorry, I was off for a few days, catching up now.

I agree with @wds15 that only the non-threaded tests are needed, the model that shows the regression does not use threading anyways. And also the Windows benchmark - develop for threading cases is N/A.

I can run the non-threaded tests on Win+gcc/clang and Linux+gcc/clang. If someone provides the mac ones, that means we can cover the non-threading case which should be enough if @syclik agrees.

syclik · 2019-05-31T10:32:11Z

Thanks. The table was what I was expecting. It sounds like both of you think otherwise. Here’s why I think it’s important: when we change this code (autodiff stack internals), any small change may have an unintended performance effect that’s unintuitive and can be verified, but not reasoned about. I think that we could miss a performance regression by only testing a subset of that. Question for both of you: why do you think only testing non-threading will provide us with enough evidence that these changes improve threading? If there is a good reason, then I’m happy to limit what we test. (My belief here is that results may change drastically with a little bit of change to the code, so.... any change in the autodiff stack code should be evaluated. Let me know why this is incorrect.)

…

On Fri, May 31, 2019 at 5:14 AM Rok Češnovar ***@***.***> wrote: Sorry, I was off for a few days, catching up now. I agree with @wds15 <https://github.com/wds15> that only the non-threaded tests are needed, the model that shows the regression does not use threading anyways. And also the Windows benchmark - develop for threading cases is N/A. I can run the non-threaded tests on Win+gcc/clang and Linux+gcc/clang. If someone provides the mac ones, that means we can cover the non-threading case which should be enough if @syclik <https://github.com/syclik> agrees. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1245?email_source=notifications&email_token=AADH6F27NV3ARGQUD2ZRIA3PYDT5VA5CNFSM4HN2F44KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWUVY5Q#issuecomment-497638518>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADH6F5YSVSJJSNG3U3FRILPYDT5VANCNFSM4HN2F44A> .

syclik · 2019-05-31T10:45:51Z

And so we’re on the same page: the purpose of this PR is to improve performance under threading. To me that means: without performance degradation for non threading and we’re seeing performance benefits for threading across the supported compilers and wherever we don’t see it, we can point users to a workaround or we have an understanding. Since develop doesn’t have this PR in, I think we do have to test both threaded and non threaded cases. Am I missing the point of the PR?

…

On Fri, May 31, 2019 at 6:31 AM Daniel Lee ***@***.***> wrote: Thanks. The table was what I was expecting. It sounds like both of you think otherwise. Here’s why I think it’s important: when we change this code (autodiff stack internals), any small change may have an unintended performance effect that’s unintuitive and can be verified, but not reasoned about. I think that we could miss a performance regression by only testing a subset of that. Question for both of you: why do you think only testing non-threading will provide us with enough evidence that these changes improve threading? If there is a good reason, then I’m happy to limit what we test. (My belief here is that results may change drastically with a little bit of change to the code, so.... any change in the autodiff stack code should be evaluated. Let me know why this is incorrect.) On Fri, May 31, 2019 at 5:14 AM Rok Češnovar ***@***.***> wrote: > Sorry, I was off for a few days, catching up now. > > I agree with @wds15 <https://github.com/wds15> that only the > non-threaded tests are needed, the model that shows the regression does not > use threading anyways. And also the Windows benchmark - develop for > threading cases is N/A. > > I can run the non-threaded tests on Win+gcc/clang and Linux+gcc/clang. If > someone provides the mac ones, that means we can cover the non-threading > case which should be enough if @syclik <https://github.com/syclik> > agrees. > > — > You are receiving this because you were mentioned. > > > Reply to this email directly, view it on GitHub > <#1245?email_source=notifications&email_token=AADH6F27NV3ARGQUD2ZRIA3PYDT5VA5CNFSM4HN2F44KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWUVY5Q#issuecomment-497638518>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AADH6F5YSVSJJSNG3U3FRILPYDT5VANCNFSM4HN2F44A> > . >

rok-cesnovar · 2019-05-31T10:53:41Z

As far as I understand the threading cases were already covered before in v1-v5. I wasnt involved in those PRs much, but I believe those tests were done.

Maybe the largest point is that the schools8-4 model that was the only one with the performance regression does not use threading. Or do we want to check if there is a regression if someone enables threading but runs a non-threaded model? If that is the reasoning than yeah, we should check for the entire table. Apart from Windows threading, as there is no baseline on develop to test it with.

wds15 · 2019-05-31T11:20:53Z

The key thing why threading got faster is the use of a thread_local construct which avoids upon each access a TLS wrapper function to be called. This is achieved by using a compile time constant expression (the null-pointer). This aspect has not changed at all. And as we have very extensively tested this we can omit retesting if threading got faster or not.

Thus, the relative benefits for threading in this PR have not changed and don't need re-evaluation from my understanding.

syclik · 2019-05-31T11:23:41Z

Ok, I understand your reasoning, but I don’t think results hold. In this PR, the autodiff implementation changed, so those performance measurements may be similar, but aren’t guaranteed to hold. That's why we missed the performance regression for non-threaded use; I prefer not to make the same mistake for the threaded case. (I have no belief that this PR would not have similar performance, but I wouldn't just merge it assuming that it holds.) I think you make a very good point. We should be running a model with map_rect in addition to this one. For Windows, we don't need to run multiple times: just once without threading and once without threading. Does what I'm thinking make sense? I might be overthinking this.

…

On Fri, May 31, 2019 at 6:53 AM Rok Češnovar ***@***.***> wrote: As far as I understand the threading cases were already covered before in v1-v5. I wasnt involved in those PRs much, but I believe those tests were done. Maybe the largest point is that the schools8-4 model that was the only one with the performance regression does not use threading. Or do we want to check if there is a regression if someone enables threading but runs a non-threaded model? If that is the reasoning than yeah, we should check for the entire table. Apart from Windows threading, as there is no baseline on develop to test it with. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1245?email_source=notifications&email_token=AADH6F5KO536XRILHQ74RD3PYD7TNA5CNFSM4HN2F44KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWU43LA#issuecomment-497667500>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADH6F3LLFJ2LKM3VNVWGFTPYD7TNANCNFSM4HN2F44A> .

syclik · 2019-06-21T12:53:33Z

@rok-cesnovar, thanks!

syclik · 2019-06-21T12:55:06Z

@wds15, thank you for that information. I think this has uncovered a different issue. If I run the warfarin model a few times in a row, I see something different:

Rejecting initial value:
  Error evaluating the log probability at the initial value.
Exception: Exception: Exception: gamma_lpdf: Inverse scale parameter[2] is -8.04187e-005, but must be > 0!  (in '../benchmark-warfarin-new.stan' at line 48)
...
Unrecoverable error evaluating the log probability at the initial value.
Exception: Exception: Max number of iterations exceeded (500).  (in '../benchmark-warfarin-new.stan' at line 151)

I don't think it should terminate the first time it hits max iterations, right? (I don't even know where that's coming from off the top of my head)

wds15 · 2019-06-21T13:30:10Z

Looks to me as if the initial leads to a log-lik value which is NaN. Therefore, Stan does not reach the part of the program where autodiff is needed, since the log-lik is evaluated in double only mode before sampling starts.

It's a bit weird that the problem happens under Windows now. The behaviour is the same under STAN_THREADS and without that?

Here is one more seed which I generated a while ago for this model (maybe it just runs with that):

theta <-
c(4.41959708856204, 3.23997089513979, 2.12421793294719)
Eta <-
structure(c(-0.688248187383304, 0.744680093701252, -0.87312505551833,
-0.378427566950146, 0.703664990470604, -0.597577248428612, -0.317997718239967,
-0.510579753740771, 0.507684062889378, 0.110256075607081, -0.977028657016065,
-0.55919410668034, 0.517432664733083, -0.13755612976235, -1.21729681828442,
0.974770378905777, 0.903864698185265, 0.29870862350923, -0.216401920029647,
0.00876521732189567, -0.285117825818102, -0.0499555039319689, 0.169276053133337,
0.192171211042593, 0.345980466957853, -0.225143924257664, 0.175210232131478,
-0.916697488535281, 1.22011176305615, -0.630232835212122, 0.555121034389788,
-0.868947709611385, -0.451069739995428, -0.451109720549416, -0.0248418523603254,
-0.293184152403504, 0.0240323637355454, -0.539375773159626, 0.547556570313531,
-0.0160069609940164, -0.674651403389227, 0.504913201382093, 0.255341722622165,
-0.530527237581262, -0.37592245465429, -0.418677546519841, 0.0671013287388712,
0.619973031935115, -0.237134044722137, 0.399927328640661, 0.378288558610415,
0.649407271155856, 0.393999107298632, 0.249552772545313, -0.769310807474185,
-0.946798265121144, 0.241608938749476, 0.23830579814668, -0.611445656642559,
-1.00529071402882, -1.24922186068239, -0.481397035229574, -0.367325781115606,
0.0716738574618031, 0.0447101604372022, -0.408015741482741, -0.0544900550162833,
0.0424837214329659, 0.692393114416288, -0.289936328993551, -0.150583561100659,
-0.576380454419351, -0.557948532694114, -0.567403944520817, 0.350128263725001,
-0.278313223250863, 0.291253358591588, 0.404988157479207, 0.180376205883475,
-0.343655710749863, -0.214261567753269, -0.421017844331495, -0.159577540535779,
-0.197581269886916, 0.139487629172406, -0.157859033760607, 0.129493637619296,
-0.875976122232035, -0.0669231418917255, -0.685125188493358, 0.380993970211202,
0.440506764954287, -0.0792972660415051, 0.988422562068841, 0.0947271220011503,
-0.640506532684065),
.Dim = c(32, 3))
sigma_eta <-
c(0.26607685943578, 0.414870876335879, 0.323991053083554)
sigma_y <- 0.66199295742126

rok-cesnovar · 2019-06-21T18:51:20Z

Here is the time.txt file from analyze.sh for windows.
time.txt

this include no-threads for both develop and TLSv6 and threads=1,2,4 for TLSv6.

Bottom line is that this is good to go.
@syclik do you have the time to populate the table with the results? I am not sure if you took the mean or median of the 10 runs to put in the table. I can do it if you let me know. I will update the benchmark repo with the windows script once I clean things up.

syclik · 2019-06-21T19:06:28Z

Awesome! I took the mean.

…

On Fri, Jun 21, 2019 at 2:51 PM Rok Češnovar ***@***.***> wrote: Here is the time.txt file from analyze.sh for windows. time.txt <https://github.com/stan-dev/math/files/3315703/time.txt> this include no-threads for both develop and TLSv6 and threads=1,2,4 for TLSv6. Bottom line is that this is good to go. @syclik <https://github.com/syclik> do you have the time to populate the table with the results? I am not sure if you took the mean or median of the 10 runs to put in the table. I can do it if you let me know. I will update the benchmark repo with the windows script once I clean things up. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1245?email_source=notifications&email_token=AADH6F4SD4JEC73ZZZH34D3P3UPKTA5CNFSM4HN2F44KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYJJJCI#issuecomment-504534153>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADH6F7GAQRRGTAHUQUNYG3P3UPKTANCNFSM4HN2F44A> .

syclik · 2019-06-21T19:07:00Z

And I can populate it. In a little bit. Unless you’re doing it already.

…

On Fri, Jun 21, 2019 at 3:06 PM Daniel Lee ***@***.***> wrote: Awesome! I took the mean. On Fri, Jun 21, 2019 at 2:51 PM Rok Češnovar ***@***.***> wrote: > Here is the time.txt file from analyze.sh for windows. > time.txt <https://github.com/stan-dev/math/files/3315703/time.txt> > > this include no-threads for both develop and TLSv6 and threads=1,2,4 for > TLSv6. > > Bottom line is that this is good to go. > @syclik <https://github.com/syclik> do you have the time to populate the > table with the results? I am not sure if you took the mean or median of the > 10 runs to put in the table. I can do it if you let me know. I will update > the benchmark repo with the windows script once I clean things up. > > — > You are receiving this because you were mentioned. > > > Reply to this email directly, view it on GitHub > <#1245?email_source=notifications&email_token=AADH6F4SD4JEC73ZZZH34D3P3UPKTA5CNFSM4HN2F44KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYJJJCI#issuecomment-504534153>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AADH6F7GAQRRGTAHUQUNYG3P3UPKTANCNFSM4HN2F44A> > . >

rok-cesnovar · 2019-06-21T19:15:44Z

I can do it, no problem.

rok-cesnovar · 2019-06-21T19:45:58Z

I populated the table in the original post of the PR. When not using threads the execution times are basically the same and when using multiple threads things look similar to unix systems. @syclik and @wds15 please review the numbers and then LGTM!

wds15 · 2019-06-21T19:52:02Z

It is interesting to see the execution times go considerably up when going from no threading on Windows to threading. I remember reading in the docs of the TBB that windows threads are terribly expensive (which is why a threadpool should speed this up as implemented in the TBB). Other than that it looks all fine to my eyes.

syclik · 2019-06-21T20:21:46Z

Great! Thank you. Yeah, that's good to know. We should really suggest people not use threading with 1 thread. I'm really glad there is an improvement at 2 threads.

syclik

Awesome!

syclik · 2019-06-21T20:24:04Z

What's the deal with Jenkins? Anyone know why it started a new job?

wds15 · 2019-06-21T21:14:37Z

No idea...had the same thought...

bob-carpenter · 2019-06-21T21:36:46Z

I think mean's better for testing unless you expect high outliers due to something like other system demand.

rok-cesnovar · 2019-06-22T08:44:41Z

@serban-nicusor-toptal hey, when you have the time please take a look at what is going on here. The last push to this PR's branch was 11 days ago and it seems that the tests keep restarting.

wds15 · 2019-06-22T08:46:25Z

I did manually restart the last one since Jenkins did hung up on Windows. I think Windows testing is right now somewhat fragile.

wds15 · 2019-06-22T08:48:05Z

Now I see what is possibly the problem: On the windows the command run is

runTests.py -jnull test/unit -f thread --make-only

So the -jnull argument is the problem possibly. I will look quickly into that...

wds15 · 2019-06-22T08:51:08Z

To me this looks like a config error of the Jenkins on the Windows instance we are getting. On that machine env.PARALLEL has the value null which is not quite right.

seantalts · 2019-06-22T13:31:18Z

Yeah, that looks right - seems like a config problem with the new EC2 Windows instances Nic has been setting up. @serban-nicusor-toptal for now I turned off the 'windows' label on those nodes so they shouldn't start up and I'll re-run this this job.

stan-buildbot · 2019-06-22T17:45:52Z

(stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan, 0.99)
(stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan, 1.01)
(stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan, 1.01)
(stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan, 0.98)
(stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan, 1.02)
(stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan, 0.99)
(stat_comp_benchmarks/benchmarks/arK/arK.stan, 0.97)
(performance.compilation, 1.02)
(stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan, 0.99)
(stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan, 1.0)
(stat_comp_benchmarks/benchmarks/sir/sir.stan, 1.07)
(stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan, 0.98)
(stat_comp_benchmarks/benchmarks/garch/garch.stan, 0.97)
(stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan, 0.97)
(stat_comp_benchmarks/benchmarks/arma/arma.stan, 0.99)
Result: 0.99605538315
Commit hash: 97d6c89

serban-nicusor-toptal · 2019-06-24T19:43:00Z

Investigated and it's an issue from the Windows AMI.
Will push an updated version and everything should go back to normal.
...
Done

…-ad-tls-v6" This reverts commit f39a325, reversing changes made to 1487556.

rok-cesnovar added 3 commits May 18, 2019 13:59

Revert "Revert "Merge pull request #1212 from stan-dev/feature/faster…

e7641bf

…-ad-tls-v5"" This reverts commit db1a002.

Revert "Revert "fix left instance access by method""

891801c

This reverts commit 49d3583.

Revert "Revert "make explicit use of pointer to AD tape for faster ac…

d013e55

…cess without relying on compiler optimizations"" This reverts commit ace8424.

syclik approved these changes Jun 21, 2019

View reviewed changes

wds15 merged commit f39a325 into develop Jun 22, 2019

serban-nicusor-toptal modified the milestones: 2.20.0++, 2.19.2 Jul 18, 2019

riddell-stan added a commit to riddell-stan/math that referenced this pull request Aug 5, 2019

Revert "Merge pull request stan-dev#1245 from stan-dev/feature/faster…

4535ae7

…-ad-tls-v6" This reverts commit f39a325, reversing changes made to 1487556.

syclik deleted the feature/faster-ad-tls-v6 branch September 24, 2019 05:09

riddell-stan added a commit to riddell-stan/math that referenced this pull request Oct 5, 2019

Revert "Merge pull request stan-dev#1245 from stan-dev/feature/faster…

d84ee0d

…-ad-tls-v6" This reverts commit f39a325, reversing changes made to 1487556.

WardBrian mentioned this pull request Apr 10, 2024

Enable STAN_THREADS by default stan-dev/cmdstan#1179

Open

Feature/faster ad tls v6 #1245

Feature/faster ad tls v6 #1245

Conversation

rok-cesnovar commented May 18, 2019 • edited

Summary

Tests

Side Effects

Performance Results

rok-cesnovar commented May 18, 2019

seantalts commented May 18, 2019 via email

rok-cesnovar commented May 18, 2019

wds15 commented May 18, 2019 • edited

seantalts commented May 18, 2019 • edited

rok-cesnovar commented May 18, 2019

seantalts commented May 18, 2019

rok-cesnovar commented May 18, 2019 • edited

stan-buildbot commented May 18, 2019

seantalts commented May 18, 2019

rok-cesnovar commented May 18, 2019

seantalts commented May 18, 2019

SteveBronder commented May 19, 2019

wds15 commented May 24, 2019

syclik commented May 24, 2019 via email

wds15 commented May 24, 2019 • edited

syclik commented May 27, 2019

syclik commented May 27, 2019

wds15 commented May 27, 2019

syclik commented May 30, 2019 • edited

wds15 commented May 31, 2019

rok-cesnovar commented May 31, 2019

syclik commented May 31, 2019 via email

syclik commented May 31, 2019 via email

rok-cesnovar commented May 31, 2019

wds15 commented May 31, 2019

syclik commented May 31, 2019 via email

syclik commented Jun 21, 2019

syclik commented Jun 21, 2019

wds15 commented Jun 21, 2019

rok-cesnovar commented Jun 21, 2019

syclik commented Jun 21, 2019 via email

syclik commented Jun 21, 2019 via email

rok-cesnovar commented Jun 21, 2019

rok-cesnovar commented Jun 21, 2019

wds15 commented Jun 21, 2019

syclik commented Jun 21, 2019

syclik left a comment

Choose a reason for hiding this comment

syclik commented Jun 21, 2019

wds15 commented Jun 21, 2019

bob-carpenter commented Jun 21, 2019 via email

rok-cesnovar commented Jun 22, 2019

wds15 commented Jun 22, 2019

wds15 commented Jun 22, 2019

wds15 commented Jun 22, 2019

seantalts commented Jun 22, 2019

stan-buildbot commented Jun 22, 2019

serban-nicusor-toptal commented Jun 24, 2019 • edited

rok-cesnovar commented May 18, 2019 •

edited

wds15 commented May 18, 2019 •

edited

seantalts commented May 18, 2019 •

edited

rok-cesnovar commented May 18, 2019 •

edited

wds15 commented May 24, 2019 •

edited

syclik commented May 30, 2019 •

edited

serban-nicusor-toptal commented Jun 24, 2019 •

edited