Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel Run Adaptive Sampler #3028

Closed
wants to merge 14 commits into from
Closed

Conversation

SteveBronder
Copy link
Collaborator

Submission Checklist

  • Run unit tests: ./runTests.py src/test/unit
  • Run cpplint: make cpplint
  • Declare copyright holder and open-source license: see below

Summary

This is a WIP to figure out some of the odds and ends of the parallel version of adaptive sampling. The main issues

  1. The sample_writers and diagnostic_writers come in as a vector of streams where the current impl comes in as a writer for each stream. Is that fine? At the cmdstan level I suppose we can just check that if n_chain > 1 we can spin up writers to output_{chain_#}?

  2. wrt RNGs, do we need to have a special PRNG or is it safe to just have multiple RNGs? Should we use a different RNG with a longer cycle and set each seed to user given initial + 2^chain_id or something?

  3. For what @wds15 brought up here should the isolate() thing happen before we put this PR in or after? I think as long as it's not exposed to users yet it's fine to work on this as long as we do that before we make this available via cmdstan.

Till we sort out 1-3 I just brought in the adaptive_sampler from cmdstan#987, once we have those 3 sorted we will have a version that just blandly runs each chain in parallel, but it will be very easy to make new methods like auto warmup across chains since we'd just have some run_auto_warmup_adaptive_sampler() with parallel_fors for the warmup and sampling where the warmup is just something like

for (size_t i = 0; i < max_warmup; i += n_warmup_checkpoints) {
  for (int iter = i; iter < n_warmup_checkpoints; ++iter) {
      tbb::parallel_for(
      tbb::blocked_range<size_t>(0, n_chain, 1),
       // cool stuff!
      );
      // smart warmup stuff here
 }
}

Intended Effect

Run Chains in Parallel

How to Verify

Side Effects

Documentation

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company):

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

@SteveBronder
Copy link
Collaborator Author

Tagging parties interested @wds15 @yizhang-yiz @mitzimorris @bbbales2 once we have the API sorted out this should be decently easy peasy

@wds15
Copy link
Contributor

wds15 commented Mar 21, 2021

The isolation thing is easy to implement. Let‘s just do an issue for that and fix it. That‘s anyway good to have to make things safe whenever Stan chains run in parallel environments (within a single arena).

@wds15
Copy link
Contributor

wds15 commented Mar 21, 2021

One point for consideration is the thread allocation / requesting. So far we have (by design of one chain per process) a by-chain thread allocation controlled by the environment variable STAN_NUM_THREADS. The way the current PR is structured is such that there won't anymore be a control over how many threads are being used per chain. I would expect that to be a good thing and it will leave resource allocation to the Intel TBB scheduler, which is what the thing is build for. If we stay with such a design, then we can leave the setup of the size of the used tbb arena within cmdstan and we do not need to deal with setting up an arena within the services. To me that makes sense. @SteveBronder , @rok-cesnovar & others ? Note that the meaning of STAN_NUM_THREADS will change (depending on how you see it) to become "total number of threads to be used for all chains" if we don't do anything. The best solution I think is to add to cmdstan a program option which controls the number of threads being used. All of these are then used by the n_chains requested from the services.

@SteveBronder
Copy link
Collaborator Author

Yeah @wds15 I think what you are saying makes sense. What we could do is finally add the parallel option to cmdstan so users would call something like

./my_model sample data file=./my_data.json chains n_chains=8 parallel n_threads=16

Making sure I read your comment right, I think we are both on the same page that the user should just tell stan the total amount of available threads and on the backend we can sort out how to use those. I think Stan's preference would always be to use a thread for each chain since the overall chain is the more expensive thing to run. Then any other available threads can be used in the model

@bbbales2
Copy link
Member

we do not need to deal with setting up an arena within the services. To me that makes sense. @SteveBronder , @rok-cesnovar & others ?

Other than segfaults, I guess the only thing I worry about here is I expect work done outside of reduce_sum to scale more efficiently than work in reduce_sum (because reduce_sum has the shared argument copy overhead). If there's lots of outside-reduce_sum work and lots of inside-reduce_sum work, I'd hope that the outside work gets priority cause the inside work scales worse.

I assume that would show up quickly with experiments though so probably just easier to see if it happens than worry about it too much ahead of time.

The sample_writers and diagnostic_writers come in as a vector of streams where the current impl comes in as a writer for each stream. Is that fine?

So there's a separate output file for each chain? Question there is what would the interface look like.

It seems awkward to do:

output file="file1.csv,file2.csv,..."

The other option would be do some sort of basename or automatic name-rewriting. So if the user specified:

output file="myfile.csv"

They would get myfile.csv.0, myfile.csv.1, etc. or something.

My guess is we should just have a meeting with @mitzimorris and @rok-cesnovar and talk that one through. I like the name re-writing which looks like what is being done in the prototyping. There's probably also some extra output that should be written in the not-quite-csv output headers.

It looks like this is something that can remain compatible with the single chain interface which is neat.

or is it safe to just have multiple RNGs?

Seems simpler to think about to have multiple RNGs unless they aren't threadsafe or something. Like, if you wanted to reproduce a calculation and you only wanted to run one thread, then no need to reproduce in parallel.

If seed isn't specified then I assume picking a seed however we currently do is the way to go. We'd probably need to be able to specify the seed as a list or something now. There are probably other arguments that may make sense to be optionally-lists? Inits? I'm not sure. That's a question to coordinate with cmdstanr and cmdstanpy I guess.

For what @wds15 brought up here should the isolate() thing happen before we put this PR in or after?

Sounds like it shouldn't negatively affect behavior so yeah def if it makes this easier.

What is the file_stream_writer here doing? I searched for it but it didn't show up in the run_adaptive thing for me. Could have missed something.

Looks like the basic coding is under control. I am happy to review, help write the tests, or help with the cmdstanr/cmdstanpy stuff for this.

@wds15
Copy link
Contributor

wds15 commented Mar 21, 2021

If there's lots of outside-reduce_sum work and lots of inside-reduce_sum work, I'd hope that the outside work gets priority cause the inside work scales worse.

The work anyway needs to get done.

... the only risk there is that the TBB scheduler slices the nested reduce_sum work into more slices than what is healthy if given more cores, but I don‘t think that this is the case.

All that said... let‘s just do a basic benchmark, sure.

Seems simpler to think about to have multiple RNGs unless they aren't threadsafe or something.

The rng topic needs to be handled careful. It should not make a difference how many threads are being used as to what the numerical result is. The numerical result should always be the same ideally.

@SteveBronder
Copy link
Collaborator Author

we do not need to deal with setting up an arena within the services. To me that makes sense. @SteveBronder , @rok-cesnovar & others ?

Other than segfaults, I guess the only thing I worry about here is I expect work done outside of reduce_sum to scale more efficiently than work in reduce_sum (because reduce_sum has the shared argument copy overhead). If there's lots of outside-reduce_sum work and lots of inside-reduce_sum work, I'd hope that the outside work gets priority cause the inside work scales worse.

I assume that would show up quickly with experiments though so probably just easier to see if it happens than worry about it too much ahead of time.

Yeah I think we would notice if we are on a bad beat here

The sample_writers and diagnostic_writers come in as a vector of streams where the current impl comes in as a writer for each stream. Is that fine?

So there's a separate output file for each chain? Question there is what would the interface look like.

The other option would be do some sort of basename or automatic name-rewriting. So if the user specified:

I like this option

output file="myfile.csv"

They would get myfile.csv.0, myfile.csv.1, etc. or something.

Though I think I'd just prefer myfile_0.csv just so things that import csvs know it's a csv still

My guess is we should just have a meeting with @mitzimorris and @rok-cesnovar and talk that one through. I like the name re-writing which looks like what is being done in the prototyping. There's probably also some extra output that should be written in the not-quite-csv output headers.

Agree!

It looks like this is something that can remain compatible with the single chain interface which is neat.

Yep!

Seems simpler to think about to have multiple RNGs unless they aren't threadsafe or something. Like, if you wanted to reproduce a calculation and you only wanted to run one thread, then no need to reproduce in parallel.

So the PRNG thing, I think we will be fine if we set the seed to be just user_seed + chain_num for each chain. I might be overreacting but I've read in the past that in parallel systems you can have issues where for certain PRNGs you can end up with overlap of the stream of random numbers (since they are pseudo RNGs). From here it seems the length of the ecuyer1988 prng is about 2^61. It would be nice to not worry about this but it's just an area I'm uncertain about. Tagging @betanalpha as he might know more about whether we should worry about this or not

What is the file_stream_writer here doing? I searched for it but it didn't show up in the run_adaptive thing for me. Could have missed something.

So that file_stream_writer holds unique pointers to each of the file streams. I had some issues using stream_writer where somewhere a reference was being dereferenced or something was falling out of scope (stream_writer just holds it's ostream by refrerence) so I thought the easiest solution was just to wrap that ostream into an std::unique_ptr. Then we also get promises that no-one ever copies that thing.

Looks like the basic coding is under control. I am happy to review, help write the tests, or help with the cmdstanr/cmdstanpy stuff for this.

Much appreciated! I think at this point I still need to figure out some patterns etc. that we can probs hash out over a call. But besides that I think the testing etc. I can handle for now and can holler otherwise

The rng topic needs to be handled careful. It should not make a difference how many threads are being used as to what the numerical result is. The numerical result should always be the same ideally.

We can def have reproducibility no sweat. The main issue is just making sure we have good random numbers (imagine spinning up 8 chains but then each chain has correlated random numbers :-x)

I got a test up and running, but running the address sanitizer over it I'm seeing leaks happening after the test finishes inside of AutodiffStackSingleton. Full output below, but it seems like it's not destroying the auto diff stack at the end of the program. But do we even want it to delete that? It just get's tossed once the program is done. Should we do something that makes sure that once the global ad_tape_observer destructor is called at the end of the program those ChainableStacks are also removed? Or is this some other issue?

https://gist.github.com/SteveBronder/7c2b2c371d4ce805fbd301fca32c088f

@rok-cesnovar
Copy link
Member

I agree that the user should just set the max number of cores to use (threads to spawn) and let us deal with how that is used.

@rok-cesnovar
Copy link
Member

I am a bit confused here. The name of the PR is "Parallel Adaptive Sampler". Does that mean the first step of multi chain in service will already include the campfire-communication-between-chains-in-warmup things? I think step 1 should be just multi chains in services. I am probably overthinking the title :)

@SteveBronder
Copy link
Collaborator Author

No sorry tbc this is just running n adaptation samplers in parallel with no communication. This does the "simpler" thing of multiple chains in services

@SteveBronder
Copy link
Collaborator Author

Hmm, running 3 chains with two threads also gives some nasty stuff

https://gist.github.com/SteveBronder/b414e37cb081ba2be7cb5da512557c7c

@SteveBronder
Copy link
Collaborator Author

Actually above may be something about stan-dev/math#2428

@SteveBronder
Copy link
Collaborator Author

Okay more confusion, if I compile that test with

STAN_THREADS=true
TBB_LIBRARIES=tbb tbbmalloc tbbmalloc_proxy

Then run the test with

STAN_NUM_THREADS=2 ./runTests.py -j30 ./src/test/unit/services/util/run_adaptive_sampler_parallel_test.cpp

I no longer get the vptr undefined error and just get the error about the end of the application not cleaning up memory.

@wds15
Copy link
Contributor

wds15 commented Mar 21, 2021

Can u use an external up to date version of the one tbb? That should be possible with our makefiles and hopefully resolves the cleanup issue.

@SteveBronder
Copy link
Collaborator Author

I'll try that later today or tomorrow. But even if that solves it folks will still not be able to use this because of how RcppParallel is on the old version right? Tbh I wonder if we could just submit OneTBB on Cran that just gives C++ access to a an updated version of OneTBB.

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.42 3.44 0.99 -0.72% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.99 -0.89% slower
eight_schools/eight_schools.stan 0.11 0.11 0.97 -2.88% slower
gp_regr/gp_regr.stan 0.16 0.16 1.03 2.98% faster
irt_2pl/irt_2pl.stan 5.32 5.31 1.0 0.17% faster
performance.compilation 90.53 88.83 1.02 1.88% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.9 8.97 0.99 -0.88% slower
pkpd/one_comp_mm_elim_abs.stan 29.89 30.01 1.0 -0.41% slower
sir/sir.stan 129.01 140.28 0.92 -8.74% slower
gp_regr/gen_gp_data.stan 0.04 0.04 1.03 2.94% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan 3.09 3.1 1.0 -0.02% slower
pkpd/sim_one_comp_mm_elim_abs.stan 0.38 0.38 0.99 -0.64% slower
arK/arK.stan 1.93 1.93 1.0 -0.03% slower
arma/arma.stan 0.64 0.8 0.8 -24.28% slower
garch/garch.stan 0.51 0.65 0.78 -28.29% slower
Mean result: 0.968134079415

Jenkins Console Log
Blue Ocean
Commit hash: 3ebbc5c


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@SteveBronder
Copy link
Collaborator Author

Ugh so I'd like to just bump up if we can, then if that's possible we can talk to the RcppParallel folks about whether they should update, whether we should make a new library, etc. But it looks like the new oneTBB must use cmake so we will have that as a depedency as well. It feels onerous to ask users to supply their own version of tbb, though I'm not sure what's normal here. I have a branch of math below with the new tbb version, but need to sort out how to hook it up to our makefile stuff. Maybe it's just time to use cmake?

https://github.com/stan-dev/math/compare/update/onetbb

@SteveBronder
Copy link
Collaborator Author

Pinging @hsbadr as I feel like he might have a better idea of what route to take here

@hsbadr
Copy link
Member

hsbadr commented Mar 22, 2021

@SteveBronder RcppParallel does support external TBB library using TBB_LIB and TBB_INC environment variables (https://github.com/RcppCore/RcppParallel#intel-tbb), just like Stan's math (https://github.com/stan-dev/math#intel-tbb). You need to (force) reinstall RcppParallel after correctly setting the environment variables.

So, we may remove the TBB source code from math (make it an external dependency by default). This will ultimately fix the conflicting TBB headers (with system libtbb-dev, RcppParallel's TBB, ... etc) where Stan can explicitly declare the supported versions of TBB that can also be adopted by RcppParallel.

We could use external oneTBB binaries (from https://github.com/oneapi-src/oneTBB/releases), ship oneTBB binaries in math to avoid cmake for now, update the internal TBB source code and the build system in https://github.com/stan-dev/math/tree/update/onetbb, or create a new oneTBB R package. Then, we can give instructions to reinstall RcppParallel with TBB_LIB and TBB_INC set. I haven't tested external oneTBB on non-Unix targets yet.

If/when needed, we can maintain code compatibility via TBB_INTERFACE_VERSION (from oneapi/tbb/version.h in oneTBB, or tbb/tbb_stddef.h in the old versions/interface of TBB) which is currently controlled by TBB_INTERFACE_NEW since we have only two conflicting (old and new) TBB interfaces.

I haven't reviewed the code in this PR yet, but I'll take a look ASAP.

@wds15
Copy link
Contributor

wds15 commented Mar 22, 2021

So, we may remove the TBB source code from math (make it an external dependency by default). This will ultimately fix the conflicting TBB headers (with system libtbb-dev, RcppParallel's TBB, ... etc) where Stan can explicitly declare the supported versions of TBB that can also be adopted by RcppParallel.

So far we have always kept all required libraries together which are required to get Stan running. I don't think we would want to move away from that.

@SteveBronder Maybe things are not so desperate. Either we can work around the issue or maybe this problem is not really an issue as things anyway get cleaned up. Let's see.

To correct what I said earlier: If we start 4 chains with 4 threads, then any nested reduce_sum should not be given priority. We then still want the 4 chains to run in parallel instead of running 1 chain at a time with reduce_sum using 4 cores simultaneously. In order to test this I suggest that we code up a version of this thing where the different chain are seeded with independent RNGs, but those run in sync. This way we fire up 4 times the same work and we should get their results in basically the same time. For the final release it should not be possible to fire up 4 exactly equal chains, of course.

@t4c1
Copy link
Collaborator

t4c1 commented Mar 22, 2021

We'd probably need to be able to specify the seed as a list or something now.

We can use single seed if we are smart about it. If we initialize one PRNG with user provided seed and use that to generate seeds for PRNGs in threads. Obviously this kind of initialization must use a different PRNG than threads.

wrt RNGs, do we need to have a special PRNG or is it safe to just have multiple RNGs? Should we use a different RNG with a longer cycle and set each seed to user given initial + 2^chain_id or something?

I might be overreacting but I've read in the past that in parallel systems you can have issues where for certain PRNGs you can end up with overlap of the stream of random numbers (since they are pseudo RNGs). From here it seems the length of the ecuyer1988 prng is about 2^61.

For usage on a single CPU period of 2^61 seems adequate. For example if we run 200 threads and generate 10^7 random numbers per thread that gives up probability that any streams overlap of 10^-6. With more threads (thinking about GPUs or clusters) we should use a PRNG with longer period.

I already did some testing of PRNGs in parallel (with focus on GPUs, but it should be applicable here): https://link.springer.com/article/10.1007/s11227-019-02756-2

@rok-cesnovar
Copy link
Member

  • re RNGs:

    Is starting 4 chains in TBB any different than starting 4 chains in separate processes, which is what the interfaces do now? Based on my understanding it is not. The services layer takes in the seed and the chain id and then multiplies the RNG with a discard stride 2^50 (see https://github.com/stan-dev/stan/blob/develop/src/stan/services/util/create_rng.hpp#L27). So either we have been doing this wrong all along and we need to fix this or we just use this.

  • re downstream libraries and supplying them:

    I agree with @wds15, we should not move away from supplying the base libraries on our own. We have to know most of our users are not C++ developers or even programmers and they do not want to deal with package managers, paths and whatnot. Controlling the versions of downstream libraries also vastly simplifies our testing. We could move them out of git and just supply them in releases, but I doubt that makes a huge difference.

    We should enable (or at least make it somewhat easy) users to plug in their own downstream library installation (like @hsbadr did for TBB) if they want to, but at that point they are on their own in terms of if things will work as we do not guarantee (and cant) it will work.

    Its fine if we dont supply any libraries that are only targeting power-users or advanced users, but the libraries that are needed for the basic stuff we should supply.

Q for @hsbadr : What is the last version of TBB/oneTBB that we can just plug into Math (so just replace sources and expect things to work with our current makefiles). Maybe we should move to that and help RcppParallel move to that as well? So we have a bit of extra time to figure out what to do next.

@SteveBronder
Copy link
Collaborator Author

And for replication purposes I have in my make/local

CXX=clang++-11
CXXFLAGS+= -DTBB_INTERFACE_NEW -ggdb3 -fsanitize=address,undefined,leak -fno-omit-frame-pointer
O=g
STAN_THREADS=true

and am doing

cd lib/stan_math
git pull
git checkout feature/update-tbb2020-3
cd  ../../
make clean-all
# cool and good
STAN_NUM_THREADS=3 ./runTests.py -j25 ./src/test/unit/services/util/run_adaptive_sampler_parallel_test.cpp
# bad :-(
STAN_NUM_THREADS=4 ./runTests.py -j25 ./src/test/unit/services/util/run_adaptive_sampler_parallel_test.cpp

@SteveBronder
Copy link
Collaborator Author

Alright so after looking around I found some docs and this thread which describes the behavior here and I think there's enough evidence to say that, for the version that uses the new tbb, this is a leak but not a dangerous leak

Cleanup for the ad tape in each thread normally happens from the global ad_type_observer's on_scheduler_exit() method. But in the tbb docs and the above thread it's stated that

A process does not wait for the worker threads to clean up, and can terminate before on_scheduler_exit is invoked.

So what's happening is that before on_scheduler_exit() can be called for a thread, the program is ending and the leak detector is getting confused as it sees memory that exists at end of program but is never cleaned up. Because the threads are never cleaned up it throws these leak warnings, but since the main thread is terminated there's no reason to care that the other threads are cleaned up. Since the tests run fine and the leak only happens after the tests end I feel pretty confident that this is not a true leak.

But the other memory error with our current version of the tbb about the vptr being incorrect I think is an actual bug inside of that version of the tbb. So I think if we bump up the tbb version to 2020-3 (which uses the makefiles still so no cmake) and use TBB_INTERFACE_NEW by default then everything here can be okie dokie! I have a PR to do that here I can open up tmrw

@SteveBronder
Copy link
Collaborator Author

One issue with the model model prng generators is that they control both the transformed data and the generated quantities blocks. While there are cases where a common generator for the transformed data makes sense there are fewer were a common generator for generated quantities makes sense. Indeed I would rather have different generators across the model replications than a common one — it’s safer, more appropriate to simulation calibration methods, and if anyone does want common data it’s easy enough to split the process up into two steps — generating common data with the fixed_param sampler and then passing that to the chains.

So the good news is that for the generated quantities the signature actually supplies an RNG, so for that the current impl I have with multiple PRNGs should be alrighty. Multiple PRNG for the transformed data might have to come some other time. One nice benefit here is that the single model is shared across all the processes which cuts out memory overhead, but if we want multiple PRNG for the transformed data then we'll also have to make multiple models to have different transformed data. Maybe we could have an option at the cmdstan level to support sbc which would then give each thread it's own model? Though I think we should do that another time as the stuff here so far seems like a nice layup (I need to read the stuff you and @t4c1 posted but so far I think my concerns were overblown and we should be okay).

@hsbadr
Copy link
Member

hsbadr commented Mar 24, 2021

is it pretty reasonable to apply the windows patches for RcppParallel?

I think so. Just need to test it on Windows.

If so it might be nice to bump up the version. Then in the future we can talk about how to do a hard deprecation wrt task_scheduler_init and newer versions of tbb

That sounds like a great transition to the new TBB interface, with 2020.3.

@SteveBronder
Copy link
Collaborator Author

Awesome! Just put up a pr with the new tbb version in math

@hsbadr
Copy link
Member

hsbadr commented Mar 24, 2021

Awesome! Just put up a pr with the new tbb version in math

Great! It'd be optimal to update TBB in RcppParallel to 2020.3 as well.

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.43 3.34 1.03 2.76% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.99 -0.68% slower
eight_schools/eight_schools.stan 0.11 0.11 1.01 0.94% faster
gp_regr/gp_regr.stan 0.16 0.16 1.01 0.71% faster
irt_2pl/irt_2pl.stan 5.38 5.42 0.99 -0.84% slower
performance.compilation 91.62 89.29 1.03 2.54% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.59 8.72 0.99 -1.46% slower
pkpd/one_comp_mm_elim_abs.stan 29.38 31.48 0.93 -7.15% slower
sir/sir.stan 121.37 130.26 0.93 -7.33% slower
gp_regr/gen_gp_data.stan 0.03 0.03 0.99 -1.33% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 3.2 3.01 1.06 5.9% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.38 0.4 0.96 -4.66% slower
arK/arK.stan 1.99 2.57 0.77 -29.09% slower
arma/arma.stan 0.94 0.64 1.46 31.56% faster
garch/garch.stan 0.51 0.56 0.91 -10.08% slower
Mean result: 1.00371713763

Jenkins Console Log
Blue Ocean
Commit hash: 5c7b0bd


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@SteveBronder SteveBronder changed the title [WIP] Parallel Adaptive Sampler Parallel Run Adaptive Sampler Mar 24, 2021
@SteveBronder SteveBronder self-assigned this Mar 24, 2021
@SteveBronder SteveBronder marked this pull request as ready for review March 24, 2021 22:35
@SteveBronder
Copy link
Collaborator Author

Alright I think this is ready for review! I'm going to open up another PR on top of this one soon that has the signature for nuts with a diagonal metric

@bbbales2
Copy link
Member

bbbales2 commented Mar 24, 2021

another PR on top

Is this a cmdstan PR or is there something else in stan that needs changed?

Edit: Also does this need to wait on the TBB math thing: stan-dev/math#2447 or can it go ahead (sorry all the TBB discussion in this thread was a whirlwind and I did not keep up lol)

@SteveBronder
Copy link
Collaborator Author

SteveBronder commented Mar 24, 2021

This is just a Stan PR, once #3033 goes in we can do the cmdstan stuff.

Also does this need to wait on the TBB math thing: stan-dev/math#2447 or can it go ahead (sorry all the TBB discussion in this thread was a whirlwind and I did not keep up lol)

It was a whirlwind lol. For review no, but it would be a good idea to wait to merge until the tbb pr is in

@bbbales2
Copy link
Member

This is just a Stan PR, once #3033 goes in

Okay so this is a services function, and then we'll do a samplers pull, and then we'll move on to cmdstan?

@SteveBronder
Copy link
Collaborator Author

Yep! #3033 is the sampler pull

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.39 3.37 1.01 0.86% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 1.03 2.58% faster
eight_schools/eight_schools.stan 0.11 0.11 1.02 2.13% faster
gp_regr/gp_regr.stan 0.16 0.16 1.0 0.32% faster
irt_2pl/irt_2pl.stan 5.38 5.35 1.01 0.58% faster
performance.compilation 90.53 88.96 1.02 1.73% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.6 8.72 0.99 -1.36% slower
pkpd/one_comp_mm_elim_abs.stan 30.05 29.84 1.01 0.7% faster
sir/sir.stan 120.86 128.03 0.94 -5.94% slower
gp_regr/gen_gp_data.stan 0.03 0.04 0.97 -3.25% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 3.24 3.01 1.08 6.98% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.38 0.39 0.98 -2.42% slower
arK/arK.stan 1.99 2.56 0.78 -28.77% slower
arma/arma.stan 0.94 0.63 1.49 32.72% faster
garch/garch.stan 0.51 0.57 0.9 -11.37% slower
Mean result: 1.01346860104

Jenkins Console Log
Blue Ocean
Commit hash: 028bb51


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Copy link
Member

@bbbales2 bbbales2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments. I'm gonna go read the sampler PR. Gotta piece more of this together.

: output_(std::move(output)), comment_prefix_(comment_prefix) {}

file_stream_writer();
file_stream_writer(file_stream_writer& other) = delete;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do these two lines mean?:

file_stream_writer();
file_stream_writer(file_stream_writer& other) = delete;

Because the output_ is a unique ptr we don't allow copy constructors only move constructors or something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh okay I looked back at what you previously wrote:

So that file_stream_writer holds unique pointers to each of the file streams. I had some
issues using stream_writer where somewhere a reference was being dereferenced or
something was falling out of scope (stream_writer just holds it's ostream by refrerence)
so I thought the easiest solution was just to wrap that ostream into an std::unique_ptr.
Then we also get promises that no-one ever copies that thing.

It seems cleaner to replace stream_writer with something that behaves better. Presumably the other use cases of stream_writer can still work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to replace stream writer, but I was having trouble with backwards compatibility. I think once cmdstan uses the file_stream_writer we can remove stream_writer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it difficult because it's one of these cross-repo things where we'd need to change cmdstan and stan at the same time?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's the main issue

[num_warmup, num_samples, num_thin, refresh, save_warmup, &samplers,
&model, &rngs, &interrupt, &logger, &sample_writers, &cont_vectors,
&diagnostic_writers](const tbb::blocked_range<size_t>& r) {
for (size_t i = r.begin(); i != r.end(); ++i) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now can we just call run_adaptive_sampler n_chains times in parallel like:

run_adaptive_sampler(samplers[i], model, cont_vectors[i], num_warmup,
                         num_samples, num_thin, refresh, save_warmup, rngs[i],
                         interrupt, logger, sample_writers[i],
                         diagnostic_writers[i]);

Long term these could behave differently (if we did custom adaptation or something) but they seem the same for now? Or is there some difference I'm missing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I didn't realize this until I started working on #3033 lol. Yeah I think that would work? Let me try to do that in #3033. Though I will say I kind of would like to keep this code only as an example of how to do parallel_for() at this level which is what the auto warmup size thing will need to do. For the auto warmup stuff you mostly just need to take this code, break it up into two parallel for loops, then add another loop around the warmup parallel_for() to do the warmup in chunk sizes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright so just ran this locally and it gives back good answers. If we are going to cut this code out then what I can actually do it remove the changes to run_adaptive_sampler in #3033 and just do the simpler parallel loop and then close this PR

for (int m = 0; m < num_iterations; ++m) {
callback();

if (refresh > 0
&& (start + m + 1 == finish || m == 0 || (m + 1) % refresh == 0)) {
int it_print_width = std::ceil(std::log10(static_cast<double>(finish)));
std::stringstream message;
if (n_chain > 0) {
message << "Chain [" << (n_chain + 1) << "]";
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense but we can run it by interfaces.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah anyone that is depending on the current behavior shouldn't be effected by this so I think it's fine

callbacks::logger& logger,
std::vector<SampT>& sample_writers,
std::vector<DiagnoseT>& diagnostic_writers,
size_t n_chain) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the model, interrupt, and logger bits need to be threadsafe.

Is every function in the model (other than constructors/destructors) thread safe? I think this threading stuff will mess up the profiles, so we gotta figure out what to do there.

For the interrupts and loggers, I see that the implementations in the test had to change but shouldn't there be changes in the main source too? Or were these already threadsafe or something?

I see that interrupt is for the interfaces to capture keystrokes. Are the interrupts signals generated at the interfaces and transmitted in here -- or are they signals generated in the samplers and transmitted back to the interfaces?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the logger goes to std::cout which is threadsafe (see here)

Unless sync_with_stdio(false) has been issued, it is safe to concurrently access these objects from multiple threads for both formatted and unformatted output.

The interrupts, at least in cmdstan, are actually just the base impl stan::callbacks::interrupt which doesn't actually do anything. I'm looking at the R interrupt and it seems like that would be threadsafe.

Is every function in the model (other than constructors/destructors) thread safe?

Yep! I've looked the model class over. The model is essentially immutable after construction so it's safe to pass around.

I think this threading stuff will mess up the profiles, so we gotta figure out what to do there.

Oooh hm idk how this will effect profiling. @rok-cesnovar looking at the profiling gen code

      {
        profile<local_scalar_t__> profile__("testerrr", const_cast<profile_map&>(profiles__));
        
        current_statement__ = 3;
        assign(transf_testp, stan::math::exp(testp),
          "assigning variable transf_testp");
      }

I think we might need a mutex in the profile constructor here

  profile(std::string name, profile_map& profiles)
      : key_({name, std::this_thread::get_id()}) {
    {
      std::lock_guard<std::mutex> map_lock(profile_mutex);
     // idt search is thread safe
      profile_map::iterator p = profiles.find(key_);
      if (p == profiles.end()) {
        // idt insertion is thread safe either
        profiles[key_] = profile_info();
    }
    // rest business as usual
    }

I see that interrupt is for the interfaces to capture keystrokes. Are the interrupts signals generated at the interfaces and transmitted in here -- or are they signals generated in the samplers and transmitted back to the interfaces?

They are called in generate_transitions which is for the keystroke thing for R / Python / higher level systems that need to tell the C++ to stop. It's not used in cmdstan since you can just call ctrl+c on a binary to tell the system to stop whatever program is running atm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it an option to run the profiler in only one thread or avoid the need for a mutex by using multiple model copies? We do not want a mutex unless we have to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want separate profiles for each thread if we want profiling to act like it currently does (a separate profile for each chain). Is Stan data actually copied into the model? Or is it references? I guess if it's references then we could actually have lots of models? But maybe that is undesirable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a mutex? The thread id is a key in the storage map and each thread is thus guaranteed to always access only its "own" profiles. std::map is supposed to be fine in that case: https://stackoverflow.com/questions/15067160/stdmap-thread-safety and other resources if you google "std::map thread safety".

We could switch to using tbb::concurent_hash_map though I think that is unecessary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The profiler itself runs in one thread, the problem is that in the above constructor the profile_map is a global std::map and idt there's any reasonable way to get around that.

imo idt a mutex is that bad here. I think we can actually put it in the if (p == profiles.end()) since find should be threadsafe since it's searching for a key that also has the thread_id in it. But for insertion with profiles[key_] we do not want to be adding keys simultaneously as we could end up in a situation where the map reallocs memory

One solution here is to prefill the profile_map keys. Since we know each key is decided by a unique name and thread combination no two threads can cause a race condition because they can write to different parts of the map. So we only need to worry about the map reallocing for new keys. for an example model like

model {
  profile("ayyy") {
    theta ~ beta(1,1);  // uniform prior on interval 0,1
  }
  profile("yooo") {
    y ~ bernoulli(theta);    
  }
}

We can have a function in the model like

void init_profile_map(){
   const auto max_threads = get_max_threads();
   for (auto&& id : std::array<const char*, 2>{"ayyy", "yooo")) {
     for (int i = 0; i < max_threads; ++i) {
       profile_map[{name, std::this_thread::get_id()}] = empty_profile_info();
     }
   } 
}

Or something like that to fill up the keys before we do any other insertions. Then we don't need to worry about data reallocation since all the keys are there.

But still I think a mutex here is not that bad. It's short lived and only causes locks whenever we are doing profiling (which imo is a usage pattern that is not for performance anyway). I think what I would like is to just put a mutex in the if for now and then later we can do the lock free pattern

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh shoot I typed that out before Rok posted his reply. Yeah from there it looks like it is safe to insert since each key is distinct for each thread. I'll look into this more to make sure it's fine. If I can't sort out anything conclusive then I think moving to a tbb::concurent_hash_map is a reasonable solution

Copy link
Contributor

@wds15 wds15 Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Global variables are problematic for obvious reasons. The TBB data structures should be a good alternative... How often is this code being run over? I mean, is it performance critical?

EDIT: If std map operations needed are supposed to be safe in this usage, then no need to switch to the TBB thing, of course.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insert to a map happens the first time a profile is registered. So once in a lifetime of a model. After that its just read access to get the reference to profile info. And those are separated for separate threads, no race conditions there.

If locking the insert in a mutex is something that would make us sleep more calmly, thats not a big price to pay. It does seem its not needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah idt it's needed. Let's not worry about this for now and we should notice it with a pretty simple test once the cmdstan stuff is up

@bbbales2
Copy link
Member

Computer crashed mid-review here. I came back and comments were gone, so I added them again. It seems like in the end some of the comments got doubled. Sorry for the noise. I cleaned up duplicates.

@SteveBronder
Copy link
Collaborator Author

fyi I'm going to close this PR in lieu of #3033 because of @bbbales2 comment here I realized we actually don't need to do parallelism at this level and can just wrap run_adaptive_sampler in a parallel_for. Much simpler!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet