Add run time options for threads and gpu stuff #825

SteveBronder · 2020-02-19T20:18:02Z

Summary:

It would be nice to have a parallel option so users could do something like parallel threads=4 opencl_device=0 opencl_platform=0 or something of that ilk. What should it look like?

Description:

Right now we need to set threading info as an environment variable and the OpenCL stuff has to be known at compile time. I think at the math level we need

method to set number of threads dynamically
method in opencl_context to set the platform and device dynamically.

Then cmdstan users (and the other upstreams) can pass in the options at runtime.

The OpenCL kernels are compiled dynamically so we'd need to set the device/platform before we do any of the kernel compilation. Which I think will just happen if we do it at the cmdstan level before any of the kernels are called.

Current Version:

v2.22.0

The text was updated successfully, but these errors were encountered:

rok-cesnovar · 2020-02-19T21:26:55Z

That would be fantastic. Let me know if you need any help or need a review.

rok-cesnovar · 2020-06-10T11:43:32Z

At least the threads part of this would be great to do for 2.24. For backward compatiblity we need to still support the environment variable.

wds15 · 2020-06-10T11:53:45Z

but we should deprecate the environment variable stuff once we have this feature.

rok-cesnovar · 2020-12-17T08:11:17Z

OpenCL side of this was fixed, so this just requires threading runtime option which is blocked by stan-dev/math#1949

mitzimorris · 2020-12-17T14:20:57Z

users could do something like parallel threads=4 opencl_device=0 opencl_platform=0 or something of that ilk.
What should it look like?

what exactly do you mean by "users could do something"? via CmdStan's extremely clunky argument parser?
or would users set environment variable at runtime?

rok-cesnovar · 2020-12-17T14:24:14Z

This was merged already.

“opencl device=1 platform=0” will select the device if the model was compiled appropriately.

Once we prepare stuff for threading downstream in Math it should be “threads=4” instead of the more clunky environment variable.

mitzimorris · 2020-12-17T14:27:53Z

sorry, I'm confused here - I'm a user, I have a Stan model, I want to take advantage of threading.
what do I do at compile time?
what do I do at run time?

this needs to be explained clearly in the CmdStan manual for similarly confused and naive users like me.

rok-cesnovar · 2020-12-17T14:41:26Z

Currently (its been this way since theading was introduced):

compile with STAN_THREADS
set environment variable at runtime to select the number of threads (not exactly at runtime, its just before runtime)

This has been documented in cmdstan guide for some time and I would say its quite clear: https://mc-stan.org/docs/2_25/cmdstan-guide/parallelization.html

Once we figure out a way to close this issue:

compile with STAN_THREADS
select the number of threads at runtime via cmdstan argument:
./bernoulli sample data file = ... threads=4

So instead of environment variable its a cmdstan argument. The environment variable approach will be deprecated, not removed.

We havent gotten to the stage where this is doable, but yes, this will be documented once its closed.

For OpenCL (GPU stuff) a chapter/section for at least the Cmdstan guide is in the works and is part of the checklist here and will be done for the release.

mitzimorris · 2020-12-17T15:02:41Z

This has been documented in cmdstan guide for some time and I would say its quite clear: https://mc-stan.org/docs/2_25/cmdstan-guide/parallelization.html

alas, not clear enough - that's a documentation issue - what's needed are a few more examples on a case-by-case basis, cases being:

user has GPU processing available, wants speedup
user has GPU processing available, wants to use reduce_sum
user doesn't have GPU processing, wants to use reduce_sum

is argument thread only relevant for models which have reduce_sum? MPI?

when editing together the CmdStan manual, I thought I understood parallelization - when Bob wanted to run CmdStan to use reduce_sum, I said "just look in the manual" - and he found it confusing, as did I, having not used this feature and therefore having forgotten whatever I understood back when I put this in the manual. mea culpa, as usual, but I don't think I'm alone.

rok-cesnovar · 2020-12-17T17:26:31Z

what's needed are a few more examples

I don't think that is thing that should be in the Cmdstan Guide. In my opinion the role of the Cmdstan Guide in this case is to answer the question: "I have a model that can use parallelization. How do I use that with Cmdstan?" And I think the link answers that concisely and clearly.

Answers to the questions like:

what functions support parallelization?
how do I parallelize my model?
is it worth it to parallelize my model given X, Y and Z?
what speedups can I expect?
what type of backend support does Stan have for parallelization?
what is reduce_sum and how do I use it?
what is map_rect and how to use it?

should be part of Stan's User guide and Functions reference. These are not limited to Cmdstan. Some of the answers to these questions are already there, some are scattered in other places like case studies and some are missing.

The answers to the OpenCL side of things will be added in a form of a section or a chapter before the release and hopefully address all these questions. GPU and reduce_sum are not really related. More will be explained in docs.

For now I can just give a short description for anyone coming here at any later point via Google/Github search:

do you have a multi-core CPU and want to parallelize it?

Use reduce_sum or map_rect in your model. The former is recommended as its easier to use. Then compile your model with STAN_THREADS and set the number of threads via environment variable STAN_NUM_THREADS.

do you have a cluster and want to run your model parallel on that cluster?

Use map_rect and MPI, which can be enabled with STAN_MPI. The number of MPI processes is defined with the -n flag to the MPI launcher like mpirun, mpiexec.
There are additional installation requirements for MPI.

do you want to use a GPU to parallelize your model?

For now, refer to the provisional and unofficial install documentation: https://github.com/bstatcomp/stan_gpu_install_docs
Also see GLM function signatures documentation. New docs are going to be available for 2.26.

mitzimorris · 2020-12-17T17:51:53Z

And I think the link answers that concisely and clearly.

there is always a conflict between making things crystal clear and being concise.
because the only make example showed how to use both reduce_sum plus GPUs, it led to the question
do you have to have both?
3 examples instead of one would clear up that confusion.

also - looking at that link again now- https://mc-stan.org/docs/2_25/cmdstan-guide/parallelization.html - unfinished edit? - the sentence that ends with:

in which case, the call to Make

the next paragraph says the same thing?

rok-cesnovar · 2020-12-17T18:47:01Z

because the only make example showed how to use both reduce_sum plus GPUs

Ok yeah, re-read it again. We are maybe trying to do too much in that page. Not sure why we are referring to both threading an OpenCL at the same time before we even introduce basic threading. Its unlikely anyone will use both. This page should just be about threading. Will fix that once we have a separate GPU section.

rok-cesnovar · 2021-10-19T17:03:21Z

We can close this now with the new num_threads cmdstan argument added.

SteveBronder self-assigned this Feb 19, 2020

SteveBronder added enhancement interface multithreading labels Feb 19, 2020

rok-cesnovar mentioned this issue Jun 25, 2020

Change init_threadpool_tbb() to accept number of threads and deprecate STAN_NUM_THREADS stan-dev/math#1949

Closed

rok-cesnovar mentioned this issue Dec 3, 2020

Runtime OpenCL device selection, allow seed=0 and fix allow-undefined user_header bug #954

Merged

rok-cesnovar closed this as completed Oct 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add run time options for threads and gpu stuff #825

Add run time options for threads and gpu stuff #825

SteveBronder commented Feb 19, 2020 •

edited by rok-cesnovar

Loading

rok-cesnovar commented Feb 19, 2020

rok-cesnovar commented Jun 10, 2020

wds15 commented Jun 10, 2020

rok-cesnovar commented Dec 17, 2020

mitzimorris commented Dec 17, 2020 •

edited

Loading

rok-cesnovar commented Dec 17, 2020 •

edited

Loading

mitzimorris commented Dec 17, 2020 •

edited

Loading

rok-cesnovar commented Dec 17, 2020 •

edited

Loading

mitzimorris commented Dec 17, 2020

rok-cesnovar commented Dec 17, 2020

mitzimorris commented Dec 17, 2020

rok-cesnovar commented Dec 17, 2020

rok-cesnovar commented Oct 19, 2021

Add run time options for threads and gpu stuff #825

Add run time options for threads and gpu stuff #825

Comments

SteveBronder commented Feb 19, 2020 • edited by rok-cesnovar Loading

Summary:

Description:

Current Version:

rok-cesnovar commented Feb 19, 2020

rok-cesnovar commented Jun 10, 2020

wds15 commented Jun 10, 2020

rok-cesnovar commented Dec 17, 2020

mitzimorris commented Dec 17, 2020 • edited Loading

rok-cesnovar commented Dec 17, 2020 • edited Loading

mitzimorris commented Dec 17, 2020 • edited Loading

rok-cesnovar commented Dec 17, 2020 • edited Loading

mitzimorris commented Dec 17, 2020

rok-cesnovar commented Dec 17, 2020

mitzimorris commented Dec 17, 2020

rok-cesnovar commented Dec 17, 2020

rok-cesnovar commented Oct 19, 2021

SteveBronder commented Feb 19, 2020 •

edited by rok-cesnovar

Loading

mitzimorris commented Dec 17, 2020 •

edited

Loading

rok-cesnovar commented Dec 17, 2020 •

edited

Loading

mitzimorris commented Dec 17, 2020 •

edited

Loading

rok-cesnovar commented Dec 17, 2020 •

edited

Loading