Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add run time options for threads and gpu stuff #825

Closed
2 tasks done
SteveBronder opened this issue Feb 19, 2020 · 13 comments
Closed
2 tasks done

Add run time options for threads and gpu stuff #825

SteveBronder opened this issue Feb 19, 2020 · 13 comments

Comments

@SteveBronder
Copy link
Contributor

SteveBronder commented Feb 19, 2020

Summary:

It would be nice to have a parallel option so users could do something like parallel threads=4 opencl_device=0 opencl_platform=0 or something of that ilk. What should it look like?

Description:

Right now we need to set threading info as an environment variable and the OpenCL stuff has to be known at compile time. I think at the math level we need

  • method to set number of threads dynamically
  • method in opencl_context to set the platform and device dynamically.

Then cmdstan users (and the other upstreams) can pass in the options at runtime.

The OpenCL kernels are compiled dynamically so we'd need to set the device/platform before we do any of the kernel compilation. Which I think will just happen if we do it at the cmdstan level before any of the kernels are called.

Current Version:

v2.22.0

@rok-cesnovar
Copy link
Member

That would be fantastic. Let me know if you need any help or need a review.

@rok-cesnovar
Copy link
Member

At least the threads part of this would be great to do for 2.24. For backward compatiblity we need to still support the environment variable.

@wds15
Copy link
Contributor

wds15 commented Jun 10, 2020

but we should deprecate the environment variable stuff once we have this feature.

@rok-cesnovar
Copy link
Member

OpenCL side of this was fixed, so this just requires threading runtime option which is blocked by stan-dev/math#1949

@mitzimorris
Copy link
Member

mitzimorris commented Dec 17, 2020

users could do something like parallel threads=4 opencl_device=0 opencl_platform=0 or something of that ilk.
What should it look like?

what exactly do you mean by "users could do something"? via CmdStan's extremely clunky argument parser?
or would users set environment variable at runtime?

@rok-cesnovar
Copy link
Member

rok-cesnovar commented Dec 17, 2020

This was merged already.

“opencl device=1 platform=0” will select the device if the model was compiled appropriately.

Once we prepare stuff for threading downstream in Math it should be “threads=4” instead of the more clunky environment variable.

@mitzimorris
Copy link
Member

mitzimorris commented Dec 17, 2020

sorry, I'm confused here - I'm a user, I have a Stan model, I want to take advantage of threading.
what do I do at compile time?
what do I do at run time?

this needs to be explained clearly in the CmdStan manual for similarly confused and naive users like me.

@rok-cesnovar
Copy link
Member

rok-cesnovar commented Dec 17, 2020

Currently (its been this way since theading was introduced):

  • compile with STAN_THREADS
  • set environment variable at runtime to select the number of threads (not exactly at runtime, its just before runtime)

This has been documented in cmdstan guide for some time and I would say its quite clear: https://mc-stan.org/docs/2_25/cmdstan-guide/parallelization.html

Once we figure out a way to close this issue:

  • compile with STAN_THREADS
  • select the number of threads at runtime via cmdstan argument:
    ./bernoulli sample data file = ... threads=4

So instead of environment variable its a cmdstan argument. The environment variable approach will be deprecated, not removed.

We havent gotten to the stage where this is doable, but yes, this will be documented once its closed.

For OpenCL (GPU stuff) a chapter/section for at least the Cmdstan guide is in the works and is part of the checklist here and will be done for the release.

@mitzimorris
Copy link
Member

This has been documented in cmdstan guide for some time and I would say its quite clear: https://mc-stan.org/docs/2_25/cmdstan-guide/parallelization.html

alas, not clear enough - that's a documentation issue - what's needed are a few more examples on a case-by-case basis, cases being:

  • user has GPU processing available, wants speedup
  • user has GPU processing available, wants to use reduce_sum
  • user doesn't have GPU processing, wants to use reduce_sum

is argument thread only relevant for models which have reduce_sum? MPI?

when editing together the CmdStan manual, I thought I understood parallelization - when Bob wanted to run CmdStan to use reduce_sum, I said "just look in the manual" - and he found it confusing, as did I, having not used this feature and therefore having forgotten whatever I understood back when I put this in the manual. mea culpa, as usual, but I don't think I'm alone.

@rok-cesnovar
Copy link
Member

what's needed are a few more examples

I don't think that is thing that should be in the Cmdstan Guide. In my opinion the role of the Cmdstan Guide in this case is to answer the question: "I have a model that can use parallelization. How do I use that with Cmdstan?" And I think the link answers that concisely and clearly.

Answers to the questions like:

  • what functions support parallelization?
  • how do I parallelize my model?
  • is it worth it to parallelize my model given X, Y and Z?
  • what speedups can I expect?
  • what type of backend support does Stan have for parallelization?
  • what is reduce_sum and how do I use it?
  • what is map_rect and how to use it?

should be part of Stan's User guide and Functions reference. These are not limited to Cmdstan. Some of the answers to these questions are already there, some are scattered in other places like case studies and some are missing.

The answers to the OpenCL side of things will be added in a form of a section or a chapter before the release and hopefully address all these questions. GPU and reduce_sum are not really related. More will be explained in docs.

For now I can just give a short description for anyone coming here at any later point via Google/Github search:

  • do you have a multi-core CPU and want to parallelize it?

Use reduce_sum or map_rect in your model. The former is recommended as its easier to use. Then compile your model with STAN_THREADS and set the number of threads via environment variable STAN_NUM_THREADS.

  • do you have a cluster and want to run your model parallel on that cluster?

Use map_rect and MPI, which can be enabled with STAN_MPI. The number of MPI processes is defined with the -n flag to the MPI launcher like mpirun, mpiexec.
There are additional installation requirements for MPI.

  • do you want to use a GPU to parallelize your model?

For now, refer to the provisional and unofficial install documentation: https://github.com/bstatcomp/stan_gpu_install_docs
Also see GLM function signatures documentation. New docs are going to be available for 2.26.

@mitzimorris
Copy link
Member

And I think the link answers that concisely and clearly.

there is always a conflict between making things crystal clear and being concise.
because the only make example showed how to use both reduce_sum plus GPUs, it led to the question
do you have to have both?
3 examples instead of one would clear up that confusion.

also - looking at that link again now- https://mc-stan.org/docs/2_25/cmdstan-guide/parallelization.html - unfinished edit? - the sentence that ends with:

in which case, the call to Make

the next paragraph says the same thing?

@rok-cesnovar
Copy link
Member

because the only make example showed how to use both reduce_sum plus GPUs

Ok yeah, re-read it again. We are maybe trying to do too much in that page. Not sure why we are referring to both threading an OpenCL at the same time before we even introduce basic threading. Its unlikely anyone will use both. This page should just be about threading. Will fix that once we have a separate GPU section.

@rok-cesnovar
Copy link
Member

We can close this now with the new num_threads cmdstan argument added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants