-
-
Notifications
You must be signed in to change notification settings - Fork 108
/
parallelization.Rmd
183 lines (128 loc) · 6.62 KB
/
parallelization.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
# Parallelization {#parallelization}
Stan provides three ways of parallelizing execution of a Stan model:
- multi-threading with Intel Threading Building Blocks (TBB),
- multi-processing with Message Passing Interface (MPI) and
- manycore processing with OpenCL.
## Multi-threading with TBB
In order to exploit multi-threading in a Stan model, the models must be
rewritten to use the `reduce_sum` and `map_rect` functions. For instructions
on how to rewrite Stan models to use these functions see [Stan's User guide chapter on parallelization](https://mc-stan.org/docs/stan-users-guide/parallelization-chapter.html), [the reduce_sum case study](https://mc-stan.org/users/documentation/case-studies/reduce_sum_tutorial.html) or the [Multithreading and Map-Reduce tutorial](https://github.com/rmcelreath/cmdstan_map_rect_tutorial).
### Compiling
Once a model is rewritten to use the above-mentioned functions, the model
must be compiled with the `STAN_THREADS` makefile flag. The flag can be
supplied in the `make` call but we recommend writing the flag to the
`make/local` file.
An example of the contents of `make/local` to enable threading with TBB:
```
STAN_THREADS=true
```
The model is then compiled as normal:
```
make path/to/model
```
### Running
Before running a multi-threaded model, we need to specify the maximum number of threads
the program can run (total threads for all chains). This is done by setting the `num_threads`
argument. Valid values for `num_threads` are positive integers and -1. If `num_threads` is set
to -1, all available cores will be used.
Generally, this number should not exceed the number of available cores for best performance.
Example:
```
./model sample data file=data.json num_threads=4 ...
```
When the model is compiled with `STAN_THREADS` we can sample with multiple chains with a single
executable (see section [running multiple chains]{#multi-chain-sampling} for cases when this is
available). When running multiple chains `num_threads` is the maximum number of threads that can
be used by all the chains combined. The exact number of threads that will be used for each chain
at a given point in time is determined by the TBB scheduler. The following example start 2 chains
with 8 total threads available:
```
./model sample num_chains=2 data file=data.json num_threads=8 ...
```
## Multi-processing with MPI
In order to use multi-processing with MPI in a Stan model, the models must be
rewritten to use the [`map_rect` function](https://mc-stan.org/docs/2_26/functions-reference/functions-map.html). By using MPI, the model can be parallelized across multiple cores or a cluster. MPI with Stan is supported on MacOS and Linux.
### Dependencies
Compiling and running Stan models with MPI requires that the system
has an MPI implementation installed. For Unix systems the most commonly used
implementations are [MPICH](https://www.mpich.org/) and [OpenMPI](https://www.open-mpi.org/).
### Compiling
Once a model is rewritten to use `map_rect`, additional makefile flags
must be written to the `make/local`. These are:
- `STAN_MPI`: Enables the use of MPI with Stan if `true`.
- `CXX`: The name of the MPI C++ compiler wrapper. Typically `mpicxx`.
- `TBB_CXX_TYPE`: The C++ compiler the MPI wrapper wraps. Typically `gcc` on Linux and `clang` on macOS.
An example of `make/local` on Linux:
```
STAN_MPI=true
CXX=mpicxx
TBB_CXX_TYPE=gcc
```
The model is then compiled as normal:
```
make path/to/model
```
### Running
The Stan model compiled with `STAN_MPI` is run using an MPI launcher. The MPI standard
suggests using `mpiexec`, but a vendor wrapper for the launcher like `mpirun` can also be used.
The launcher is supplied the path to the built executable and the number of processes to start:
`-n X` for `mpiexec` or `-np X` for `mpirun` where `X` is replaced by the integer representing
the number of processes.
Example for running a model with six processes:
```
mpiexec -n 6 path/to/model sample data file=data.json ...
```
## OpenCL
### Dependencies
OpenCL is supported on most modern CPUs and GPUs. In order to run OpenCL-enabled Stan models,
an OpenCL runtime for the target device must be installed. This subsection lists installation
instructions for OpenCL runtimes of the commonly-found devices.
In order to check if any OpenCL-enabled device and its runtime is already present use the
`clinfo` tool. On Linux, `clinfo` can typically be installed with the default package manager
(for example `sudo apt-get install clinfo` on Ubuntu). For Windows, pre-built `clinfo` binary
can be found [here](https://github.com/Oblomov/clinfo#windows-support).
Also use `clinfo` to verify successful installation of OpenCL runtimes.
#### NVIDIA GPU
- Linux:
Install the NVIDIA GPU driver and the NVIDIA CUDA Toolkit.
On Ubuntu the commands to install both is:
```
sudo apt update
sudo apt install nvidia-driver-460 nvidia-cuda-toolkit
```
Replace the driver version (`460` in the above case) with the lastest number at the time of installation.
- Windows:
Install the [NVIDIA GPU Driver](https://www.nvidia.com/Download/index.aspx) and [CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit).
#### AMD GPU
- Linux:
Install `Radeon Software for Linux` available [here](https://www.amd.com/en/support/kb/release-notes/rn-amdgpu-unified-linux-20-40).
- Windows:
We recommend installing the open source [OCL-SDK](https://github.com/GPUOpen-LibrariesAndSDKs/OCL-SDK/releases).
#### AMD CPU
Install the open source [PoCL](http://portablecl.org/download.html).
#### Intel CPU/GPU
Follow Intel's install instructions given [here](https://software.intel.com/content/www/us/en/develop/articles/opencl-drivers.html) (requires registration).
### Compiling
In order to enable the OpenCL backend the model
must be compiled with the `STAN_OPENCL` makefile flag. The flag can be
supplied in the `make` call but we recommend writing the flag to the
`make/local` file.
An example of the contents of `make/local` to enable parallelization
with OpenCL:
```
STAN_OPENCL=true
```
The model is then compiled as normal:
```
make path/to/model
```
### Running
The Stan model compiled with `STAN_OPENCL` can also be supplied the OpenCL platform and device IDs
of the target device. These IDs determine the device on which to run the OpenCL-supported functions on.
You can list the devices on your system using the `clinfo` program. If the system has one GPU and
no OpenCL CPU runtime, the platform and device IDs of the GPU are typically `0`. In that case
you can also omit the OpenCL IDs as the default `0` IDs are used in that case.
We supply these IDs when starting the executable as shown below:
```
path/to/model sample data file=data.json opencl platform=0 device=1
```