The benchmark here is 8 schools but with varying J to make things slow enough
that we can make measurements: 800 schools and 8000 schools. (The remaining
data is simply repeated the appropriate number of times.) In both cases the
models are compiled with -DSTAN_THREADS. Here are the results:
| J |
thin |
cmdstan 2.18 |
httpstan 0.6.1 |
% slower |
| 800 |
1 |
4.5s |
17s |
270% |
| 8000 |
50 |
1m31s |
1m45s |
15% |
TODO: add cmdstan without -DSTAN_THREADS
Encoding things into protobuf format is the bottleneck. Since draws are coming out of Stan C++
much slower in the J = 8000 case, the bottleneck goes away. Two things would improve things, I suspect:
- Moving the protobuf encoding into Cython/C++. If the protobuf_writer emerges (/cc @sakrejda), this would solve the problem.
- Writing Stan C++ output to disk or to a cache in Cython rather than via the
async for loop. Python loops are slow.
I'm happy with the progress so far. Things were far slower a week ago.
The benchmark here is 8 schools but with varying
Jto make things slow enoughthat we can make measurements: 800 schools and 8000 schools. (The remaining
data is simply repeated the appropriate number of times.) In both cases the
models are compiled with
-DSTAN_THREADS. Here are the results:TODO: add cmdstan without
-DSTAN_THREADSEncoding things into protobuf format is the bottleneck. Since draws are coming out of Stan C++
much slower in the J = 8000 case, the bottleneck goes away. Two things would improve things, I suspect:
async forloop. Python loops are slow.I'm happy with the progress so far. Things were far slower a week ago.