Switch to using rayon exclusively #21

ebfull · 2018-03-22T01:55:55Z

Right now we use futures-cpupool for one purpose (work stealing) and crossbeam for another (scoped threads). I can get both of these with rayon, which is also more mature, but in the past when I switched to rayon it was slower.

I think there are tweaks and new features in its API which would allow me to adopt it entirely.

The text was updated successfully, but these errors were encountered:

Pratyush · 2019-08-16T09:35:09Z

PS @ebfull, did you have any notes on why Rayon was slower? In Zexe we switched to using only rayon, and it has made the MSM ~10-15% slower. (Even when the new code is structured almost identically to the old code).

hdevalence · 2019-08-16T20:05:08Z

I don't know about bellman, but in a very early version of Bulletproofs we tried using Rayon for the parallel parts of the inner product proof and got minimal speedup even for a very parallel task. I don't think we dug super far into it but from perf counters it seemed that it was doing a ton of context switches. Perhaps the work-stealing has more overhead than expected?

Pratyush · 2019-08-16T21:35:22Z

Hmm how large were those inner product proofs? Perhaps on small instance sizes the overhead is too large? In our case, an MSM over large inputs does noticeably speed up when parallelized, but is still slower than bellman's MSM on the same input size.

Moreover, futures_cpupool also does work-stealing. It would be quite interesting to investigate if, in some cases, futures_cpupool achieves lower overhead for work-stealing than rayon; maybe the lessons learnt could be used to improve rayon as well.

hdevalence · 2019-08-16T22:59:55Z

I don't remember the size but I believe the timings were in the range of 1-20 ms.

Pratyush · 2019-10-14T00:57:32Z

OK so I did some investigation, and came up with the following hypothesis: using futures_cpupool in the Groth16 prover gives better performance than using rayon because somehow futures_cpupool::CpuPool schedules tasks for the MSM better than rayon does when there are multiple MSMs happening in quick succession.

To justify this hypothesis, I performed the following test: I modified the multiexp code to create a new CpuPool for each invocation (so CpuPools are not shared by different MSMs). The resulting code has the ~same performance as the rayon-ized version. When used in the Groth16 prover, it results in worse performance than what's currently in master.

Version bump to 0.10.1

Edwards scalar multiplication inside the circuit

str4d pushed a commit that referenced this issue Aug 25, 2020

Auto merge of #21 - ebfull:bump-again, r=ebfull

aa5d634

Version bump to 0.10.1

str4d pushed a commit that referenced this issue Aug 25, 2020

Merge pull request #21 from ebfull/gh-revisions

c8cc190

Edwards scalar multiplication inside the circuit

str4d mentioned this issue Jun 1, 2021

Multicore improvements #69

Merged

str4d closed this as completed in #69 Sep 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to using rayon exclusively #21

Switch to using rayon exclusively #21

ebfull commented Mar 22, 2018

Pratyush commented Aug 16, 2019 •

edited

Loading

hdevalence commented Aug 16, 2019

Pratyush commented Aug 16, 2019

hdevalence commented Aug 16, 2019

Pratyush commented Oct 14, 2019

Switch to using rayon exclusively #21

Switch to using rayon exclusively #21

Comments

ebfull commented Mar 22, 2018

Pratyush commented Aug 16, 2019 • edited Loading

hdevalence commented Aug 16, 2019

Pratyush commented Aug 16, 2019

hdevalence commented Aug 16, 2019

Pratyush commented Oct 14, 2019

Pratyush commented Aug 16, 2019 •

edited

Loading