Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to using rayon exclusively #21

Closed
ebfull opened this issue Mar 22, 2018 · 5 comments · Fixed by #69
Closed

Switch to using rayon exclusively #21

ebfull opened this issue Mar 22, 2018 · 5 comments · Fixed by #69

Comments

@ebfull
Copy link
Collaborator

ebfull commented Mar 22, 2018

Right now we use futures-cpupool for one purpose (work stealing) and crossbeam for another (scoped threads). I can get both of these with rayon, which is also more mature, but in the past when I switched to rayon it was slower.

I think there are tweaks and new features in its API which would allow me to adopt it entirely.

@Pratyush
Copy link

Pratyush commented Aug 16, 2019

PS @ebfull, did you have any notes on why Rayon was slower? In Zexe we switched to using only rayon, and it has made the MSM ~10-15% slower. (Even when the new code is structured almost identically to the old code).

@hdevalence
Copy link
Contributor

I don't know about bellman, but in a very early version of Bulletproofs we tried using Rayon for the parallel parts of the inner product proof and got minimal speedup even for a very parallel task. I don't think we dug super far into it but from perf counters it seemed that it was doing a ton of context switches. Perhaps the work-stealing has more overhead than expected?

@Pratyush
Copy link

Hmm how large were those inner product proofs? Perhaps on small instance sizes the overhead is too large? In our case, an MSM over large inputs does noticeably speed up when parallelized, but is still slower than bellman's MSM on the same input size.

Moreover, futures_cpupool also does work-stealing. It would be quite interesting to investigate if, in some cases, futures_cpupool achieves lower overhead for work-stealing than rayon; maybe the lessons learnt could be used to improve rayon as well.

@hdevalence
Copy link
Contributor

I don't remember the size but I believe the timings were in the range of 1-20 ms.

@Pratyush
Copy link

OK so I did some investigation, and came up with the following hypothesis: using futures_cpupool in the Groth16 prover gives better performance than using rayon because somehow futures_cpupool::CpuPool schedules tasks for the MSM better than rayon does when there are multiple MSMs happening in quick succession.

To justify this hypothesis, I performed the following test: I modified the multiexp code to create a new CpuPool for each invocation (so CpuPools are not shared by different MSMs). The resulting code has the ~same performance as the rayon-ized version. When used in the Groth16 prover, it results in worse performance than what's currently in master.

str4d pushed a commit that referenced this issue Aug 25, 2020
str4d pushed a commit that referenced this issue Aug 25, 2020
Edwards scalar multiplication inside the circuit
@str4d str4d closed this as completed in #69 Sep 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants