Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelization #6

Closed
pgrinaway opened this issue Dec 6, 2019 · 3 comments
Closed

Parallelization #6

pgrinaway opened this issue Dec 6, 2019 · 3 comments

Comments

@pgrinaway
Copy link

Hi,

I have a quick question. I noticed in the paper for Fractal, the benchmarks are done single-threaded. When I was looking at a benchmark on my machine, I saw that the multiplicative_FFT_wrapper often took the most time, so I looked at the code. It seems like it's taken from libfqfft--is there any reason that libfqfft wasn't directly used to take advantage of its OpenMP support? Would it be (somewhat) easy to drop in a call to libfqfft?

Thanks!

PS: thanks for sharing all this. It's really awesome stuff.

@ValarDragon
Copy link
Member

Sorry for the late response on this!

The FFTs can be completely parallelized, exactly like libfqfft. The core loop that does all the work is actually basically the same, so the openMP parallelization can be done identically. You could also switch it back to libfqfft rather easily. (In fact I think the version of the library in the first commit already did that, so you could just copy paste its code)

The reason I reimplemented multiplicative FFTs was to take two significant optimizations:

  1. In libfqfft the FFT takes nlog(n) multiplications. However the FFT only has to take nlog(d) multiplications. This matters in our case, as we take FFTs of a degree H polynomial over a much larger domain L, where L is typically 32H. So at circuits of size 2^20, this is a 25% speed improvement.

  2. The second thing we do differently in our FFTs is that we cache some terms that typically get recalculated via data-dependent multiplications. These are the so called FFT 'twiddle factors', which are particular group elements of the evaluation domain. Since the evaluation domain is known in advance these can be computed ahead of time. This reduced FFT time by another 30%, with it now doing nlog(d)/2 multiplications, nlog(d) additions, and most time presumably coming from shuffling data around.

@pgrinaway
Copy link
Author

Thanks so much for the answer, this is super helpful!

@ValarDragon
Copy link
Member

ValarDragon commented Jan 13, 2020

No problem! One thing I probably should have said is that these FFTs can also be parallelized better than libfqfft's.

You can divide the FFT into n/d different parts that have no cross-communication. Since FFTs are actually hard to parallelize in practice due to memory issues, this will allow libiop's parallel implementation to perform much better. Taking advantage of this will require some reformatting of the code. If you (or anyone else) is interested in this, I'm happy to provide guidance on how to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants