Parallelization #6

pgrinaway · 2019-12-06T20:31:58Z

Hi,

I have a quick question. I noticed in the paper for Fractal, the benchmarks are done single-threaded. When I was looking at a benchmark on my machine, I saw that the multiplicative_FFT_wrapper often took the most time, so I looked at the code. It seems like it's taken from libfqfft--is there any reason that libfqfft wasn't directly used to take advantage of its OpenMP support? Would it be (somewhat) easy to drop in a call to libfqfft?

Thanks!

PS: thanks for sharing all this. It's really awesome stuff.

The text was updated successfully, but these errors were encountered:

ValarDragon · 2019-12-29T23:16:31Z

Sorry for the late response on this!

The FFTs can be completely parallelized, exactly like libfqfft. The core loop that does all the work is actually basically the same, so the openMP parallelization can be done identically. You could also switch it back to libfqfft rather easily. (In fact I think the version of the library in the first commit already did that, so you could just copy paste its code)

The reason I reimplemented multiplicative FFTs was to take two significant optimizations:

In libfqfft the FFT takes nlog(n) multiplications. However the FFT only has to take nlog(d) multiplications. This matters in our case, as we take FFTs of a degree H polynomial over a much larger domain L, where L is typically 32H. So at circuits of size 2^20, this is a 25% speed improvement.
The second thing we do differently in our FFTs is that we cache some terms that typically get recalculated via data-dependent multiplications. These are the so called FFT 'twiddle factors', which are particular group elements of the evaluation domain. Since the evaluation domain is known in advance these can be computed ahead of time. This reduced FFT time by another 30%, with it now doing nlog(d)/2 multiplications, nlog(d) additions, and most time presumably coming from shuffling data around.

pgrinaway · 2020-01-13T22:26:32Z

Thanks so much for the answer, this is super helpful!

ValarDragon · 2020-01-13T22:49:20Z

No problem! One thing I probably should have said is that these FFTs can also be parallelized better than libfqfft's.

You can divide the FFT into n/d different parts that have no cross-communication. Since FFTs are actually hard to parallelize in practice due to memory issues, this will allow libiop's parallel implementation to perform much better. Taking advantage of this will require some reformatting of the code. If you (or anyone else) is interested in this, I'm happy to provide guidance on how to do so.

ramtej mentioned this issue Dec 29, 2019

Fractal - Benchmark, Parallelization and Memory usage #7

Open

ValarDragon closed this as completed Jun 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelization #6

Parallelization #6

pgrinaway commented Dec 6, 2019

ValarDragon commented Dec 29, 2019

pgrinaway commented Jan 13, 2020

ValarDragon commented Jan 13, 2020 •

edited

Loading

Parallelization #6

Parallelization #6

Comments

pgrinaway commented Dec 6, 2019

ValarDragon commented Dec 29, 2019

pgrinaway commented Jan 13, 2020

ValarDragon commented Jan 13, 2020 • edited Loading

ValarDragon commented Jan 13, 2020 •

edited

Loading