Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add parallel computation to scipy.fft #10614

Merged
merged 15 commits into from Sep 5, 2019

Conversation

@peterbell10
Copy link
Member

peterbell10 commented Aug 8, 2019

What does this implement/fix?

I've moved pypocketfft from OpenMP to a custom thread-pool written using plain C++11. This implementation uses pthread_atfork to cleanly handle fork on posix systems.

With this, I can enable multithreading for all of the transforms in scipy.fft via a workers argument.

Example

In [2]: x = np.random.randn(1024, 1000).astype(complex)

In [3]: %timeit scipy.fft.fft(x)  # Defaults to 1 thread, as before
8.25 ms ± 80.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit scipy.fft.fft(x, workers=-1)  # Threads chosen automatically
3.61 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
@larsoner

This comment has been minimized.

Copy link
Member

larsoner commented Aug 8, 2019

Just a couple of quick comments right off the bat, since it might take me a bit to look at the C++ code.

IIRC in joblib they use n_jobs=-1 to mean "use all CPUs", similar to negative indexing in Python. It might be nice to mimic this interface.

It's important to say how the number of workers is leveraged. One might assume that even for a 1D FFT with shape (n_samples,) they should see some speedup. But if the parallelization is done over transforms (rather than within each transform) then they won't see any.

@larsoner

This comment has been minimized.

Copy link
Member

larsoner commented Aug 8, 2019

Also marking as "needs-decision" since we'll probably need to discuss (on the mailing list?) if we want a custom thread pool. Might be worth emailing the same thread about OpenMP to mention that this is an alternative to see what people think. I am not that experienced with multithreading in C++ so it would be good to solicit other opinions about maintainability, extendability, etc.

Also it's possible that if this is general enough we could perhaps use it elsewhere in SciPy. But it's important to list the caveats about what can make use of it, and what couldn't. (Just pocketfft? Any pure C/C++ code? Any C/C++ code plus any nogil Cython code?)

@peterbell10

This comment has been minimized.

Copy link
Member Author

peterbell10 commented Aug 8, 2019

For C++ reviewers, it's best to look at pypocketfft!23 and pypocketfft!24 which contains just the thread pool changes. It's actually quite small, <200 lines of code. The diff here also contains some refactoring other tweaks to other parts of the code.

@insertinterestingnamehere

This comment has been minimized.

Copy link
Member

insertinterestingnamehere commented Aug 8, 2019

My two cents, this seems like a great idea to me. In many cases users are already calling into threaded BLAS all the time. It makes sense to allow threading of some kind in the FFT library.

Brief follow-up question, should this do anything special in cases where OMP_NUM_THREADS is set? As I understand it, setting OMP_NUM_THREADS=1 is the current standard practice for avoiding oversubscription when doing something requiring various distinct BLAS calls to be run in parallel.

@peterbell10

This comment has been minimized.

Copy link
Member Author

peterbell10 commented Aug 8, 2019

Also it's possible that if this is general enough we could perhaps use it elsewhere in SciPy. But it's important to list the caveats about what can make use of it, and what couldn't. (Just pocketfft? Any pure C/C++ code? Any C/C++ code plus any nogil Cython code?)

The biggest thing to bear in mind is that this only implements a tiny subset of OpenMP's functionality. Basically just enough to get pypocketfft working.

However, the thread pool is completely general so any C++ code should be able to use it. A C and/or cython wrapper should also be possible, although wrapping the function object that is submitted to the pool would be more challenging than in C++.
I think f2py can even traffic python callbacks into c function pointers so I think anything should be possible really.

@peterbell10 peterbell10 force-pushed the peterbell10:fft-threading branch from fd830be to e28f186 Aug 8, 2019
@rgommers

This comment has been minimized.

Copy link
Member

rgommers commented Aug 8, 2019

IIRC in joblib they use n_jobs=-1 to mean "use all CPUs", similar to negative indexing in Python. It might be nice to mimic this interface.

we have a workers interface in a few other places, optimize global optimizers mainly, and perhaps cKDTree. IIRC the optimize interface is what we wanted. It allows -1 (as in scikit-learn), as well as a multiprocessing.Pool type input.

@peterbell10

This comment has been minimized.

Copy link
Member Author

peterbell10 commented Aug 8, 2019

I've updated it to use -1 as the special value. Accepting multiprocessing.Pool here would probably be both significantly more work and not actually any faster.

@peterbell10 peterbell10 force-pushed the peterbell10:fft-threading branch from 5275009 to d5dde59 Aug 8, 2019
@grlee77

This comment has been minimized.

Copy link
Contributor

grlee77 commented Aug 8, 2019

I've updated it to use -1 as the special value.

nice, this is consistent with pyFFTW as well.

peterbell10 added 2 commits Aug 8, 2019
When capturing a constexpr variable, MSVC doesn't recognise it as constexpr
within the lambda's scope.
@peterbell10 peterbell10 force-pushed the peterbell10:fft-threading branch from d5dde59 to efc6991 Aug 8, 2019
scipy/fft/_basic.py Outdated Show resolved Hide resolved
@peterbell10 peterbell10 force-pushed the peterbell10:fft-threading branch from 2204250 to 3498f92 Aug 8, 2019
@peterbell10 peterbell10 force-pushed the peterbell10:fft-threading branch from 3498f92 to a620d98 Aug 8, 2019
@peterbell10

This comment has been minimized.

Copy link
Member Author

peterbell10 commented Aug 11, 2019

Brief follow-up question, should this do anything special in cases where OMP_NUM_THREADS is set? As I understand it, setting OMP_NUM_THREADS=1 is the current standard practice for avoiding oversubscription when doing something requiring various distinct BLAS calls to be run in parallel.

@larsoner, @grlee77 and I briefly discussed this on thursday. Some of the points brought up were:

  • Unlike BLAS, this only uses multithreading if the caller requests it (passing workers=n).
  • OMP_NUM_THREADS is only used because there isn't a better way to control it.
  • We could respect it for the automatic option (workers=-1) but after the latest commit this now follows os.cpu_count() instead.

If the goal is to have a global configuration option then I could add a context manager. e.g.

with scipy.fft.workers(4):
    scipy.signal.fftconvolve(x, y)

This would probably be more useful than an environment variable.

@insertinterestingnamehere

This comment has been minimized.

Copy link
Member

insertinterestingnamehere commented Aug 12, 2019

Brief follow-up question, should this do anything special in cases where OMP_NUM_THREADS is set? As I understand it, setting OMP_NUM_THREADS=1 is the current standard practice for avoiding oversubscription when doing something requiring various distinct BLAS calls to be run in parallel.

@larsoner, @grlee77 and I briefly discussed this on thursday. Some of the points brought up were:

  • Unlike BLAS, this only uses multithreading if the caller requests it (passing workers=n).
  • OMP_NUM_THREADS is only used because there isn't a better way to control it.
  • We could respect it for the automatic option (workers=-1) but after the latest commit this now follows os.cpu_count() instead.

If the goal is to have a global configuration option then I could add a context manager. e.g.

with scipy.fft.workers(4):
    scipy.signal.fftconvolve(x, y)

This would probably be more useful than an environment variable.

Yah, the context manager design is great and definitely better than the environment variable. You're right that it might make sense to prioritize the environment variable in the "workers=-1" case, but given that the default is still sequential, it doesn't matter as much. This isn't going to suddenly oversubscribe codes that expect FFTs to run sequentially. This all makes sense to me. Thanks for the follow up.

@peterbell10 peterbell10 force-pushed the peterbell10:fft-threading branch from 6cb73c5 to a385b19 Aug 12, 2019
@peterbell10 peterbell10 force-pushed the peterbell10:fft-threading branch from 5044cca to 0b5db50 Aug 14, 2019
Copy link
Member

larsoner left a comment

Other than a couple of minor things, LGTM. Probably worth adding a benchmark for it, though. Also might be worth having a comparison to the ThreadPoolExecutor version like what's done in benchmarks/signal_filtering.py.

This also makes me think that it might be worth saying in the Notes what advantages there are over using something like ThreadPoolExecutor. It might just be simplicity of usage pattern (no need to determine how to parallelize and no need to patch back together the output), which is probably okay.

scipy/fft/_basic.py Outdated Show resolved Hide resolved
scipy/fft/_pocketfft/helper.py Outdated Show resolved Hide resolved
@peterbell10

This comment has been minimized.

Copy link
Member Author

peterbell10 commented Aug 15, 2019

I've added a benchmark comparing against ThreadPoolExecutor. If my python version is correct, then it seems that using the workers argument is pretty consistently faster than doing it in python. In some cases by a very significant margin (>4x faster).

(My CPU has 4 cores, 8 hyper-threads)

fft_basic.FftThreading.time_fft
========== ================ ============= =============
--                                     method
--------------------------- ---------------------------
   size     num_transforms     workers      threading
========== ================ ============= =============
 100x100          1           80.4±10μs      212±30μs
 100x100          8            549±30μs      687±10μs
 100x100          32          2.37±0.2ms   2.40±0.08ms
 100x100         100          7.07±0.2ms    15.1±0.8ms
 1000x100         1            231±9μs      1.04±0.1ms
 1000x100         8          1.98±0.05ms    2.87±0.5ms
 1000x100         32          8.05±0.4ms     36.4±6ms
 1000x100        100           23.9±1ms      127±20ms
 256x256          1            158±5μs       736±10μs
 256x256          8          1.46±0.05ms   1.73±0.09ms
 256x256          32          6.34±0.3ms     23.6±2ms
 256x256         100           17.1±1ms      104±10ms
 512x512          1            496±10μs     2.77±0.1ms
 512x512          8           4.13±0.2ms    7.16±0.7ms
 512x512          32          16.2±0.5ms    76.1±10ms
 512x512         100           51.6±2ms      416±30ms
========== ================ ============= =============


fft_basic.FftThreading.time_fftn
========== ================ ============= =============
--                                     method
--------------------------- ---------------------------
   size     num_transforms     workers      threading
========== ================ ============= =============
 100x100          1            176±3μs       327±20μs
 100x100          8          1.12±0.03ms     870±50μs
 100x100          32          4.41±0.2ms    3.36±0.1ms
 100x100         100          13.5±0.1ms     17.9±1ms
 1000x100         1            569±90μs    2.02±0.07ms
 1000x100         8          4.65±0.05ms    5.58±0.1ms
 1000x100         32          18.8±0.4ms     41.7±6ms
 1000x100        100           56.0±1ms      155±10ms
 256x256          1            409±40μs    1.45±0.09ms
 256x256          8           3.61±0.3ms    3.48±0.2ms
 256x256          32          13.5±0.6ms     22.9±2ms
 256x256         100           40.2±3ms     91.8±10ms
 512x512          1          1.46±0.06ms     6.69±1ms
 512x512          8           12.6±0.4ms     15.7±1ms
 512x512          32           48.3±1ms      103±10ms
 512x512         100           148±20ms      433±20ms
========== ================ ============= ============= 
@larsoner

This comment has been minimized.

Copy link
Member

larsoner commented Aug 16, 2019

@peterbell10 I have four physical cores (eight hyper-threads) as well but I see this:

In [1]: import numpy as np                                                                                                                                                          
In [2]: from scipy.fft import rfft                                                                                                                                                  
In [3]: x = np.random.RandomState(0).randn(100, 100000)                                                                                                                             
In [4]: timeit rfft(x, axis=-1, workers=1)                                                                                                                                          
48.7 ms ± 903 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: timeit rfft(x, axis=-1, workers=2)                                                                                                                                          
38.1 ms ± 1.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: timeit rfft(x, axis=-1, workers=3)                                                                                                                                          
53.8 ms ± 2.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: timeit rfft(x, axis=-1, workers=4)                                                                                                                                          
70.6 ms ± 1.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

I am a bit surprised by this. Any ideas why 2 is optimal here?

For comparison here is the futures way:

In [23]: pool = futures.ThreadPoolExecutor(1)                                                                                                                                       
In [24]: timeit futures.wait([pool.submit(rfft, xx) for xx in x])                                                                                                                   
77 ms ± 1.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [25]: pool = futures.ThreadPoolExecutor(2)                                                                                                                                       
In [26]: timeit futures.wait([pool.submit(rfft, xx) for xx in x])                                                                                                                   
38.3 ms ± 682 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [27]: pool = futures.ThreadPoolExecutor(3)                                                                                                                                       
In [28]: timeit futures.wait([pool.submit(rfft, xx) for xx in x])                                                                                                                   
31.1 ms ± 548 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [29]: pool = futures.ThreadPoolExecutor(4)                                                                                                                                       
In [30]: timeit futures.wait([pool.submit(rfft, xx) for xx in x])                                                                                                                   
34.2 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

FYI my benchmarking results show some improvement, but not quite as drastic as yours.

Benchmark
[ 75.00%] ··· fft_basic.FftThreading.time_fft                                                                                                                                     ok
[ 75.00%] ··· ========== ================ ============ =============
              --                                    method          
              --------------------------- --------------------------
                 size     num_transforms    workers      threading  
              ========== ================ ============ =============
               100x100          1          28.5±0.6μs     87.2±1μs  
               100x100          8           310±40μs      245±2μs   
               100x100          32          949±20μs     1.02±0.1ms 
               100x100         100          3.21±1ms     3.36±0.4ms 
               1000x100         1           174±70μs     663±100μs  
               1000x100         8          1.69±0.6ms    2.19±0.4ms 
               1000x100         32          5.60±1ms     7.81±0.8ms 
               1000x100        100          17.7±5ms      22.1±2ms  
               256x256          1           105±10μs     843±500μs  
               256x256          8           753±70μs    1.11±0.04ms 
               256x256          32         3.33±0.5ms    5.00±0.5ms 
               256x256         100          11.9±2ms      16.1±2ms  
               512x512          1           457±60μs     1.41±0.1ms 
               512x512          8          3.92±0.6ms     5.45±1ms  
               512x512          32          16.2±7ms      22.2±9ms  
               512x512         100         49.8±20ms      58.0±3ms  
              ========== ================ ============ =============

[100.00%] ··· fft_basic.FftThreading.time_fftn                                                                                                                                    ok
[100.00%] ··· ========== ================ ============ ============
              --                                    method         
              --------------------------- -------------------------
                 size     num_transforms    workers     threading  
              ========== ================ ============ ============
               100x100          1           56.9±2μs    178±200μs  
               100x100          8           457±7μs      366±3μs   
               100x100          32         2.44±0.9ms   2.07±0.1ms 
               100x100         100          8.34±1ms     5.56±1ms  
               1000x100         1          469±100μs    1.56±0.6ms 
               1000x100         8           4.00±1ms    4.70±0.8ms 
               1000x100         32          16.5±5ms     19.6±3ms  
               1000x100        100         47.2±20ms     58.4±9ms  
               256x256          1           261±90μs    1.03±0.3ms 
               256x256          8          2.33±0.7ms   2.89±0.4ms 
               256x256          32          9.34±3ms     10.9±2ms  
               256x256         100         28.9±10ms     34.3±6ms  
               512x512          1          1.17±0.3ms   3.94±0.4ms 
               512x512          8           9.36±2ms    11.5±0.6ms 
               512x512          32         36.4±10ms    43.8±0.8ms 
               512x512         100          114±5ms      141±2ms   
              ========== ================ ============ ============
@peterbell10

This comment has been minimized.

Copy link
Member Author

peterbell10 commented Aug 16, 2019

In [3]: x = np.random.RandomState(0).randn(100, 100000)                                                                                                                             
In [4]: timeit rfft(x, axis=-1, workers=1)

Any ideas why 2 is optimal here?

This is doing FFTs of length 100,000 so I would expect this to be hugely cache dependent. I think 2 workers would require ~3 MB of memory; is one of your CPU caches ~4 MB or similar by any chance?

Copy link
Member

larsoner left a comment

LGTM +1 for merge.

@pv it would be good to get +1 from you if possible in case you have thoughts/concerns about the expansion into parallel processing.

@larsoner larsoner removed the needs-work label Aug 16, 2019
@larsoner

This comment has been minimized.

Copy link
Member

larsoner commented Aug 16, 2019

(FYI my CPU has 8 MB of cache, so seems reasonable as the culprit for these performance differences.)

@larsoner larsoner added this to the 1.4.0 milestone Aug 24, 2019
@mreineck

This comment has been minimized.

Copy link

mreineck commented Sep 2, 2019

Is this waiting for anything in particular? If there is anything I can do, please let me know.

@larsoner

This comment has been minimized.

Copy link
Member

larsoner commented Sep 2, 2019

It would be good to have another +1 from a SciPy maintainer before merging. So I think we're just waiting for another review at this point, no changes currently required.

@grlee77
grlee77 approved these changes Sep 3, 2019
Copy link
Contributor

grlee77 left a comment

I did not review all of the pypocketfft C++ threading code in full detail, but the Python side looks good to me and the tests look sound.

I tried some 3D FFTs on a 10-core Skylake-X and saw 6-7 fold speedup with workers=-1. pyFFTW configured to use the same number of threads and FFTW_MEASURE planning effort was only about 20% faster, so this is quite good!

As expected, trying to set workers when the pyFFTW backend is selected does not currently work:
TypeError: fftn() got an unexpected keyword argument 'workers'
, but that can be fixed by a corresponding PR to pyFFTW to add the workers argument to the backend implementation there after this is merged.

@peterbell10

This comment has been minimized.

Copy link
Member Author

peterbell10 commented Sep 3, 2019

Would it be worth sending another email to the mailing list? My first email was as a reply to the OpenMP discussion so some might have missed it.

@larsoner

This comment has been minimized.

Copy link
Member

larsoner commented Sep 3, 2019

I'd worry about bugging people too much. It's already marked for 1.4.0 so should make it into that release. And around release time more reviewers usually come out of the woodwork. And if nobody objects (or reviews) by around 1.4.0 release time, we can take it as implicit acceptance and merge to get it in.

This is of course assuming it's not problematic in some way for this to sit around for another month or two (e.g., preventing other follow-up PRs). If it is holding something up, then we can try to move faster.

@peterbell10

This comment has been minimized.

Copy link
Member Author

peterbell10 commented Sep 3, 2019

I'd worry about bugging people too much. It's already marked for 1.4.0 so should make it into that release.

Okay, then lets just wait.

This is of course assuming it's not problematic in some way for this to sit around for another month or two (e.g., preventing other follow-up PRs).

The only issue I can see is that as @grlee77 has said, this will need similar changes in pyFFTW's API.

@ev-br

This comment has been minimized.

Copy link
Member

ev-br commented Sep 3, 2019

I'd suggest waiting for a little bit (say, a week or so) and then merge unless there are issues or concerns. One reason being, it gets more testing (or at least a chance for) before it's released. If problem are found that can't be easily fixed, a PR can be reverted for 1.4.0.

@rgommers rgommers removed the needs-decision label Sep 3, 2019
@rgommers

This comment has been minimized.

Copy link
Member

rgommers commented Sep 3, 2019

Everyone here seems happy, and this PR looks good to me too. We already have a pure C++ threadpool in cKDTree, so I don't think this is controversial. I'd say let's give it another 1-2 days and hit the green button.

@grlee77

This comment has been minimized.

Copy link
Contributor

grlee77 commented Sep 4, 2019

@peterbell10: a quick heads up to avoid duplicated effort:
I started a pyFFTW branch with corresponding changes to match this PR. I will open a PR based on that once this has been merged and new weekly development wheels are available.
https://github.com/grlee77/pyFFTW/commits/scipy_fft_workers

@larsoner larsoner merged commit fd1afd3 into scipy:master Sep 5, 2019
9 checks passed
9 checks passed
ci/circleci: build_docs Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
scipy.scipy Build #20190815.23 succeeded
Details
scipy.scipy (Linux_Python_36_32bit_full) Linux_Python_36_32bit_full succeeded
Details
scipy.scipy (Windows Python35-64bit-full) Windows Python35-64bit-full succeeded
Details
scipy.scipy (Windows Python36-32bit-full) Windows Python36-32bit-full succeeded
Details
scipy.scipy (Windows Python36-64bit-full) Windows Python36-64bit-full succeeded
Details
scipy.scipy (Windows Python37-64bit-full) Windows Python37-64bit-full succeeded
Details
@larsoner

This comment has been minimized.

Copy link
Member

larsoner commented Sep 5, 2019

Okay it's in, thanks @peterbell10 !

@peterbell10 peterbell10 deleted the peterbell10:fft-threading branch Sep 5, 2019
@peterbell10

This comment has been minimized.

Copy link
Member Author

peterbell10 commented Sep 7, 2019

I've updated the release notes: added this item to the list of pypocketfft benefits

  • optional multithreading of multi-dimensional transforms via the workers argument
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants
You can’t perform that action at this time.