scipy.fft: Can OpenMP be used portably #10239

peterbell10 · 2019-05-31T13:28:32Z

Related to #10238

Can we build and use OpenMP in a portable way?

This will require compiler dependent build arguments which I can't see explicitly mentioned in the numpy distutils guide. It also needs to link against the OpenMP runtime which may be an issue.

larsoner · 2019-05-31T14:08:15Z

Can we start without it and add it as an enhancement later?

If it's an integral / non-optional part of the software we'll have to figure this out now. But if not, I'd rather wait. As soon as you talk about multithreading the C code, it can be very useful, but you have to think about things like can you set the number of threads for each call, what's the interface for doing so, etc. Because even on a multicore machine, you might know that you are going to use joblib to do N tasks over N CPUs, each of which uses calls to scipy.fft so really you want each of these to do a single-threaded computation. So things like this can increase the complexity level.

You'll probably notice a theme in these responses: trying to start with something small that works and build up from there, rather than trying to include everything at once. It allows for easier review and also easier division of labor (e.g., someone else could do OpenMP once the basics of scipy.fft are in place if they are motivated to do so).

peterbell10 · 2019-05-31T14:17:36Z

Okay, I'll remove the worker argument for now and build without OpenMP on all platforms.

rgommers · 2019-05-31T15:16:29Z

I am interested in figuring out the status of OpenMP though. I've looked for an up-to-date description before, but didn't find one. Does anyone know of one? It would need to cover:

portability
compiler support
playing well with other libraries that may use OpenMP (same or other version), multiprocessing, or threading.

grlee77 · 2019-05-31T20:15:41Z

compiler support

Here is detailed OpenMP compiler support info

playing well with other libraries that may use OpenMP (same or other version), multiprocessing, or threading.

Calls to omp_set_num_threads take precedence over the OMP_NUM_THREADS environment variable. There is also a num_threads clause that can be used to override either of these. This is explained in detail in here.

grlee77 · 2019-05-31T20:17:39Z

For reference, here is a concrete example from the DiPy project which uses OpenMP within Cython code (probably a bit more complex that what would be needed here as we probably don't need to call omp_set_num_threads, etc, explicitly if pypocketfft does that for us):

At build time, setup.py attempts to compile a simple source importing omp.h and if it succeeds, OpenMP support is enabled:
https://github.com/nipy/dipy/blob/4b24d30ba6e4cb959d2cc788362241361c80f024/setup.py#L123-L141
where the add_flag_checking class is defined in setup_helpers.py:
https://github.com/nipy/dipy/blob/4b24d30ba6e4cb959d2cc788362241361c80f024/setup_helpers.py#L81

Calls to omp_set_num_threads, omp_get_num_threads etc can be made at the Cython level via:
https://github.com/nipy/dipy/blob/master/dipy/utils/omp.pyx
which relies on:
https://github.com/nipy/dipy/blob/master/src/safe_openmp.pxd
https://github.com/nipy/dipy/blob/master/src/conditional_omp.h

Finally, an example of dynamically setting and then restoring the number of threads based on user input from within a Cython function:
https://github.com/nipy/dipy/blob/4b24d30ba6e4cb959d2cc788362241361c80f024/dipy/denoise/denspeed.pyx#L96-L111

andyfaff · 2019-06-01T00:59:25Z

The openmp discussion is getting a bit fragmented now because it's spread over #10242, #10238, and here. This makes it harder for other devs to follow.
In my experience I just couldn't iron out all the issues with openmp - I would be very, very, cautious about making it an integral part of code from the outset (I was scarred by the entire experience).The following points are from that experience:

The clang default compiler on macOS doesn't support openmp
This would mean that macOS users wouldn't get the performant behaviour. I tried to get around this by installing the llvm-openmp and clang conda packages, then setting the CC and CXX environment variables to choose the new compiler. However, you then run into the issues of distributing wheels - the openmp library needs to be distributed as well.
On macOS the openmp version crashes (if mkl numpy is installed) because there are two openmp libraries that are initialised (the one that mkl uses and the llvm-openmp version from 1). This can be circumvented by setting the KMP_DUPLICATE_LIB_OK='True' environment variable. I'm not sure how to ensure that this wouldn't occur, and it's a generalised problem. There is the potential for several different openmp runtimes to exist, e.g. libiomp, libomp, libgomp. I'm not sure if this issue also applies to other OS's, I develop on macOS.
openmp doesn't play nicely with multiprocessing.Pool
See A plan to safely support OpenMP in our Cython code base scikit-learn/scikit-learn#7650 and also https://joblib.readthedocs.io/en/latest/parallel.html#bad-interaction-of-multiprocessing-and-third-party-libraries

At the moment it is not safe to use OpenMP for low overhead thread based parallelism in our Cython code base because of a bad interaction between multiprocessing.Pool (worker processes are necessarily created by fork without exec under Python 2) and the openmp runtime library libgomp used in GCC. This can cause the program to silently freeze.

I also experienced this when running some Linux tests on through a Docker image. To get around this one can use joblib with the Loky backend. joblib is a nice project, but I found (for my particular code) that joblib wasn't as performant as multiprocessing.

One also has to consider the overall picture of how the parallelisation is carried out, it's going to be in such a heterogeneous and nested environment. Here are some cases:

user does computation (FFT?) on their own machine --> wants to use as many processors as possible to parallelise the computation.
user is doing a parallelised task [e.g. minimisation (differential_evolution) using a Pool of workers] --> want to use as many workers as there are cores. However, each distributed computation (using FFT) should only be single threaded.
user is doing a distributed task on a cluster, they want to parallelise over nodes, each node having several cores. Perhaps they want to parallelise over the nodes and cores on each node using MPI, with each discrete (FFT) computation being single threaded.
alternatively, the cluster computation could be MPI parallelised over the nodes, but on each node the (FFT) computation is parallelised over the cores using openmp.

The bottom line is that precise control is needed over how parallelisation is achieved, with no recourse to environment variables (see point 2), global state, etc.

I went down this path because I was using pthreads for macOS/posix, and Windows parallelisation calls (_beginthreadex, _endthreadex, WaitForMultipleObjects, etc). Using openmp would've done a slight amount of cleanup in the code. However, in the end it just wasn't worth it, there were too many difficulties.

Would it be possible to continue discussions on this kind of thing on scipy-dev?

larsoner · 2019-06-01T01:54:22Z

Would it be possible to continue discussions on this kind of thing on scipy-dev?

Yep, sounds good. Perhaps copy-paste what you wrote above to move the discussion there?

For this specific issue, I think this implies we should not include it right now and add it later (if ever).

mreineck · 2019-08-16T19:10:02Z

Proposed parallelization for scipy.fft is now standard multithreading instead of OpenMP (#10614), so this can perhaps be closed.

larsoner · 2019-08-16T19:11:51Z

Agreed, thanks @mreineck

rgommers · 2021-01-09T19:57:38Z

Linking the related mailing list thread which has quite a bit of relevant content, like how scikit-learn does manage to ship wheels with OpenMP: https://mail.python.org/pipermail/scipy-dev/2019-August/023635.html

rgommers · 2021-03-10T07:41:14Z

This is now the canonical issue for "why is OpenMP not allowed", and I wanted a shorter summary to reference. So here's an attempt:

Most important reason: because it plays badly with multiprocessing.
The different runtimes (e.g. libgomp and libiomp) don't play too well together.
Also, because nested parallelism is hard and OpenMP is very primitive UX-wise. The OpenMP model assumes that you build a whole stack with the same toolchain and use it for all your parallelism needs. With the "everyone builds separate wheels on PyPI and locally, and may employ multiple ways of achieving parallelism" it just doesn't work well.

h-vetinari · 2021-03-20T21:50:14Z

Also adding the following 2019 discussion from the ML for reference for anyone interested.

(Not my sleuthing, mentioned by @jb-leger in #13576)

ogrisel · 2021-11-22T13:59:04Z

Most important reason: because it plays badly with multiprocessing.

Only libgomp (the default OpenMP runtime implementation that comes with GCC) is not fork-safe. The Microsoft OpenMP runtime (used by default when building with MSVC) and the LLVM OpenMP runtime (used by default when building with clang) are all safe.

On Linux, it's possible to build with GCC and then replace libgomp by llvm-openmp. It should also be possible to build with clang directly. More details in the conda-forge documentation: https://conda-forge.org/docs/maintainer/knowledge_base.html#openmp

mreineck · 2021-11-23T08:18:39Z

If the OpenMP standard itself does not mandate this form of fork safety, I'd personally not rely on added guarantees that are are made by specific implementations.

At least for C++ the current approach with a thread pool implemented with standard language features is very convenient to use, and (although I was reluctant to switch away from OpenMP myself at first) I'm now using it in my other C++ projects as well and am really happy with reliability and portability.

However I admit that this is probably not a great help when multithreading is needed in extensions written in C or Fortran...

ogrisel · 2021-11-23T10:55:23Z

However I admit that this is probably not a great help when multithreading is needed in extensions written in C or Fortran...

or Cython, or pythran or numba which all have OpenMP integration.

peterbell10 · 2021-11-25T17:23:50Z

On Linux, it's possible to build with GCC and then replace libgomp by llvm-openmp. It should also be possible to build with clang directly. More details in the conda-forge documentation: https://conda-forge.org/docs/maintainer/knowledge_base.html#openmp

The problem isn't that you can't use a fork-safe OpenMP; it's that you need to be able to guarantee no user at runtime will end up running libgomp. On PyTorch for example, we've had some bug reports that depend on whether import numpy or import torch happens first because on some environments you end up using a different version of OpenMP.

ilayn added Build issues Issues with building from source, including different choices of architecture, compilers and OS scipy.fft labels May 31, 2019

peterbell10 mentioned this issue May 31, 2019

ENH: Add new scipy.fft module using pocketfft #10238

Merged

10 tasks

larsoner closed this as completed Aug 16, 2019

rth mentioned this issue Feb 5, 2020

csc_matmat_pass1 and pass2 and csc_matvec are slow, appear not to use MKL/threading. #10201

Open

rgommers mentioned this issue Jun 4, 2020

Parallelism - what do libraries offer, and is there an API aspect to it data-apis/array-api#4

Closed

rgommers mentioned this issue Jan 9, 2021

Does scipy support OpenMP #13372

Closed

rgommers mentioned this issue Mar 10, 2021

ENH: improve discrepancy performance #13576

Merged

10 tasks

rgommers mentioned this issue Jun 25, 2021

RBFInterpolator gives Segmentation fault for large images #14234

Closed

thomasjpfan mentioned this issue Aug 30, 2021

ENH Private Cython Submodule for Reduction over Pairwise Distances scikit-learn/scikit-learn#20254

Closed

11 tasks

ilayn mentioned this issue Nov 22, 2021

Performance issue on macOS arm64 (M1) when installing from wheels (2x libopenblas) #15050

Closed

rgommers mentioned this issue Nov 30, 2021

Consider unifying the two OpenBLAS libraries in NumPy and SciPy wheels Create OpenBLAS wheel #15129

Open

Ericgig mentioned this issue May 27, 2022

Restore openmp support on dev.major qutip/qutip#1904

Open

h-vetinari mentioned this issue May 9, 2023

OpenMP Groupby any/all pandas-dev/pandas#53149

Closed

rgommers mentioned this issue Jan 4, 2024

ENH: Parallelize CSR (sparse matrix) methods #19717

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scipy.fft: Can OpenMP be used portably #10239

scipy.fft: Can OpenMP be used portably #10239

peterbell10 commented May 31, 2019

larsoner commented May 31, 2019

peterbell10 commented May 31, 2019

rgommers commented May 31, 2019

grlee77 commented May 31, 2019

grlee77 commented May 31, 2019

andyfaff commented Jun 1, 2019

larsoner commented Jun 1, 2019

mreineck commented Aug 16, 2019

larsoner commented Aug 16, 2019

rgommers commented Jan 9, 2021

rgommers commented Mar 10, 2021

h-vetinari commented Mar 20, 2021

ogrisel commented Nov 22, 2021 •

edited

Loading

mreineck commented Nov 23, 2021

ogrisel commented Nov 23, 2021

peterbell10 commented Nov 25, 2021 •

edited

Loading

scipy.fft: Can OpenMP be used portably #10239

scipy.fft: Can OpenMP be used portably #10239

Comments

peterbell10 commented May 31, 2019

larsoner commented May 31, 2019

peterbell10 commented May 31, 2019

rgommers commented May 31, 2019

grlee77 commented May 31, 2019

grlee77 commented May 31, 2019

andyfaff commented Jun 1, 2019

larsoner commented Jun 1, 2019

mreineck commented Aug 16, 2019

larsoner commented Aug 16, 2019

rgommers commented Jan 9, 2021

rgommers commented Mar 10, 2021

h-vetinari commented Mar 20, 2021

ogrisel commented Nov 22, 2021 • edited Loading

mreineck commented Nov 23, 2021

ogrisel commented Nov 23, 2021

peterbell10 commented Nov 25, 2021 • edited Loading

ogrisel commented Nov 22, 2021 •

edited

Loading

peterbell10 commented Nov 25, 2021 •

edited

Loading