Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scipy.fft: Can OpenMP be used portably #10239

Closed
peterbell10 opened this issue May 31, 2019 · 16 comments
Closed

scipy.fft: Can OpenMP be used portably #10239

peterbell10 opened this issue May 31, 2019 · 16 comments
Labels
Build issues Issues with building from source, including different choices of architecture, compilers and OS scipy.fft

Comments

@peterbell10
Copy link
Member

Related to #10238

Can we build and use OpenMP in a portable way?

This will require compiler dependent build arguments which I can't see explicitly mentioned in the numpy distutils guide. It also needs to link against the OpenMP runtime which may be an issue.

@ilayn ilayn added Build issues Issues with building from source, including different choices of architecture, compilers and OS scipy.fft labels May 31, 2019
@larsoner
Copy link
Member

Can we start without it and add it as an enhancement later?

If it's an integral / non-optional part of the software we'll have to figure this out now. But if not, I'd rather wait. As soon as you talk about multithreading the C code, it can be very useful, but you have to think about things like can you set the number of threads for each call, what's the interface for doing so, etc. Because even on a multicore machine, you might know that you are going to use joblib to do N tasks over N CPUs, each of which uses calls to scipy.fft so really you want each of these to do a single-threaded computation. So things like this can increase the complexity level.

You'll probably notice a theme in these responses: trying to start with something small that works and build up from there, rather than trying to include everything at once. It allows for easier review and also easier division of labor (e.g., someone else could do OpenMP once the basics of scipy.fft are in place if they are motivated to do so).

@peterbell10
Copy link
Member Author

Okay, I'll remove the worker argument for now and build without OpenMP on all platforms.

@rgommers
Copy link
Member

I am interested in figuring out the status of OpenMP though. I've looked for an up-to-date description before, but didn't find one. Does anyone know of one? It would need to cover:

  • portability
  • compiler support
  • playing well with other libraries that may use OpenMP (same or other version), multiprocessing, or threading.

@grlee77
Copy link
Contributor

grlee77 commented May 31, 2019

  • compiler support

Here is detailed OpenMP compiler support info

  • playing well with other libraries that may use OpenMP (same or other version), multiprocessing, or threading.

Calls to omp_set_num_threads take precedence over the OMP_NUM_THREADS environment variable. There is also a num_threads clause that can be used to override either of these. This is explained in detail in here.

@grlee77
Copy link
Contributor

grlee77 commented May 31, 2019

For reference, here is a concrete example from the DiPy project which uses OpenMP within Cython code (probably a bit more complex that what would be needed here as we probably don't need to call omp_set_num_threads, etc, explicitly if pypocketfft does that for us):

At build time, setup.py attempts to compile a simple source importing omp.h and if it succeeds, OpenMP support is enabled:
https://github.com/nipy/dipy/blob/4b24d30ba6e4cb959d2cc788362241361c80f024/setup.py#L123-L141
where the add_flag_checking class is defined in setup_helpers.py:
https://github.com/nipy/dipy/blob/4b24d30ba6e4cb959d2cc788362241361c80f024/setup_helpers.py#L81

Calls to omp_set_num_threads, omp_get_num_threads etc can be made at the Cython level via:
https://github.com/nipy/dipy/blob/master/dipy/utils/omp.pyx
which relies on:
https://github.com/nipy/dipy/blob/master/src/safe_openmp.pxd
https://github.com/nipy/dipy/blob/master/src/conditional_omp.h

Finally, an example of dynamically setting and then restoring the number of threads based on user input from within a Cython function:
https://github.com/nipy/dipy/blob/4b24d30ba6e4cb959d2cc788362241361c80f024/dipy/denoise/denspeed.pyx#L96-L111

@andyfaff
Copy link
Contributor

andyfaff commented Jun 1, 2019

The openmp discussion is getting a bit fragmented now because it's spread over #10242, #10238, and here. This makes it harder for other devs to follow.
In my experience I just couldn't iron out all the issues with openmp - I would be very, very, cautious about making it an integral part of code from the outset (I was scarred by the entire experience).The following points are from that experience:

  1. The clang default compiler on macOS doesn't support openmp
    This would mean that macOS users wouldn't get the performant behaviour. I tried to get around this by installing the llvm-openmp and clang conda packages, then setting the CC and CXX environment variables to choose the new compiler. However, you then run into the issues of distributing wheels - the openmp library needs to be distributed as well.
  2. On macOS the openmp version crashes (if mkl numpy is installed) because there are two openmp libraries that are initialised (the one that mkl uses and the llvm-openmp version from 1). This can be circumvented by setting the KMP_DUPLICATE_LIB_OK='True' environment variable. I'm not sure how to ensure that this wouldn't occur, and it's a generalised problem. There is the potential for several different openmp runtimes to exist, e.g. libiomp, libomp, libgomp. I'm not sure if this issue also applies to other OS's, I develop on macOS.
  3. openmp doesn't play nicely with multiprocessing.Pool
    See A plan to safely support OpenMP in our Cython code base scikit-learn/scikit-learn#7650 and also https://joblib.readthedocs.io/en/latest/parallel.html#bad-interaction-of-multiprocessing-and-third-party-libraries

At the moment it is not safe to use OpenMP for low overhead thread based parallelism in our Cython code base because of a bad interaction between multiprocessing.Pool (worker processes are necessarily created by fork without exec under Python 2) and the openmp runtime library libgomp used in GCC. This can cause the program to silently freeze.

I also experienced this when running some Linux tests on through a Docker image. To get around this one can use joblib with the Loky backend. joblib is a nice project, but I found (for my particular code) that joblib wasn't as performant as multiprocessing.

One also has to consider the overall picture of how the parallelisation is carried out, it's going to be in such a heterogeneous and nested environment. Here are some cases:

  • user does computation (FFT?) on their own machine --> wants to use as many processors as possible to parallelise the computation.
  • user is doing a parallelised task [e.g. minimisation (differential_evolution) using a Pool of workers] --> want to use as many workers as there are cores. However, each distributed computation (using FFT) should only be single threaded.
  • user is doing a distributed task on a cluster, they want to parallelise over nodes, each node having several cores. Perhaps they want to parallelise over the nodes and cores on each node using MPI, with each discrete (FFT) computation being single threaded.
  • alternatively, the cluster computation could be MPI parallelised over the nodes, but on each node the (FFT) computation is parallelised over the cores using openmp.

The bottom line is that precise control is needed over how parallelisation is achieved, with no recourse to environment variables (see point 2), global state, etc.

I went down this path because I was using pthreads for macOS/posix, and Windows parallelisation calls (_beginthreadex, _endthreadex, WaitForMultipleObjects, etc). Using openmp would've done a slight amount of cleanup in the code. However, in the end it just wasn't worth it, there were too many difficulties.

Would it be possible to continue discussions on this kind of thing on scipy-dev?

@larsoner
Copy link
Member

larsoner commented Jun 1, 2019

Would it be possible to continue discussions on this kind of thing on scipy-dev?

Yep, sounds good. Perhaps copy-paste what you wrote above to move the discussion there?

For this specific issue, I think this implies we should not include it right now and add it later (if ever).

@mreineck
Copy link
Contributor

Proposed parallelization for scipy.fft is now standard multithreading instead of OpenMP (#10614), so this can perhaps be closed.

@larsoner
Copy link
Member

Agreed, thanks @mreineck

@rgommers
Copy link
Member

rgommers commented Jan 9, 2021

Linking the related mailing list thread which has quite a bit of relevant content, like how scikit-learn does manage to ship wheels with OpenMP: https://mail.python.org/pipermail/scipy-dev/2019-August/023635.html

@rgommers
Copy link
Member

This is now the canonical issue for "why is OpenMP not allowed", and I wanted a shorter summary to reference. So here's an attempt:

  • Most important reason: because it plays badly with multiprocessing.
  • The different runtimes (e.g. libgomp and libiomp) don't play too well together.
  • Also, because nested parallelism is hard and OpenMP is very primitive UX-wise. The OpenMP model assumes that you build a whole stack with the same toolchain and use it for all your parallelism needs. With the "everyone builds separate wheels on PyPI and locally, and may employ multiple ways of achieving parallelism" it just doesn't work well.

@h-vetinari
Copy link
Member

Also adding the following 2019 discussion from the ML for reference for anyone interested.

(Not my sleuthing, mentioned by @jb-leger in #13576)

@ogrisel
Copy link
Contributor

ogrisel commented Nov 22, 2021

Most important reason: because it plays badly with multiprocessing.

Only libgomp (the default OpenMP runtime implementation that comes with GCC) is not fork-safe. The Microsoft OpenMP runtime (used by default when building with MSVC) and the LLVM OpenMP runtime (used by default when building with clang) are all safe.

On Linux, it's possible to build with GCC and then replace libgomp by llvm-openmp. It should also be possible to build with clang directly. More details in the conda-forge documentation: https://conda-forge.org/docs/maintainer/knowledge_base.html#openmp

@mreineck
Copy link
Contributor

If the OpenMP standard itself does not mandate this form of fork safety, I'd personally not rely on added guarantees that are are made by specific implementations.

At least for C++ the current approach with a thread pool implemented with standard language features is very convenient to use, and (although I was reluctant to switch away from OpenMP myself at first) I'm now using it in my other C++ projects as well and am really happy with reliability and portability.

However I admit that this is probably not a great help when multithreading is needed in extensions written in C or Fortran...

@ogrisel
Copy link
Contributor

ogrisel commented Nov 23, 2021

However I admit that this is probably not a great help when multithreading is needed in extensions written in C or Fortran...

or Cython, or pythran or numba which all have OpenMP integration.

@peterbell10
Copy link
Member Author

peterbell10 commented Nov 25, 2021

On Linux, it's possible to build with GCC and then replace libgomp by llvm-openmp. It should also be possible to build with clang directly. More details in the conda-forge documentation: https://conda-forge.org/docs/maintainer/knowledge_base.html#openmp

The problem isn't that you can't use a fork-safe OpenMP; it's that you need to be able to guarantee no user at runtime will end up running libgomp. On PyTorch for example, we've had some bug reports that depend on whether import numpy or import torch happens first because on some environments you end up using a different version of OpenMP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Build issues Issues with building from source, including different choices of architecture, compilers and OS scipy.fft
Projects
None yet
Development

No branches or pull requests

9 participants