OpenBLAS hangs when calling DGEMM after a Unix fork #294

ogrisel · 2013-09-04T18:10:49Z

Here is a sample program that illustrate a problem when I want to use OpenBLAS with Unix fork (without exec):

https://gist.github.com/ogrisel/6440446

$ wget https://gist.github.com/ogrisel/6440446/raw/openblas_fork.c
$ gcc -lopenblas -L/opt/OpenBLAS/lib -I/opt/OpenBLAS/include \
           openblas_fork.c -o openblas_fork

Then calling GEMM with a 7x7 squared matrix in parent and child process works without issue:

$ ./openblas_fork 7
computing for size 7
c[0][0] is 7.000000
parent d[0][0] is 7.000000
child d[0][0] is 7.000000

Doing the same for a 8x8 matrix will cause the child process to never completes (while burning 100% of a CPU) until I kill it.

Might this be caused by some static initialization step in the OpenBLAS runtime? If so would it be possible during init to store the pid of the process at init time and then later on check that the current process pid still the same and if not reinit the runtime?

Atlas, Apple vecLib and Intel MKL have no trouble with Unix forks. Hence this limitation (bug?) prevents OpenBLAS to be used as a drop-in replacement for those guys. This is especially annoying when used with Python that uses Unix forks in its standard library via the multiprocessing module. As Python threads suffer from a locking issue, many project use multiprocessing as a work around.

The text was updated successfully, but these errors were encountered:

xianyi · 2013-09-05T00:53:55Z

This is same to #240

ogrisel · 2013-09-05T09:00:08Z

Thanks, sorry I did not find it priori to reporting. I will have a deeper look.

While this is not the problem I reported here: we might want to be able to do blas calls in the parent process before and after the fork, so enabling openmp for openblas would be detrimental as this pattern is officially not supported by openmp.

ogrisel · 2013-09-05T10:11:26Z

To make myself clearer, here is a new version of the program:

$ wget https://gist.github.com/ogrisel/6440446/raw/openblas_fork_parent_init.c
$ gcc -lopenblas -L/opt/OpenBLAS/lib -I/opt/OpenBLAS/include \
           openblas_fork_parent_init.c -o openblas_fork_parent_init

If I build OpenBLAS with:

$ CC=gcc-4.7 make TARGET="NEHALEM" NO_AFFINITY=1 USE_OPENMP=1 USE_SIMPLE_THREADED_LEVEL3=1 NO_WARMUP=1
$ make install

Then this programs hangs forever because of libgomp not supporting OpenMP to be used in the parent process before a fork and in the child process (after the fork).

There is a bug on GCC here: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52303 unfortunately it was deemed invalid by the GCC developers some time ago. This shortcoming of OpenMP is also documented in this OpenMP howto: http://bisqwit.iki.fi/story/howto/openmp/#OpenmpAndFork

It is my understanding that this OpenMP shortcoming is not implied by the standard itself but results from the specific implementation in libgomp.

In the mean time would it be possible to "fix" OpenBLAS built with USE_OPENMP=0 so as to have the openblas_fork_parent_init not crash nor hang? What change would be involved?

Both R and Python rely on forking for handling parallelization at their level due to limitations of their interpreter design. It would be great if imported libraries such as OpenBLAS would not crash in such a case.

ogrisel · 2013-09-05T11:50:09Z

If transparent fork-child detection is too complicated to implement in OpenBLAS itself, at least it would help to be able to have the functions blas_thread_shutdown and blas_thread_init exposed in the openblas dynlib, maybe under names such as openblas_thread_shutdown and openblas_thread_init so that the program doing the fork could call them around the fork.

Compiling statically is not really feasible in my use case: I have many libraries in my ecosystem who would need to compile statically against OpenBLAS: numpy, scipy, scikit-learn and maybe others. The library doing the fork would not to know about all the Python compiled extensions that could have been compiled statically against OpenBLAS. In scikit-learn alone there are at least several compiled extensions (.so files) that would be impacted. The library handling the parallelization using Python multiprocessing module would need to be aware about all of them. This is really not practical as they are considered private API.

larsmans · 2013-09-05T11:55:46Z

I can't reproduce this (current head of the develop branch, Linux x86-64 running on Intel Core i7 (Sandy Bridge), clean build) -- I get a different bug instead where the child process seems to silently die during the DGEMM.

(Btw., dynamic linking is even more important because it should be possible to just drop in OpenBLAS without even recompiling. This is what the Debian/Ubuntu packages for BLAS, ATLAS and OpenBLAS do and it's a very easy way to get a speedup in NumPy and SciPy.)

ogrisel · 2013-09-05T11:59:46Z

@larsmans I have never observed a child crash silently: either it hangs with 100% CPU usage (when USE_OPENMP=0) or with 0% CPU usage (when USE_OPENMP=1). In both cases I have to kill the child process manually with pkill...

larsmans · 2013-09-05T12:05:17Z

Correction, you're right, the child processes keep running in the background. I noticed when I added a waitpid to the program.

ogrisel · 2013-09-05T12:10:45Z

Alright, then it's probably that under Linux the default compiler is GCC and OpenMP is enabled by default by the OpenBLAS Makefile. This is not the case under OSX where the default C compiler is clang which does not support OpenMP yet.

larsmans · 2013-09-05T12:38:22Z

@ogrisel: @xianyi's suggestion of compiling with

make NO_LAPACK=1 NO_AFFINITY=1 USE_OPENMP=1 USE_SIMPLE_THREADED_LEVEL3=1 NO_WARMUP=1

seems to work. An additional

cd /opt/OpenBLAS/lib
ln -s libopenblas_sandybridgep-r0.2.8.so libblas.so.3

still gives a significant speedup in np.dot. With some care (avoiding the segfaulting error handling in OpenBLAS), I can even compute matrix products in a joblib.Parallel.

ogrisel · 2013-09-05T12:42:25Z

Even when you call a np.dot in the parent process prior to forking with joblib.Parallel? I doubt it. This is the situation I emulate in C with the openblas_fork_parent_init.c program.

larsmans · 2013-09-05T12:56:19Z

Yes, that seems to work too.

ogrisel · 2013-09-05T12:59:29Z

But the openblas_fork_parent_init.c program is able to trigger the crash no?

larsmans · 2013-09-05T13:17:17Z

Not with those build flags.

ogrisel · 2013-09-05T14:00:55Z

I just recompiled OpenBLAS with:

CC=gcc-4.7 make NO_AFFINITY=1 USE_OPENMP=1 USE_SIMPLE_THREADED_LEVEL3=1 NO_WARMUP=1
make install PREFIX=/opt/OpenBLAS_with_OpenMP

Note that I did not disable LAPACK with NO_LAPACK=1 as numpy.dot does not work well otherwise (I do not understand why, but anyway LAPACK is required for scipy and scikit-learn so we have to have it enabled).
I then built numpy with this branch and the following site.cfg:

[openblas]
library_dirs = /opt/OpenBLAS_with_OpenMP/lib
include_dirs = /opt/OpenBLAS_with_OpenMP/include

Then when run this numpy_fork.py script I get a hung child process:

$ python numpy_fork.py
{'language': 'f77',
 'libraries': ['openblas'],
 'library_dirs': ['/opt/OpenBLAS_with_OpenMP/lib']}
{'language': 'f77',
 'libraries': ['openblas'],
 'library_dirs': ['/opt/OpenBLAS_with_OpenMP/lib']}
[13887] Calling dot before fork
[13887] Result: 0.900
[13887] Forking
[13887] Calling dot after fork
[13891] Calling dot after fork
[13887] Result: 0.900

I have to kill the hung process with pkill -f numpy_fork.py.

I have also made another version version of the script numpy_fork_omp_aware.py to make it possible to call omp_set_num_threads(1) on the OpenBLAS lib prior to forking assuming that libgomp was statically linked into it (which is apparently the case by default):

$ python openblask_fork_gist/numpy_fork_omp_aware.py /opt/OpenBLAS_with_OpenMP/lib/libopenblas_nehalemp-r0.2.8.dylib
{'language': 'f77',
 'libraries': ['openblas'],
 'library_dirs': ['/opt/OpenBLAS_with_OpenMP/lib']}
{'language': 'f77',
 'libraries': ['openblas'],
 'library_dirs': ['/opt/OpenBLAS_with_OpenMP/lib']}
[13857] Calling dot before fork
[13857] Result: 1.080
Loading shared lib: /opt/OpenBLAS_with_OpenMP/lib/libopenblas_nehalemp-r0.2.8.dylib
omp_get_num_threads() = 1
Calling omp_set_num_threads(1)
omp_get_num_threads() = 1
[13857] Forking
[13857] Calling dot after fork
[13861] Calling dot after fork
[13857] Result: 1.080
[13861] Result: 1.080

That works, but then OpenBLAS cannot used with multiple threads in the children. That might be a good work around for joblib though as one could expect the size of the joblib process pool to match the number of physical cores of the host machine.

What I don't understand is why omp_get_num_threads() = 1 before I call omp_set_num_threads(1)...

ogrisel · 2013-09-05T14:07:46Z

Not with those build flags.

I had not seen this. This is weird. The only difference is the NO_LAPACK=1 flag. It might be the culprit but we actually need LAPACK for other stuff in our case.

xianyi · 2013-09-05T15:34:37Z

Sorry for the delay.

The easy way is exporting names such as openblas_thread_shutdown and openblas_thread_init. I think I can implement it tomorrow.

@ogrisel , it's a good idea to store the pid. However, I need to evaluate the overhead.

@ larsmans, I think you need try bigger matrices to trigger the multi-threading.

ogrisel · 2013-09-05T15:45:28Z

Thanks for the feedback @xianyi . It would be great if you could implement a fork-safe mode for OpenBLAS that stores the pid of the thread pool initializer. I just had a quick look and I think a similar strategy can be implemented in gcc/libgomp/parallel.c to make it fork-safe. Unfortunately I probably won't have the time to experiment with that my-self in the short term.

ogrisel · 2013-09-10T09:00:59Z

@xianyi I will submit a patch to libgomp. I think the problem in OpenBLAS can be fixed similarly by using pthreads_atfork (assuming you are using pthreads to implement the non-OpenMP case):

https://github.com/ogrisel/gcc/compare/forksafe-omp-pthread_atfork

I think this is cleaner than using a pid and more robust to vfork / clone and sutff like this as @larsmans told me.

ogrisel · 2013-09-10T14:40:03Z

BTW my tentative fix for the libgomp implementation of OpenMP in GCC has been rejected as invalid and it's unlikely that libgomp will ever be made robust to raw forks.

The good news it that multiprocessing will have options (the spawn and forkserver modes) to mitigate the issue as soon as Python 3.4 is out:

http://bugs.python.org/issue8713

larsmans · 2013-10-28T15:47:05Z

I guess this is fixed now, right? It's not OpenBLAS' fault, CPython violates the POSIX standard on this one.

ogrisel · 2013-10-28T16:46:35Z

I agree, let's close this issue.

mattcbro · 2013-12-19T06:25:35Z

This bug is now a disaster on ubuntu 13.10 and thus linux mint 16. Using numpy with openblas causes linear algebra to slow to a crawl. Moreover a sufficiently complex application pushes all the cores to 100% utilization prior to freezing.

As a result we are back to atlas which is a pain to tune and link to the right libraries on ubuntu. While you might have closed this because it's python's fault, it's probably a lot harder to change python than openblas, and numpy has a huge application base that needs a good blas library.

xianyi · 2013-12-19T07:18:53Z

Hi @mattcbro ,
Thank you for the feedback. I will reopen this issue.
Could you provide the test code with numpy?

ogrisel · 2013-12-19T08:36:43Z

@mattcbro what do you mean by "Using numpy with openblas causes linear algebra to slow to a crawl"? This issue is about multiprocessing's fork causing a numpy program to completely hang (stop and never complete) not just slowing down. Are you sure this the same issue?

Also unless OpenBLAS stops using OpenMP and only uses a manually managed thread pool, I think there is no way to fix the issue on OpenBLAS' side.

larsmans · 2013-12-19T09:05:50Z

I don't think not using OpenMP would solve the problem. You still can't reliably fork a worker process in the context of POSIX threads.

njsmith · 2014-02-19T03:37:24Z

The flag trick definitely won't work for disabling affinity, because the
point of that trick is to avoid doing anything risky in the fork+exec case.
But we especially want to disable affinity in the fork+exec case.

In pretty sure sched_setaffinity is on the list of functions that you are
allowed to call in a forked child, though, so we can just do it.

Possibly I was being silly to bring it up here though; it is a bit of a
distraction from the current bug.

I have add a quick look at the affinity issue. The code looks quite
complicated but it might be possible to set a boolean flag in the child
after the fork as @njsmith https://github.com/njsmith did in his libgomp
patch and introspect that flag in exec_blas_* and call in turn:

disable_mapping = 1;gotoblas_affinity_init();

although the code in driver/others/init.c is rather complicated and writing
a test for that sounds painful.

—
Reply to this email directly or view it on
GitHubhttps://github.com//issues/294#issuecomment-35395868
.

ogrisel · 2014-02-19T09:10:11Z

@njsmith agreed, let us postpone the affinity issue for now and open a new issue once #294 is fixed.

@xianyi I think that __builtin_expect is supported by most compilers but I am not sure about Microsoft's compiler so I just removed them.

I'll try to have a look at the USE_THREAD=0 case soon.

xianyi · 2014-02-19T13:23:15Z

@ogrisel , I like lazy init in exec_blas_*.
I prefer unlikely macro in https://gist.github.com/xianyi/9091495 .

For affinity, in init.c, you can use sched_setaffinity(0, sizeof(cpu_orig_mask), &cpu_orig_mask[0]); to restore the default affinity setting.

…pthread_atfork

… a Unix fork." This reverts commit 3617c22.

ogrisel · 2014-02-19T18:15:37Z

Sorry for the rebase noise... I have just opened PR #343 to submit a fix for this issue. I tested it with gcc 4.7 with and without OpenMP (to check that the fork test is skipped with OpenMP) and clang-500.2.79 with and without USE_THREAD=0 (to check that it still builds in sequential mode).

I have not tested it under Windows but the fork test should not be built under windows (and the problem does not exist there as Windows does not implement fork).

I leave the affinity stuff of another PR as I would like to write a test for this case as well but I don't have the time to do it now.

FIX #294: fork-safe pthread mode

xianyi · 2014-02-19T22:59:00Z

@ogrisel ,
Thank you for the patch.

cartazio · 2014-02-20T02:17:38Z

ummm, now my haskell bindings don't built any more!
https://travis-ci.org/wellposed/hOpenBLAS/jobs/19229988#L604

ogrisel · 2014-02-20T09:01:20Z

@cartazio sorry for that. I don't really understand how your build works but that might be caused buy the default OMP build that does not require the link to the pthread library. Here is a new PR to try and fix that: #345

ogrisel · 2014-02-20T10:35:39Z

@cartazio please feel free to comment on #345 if it fixes your build issue or not.

This was referenced Sep 4, 2013

Parallel K-Means hangs on Mac OS X Lion scikit-learn/scikit-learn#636

Closed

BLD: Allow easy use of openblas. numpy/numpy#3642

Merged

ogrisel mentioned this issue Sep 5, 2013

[WIP] nogil decls in SGD and WeightVector code scikit-learn/scikit-learn#2419

Closed

ogrisel closed this as completed Oct 28, 2013

xianyi reopened this Dec 19, 2013

ogrisel added a commit to ogrisel/OpenBLAS that referenced this issue Feb 19, 2014

FIX OpenMathLib#294: make OpenBLAS thread-pool resilient to fork via …

c0a7b12

…pthread_atfork

ogrisel added a commit to ogrisel/OpenBLAS that referenced this issue Feb 19, 2014

Revert "Refs OpenMathLib#294. Used pthread_atfork to avoid hang after…

046e401

… a Unix fork." This reverts commit 3617c22.

ogrisel mentioned this issue Feb 19, 2014

FIX #294: fork-safe pthread mode #343

Merged

xianyi closed this as completed in 138a841 Feb 19, 2014

xianyi added a commit that referenced this issue Feb 19, 2014

Merge pull request #343 from ogrisel/fix-294-fork-safe-pthread

b20ee69

FIX #294: fork-safe pthread mode

ogrisel mentioned this issue Feb 20, 2014

Do not reference pthread_atfork in non-SMP_SERVER mode #345

Merged

jakirkham mentioned this issue Aug 14, 2014

OpenBLAS hangs on forks nanshe-org/nanshe#139

Closed

mattcbro mentioned this issue Sep 30, 2014

Fork Bug Not Fixed #455

Closed

juliantaylor mentioned this issue Jan 5, 2015

Drotmg failure on Ubuntu 14.04 #484

Closed

traversaro mentioned this issue Mar 24, 2015

[YARP_math] Problems when linking YARP_math with OpenBLAS on Debian Wheezy robotology/yarp#456

Closed

ogrisel mentioned this issue Apr 7, 2015

NumPy "dot" hangs when used with multiprocessing (potentially Apple Accelerate related?) numpy/numpy#5752

Closed

jseabold mentioned this issue May 9, 2015

ENH: Pickle xgbooster enhancments. dmlc/xgboost#299

Merged

bwlewis mentioned this issue Dec 19, 2015

Allow forks on doRedis workers bwlewis/doRedis#31

Closed

jseabold mentioned this issue Feb 15, 2016

Consider using forkserver on Python 3 for multiprocessing backend by default stan-dev/pystan2#210

Closed

martin-frbg mentioned this issue Sep 9, 2016

Segmentation fault on aarch64 #956

Closed

martin-frbg mentioned this issue Feb 11, 2018

Compilation hangs on fork safety test #1456

Closed

ywwry66 mentioned this issue Apr 19, 2021

R parallel computing deadlock with openblas Homebrew/homebrew-core#75506

Closed

2 tasks

orenbenkiki mentioned this issue Jul 7, 2023

Thread explosion tanaylab/metacells#50

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenBLAS hangs when calling DGEMM after a Unix fork #294

OpenBLAS hangs when calling DGEMM after a Unix fork #294

ogrisel commented Sep 4, 2013

xianyi commented Sep 5, 2013

ogrisel commented Sep 5, 2013

ogrisel commented Sep 5, 2013

ogrisel commented Sep 5, 2013

larsmans commented Sep 5, 2013

ogrisel commented Sep 5, 2013

larsmans commented Sep 5, 2013

ogrisel commented Sep 5, 2013

larsmans commented Sep 5, 2013

ogrisel commented Sep 5, 2013

larsmans commented Sep 5, 2013

ogrisel commented Sep 5, 2013

larsmans commented Sep 5, 2013

ogrisel commented Sep 5, 2013

ogrisel commented Sep 5, 2013

xianyi commented Sep 5, 2013

ogrisel commented Sep 5, 2013

ogrisel commented Sep 10, 2013

ogrisel commented Sep 10, 2013

larsmans commented Oct 28, 2013

ogrisel commented Oct 28, 2013

mattcbro commented Dec 19, 2013

xianyi commented Dec 19, 2013

ogrisel commented Dec 19, 2013

larsmans commented Dec 19, 2013

njsmith commented Feb 19, 2014

ogrisel commented Feb 19, 2014

xianyi commented Feb 19, 2014

ogrisel commented Feb 19, 2014

xianyi commented Feb 19, 2014

cartazio commented Feb 20, 2014

ogrisel commented Feb 20, 2014

ogrisel commented Feb 20, 2014

OpenBLAS hangs when calling DGEMM after a Unix fork #294

OpenBLAS hangs when calling DGEMM after a Unix fork #294

Comments

ogrisel commented Sep 4, 2013

xianyi commented Sep 5, 2013

ogrisel commented Sep 5, 2013

ogrisel commented Sep 5, 2013

ogrisel commented Sep 5, 2013

larsmans commented Sep 5, 2013

ogrisel commented Sep 5, 2013

larsmans commented Sep 5, 2013

ogrisel commented Sep 5, 2013

larsmans commented Sep 5, 2013

ogrisel commented Sep 5, 2013

larsmans commented Sep 5, 2013

ogrisel commented Sep 5, 2013

larsmans commented Sep 5, 2013

ogrisel commented Sep 5, 2013

ogrisel commented Sep 5, 2013

xianyi commented Sep 5, 2013

ogrisel commented Sep 5, 2013

ogrisel commented Sep 10, 2013

ogrisel commented Sep 10, 2013

larsmans commented Oct 28, 2013

ogrisel commented Oct 28, 2013

mattcbro commented Dec 19, 2013

xianyi commented Dec 19, 2013

ogrisel commented Dec 19, 2013

larsmans commented Dec 19, 2013

njsmith commented Feb 19, 2014

ogrisel commented Feb 19, 2014

xianyi commented Feb 19, 2014

ogrisel commented Feb 19, 2014

xianyi commented Feb 19, 2014

cartazio commented Feb 20, 2014

ogrisel commented Feb 20, 2014

ogrisel commented Feb 20, 2014