Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenBLAS hangs when calling DGEMM after a Unix fork #294

Closed
ogrisel opened this issue Sep 4, 2013 · 88 comments
Closed

OpenBLAS hangs when calling DGEMM after a Unix fork #294

ogrisel opened this issue Sep 4, 2013 · 88 comments

Comments

@ogrisel
Copy link
Contributor

ogrisel commented Sep 4, 2013

Here is a sample program that illustrate a problem when I want to use OpenBLAS with Unix fork (without exec):

https://gist.github.com/ogrisel/6440446

$ wget https://gist.github.com/ogrisel/6440446/raw/openblas_fork.c
$ gcc -lopenblas -L/opt/OpenBLAS/lib -I/opt/OpenBLAS/include \
           openblas_fork.c -o openblas_fork

Then calling GEMM with a 7x7 squared matrix in parent and child process works without issue:

$ ./openblas_fork 7
computing for size 7
c[0][0] is 7.000000
parent d[0][0] is 7.000000
child d[0][0] is 7.000000

Doing the same for a 8x8 matrix will cause the child process to never completes (while burning 100% of a CPU) until I kill it.

Might this be caused by some static initialization step in the OpenBLAS runtime? If so would it be possible during init to store the pid of the process at init time and then later on check that the current process pid still the same and if not reinit the runtime?

Atlas, Apple vecLib and Intel MKL have no trouble with Unix forks. Hence this limitation (bug?) prevents OpenBLAS to be used as a drop-in replacement for those guys. This is especially annoying when used with Python that uses Unix forks in its standard library via the multiprocessing module. As Python threads suffer from a locking issue, many project use multiprocessing as a work around.

@xianyi
Copy link
Collaborator

xianyi commented Sep 5, 2013

This is same to #240

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 5, 2013

Thanks, sorry I did not find it priori to reporting. I will have a deeper look.

While this is not the problem I reported here: we might want to be able to do blas calls in the parent process before and after the fork, so enabling openmp for openblas would be detrimental as this pattern is officially not supported by openmp.

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 5, 2013

To make myself clearer, here is a new version of the program:

$ wget https://gist.github.com/ogrisel/6440446/raw/openblas_fork_parent_init.c
$ gcc -lopenblas -L/opt/OpenBLAS/lib -I/opt/OpenBLAS/include \
           openblas_fork_parent_init.c -o openblas_fork_parent_init

If I build OpenBLAS with:

$ CC=gcc-4.7 make TARGET="NEHALEM" NO_AFFINITY=1 USE_OPENMP=1 USE_SIMPLE_THREADED_LEVEL3=1 NO_WARMUP=1
$ make install

Then this programs hangs forever because of libgomp not supporting OpenMP to be used in the parent process before a fork and in the child process (after the fork).

There is a bug on GCC here: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52303 unfortunately it was deemed invalid by the GCC developers some time ago. This shortcoming of OpenMP is also documented in this OpenMP howto: http://bisqwit.iki.fi/story/howto/openmp/#OpenmpAndFork

It is my understanding that this OpenMP shortcoming is not implied by the standard itself but results from the specific implementation in libgomp.

In the mean time would it be possible to "fix" OpenBLAS built with USE_OPENMP=0 so as to have the openblas_fork_parent_init not crash nor hang? What change would be involved?

Both R and Python rely on forking for handling parallelization at their level due to limitations of their interpreter design. It would be great if imported libraries such as OpenBLAS would not crash in such a case.

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 5, 2013

If transparent fork-child detection is too complicated to implement in OpenBLAS itself, at least it would help to be able to have the functions blas_thread_shutdown and blas_thread_init exposed in the openblas dynlib, maybe under names such as openblas_thread_shutdown and openblas_thread_init so that the program doing the fork could call them around the fork.

Compiling statically is not really feasible in my use case: I have many libraries in my ecosystem who would need to compile statically against OpenBLAS: numpy, scipy, scikit-learn and maybe others. The library doing the fork would not to know about all the Python compiled extensions that could have been compiled statically against OpenBLAS. In scikit-learn alone there are at least several compiled extensions (.so files) that would be impacted. The library handling the parallelization using Python multiprocessing module would need to be aware about all of them. This is really not practical as they are considered private API.

@larsmans
Copy link
Contributor

larsmans commented Sep 5, 2013

I can't reproduce this (current head of the develop branch, Linux x86-64 running on Intel Core i7 (Sandy Bridge), clean build) -- I get a different bug instead where the child process seems to silently die during the DGEMM.

(Btw., dynamic linking is even more important because it should be possible to just drop in OpenBLAS without even recompiling. This is what the Debian/Ubuntu packages for BLAS, ATLAS and OpenBLAS do and it's a very easy way to get a speedup in NumPy and SciPy.)

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 5, 2013

@larsmans I have never observed a child crash silently: either it hangs with 100% CPU usage (when USE_OPENMP=0) or with 0% CPU usage (when USE_OPENMP=1). In both cases I have to kill the child process manually with pkill...

@larsmans
Copy link
Contributor

larsmans commented Sep 5, 2013

Correction, you're right, the child processes keep running in the background. I noticed when I added a waitpid to the program.

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 5, 2013

Alright, then it's probably that under Linux the default compiler is GCC and OpenMP is enabled by default by the OpenBLAS Makefile. This is not the case under OSX where the default C compiler is clang which does not support OpenMP yet.

@larsmans
Copy link
Contributor

larsmans commented Sep 5, 2013

@ogrisel: @xianyi's suggestion of compiling with

make NO_LAPACK=1 NO_AFFINITY=1 USE_OPENMP=1 USE_SIMPLE_THREADED_LEVEL3=1 NO_WARMUP=1

seems to work. An additional

cd /opt/OpenBLAS/lib
ln -s libopenblas_sandybridgep-r0.2.8.so libblas.so.3

still gives a significant speedup in np.dot. With some care (avoiding the segfaulting error handling in OpenBLAS), I can even compute matrix products in a joblib.Parallel.

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 5, 2013

Even when you call a np.dot in the parent process prior to forking with joblib.Parallel? I doubt it. This is the situation I emulate in C with the openblas_fork_parent_init.c program.

@larsmans
Copy link
Contributor

larsmans commented Sep 5, 2013

Yes, that seems to work too.

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 5, 2013

But the openblas_fork_parent_init.c program is able to trigger the crash no?

@larsmans
Copy link
Contributor

larsmans commented Sep 5, 2013

Not with those build flags.

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 5, 2013

I just recompiled OpenBLAS with:

CC=gcc-4.7 make NO_AFFINITY=1 USE_OPENMP=1 USE_SIMPLE_THREADED_LEVEL3=1 NO_WARMUP=1
make install PREFIX=/opt/OpenBLAS_with_OpenMP

Note that I did not disable LAPACK with NO_LAPACK=1 as numpy.dot does not work well otherwise (I do not understand why, but anyway LAPACK is required for scipy and scikit-learn so we have to have it enabled).
I then built numpy with this branch and the following site.cfg:

[openblas]
library_dirs = /opt/OpenBLAS_with_OpenMP/lib
include_dirs = /opt/OpenBLAS_with_OpenMP/include

Then when run this numpy_fork.py script I get a hung child process:

$ python numpy_fork.py
{'language': 'f77',
 'libraries': ['openblas'],
 'library_dirs': ['/opt/OpenBLAS_with_OpenMP/lib']}
{'language': 'f77',
 'libraries': ['openblas'],
 'library_dirs': ['/opt/OpenBLAS_with_OpenMP/lib']}
[13887] Calling dot before fork
[13887] Result: 0.900
[13887] Forking
[13887] Calling dot after fork
[13891] Calling dot after fork
[13887] Result: 0.900

I have to kill the hung process with pkill -f numpy_fork.py.

I have also made another version version of the script numpy_fork_omp_aware.py to make it possible to call omp_set_num_threads(1) on the OpenBLAS lib prior to forking assuming that libgomp was statically linked into it (which is apparently the case by default):

$ python openblask_fork_gist/numpy_fork_omp_aware.py /opt/OpenBLAS_with_OpenMP/lib/libopenblas_nehalemp-r0.2.8.dylib
{'language': 'f77',
 'libraries': ['openblas'],
 'library_dirs': ['/opt/OpenBLAS_with_OpenMP/lib']}
{'language': 'f77',
 'libraries': ['openblas'],
 'library_dirs': ['/opt/OpenBLAS_with_OpenMP/lib']}
[13857] Calling dot before fork
[13857] Result: 1.080
Loading shared lib: /opt/OpenBLAS_with_OpenMP/lib/libopenblas_nehalemp-r0.2.8.dylib
omp_get_num_threads() = 1
Calling omp_set_num_threads(1)
omp_get_num_threads() = 1
[13857] Forking
[13857] Calling dot after fork
[13861] Calling dot after fork
[13857] Result: 1.080
[13861] Result: 1.080

That works, but then OpenBLAS cannot used with multiple threads in the children. That might be a good work around for joblib though as one could expect the size of the joblib process pool to match the number of physical cores of the host machine.

What I don't understand is why omp_get_num_threads() = 1 before I call omp_set_num_threads(1)...

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 5, 2013

Not with those build flags.

I had not seen this. This is weird. The only difference is the NO_LAPACK=1 flag. It might be the culprit but we actually need LAPACK for other stuff in our case.

@xianyi
Copy link
Collaborator

xianyi commented Sep 5, 2013

Sorry for the delay.

The easy way is exporting names such as openblas_thread_shutdown and openblas_thread_init. I think I can implement it tomorrow.

@ogrisel , it's a good idea to store the pid. However, I need to evaluate the overhead.

@ larsmans, I think you need try bigger matrices to trigger the multi-threading.

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 5, 2013

Thanks for the feedback @xianyi . It would be great if you could implement a fork-safe mode for OpenBLAS that stores the pid of the thread pool initializer. I just had a quick look and I think a similar strategy can be implemented in gcc/libgomp/parallel.c to make it fork-safe. Unfortunately I probably won't have the time to experiment with that my-self in the short term.

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 10, 2013

@xianyi I will submit a patch to libgomp. I think the problem in OpenBLAS can be fixed similarly by using pthreads_atfork (assuming you are using pthreads to implement the non-OpenMP case):

https://github.com/ogrisel/gcc/compare/forksafe-omp-pthread_atfork

I think this is cleaner than using a pid and more robust to vfork / clone and sutff like this as @larsmans told me.

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 10, 2013

BTW my tentative fix for the libgomp implementation of OpenMP in GCC has been rejected as invalid and it's unlikely that libgomp will ever be made robust to raw forks.

The good news it that multiprocessing will have options (the spawn and forkserver modes) to mitigate the issue as soon as Python 3.4 is out:

http://bugs.python.org/issue8713

@larsmans
Copy link
Contributor

I guess this is fixed now, right? It's not OpenBLAS' fault, CPython violates the POSIX standard on this one.

@ogrisel
Copy link
Contributor Author

ogrisel commented Oct 28, 2013

I agree, let's close this issue.

@ogrisel ogrisel closed this as completed Oct 28, 2013
@mattcbro
Copy link

This bug is now a disaster on ubuntu 13.10 and thus linux mint 16. Using numpy with openblas causes linear algebra to slow to a crawl. Moreover a sufficiently complex application pushes all the cores to 100% utilization prior to freezing.

As a result we are back to atlas which is a pain to tune and link to the right libraries on ubuntu. While you might have closed this because it's python's fault, it's probably a lot harder to change python than openblas, and numpy has a huge application base that needs a good blas library.

@xianyi
Copy link
Collaborator

xianyi commented Dec 19, 2013

Hi @mattcbro ,
Thank you for the feedback. I will reopen this issue.
Could you provide the test code with numpy?

@xianyi xianyi reopened this Dec 19, 2013
@ogrisel
Copy link
Contributor Author

ogrisel commented Dec 19, 2013

@mattcbro what do you mean by "Using numpy with openblas causes linear algebra to slow to a crawl"? This issue is about multiprocessing's fork causing a numpy program to completely hang (stop and never complete) not just slowing down. Are you sure this the same issue?

Also unless OpenBLAS stops using OpenMP and only uses a manually managed thread pool, I think there is no way to fix the issue on OpenBLAS' side.

@larsmans
Copy link
Contributor

I don't think not using OpenMP would solve the problem. You still can't reliably fork a worker process in the context of POSIX threads.

@njsmith
Copy link

njsmith commented Feb 19, 2014

The flag trick definitely won't work for disabling affinity, because the
point of that trick is to avoid doing anything risky in the fork+exec case.
But we especially want to disable affinity in the fork+exec case.

In pretty sure sched_setaffinity is on the list of functions that you are
allowed to call in a forked child, though, so we can just do it.

Possibly I was being silly to bring it up here though; it is a bit of a
distraction from the current bug.

I have add a quick look at the affinity issue. The code looks quite
complicated but it might be possible to set a boolean flag in the child
after the fork as @njsmith https://github.com/njsmith did in his libgomp
patch and introspect that flag in exec_blas_* and call in turn:

disable_mapping = 1;gotoblas_affinity_init();

although the code in driver/others/init.c is rather complicated and writing
a test for that sounds painful.


Reply to this email directly or view it on
GitHubhttps://github.com//issues/294#issuecomment-35395868
.

@ogrisel
Copy link
Contributor Author

ogrisel commented Feb 19, 2014

@njsmith agreed, let us postpone the affinity issue for now and open a new issue once #294 is fixed.

@xianyi I think that __builtin_expect is supported by most compilers but I am not sure about Microsoft's compiler so I just removed them.

I'll try to have a look at the USE_THREAD=0 case soon.

@xianyi
Copy link
Collaborator

xianyi commented Feb 19, 2014

@ogrisel , I like lazy init in exec_blas_*.
I prefer unlikely macro in https://gist.github.com/xianyi/9091495 .

For affinity, in init.c, you can use sched_setaffinity(0, sizeof(cpu_orig_mask), &cpu_orig_mask[0]); to restore the default affinity setting.

ogrisel added a commit to ogrisel/OpenBLAS that referenced this issue Feb 19, 2014
ogrisel added a commit to ogrisel/OpenBLAS that referenced this issue Feb 19, 2014
@ogrisel
Copy link
Contributor Author

ogrisel commented Feb 19, 2014

Sorry for the rebase noise... I have just opened PR #343 to submit a fix for this issue. I tested it with gcc 4.7 with and without OpenMP (to check that the fork test is skipped with OpenMP) and clang-500.2.79 with and without USE_THREAD=0 (to check that it still builds in sequential mode).

I have not tested it under Windows but the fork test should not be built under windows (and the problem does not exist there as Windows does not implement fork).

I leave the affinity stuff of another PR as I would like to write a test for this case as well but I don't have the time to do it now.

@xianyi xianyi closed this as completed in 138a841 Feb 19, 2014
xianyi added a commit that referenced this issue Feb 19, 2014
@xianyi
Copy link
Collaborator

xianyi commented Feb 19, 2014

@ogrisel ,
Thank you for the patch.

@cartazio
Copy link

ummm, now my haskell bindings don't built any more!
https://travis-ci.org/wellposed/hOpenBLAS/jobs/19229988#L604

@ogrisel
Copy link
Contributor Author

ogrisel commented Feb 20, 2014

@cartazio sorry for that. I don't really understand how your build works but that might be caused buy the default OMP build that does not require the link to the pthread library. Here is a new PR to try and fix that: #345

@ogrisel
Copy link
Contributor Author

ogrisel commented Feb 20, 2014

@cartazio please feel free to comment on #345 if it fixes your build issue or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants