New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenBLAS hangs when calling DGEMM after a Unix fork #294
Comments
This is same to #240 |
Thanks, sorry I did not find it priori to reporting. I will have a deeper look. While this is not the problem I reported here: we might want to be able to do blas calls in the parent process before and after the fork, so enabling openmp for openblas would be detrimental as this pattern is officially not supported by openmp. |
To make myself clearer, here is a new version of the program:
If I build OpenBLAS with:
Then this programs hangs forever because of libgomp not supporting OpenMP to be used in the parent process before a fork and in the child process (after the fork). There is a bug on GCC here: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52303 unfortunately it was deemed invalid by the GCC developers some time ago. This shortcoming of OpenMP is also documented in this OpenMP howto: http://bisqwit.iki.fi/story/howto/openmp/#OpenmpAndFork It is my understanding that this OpenMP shortcoming is not implied by the standard itself but results from the specific implementation in libgomp. In the mean time would it be possible to "fix" OpenBLAS built with Both R and Python rely on forking for handling parallelization at their level due to limitations of their interpreter design. It would be great if imported libraries such as OpenBLAS would not crash in such a case. |
If transparent fork-child detection is too complicated to implement in OpenBLAS itself, at least it would help to be able to have the functions Compiling statically is not really feasible in my use case: I have many libraries in my ecosystem who would need to compile statically against OpenBLAS: numpy, scipy, scikit-learn and maybe others. The library doing the fork would not to know about all the Python compiled extensions that could have been compiled statically against OpenBLAS. In scikit-learn alone there are at least several compiled extensions ( |
I can't reproduce this (current head of the (Btw., dynamic linking is even more important because it should be possible to just drop in OpenBLAS without even recompiling. This is what the Debian/Ubuntu packages for BLAS, ATLAS and OpenBLAS do and it's a very easy way to get a speedup in NumPy and SciPy.) |
@larsmans I have never observed a child crash silently: either it hangs with 100% CPU usage (when |
Correction, you're right, the child processes keep running in the background. I noticed when I added a |
Alright, then it's probably that under Linux the default compiler is GCC and OpenMP is enabled by default by the OpenBLAS Makefile. This is not the case under OSX where the default C compiler is clang which does not support OpenMP yet. |
@ogrisel: @xianyi's suggestion of compiling with
seems to work. An additional
still gives a significant speedup in |
Even when you call a |
Yes, that seems to work too. |
But the |
Not with those build flags. |
I just recompiled OpenBLAS with:
Note that I did not disable LAPACK with
Then when run this numpy_fork.py script I get a hung child process:
I have to kill the hung process with I have also made another version version of the script numpy_fork_omp_aware.py to make it possible to call
That works, but then OpenBLAS cannot used with multiple threads in the children. That might be a good work around for What I don't understand is why |
I had not seen this. This is weird. The only difference is the |
Sorry for the delay. The easy way is exporting names such as openblas_thread_shutdown and openblas_thread_init. I think I can implement it tomorrow. @ogrisel , it's a good idea to store the pid. However, I need to evaluate the overhead. @ larsmans, I think you need try bigger matrices to trigger the multi-threading. |
Thanks for the feedback @xianyi . It would be great if you could implement a fork-safe mode for OpenBLAS that stores the pid of the thread pool initializer. I just had a quick look and I think a similar strategy can be implemented in gcc/libgomp/parallel.c to make it fork-safe. Unfortunately I probably won't have the time to experiment with that my-self in the short term. |
@xianyi I will submit a patch to libgomp. I think the problem in OpenBLAS can be fixed similarly by using https://github.com/ogrisel/gcc/compare/forksafe-omp-pthread_atfork I think this is cleaner than using a pid and more robust to |
BTW my tentative fix for the libgomp implementation of OpenMP in GCC has been rejected as invalid and it's unlikely that libgomp will ever be made robust to raw forks. The good news it that multiprocessing will have options (the |
I guess this is fixed now, right? It's not OpenBLAS' fault, CPython violates the POSIX standard on this one. |
I agree, let's close this issue. |
This bug is now a disaster on ubuntu 13.10 and thus linux mint 16. Using numpy with openblas causes linear algebra to slow to a crawl. Moreover a sufficiently complex application pushes all the cores to 100% utilization prior to freezing. As a result we are back to atlas which is a pain to tune and link to the right libraries on ubuntu. While you might have closed this because it's python's fault, it's probably a lot harder to change python than openblas, and numpy has a huge application base that needs a good blas library. |
Hi @mattcbro , |
@mattcbro what do you mean by "Using numpy with openblas causes linear algebra to slow to a crawl"? This issue is about multiprocessing's fork causing a numpy program to completely hang (stop and never complete) not just slowing down. Are you sure this the same issue? Also unless OpenBLAS stops using OpenMP and only uses a manually managed thread pool, I think there is no way to fix the issue on OpenBLAS' side. |
I don't think not using OpenMP would solve the problem. You still can't reliably fork a worker process in the context of POSIX threads. |
The flag trick definitely won't work for disabling affinity, because the In pretty sure sched_setaffinity is on the list of functions that you are Possibly I was being silly to bring it up here though; it is a bit of a I have add a quick look at the affinity issue. The code looks quite disable_mapping = 1;gotoblas_affinity_init(); although the code in driver/others/init.c is rather complicated and writing — |
@ogrisel , I like lazy init in For affinity, in |
… a Unix fork." This reverts commit 3617c22.
Sorry for the rebase noise... I have just opened PR #343 to submit a fix for this issue. I tested it with gcc 4.7 with and without OpenMP (to check that the fork test is skipped with OpenMP) and clang-500.2.79 with and without USE_THREAD=0 (to check that it still builds in sequential mode). I have not tested it under Windows but the fork test should not be built under windows (and the problem does not exist there as Windows does not implement fork). I leave the affinity stuff of another PR as I would like to write a test for this case as well but I don't have the time to do it now. |
@ogrisel , |
ummm, now my haskell bindings don't built any more! |
Here is a sample program that illustrate a problem when I want to use OpenBLAS with Unix fork (without exec):
https://gist.github.com/ogrisel/6440446
Then calling GEMM with a 7x7 squared matrix in parent and child process works without issue:
Doing the same for a 8x8 matrix will cause the child process to never completes (while burning 100% of a CPU) until I kill it.
Might this be caused by some static initialization step in the OpenBLAS runtime? If so would it be possible during init to store the pid of the process at init time and then later on check that the current process pid still the same and if not reinit the runtime?
Atlas, Apple vecLib and Intel MKL have no trouble with Unix forks. Hence this limitation (bug?) prevents OpenBLAS to be used as a drop-in replacement for those guys. This is especially annoying when used with Python that uses Unix forks in its standard library via the multiprocessing module. As Python threads suffer from a locking issue, many project use multiprocessing as a work around.
The text was updated successfully, but these errors were encountered: