-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to build in parallel with Python 3 #7112
Comments
does it depend on the number of cores you're using, and is the error always the same? |
@rgommers I'm not sure. I tried building with 16 cores again and it failed. Then I tried 1, 2, 4, 8, and 16 core builds and they worked. It appears to be non-deterministic, as is the case with most parallel build failures. I realize that you may not have access to a 16-core CPU, so let me know if there are any debugging steps I can take and I can provide you with more information. I can also test on our 32-core CPU nodes as well. |
@adamjstewart sorry for the slow reply, this issue fell through the cracks. Thinking about it again, I'm unsure how to debug it without being able to reproduce it. What would be useful though is providing a full build log (in a gist) and checking whether it fails in the same place each time. In the snippet above it failed on |
We've been seeing a similar for a bit in Gentoo linux. I have actually a build log in bugzilla that you should be able to look at https://bugs.gentoo.org/attachment.cgi?id=473460 this is actually a different error than above but appears to be the result of a race condition in multiple building threads as well. SO far I haven't tried to figure it out. Any clues of where scipy deals with build dependencies? |
All three examples are failures when gfortran is used for linking. The original report is strange has it seem to concern pure fortran sources which are not supposed to be parallelized. So is the case for I am thinking something does parallel building when it shouldn't. |
When looking hard at the problem and do repeated build, the place is breaks appears to be random. I thought for a while that threading of the building of the objects in numpy's distutils was to blame. But even when I patch numpy's distutils to enforce serial build I still get build failures. So it must triggered from elsewhere. It is definitely a race condition between linking and finishing the compilation of objects. When you dig up the logs to see the actual error it is something like
It also appears that various scipy components (separate folders) are built in parallel so there is a higher level of threading of the build at play. |
I did some more experiment. If I shut down the parallelism in numpy's distutils, I still get failures. I now think python 3.5+'s distutils parallelism cannot deal with scipy. Interestingly if you do the reverse, shutdown the python distutils parallelism and enable numpy's one, you also get failures. So the implementation of parallelism in both python and numpy seems to be lacking. I don't see this a scipy anymore. Both python and numpy are to blame in my opinion. |
Can confirm issues as reported by @kiwifb with gentoo and scipy-0.19.1 and scipy-1.0.0. Any luck with a fix? |
After a lot of experiments, the problems lies specifically in python 3.5+ version of distutils. There is no way to specifically stop parallelism in python's distutils and enabling it in numpy's one. numpy developer have aligned their options to enable parallel building with the python one so disabling one or the other is not possible by setting an option or the other. All in all, to me it is a python bug, not a scipy bug. |
Just encountered this on Gentoo and Python 3.6 when I've disabled python3_5 distutils-r1_python_compile \
$(usex python_targets_python3_5 "" "-j $(makeopts_jobs)") \
${SCIPY_FCONFIG} For ones with same problem and looking for quick workaround, just do:
|
After 4 years discussion and debugging, we conclude that Python 3 is deeply broken in parallel builds for anything involving compiling of C/C++/fortran code. The problem is universal, regardless how dev-python/numpy is built. Numpy and scipy upstream cannot do anything about this. We bite the bullet and disable parallel build of scipy completely. Thanks to all who have contributed to this heroic marathon debugging. We regret that only a workaround can be provided at this moment. Credit: Andrés Becerra Sandoval, Hendrik v. Raven, younky.yang@yahoo.com Credit: matoro, Denis Descheneaux, Mathy Vanvoorden, email200202@yahoo.com Credit: jon R-B, Anton Kochkov, Jonas Stein, edes, David Duchesne Credit: thulle, Mathy Vanvoorden, Sasha Medvedev, rtgiskard@gmail.com Credit: Lukasz Ligowski, Zentoo, Jouni Kosonen, Neil, Harris Landgarten Credit: Markus Oehme, Andreas Proteus Suggested-By: François Bissey, Arfrever Frehtes Taifersar Arahesis Reference: numpy/numpy#13080 Reference: scipy/scipy#7112 Closes: https://bugs.gentoo.org/614464 Package-Manager: Portage-2.3.79, Repoman-2.3.18 Signed-off-by: Benda Xu <heroxbd@gentoo.org>
Building in parallel is disabled in
This needs a fix in |
This is still the state as of today; it won't be fixed anymore because You can build SciPy in parallel with Meson now. |
I'm trying to build scipy 0.18.1 with Python 3.6.0, Numpy 1.12.0, GCC 6.1.0, and OpenBLAS 0.2.19 on CentOS 6.8. When I build in serial, things work just fine. But when I use
-j 16
, I'm seeing build errors:I can successfully build in parallel with Python 2.7.13, so this problem may be specific to Python 3.
The text was updated successfully, but these errors were encountered: