Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to build in parallel with Python 3 #7112

Closed
adamjstewart opened this issue Feb 28, 2017 · 12 comments
Closed

Unable to build in parallel with Python 3 #7112

adamjstewart opened this issue Feb 28, 2017 · 12 comments
Labels
Build issues Issues with building from source, including different choices of architecture, compilers and OS wontfix Not actionable, rejected or unplanned changes

Comments

@adamjstewart
Copy link
Contributor

I'm trying to build scipy 0.18.1 with Python 3.6.0, Numpy 1.12.0, GCC 6.1.0, and OpenBLAS 0.2.19 on CentOS 6.8. When I build in serial, things work just fine. But when I use -j 16, I'm seeing build errors:

/blues/gpfs/home/software/spack-0.10.0/lib/spack/env/gcc/gfortran -Wall -g -Wall -g -shared build/temp.linux-x86_64-3.6/build/src.linux-x86_64-3.6/scipy/integrate/_dopmodule.o build/temp.linux-x86_64-3.6/build/src.linux-x86_64-3.6/fortranobject.o -L/blues/gpfs/home/software/spack-0.10.0/opt/spack/linux-centos6-x86_64/gcc-6.1.0/python-3.6.0-prk6gk3ufbfetjc2bthqokmkjtjnce3j/lib -Lbuild/temp.linux-x86_64-3.6 -ldop -lpython3.6m -lgfortran -o build/lib.linux-x86_64-3.6/scipy/integrate/_dop.cpython-36m-x86_64-linux-gnu.so
build/temp.linux-x86_64-3.6/build/src.linux-x86_64-3.6/fortranobject.o: file not recognized: File truncated
collect2: error: ld returned 1 exit status
build/temp.linux-x86_64-3.6/build/src.linux-x86_64-3.6/fortranobject.o: file not recognized: File truncated
collect2: error: ld returned 1 exit status

I can successfully build in parallel with Python 2.7.13, so this problem may be specific to Python 3.

@rgommers rgommers added the Build issues Issues with building from source, including different choices of architecture, compilers and OS label Mar 15, 2017
@rgommers
Copy link
Member

does it depend on the number of cores you're using, and is the error always the same?

@adamjstewart
Copy link
Contributor Author

@rgommers I'm not sure. I tried building with 16 cores again and it failed. Then I tried 1, 2, 4, 8, and 16 core builds and they worked. It appears to be non-deterministic, as is the case with most parallel build failures. I realize that you may not have access to a 16-core CPU, so let me know if there are any debugging steps I can take and I can provide you with more information. I can also test on our 32-core CPU nodes as well.

@rgommers
Copy link
Member

@adamjstewart sorry for the slow reply, this issue fell through the cracks. Thinking about it again, I'm unsure how to debug it without being able to reproduce it. What would be useful though is providing a full build log (in a gist) and checking whether it fails in the same place each time. In the snippet above it failed on integrate._dop, but there's nothing unusual in the build config in integrate/setup.py as far as I can see.

@kiwifb
Copy link

kiwifb commented Nov 20, 2017

We've been seeing a similar for a bit in Gentoo linux. I have actually a build log in bugzilla that you should be able to look at https://bugs.gentoo.org/attachment.cgi?id=473460 this is actually a different error than above but appears to be the result of a race condition in multiple building threads as well.
I am usually building with 8 threads. We have at least one report of a third place where there can be failure https://bugs.gentoo.org/attachment.cgi?id=499348

SO far I haven't tried to figure it out. Any clues of where scipy deals with build dependencies?

@kiwifb
Copy link

kiwifb commented Nov 20, 2017

All three examples are failures when gfortran is used for linking. The original report is strange has it seem to concern pure fortran sources which are not supposed to be parallelized. So is the case for _fblas, my build log on the other hand has mixed fortran and C sources so you could expect a problem, this is strange.

I am thinking something does parallel building when it shouldn't.

@kiwifb
Copy link

kiwifb commented Nov 21, 2017

When looking hard at the problem and do repeated build, the place is breaks appears to be random. I thought for a while that threading of the building of the objects in numpy's distutils was to blame. But even when I patch numpy's distutils to enforce serial build I still get build failures. So it must triggered from elsewhere.

It is definitely a race condition between linking and finishing the compilation of objects. When you dig up the logs to see the actual error it is something like

/dev/shm/portage/sci-libs/scipy-0.19.1/work/scipy-0.19.1-python3_6/build/temp.linux-x86_64-3.6/scipy/special/cdf_wrappers.o: file not recognized: File truncated
collect2: error: ld returned 1 exit status
/dev/shm/portage/sci-libs/scipy-0.19.1/work/scipy-0.19.1-python3_6/build/temp.linux-x86_64-3.6/scipy/special/cdf_wrappers.o: file not recognized: File truncated
collect2: error: ld returned 1 exit status

It also appears that various scipy components (separate folders) are built in parallel so there is a higher level of threading of the build at play.

@kiwifb
Copy link

kiwifb commented Nov 22, 2017

I did some more experiment. If I shut down the parallelism in numpy's distutils, I still get failures. I now think python 3.5+'s distutils parallelism cannot deal with scipy.

Interestingly if you do the reverse, shutdown the python distutils parallelism and enable numpy's one, you also get failures.

So the implementation of parallelism in both python and numpy seems to be lacking. I don't see this a scipy anymore. Both python and numpy are to blame in my opinion.

@iffsid
Copy link

iffsid commented Dec 6, 2017

Can confirm issues as reported by @kiwifb with gentoo and scipy-0.19.1 and scipy-1.0.0. Any luck with a fix?

@kiwifb
Copy link

kiwifb commented Dec 6, 2017

After a lot of experiments, the problems lies specifically in python 3.5+ version of distutils. There is no way to specifically stop parallelism in python's distutils and enabling it in numpy's one. numpy developer have aligned their options to enable parallel building with the python one so disabling one or the other is not possible by setting an option or the other.

All in all, to me it is a python bug, not a scipy bug.

@Kagami
Copy link

Kagami commented Mar 2, 2018

Just encountered this on Gentoo and Python 3.6 when I've disabled python3_5 PYTHON_TARGET of scipy ebuild. Because of:

        distutils-r1_python_compile \
                $(usex python_targets_python3_5 "" "-j $(makeopts_jobs)") \
                ${SCIPY_FCONFIG}

For ones with same problem and looking for quick workaround, just do:

MAKEOPTS=-j1 emerge scipy

gentoo-bot pushed a commit to gentoo/gentoo that referenced this issue Jan 27, 2020
  After 4 years discussion and debugging, we conclude that Python 3 is
  deeply broken in parallel builds for anything involving compiling of
  C/C++/fortran code.  The problem is universal, regardless how
  dev-python/numpy is built.

  Numpy and scipy upstream cannot do anything about this.  We bite the
  bullet and disable parallel build of scipy completely.

  Thanks to all who have contributed to this heroic marathon
  debugging.  We regret that only a workaround can be provided at this
  moment.

Credit: Andrés Becerra Sandoval, Hendrik v. Raven, younky.yang@yahoo.com
Credit: matoro, Denis Descheneaux, Mathy Vanvoorden, email200202@yahoo.com
Credit: jon R-B, Anton Kochkov, Jonas Stein, edes, David Duchesne
Credit: thulle, Mathy Vanvoorden, Sasha Medvedev, rtgiskard@gmail.com
Credit: Lukasz Ligowski, Zentoo, Jouni Kosonen, Neil, Harris Landgarten
Credit: Markus Oehme, Andreas Proteus
Suggested-By: François Bissey,  Arfrever Frehtes Taifersar Arahesis
Reference: numpy/numpy#13080
Reference: scipy/scipy#7112
Closes: https://bugs.gentoo.org/614464
Package-Manager: Portage-2.3.79, Repoman-2.3.18
Signed-off-by: Benda Xu <heroxbd@gentoo.org>
@rgommers
Copy link
Member

rgommers commented Feb 6, 2021

Building in parallel is disabled in setup.py:

    class build_ext(BaseBuildExt):
        def finalize_options(self):
            super().finalize_options()

            # Disable distutils parallel build, due to race conditions
            # in numpy.distutils (Numpy issue gh-15957)
            if self.parallel:
                print("NOTE: -j build option not supported. Set NPY_NUM_BUILD_JOBS=4 "
                      "for parallel build.")
            self.parallel = None

This needs a fix in numpy.distutils first.

@rgommers
Copy link
Member

rgommers commented Apr 1, 2022

This is still the state as of today; it won't be fixed anymore because numpy.distutils and distutils are both deprecated. So I'll close this issue.

You can build SciPy in parallel with Meson now.

@rgommers rgommers closed this as completed Apr 1, 2022
@rgommers rgommers added the wontfix Not actionable, rejected or unplanned changes label Apr 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Build issues Issues with building from source, including different choices of architecture, compilers and OS wontfix Not actionable, rejected or unplanned changes
Projects
None yet
Development

No branches or pull requests

5 participants