Under certain (unclear) conditions, use_multiprocessing causes calculation to hang #23

mperrin · 2018-08-23T21:40:15Z

Issue by josePhoenix
Friday Dec 12, 2014 at 16:17 GMT
Originally opened as mperrin/poppy#23

For some reason, this sample code (using WebbPSF) causes a hang on my Mac Pro but not the science5 server or my personal machine (MacBook Air). Needs a better test case that uses only POPPY.

(To avoid the sometimes unusual behavior of multiprocessing + the interactive interpreter, I saved this as a separate script and ran it from the command line.)

import webbpsf
import poppy
nc = webbpsf.NIRCam()
nc.filter = 'F150W2'
poppy.conf.use_fftw = False
poppy.conf.use_multiprocessing = True
poppy.conf.n_processes = 8
nc.calcPSF()

Running this prints out logging output through "Propagating wavelength..." lines for the first 20 of 40 wavelengths, and then nothing happens.

If I interrupt the script with Ctrl-C, I get tracebacks from all the running workers interleaved (to be expected, but hard to interpret) and it looks like they're all waiting on "task = get()" which in turn is waiting
on "racquire()" which seems to acquire a lock on a pipe.

mperrin · 2018-08-23T21:40:15Z

Comment by mperrin
Wednesday Jan 28, 2015 at 22:47 GMT

I am now having this same problem on my laptop (Macbook Pro), while the exact same code works on my desktop (Mac Pro). Both are OS X 10.9.5.

Here's webbpsf.system_diagnostic() for the one that works:

OS: Darwin-13.4.0-x86_64-i386-64bit
Python version: 2.7.8 (default, Oct  7 2014, 15:36:11)  [GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)]
numpy version: 1.8.2
poppy version: 0.3.3.dev407
webbpsf version: 0.3rc4

tkinter version: 0.3.1
wxpython version: 3.0.0.0

astropy version: 0.4.1
pysynphot version: 0.9.5
pyFFTW version: not found

Floating point type information for numpy.float:
Machine parameters for float64
---------------------------------------------------------------------
precision= 15   resolution= 1.0000000000000001e-15
machep=   -52   eps=        2.2204460492503131e-16
negep =   -53   epsneg=     1.1102230246251565e-16
minexp= -1022   tiny=       2.2250738585072014e-308
maxexp=  1024   max=        1.7976931348623157e+308
nexp  =    11   min=        -max
---------------------------------------------------------------------

Floating point type information for numpy.complex:
Machine parameters for float64
---------------------------------------------------------------------
precision= 15   resolution= 1.0000000000000001e-15
machep=   -52   eps=        2.2204460492503131e-16
negep =   -53   epsneg=     1.1102230246251565e-16
minexp= -1022   tiny=       2.2250738585072014e-308
maxexp=  1024   max=        1.7976931348623157e+308
nexp  =    11   min=        -max
---------------------------------------------------------------------

And here's the one that doesn't work:

OS: Darwin-13.4.0-x86_64-i386-64bit
Python version: 2.7.8 (default, Oct  7 2014, 15:56:22)  [GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))]
numpy version: 1.8.1
poppy version: 0.3.3.dev407
webbpsf version: 0.3rc4

tkinter version: 0.3.1
wxpython version: 3.0.0.0

astropy version: 0.4.dev9488
pysynphot version: 0.9.5
pyFFTW version: 0.9.2

Floating point type information for numpy.float:
Machine parameters for float64
---------------------------------------------------------------------
precision= 15   resolution= 1.0000000000000001e-15
machep=   -52   eps=        2.2204460492503131e-16
negep =   -53   epsneg=     1.1102230246251565e-16
minexp= -1022   tiny=       2.2250738585072014e-308
maxexp=  1024   max=        1.7976931348623157e+308
nexp  =    11   min=        -max
---------------------------------------------------------------------

Floating point type information for numpy.complex:
Machine parameters for float64
---------------------------------------------------------------------
precision= 15   resolution= 1.0000000000000001e-15
machep=   -52   eps=        2.2204460492503131e-16
negep =   -53   epsneg=     1.1102230246251565e-16
minexp= -1022   tiny=       2.2250738585072014e-308
maxexp=  1024   max=        1.7976931348623157e+308
nexp  =    11   min=        -max
---------------------------------------------------------------------

mperrin · 2018-08-23T21:40:16Z

Comment by mperrin
Wednesday Jan 28, 2015 at 22:54 GMT

Hangs the same way in both python and IPython, so that's not the problem.

This looks maybe relevant: http://stackoverflow.com/questions/12485322/python-2-6-7-multiprocessing-pool-fails-to-shut-down

mperrin · 2018-08-23T21:40:16Z

Comment by josePhoenix
Monday Apr 06, 2015 at 19:32 GMT

I've done some more investigation, and it appears that the hang happens on a line in matrixDFT.py where it computes a dot product. So I Google "numpy dot multiprocessing" and whaddya know, someone else has had this issue... http://stackoverflow.com/questions/23963997/python-child-process-crashes-on-numpy-dot-if-pyside-is-imported

The takeaway seems to be:

this is a general issue with some BLAS libraries used by numpy for dot.

Apple Accelerate and OpenBlas built with GNU Openmp are known to not be safe to use on both sides of a fork (the parent and the child process multiprocessing create). They will deadlock.

This cannot be fixed by numpy but there are three workarounds:

use netlib BLAS, ATLAS or git master OpenBlas based on pthreads (2.8.0 does not work)

use python 3.4 and its new multiprocessing spawn or forkserver start methods

use threading instead of multiprocessing, numpy releases the gil for most expensive operations so you can archive decent threading speedups on typical desktop machines

I can't believe it!

mperrin · 2018-08-23T21:40:17Z

Comment by josePhoenix
Monday Apr 06, 2015 at 20:48 GMT

I threw a bug report into the ether: numpy/numpy#5752

It may be time to rewrite the Slow Fourier Transform in Cython / C / OpenCL (depending on how adventurous I'm feeling).

mperrin · 2018-08-23T21:40:17Z

Comment by mperrin
Monday Apr 06, 2015 at 23:49 GMT

Oh good grief. I wasn't expecting this to be a fundamental and intractable issue with the Accelerate framework, of all things. I bet the fact that this has generally worked more reliably to me is due to my using macports which can build its own copy of ATLAS.

Yeah, I agree with your comment on the numpy bug discussion that we should just test for this and warn users if they're hitting this case. Looks like we can test against numpy.__config__.blas_opt_info to see whether a given numpy instance is linked against Accelerate.

I'd be fully in support of developing a more accelerated version of the MFT, definitely. There are some GPU experts in the building if you want to go that route. I'm tempted to think that the Cython approach could have better return on time invested than the complexities of dealing with GPUs, except the matrix_dft function is already mostly just calls to numpy ufuncs that are presumably written in C internally. So Cython might not give much speedup, but I suppose would at least work around the numpy.dot/Accelerate bug.

Possibly useful notebook on linear algebra in Cython: http://nbviewer.ipython.org/github/carljv/cython_testing/blob/master/cython_linalg.ipynb

mperrin · 2018-08-23T21:40:17Z

Comment by josePhoenix
Tuesday Apr 07, 2015 at 02:24 GMT

"Good grief": my sentiments exactly.

Further action on the GitHub issue at numpy/numpy sounds like there's a possibility of building wheels with ATLAS instead of Accelerate, perhaps even as the default going forward. I checked our internal ssbx numpy, and it's currently built with Accelerate for Mac users 😞

mperrin · 2018-08-23T21:40:18Z

Comment by josePhoenix
Wednesday Apr 08, 2015 at 20:37 GMT

A translation of the NumPy-based matrixDFT code into Cython, with a Cython matrix product function to replace numpy.dot, is about 23x slower. I've gotten the low hanging optimization fruit already, I think, though I'm not really experienced with Cython.

If it had been only 4x slower, I would have been happy... The hot path (the matrix product code) is essentially C at this point, so I'm not sure how much additional benefit a C extension will have.

mperrin · 2018-08-23T21:40:18Z

Comment by mperrin
Wednesday Apr 08, 2015 at 22:41 GMT

Bleh. 23x bleh.

I wonder why it's that much slower. There must be some deep SSE magic going on in the assembly code for numpy.dot. Yeah OK, no wins from Cython.

Do you want to give pyCUDA a try? If so maybe talk to Perry or Tim in SSB, I think they have some experience with it or can direct you to those here who do. (FYI there was very mixed experience here a couple years ago trying GPUs for several projects like the CTE correction). Of course, given a goal of "easy distributability" anything GPUish is likely even more of a headache than this Accelerate mess so I'm wary.

Right now we can at least be fairly confident that the existing code should be robust for multiprocessing on Linux, yes? In the long run it may be most efficient to just get things up and running on Python 3.4 since it sounds like that avoids the underlying issue, and we want to go that way anyhow.

mperrin · 2018-08-23T21:40:19Z

Comment by josePhoenix
Thursday Apr 09, 2015 at 00:35 GMT

Yeah, getting things running on 3.4 is definitely something we should do so we're on a supported version of the language. We can just warn users on 2.7.x if they enable multiprocessing on an affected numpy build.

I was looking at PyOpenCL for broader compatibility than PyCUDA, but they're both lower level than I'd like (have to separate our real and imaginary components, etc.).

Re: Linux: it's hard to say, since there are a few different BLAS libs NumPy can be built with and no standard binary build for Linux as far as I know. It sounds like a similar issue has existed in other libraries and been worked around. I would want to check against our ssb* builds at least to confirm it's not an issue there.

mperrin · 2018-08-23T21:40:19Z

Comment by josePhoenix
Thursday Apr 09, 2015 at 02:43 GMT

For future reference:

(slightly) higher level Python API over OpenCL and CUDA: reikna
An apparently very thorough discussion of FFTs with OpenCL from AMD: part 1 part 2 (with some formatting weirdness)

mperrin · 2018-08-23T21:40:20Z

Comment by mperrin
Monday Apr 20, 2015 at 04:53 GMT

Some thoughts on performance optimization.

I've just done some benchmarking with line_profiler. For a relatively simple case (webbpsf NIRCam F212N direct imaging), the execution time breaks down as follows. Each level of indentation is what's inside the function call above, but I've (manually) converted percentages to be relative to the total execution time.

~99% of time is spent inside OpticalSystem.propagate_mono()
    78% is inside wavefront.propagateTo
        77% is inside matrix_dft
             1% is the two np.outer() calls
            11% is setting up the expXU and expYV arrays
            64% is the two np.dot() calls
             1% is multiplying by norm_coeff
    11% is wavefront *= optic
     5% is wavefront.normalize()
     6% is return wavefront.asFITS()
        5% is unnecessary array copies which we could easily avoid

So, no real surprises. Nearly 80% of the execution time is spent in calls to np.dot, np.outer, and np.exp. There is not likely to be much/anything we can do to optimize those beyond BLAS. (Setting aside cases of "numpy is not compiled to use BLAS so is wastefully slow" as an anti-optimization. Argh.) Extensive web searching has turned up some interesting tidbits but nothing that looks particularly likely to yield any substantial speedups.

Continuum Analytics provides a version of numpy optimized with Intel's Math Kernel Library. However, apparently Apple Accelerate or OpenBLAS are just as fast as MKL, says this quora thread and twitter post with benchmarks. Of course there is the lack-of-fork-safeness issue.
Continuum Analytics also has their numba just-in-time compiler. But that doesn't help here since all the time is being spent inside compiled C kernels of ufuncs that are just calling BLAS anyway.
In old versions of numpy <= 1.7 there were some speed pitfalls depending on array order in memory but apparently no longer the case, as per this stackoverflow discussion
- another stack overflow discussions that's informative but ultimately reaches the same conclusions that BLAS is best http://stackoverflow.com/questions/12239002/how-to-accelerate-matrix-multiplications-in-python

The only additional avenue I can think of to pursue would be GPU computing. In particular Continuum Analytics also has CUDA integrated into their Python JIT compiler toolchain NumbaPro. Here's a example notebook. Seems very slick. CUDA includes its own version of BLAS called CUBLAS. Under the right circumstances it sounds like this can provide significant speedups, and we should probably try it out. On the other hand, the overheads for transferring data in/out of GPU memory could potentially be significant.

(In the back of my mind there's long been an idea that it would be possible to rewrite the code such that the entire propagation takes place on the GPU and only the final results get sent back to main memory, i.e. all the OpticalElement phasor creation code could be compiled to run on the GPUs too. But that is clearly way outside the scope of current efforts. That said, this seems like exactly the sort of thing that scikit-CUDA is intended for.)

Bottom line I think right now we should:

Add error-checking code for Accelerate in numpy.__config__.blas_opt_info and warn users it is not compatible with multiprocessing on Python 2.7
Rework the multiprocessing code so that it can use the forkserver method on Python 3.4 which supposedly avoids that problem, and encourage users to migrate to 3.4 if they want to use multiprocessing.
Investigate GPU options over the longer term, probably most efficiently through NumbaPro.

mperrin · 2018-08-23T21:40:20Z

Comment by mperrin
Tuesday Apr 21, 2015 at 04:40 GMT

Should add at least the error-check and warning before release 0.3.5:

Add error-checking code for Accelerate in numpy.config.blas_opt_info and warn users it is not compatible with multiprocessing on Python 2.7
Rework the multiprocessing code so that it can use the forkserver method on Python 3.4 which supposedly avoids that problem, and encourage users to migrate to 3.4 if they want to use multiprocessing.

mperrin · 2018-08-23T21:40:21Z

Comment by JarronL
Monday Aug 01, 2016 at 20:14 GMT

I recently did a clean install of Mac OS X (10.10) on my MacPro along with the latest Anaconda (Python 2.7) and pyFFTW with corresponding FFTW libraries. After doing some extensive benchmarking, I discovered that the numpy fft isn't that much slower than the pyFFTW, and came to the conclusion that multiprocessing numpy fft is a much better use of my machine's resources. Fortunately, my numpy uses MKL, so I don't have the forking issue under Apple's Accelerate library. Unfortunately, the error checking statement in Poppy (line 1560) assumes that np.__config__.blas_opt_info['extra_link_args'] exists, which causes an error for me since that key is not present in my blas_opt_info dict. For reference, my dictionary looks like:

{'define_macros': [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)],
 'include_dirs': ['/Users/jarron/anaconda2/include'],
 'libraries': ['mkl_intel_lp64',
  'mkl_intel_thread',
  'mkl_core',
  'iomp5',
  'pthread'],
 'library_dirs': ['/Users/jarron/anaconda2/lib']}

Until there is a fix, I modified my copy to also check if that key exists:

d = np.__config__.blas_opt_info
if d.has_key('extra_link_args') and ('-Wl,Accelerate' in d['extra_link_args']):
    accel_bool = True
else:
    accel_bool = False
if ( (sys.version_info < (3,4,0)) and (platform.system()=='Darwin') and accel_bool ):
        _log.error("Multiprocessing not compatible with Apple Accelerate library on Python < 3.4")
        _log.error(" See https://github.com/mperrin/poppy/issues/23 ")
        _log.error(" Either disable multiprocessing, or recompile your numpy without Accelerate.")
        raise NotImplementedError("Multiprocessing not compatible with Apple Accelerate framework.")

mperrin · 2018-08-23T21:40:21Z

Comment by mandarup
Tuesday Dec 05, 2017 at 17:45 GMT

Just commenting for reference in case this is useful to someone:

A few years later, still using python2.7 due to some dependencies lagging behind, realized np.dot was hanging joblib.parallel with backend='multiprocessing'

worked around using simple jitted dot implementation

import numba as nb
import numpy as np
@nb.jit(nb.f8[:, :](nb.f8[:, :], nb.f8[:, :]), nopython=True)
def dot_product(a, b):
    size_a = a.shape
    size_b = b.shape
    size_out = (size_a[0], size_b[1])
    out = np.zeros(size_out)
    for i in range(size_a[0]):
        for j in range(size_b[1]):
            for k in range(size_b[0]):
                out[i,j] += a[i,k] * b[k, j]
    return out

mperrin added this to the 0.3.5 milestone Aug 23, 2018

mperrin added the bug Something isn't working label Aug 23, 2018

mperrin closed this as completed Aug 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Under certain (unclear) conditions, use_multiprocessing causes calculation to hang #23

Under certain (unclear) conditions, use_multiprocessing causes calculation to hang #23

mperrin commented Aug 23, 2018

mperrin commented Aug 23, 2018

mperrin commented Aug 23, 2018

mperrin commented Aug 23, 2018

mperrin commented Aug 23, 2018

mperrin commented Aug 23, 2018

mperrin commented Aug 23, 2018

mperrin commented Aug 23, 2018

mperrin commented Aug 23, 2018

mperrin commented Aug 23, 2018

mperrin commented Aug 23, 2018

mperrin commented Aug 23, 2018

mperrin commented Aug 23, 2018

mperrin commented Aug 23, 2018

mperrin commented Aug 23, 2018

Under certain (unclear) conditions, use_multiprocessing causes calculation to hang #23

Under certain (unclear) conditions, use_multiprocessing causes calculation to hang #23

Comments

mperrin commented Aug 23, 2018

mperrin commented Aug 23, 2018

mperrin commented Aug 23, 2018

mperrin commented Aug 23, 2018

mperrin commented Aug 23, 2018

mperrin commented Aug 23, 2018

mperrin commented Aug 23, 2018

mperrin commented Aug 23, 2018

mperrin commented Aug 23, 2018

mperrin commented Aug 23, 2018

mperrin commented Aug 23, 2018

mperrin commented Aug 23, 2018

mperrin commented Aug 23, 2018

mperrin commented Aug 23, 2018

mperrin commented Aug 23, 2018