Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Under certain (unclear) conditions, use_multiprocessing causes calculation to hang #23

Closed
mperrin opened this issue Aug 23, 2018 · 14 comments
Labels
bug Something isn't working
Milestone

Comments

@mperrin
Copy link
Collaborator

mperrin commented Aug 23, 2018

Issue by josePhoenix
Friday Dec 12, 2014 at 16:17 GMT
Originally opened as mperrin/poppy#23


For some reason, this sample code (using WebbPSF) causes a hang on my Mac Pro but not the science5 server or my personal machine (MacBook Air). Needs a better test case that uses only POPPY.

(To avoid the sometimes unusual behavior of multiprocessing + the interactive interpreter, I saved this as a separate script and ran it from the command line.)

import webbpsf
import poppy
nc = webbpsf.NIRCam()
nc.filter = 'F150W2'
poppy.conf.use_fftw = False
poppy.conf.use_multiprocessing = True
poppy.conf.n_processes = 8
nc.calcPSF()

Running this prints out logging output through "Propagating wavelength..." lines for the first 20 of 40 wavelengths, and then nothing happens.

If I interrupt the script with Ctrl-C, I get tracebacks from all the running workers interleaved (to be expected, but hard to interpret) and it looks like they're all waiting on "task = get()" which in turn is waiting
on "racquire()" which seems to acquire a lock on a pipe.

@mperrin
Copy link
Collaborator Author

mperrin commented Aug 23, 2018

Comment by mperrin
Wednesday Jan 28, 2015 at 22:47 GMT


I am now having this same problem on my laptop (Macbook Pro), while the exact same code works on my desktop (Mac Pro). Both are OS X 10.9.5.

Here's webbpsf.system_diagnostic() for the one that works:

OS: Darwin-13.4.0-x86_64-i386-64bit
Python version: 2.7.8 (default, Oct  7 2014, 15:36:11)  [GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)]
numpy version: 1.8.2
poppy version: 0.3.3.dev407
webbpsf version: 0.3rc4

tkinter version: 0.3.1
wxpython version: 3.0.0.0

astropy version: 0.4.1
pysynphot version: 0.9.5
pyFFTW version: not found

Floating point type information for numpy.float:
Machine parameters for float64
---------------------------------------------------------------------
precision= 15   resolution= 1.0000000000000001e-15
machep=   -52   eps=        2.2204460492503131e-16
negep =   -53   epsneg=     1.1102230246251565e-16
minexp= -1022   tiny=       2.2250738585072014e-308
maxexp=  1024   max=        1.7976931348623157e+308
nexp  =    11   min=        -max
---------------------------------------------------------------------

Floating point type information for numpy.complex:
Machine parameters for float64
---------------------------------------------------------------------
precision= 15   resolution= 1.0000000000000001e-15
machep=   -52   eps=        2.2204460492503131e-16
negep =   -53   epsneg=     1.1102230246251565e-16
minexp= -1022   tiny=       2.2250738585072014e-308
maxexp=  1024   max=        1.7976931348623157e+308
nexp  =    11   min=        -max
---------------------------------------------------------------------

And here's the one that doesn't work:

OS: Darwin-13.4.0-x86_64-i386-64bit
Python version: 2.7.8 (default, Oct  7 2014, 15:56:22)  [GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))]
numpy version: 1.8.1
poppy version: 0.3.3.dev407
webbpsf version: 0.3rc4

tkinter version: 0.3.1
wxpython version: 3.0.0.0

astropy version: 0.4.dev9488
pysynphot version: 0.9.5
pyFFTW version: 0.9.2

Floating point type information for numpy.float:
Machine parameters for float64
---------------------------------------------------------------------
precision= 15   resolution= 1.0000000000000001e-15
machep=   -52   eps=        2.2204460492503131e-16
negep =   -53   epsneg=     1.1102230246251565e-16
minexp= -1022   tiny=       2.2250738585072014e-308
maxexp=  1024   max=        1.7976931348623157e+308
nexp  =    11   min=        -max
---------------------------------------------------------------------

Floating point type information for numpy.complex:
Machine parameters for float64
---------------------------------------------------------------------
precision= 15   resolution= 1.0000000000000001e-15
machep=   -52   eps=        2.2204460492503131e-16
negep =   -53   epsneg=     1.1102230246251565e-16
minexp= -1022   tiny=       2.2250738585072014e-308
maxexp=  1024   max=        1.7976931348623157e+308
nexp  =    11   min=        -max
---------------------------------------------------------------------

@mperrin mperrin added this to the 0.3.5 milestone Aug 23, 2018
@mperrin mperrin added the bug Something isn't working label Aug 23, 2018
@mperrin
Copy link
Collaborator Author

mperrin commented Aug 23, 2018

Comment by mperrin
Wednesday Jan 28, 2015 at 22:54 GMT


Hangs the same way in both python and IPython, so that's not the problem.

This looks maybe relevant: http://stackoverflow.com/questions/12485322/python-2-6-7-multiprocessing-pool-fails-to-shut-down

@mperrin
Copy link
Collaborator Author

mperrin commented Aug 23, 2018

Comment by josePhoenix
Monday Apr 06, 2015 at 19:32 GMT


I've done some more investigation, and it appears that the hang happens on a line in matrixDFT.py where it computes a dot product. So I Google "numpy dot multiprocessing" and whaddya know, someone else has had this issue... http://stackoverflow.com/questions/23963997/python-child-process-crashes-on-numpy-dot-if-pyside-is-imported

The takeaway seems to be:

this is a general issue with some BLAS libraries used by numpy for dot.

Apple Accelerate and OpenBlas built with GNU Openmp are known to not be safe to use on both sides of a fork (the parent and the child process multiprocessing create). They will deadlock.

This cannot be fixed by numpy but there are three workarounds:

  • use netlib BLAS, ATLAS or git master OpenBlas based on pthreads (2.8.0 does not work)
  • use python 3.4 and its new multiprocessing spawn or forkserver start methods
  • use threading instead of multiprocessing, numpy releases the gil for most expensive operations so you can archive decent threading speedups on typical desktop machines

I can't believe it!

@mperrin
Copy link
Collaborator Author

mperrin commented Aug 23, 2018

Comment by josePhoenix
Monday Apr 06, 2015 at 20:48 GMT


I threw a bug report into the ether: numpy/numpy#5752

It may be time to rewrite the Slow Fourier Transform in Cython / C / OpenCL (depending on how adventurous I'm feeling).

@mperrin
Copy link
Collaborator Author

mperrin commented Aug 23, 2018

Comment by mperrin
Monday Apr 06, 2015 at 23:49 GMT


Oh good grief. I wasn't expecting this to be a fundamental and intractable issue with the Accelerate framework, of all things. I bet the fact that this has generally worked more reliably to me is due to my using macports which can build its own copy of ATLAS.

Yeah, I agree with your comment on the numpy bug discussion that we should just test for this and warn users if they're hitting this case. Looks like we can test against numpy.__config__.blas_opt_info to see whether a given numpy instance is linked against Accelerate.

I'd be fully in support of developing a more accelerated version of the MFT, definitely. There are some GPU experts in the building if you want to go that route. I'm tempted to think that the Cython approach could have better return on time invested than the complexities of dealing with GPUs, except the matrix_dft function is already mostly just calls to numpy ufuncs that are presumably written in C internally. So Cython might not give much speedup, but I suppose would at least work around the numpy.dot/Accelerate bug.

Possibly useful notebook on linear algebra in Cython: http://nbviewer.ipython.org/github/carljv/cython_testing/blob/master/cython_linalg.ipynb

@mperrin
Copy link
Collaborator Author

mperrin commented Aug 23, 2018

Comment by josePhoenix
Tuesday Apr 07, 2015 at 02:24 GMT


"Good grief": my sentiments exactly.

Further action on the GitHub issue at numpy/numpy sounds like there's a possibility of building wheels with ATLAS instead of Accelerate, perhaps even as the default going forward. I checked our internal ssbx numpy, and it's currently built with Accelerate for Mac users 😞

@mperrin
Copy link
Collaborator Author

mperrin commented Aug 23, 2018

Comment by josePhoenix
Wednesday Apr 08, 2015 at 20:37 GMT


A translation of the NumPy-based matrixDFT code into Cython, with a Cython matrix product function to replace numpy.dot, is about 23x slower. I've gotten the low hanging optimization fruit already, I think, though I'm not really experienced with Cython.

If it had been only 4x slower, I would have been happy... The hot path (the matrix product code) is essentially C at this point, so I'm not sure how much additional benefit a C extension will have.

@mperrin
Copy link
Collaborator Author

mperrin commented Aug 23, 2018

Comment by mperrin
Wednesday Apr 08, 2015 at 22:41 GMT


Bleh. 23x bleh.

I wonder why it's that much slower. There must be some deep SSE magic going on in the assembly code for numpy.dot. Yeah OK, no wins from Cython.

Do you want to give pyCUDA a try? If so maybe talk to Perry or Tim in SSB, I think they have some experience with it or can direct you to those here who do. (FYI there was very mixed experience here a couple years ago trying GPUs for several projects like the CTE correction). Of course, given a goal of "easy distributability" anything GPUish is likely even more of a headache than this Accelerate mess so I'm wary.

Right now we can at least be fairly confident that the existing code should be robust for multiprocessing on Linux, yes? In the long run it may be most efficient to just get things up and running on Python 3.4 since it sounds like that avoids the underlying issue, and we want to go that way anyhow.

@mperrin
Copy link
Collaborator Author

mperrin commented Aug 23, 2018

Comment by josePhoenix
Thursday Apr 09, 2015 at 00:35 GMT


Yeah, getting things running on 3.4 is definitely something we should do so we're on a supported version of the language. We can just warn users on 2.7.x if they enable multiprocessing on an affected numpy build.

I was looking at PyOpenCL for broader compatibility than PyCUDA, but they're both lower level than I'd like (have to separate our real and imaginary components, etc.).

Re: Linux: it's hard to say, since there are a few different BLAS libs NumPy can be built with and no standard binary build for Linux as far as I know. It sounds like a similar issue has existed in other libraries and been worked around. I would want to check against our ssb* builds at least to confirm it's not an issue there.

@mperrin
Copy link
Collaborator Author

mperrin commented Aug 23, 2018

Comment by josePhoenix
Thursday Apr 09, 2015 at 02:43 GMT


For future reference:

  • (slightly) higher level Python API over OpenCL and CUDA: reikna
  • An apparently very thorough discussion of FFTs with OpenCL from AMD: part 1 part 2 (with some formatting weirdness)

@mperrin
Copy link
Collaborator Author

mperrin commented Aug 23, 2018

Comment by mperrin
Monday Apr 20, 2015 at 04:53 GMT


Some thoughts on performance optimization.

I've just done some benchmarking with line_profiler. For a relatively simple case (webbpsf NIRCam F212N direct imaging), the execution time breaks down as follows. Each level of indentation is what's inside the function call above, but I've (manually) converted percentages to be relative to the total execution time.

~99% of time is spent inside OpticalSystem.propagate_mono()
    78% is inside wavefront.propagateTo
        77% is inside matrix_dft
             1% is the two np.outer() calls
            11% is setting up the expXU and expYV arrays
            64% is the two np.dot() calls
             1% is multiplying by norm_coeff
    11% is wavefront *= optic
     5% is wavefront.normalize()
     6% is return wavefront.asFITS()
        5% is unnecessary array copies which we could easily avoid 

So, no real surprises. Nearly 80% of the execution time is spent in calls to np.dot, np.outer, and np.exp. There is not likely to be much/anything we can do to optimize those beyond BLAS. (Setting aside cases of "numpy is not compiled to use BLAS so is wastefully slow" as an anti-optimization. Argh.) Extensive web searching has turned up some interesting tidbits but nothing that looks particularly likely to yield any substantial speedups.

The only additional avenue I can think of to pursue would be GPU computing. In particular Continuum Analytics also has CUDA integrated into their Python JIT compiler toolchain NumbaPro. Here's a example notebook. Seems very slick. CUDA includes its own version of BLAS called CUBLAS. Under the right circumstances it sounds like this can provide significant speedups, and we should probably try it out. On the other hand, the overheads for transferring data in/out of GPU memory could potentially be significant.

(In the back of my mind there's long been an idea that it would be possible to rewrite the code such that the entire propagation takes place on the GPU and only the final results get sent back to main memory, i.e. all the OpticalElement phasor creation code could be compiled to run on the GPUs too. But that is clearly way outside the scope of current efforts. That said, this seems like exactly the sort of thing that scikit-CUDA is intended for.)

Bottom line I think right now we should:

  1. Add error-checking code for Accelerate in numpy.__config__.blas_opt_info and warn users it is not compatible with multiprocessing on Python 2.7
  2. Rework the multiprocessing code so that it can use the forkserver method on Python 3.4 which supposedly avoids that problem, and encourage users to migrate to 3.4 if they want to use multiprocessing.
  3. Investigate GPU options over the longer term, probably most efficiently through NumbaPro.

@mperrin
Copy link
Collaborator Author

mperrin commented Aug 23, 2018

Comment by mperrin
Tuesday Apr 21, 2015 at 04:40 GMT


Should add at least the error-check and warning before release 0.3.5:

  • Add error-checking code for Accelerate in numpy.config.blas_opt_info and warn users it is not compatible with multiprocessing on Python 2.7
  • Rework the multiprocessing code so that it can use the forkserver method on Python 3.4 which supposedly avoids that problem, and encourage users to migrate to 3.4 if they want to use multiprocessing.

@mperrin
Copy link
Collaborator Author

mperrin commented Aug 23, 2018

Comment by JarronL
Monday Aug 01, 2016 at 20:14 GMT


I recently did a clean install of Mac OS X (10.10) on my MacPro along with the latest Anaconda (Python 2.7) and pyFFTW with corresponding FFTW libraries. After doing some extensive benchmarking, I discovered that the numpy fft isn't that much slower than the pyFFTW, and came to the conclusion that multiprocessing numpy fft is a much better use of my machine's resources. Fortunately, my numpy uses MKL, so I don't have the forking issue under Apple's Accelerate library. Unfortunately, the error checking statement in Poppy (line 1560) assumes that np.__config__.blas_opt_info['extra_link_args'] exists, which causes an error for me since that key is not present in my blas_opt_info dict. For reference, my dictionary looks like:

{'define_macros': [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)],
 'include_dirs': ['/Users/jarron/anaconda2/include'],
 'libraries': ['mkl_intel_lp64',
  'mkl_intel_thread',
  'mkl_core',
  'iomp5',
  'pthread'],
 'library_dirs': ['/Users/jarron/anaconda2/lib']}

Until there is a fix, I modified my copy to also check if that key exists:

d = np.__config__.blas_opt_info
if d.has_key('extra_link_args') and ('-Wl,Accelerate' in d['extra_link_args']):
    accel_bool = True
else:
    accel_bool = False
if ( (sys.version_info < (3,4,0)) and (platform.system()=='Darwin') and accel_bool ):
        _log.error("Multiprocessing not compatible with Apple Accelerate library on Python < 3.4")
        _log.error(" See https://github.com/mperrin/poppy/issues/23 ")
        _log.error(" Either disable multiprocessing, or recompile your numpy without Accelerate.")
        raise NotImplementedError("Multiprocessing not compatible with Apple Accelerate framework.")

@mperrin
Copy link
Collaborator Author

mperrin commented Aug 23, 2018

Comment by mandarup
Tuesday Dec 05, 2017 at 17:45 GMT


Just commenting for reference in case this is useful to someone:

A few years later, still using python2.7 due to some dependencies lagging behind, realized np.dot was hanging joblib.parallel with backend='multiprocessing'

worked around using simple jitted dot implementation

import numba as nb
import numpy as np
@nb.jit(nb.f8[:, :](nb.f8[:, :], nb.f8[:, :]), nopython=True)
def dot_product(a, b):
    size_a = a.shape
    size_b = b.shape
    size_out = (size_a[0], size_b[1])
    out = np.zeros(size_out)
    for i in range(size_a[0]):
        for j in range(size_b[1]):
            for k in range(size_b[0]):
                out[i,j] += a[i,k] * b[k, j]
    return out

@mperrin mperrin closed this as completed Aug 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant