New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Under certain (unclear) conditions, use_multiprocessing causes calculation to hang #23
Comments
Comment by mperrin I am now having this same problem on my laptop (Macbook Pro), while the exact same code works on my desktop (Mac Pro). Both are OS X 10.9.5. Here's webbpsf.system_diagnostic() for the one that works:
And here's the one that doesn't work:
|
Comment by mperrin Hangs the same way in both python and IPython, so that's not the problem. This looks maybe relevant: http://stackoverflow.com/questions/12485322/python-2-6-7-multiprocessing-pool-fails-to-shut-down |
Comment by josePhoenix I've done some more investigation, and it appears that the hang happens on a line in The takeaway seems to be:
I can't believe it! |
Comment by josePhoenix I threw a bug report into the ether: numpy/numpy#5752 It may be time to rewrite the Slow Fourier Transform in Cython / C / OpenCL (depending on how adventurous I'm feeling). |
Comment by mperrin Oh good grief. I wasn't expecting this to be a fundamental and intractable issue with the Accelerate framework, of all things. I bet the fact that this has generally worked more reliably to me is due to my using macports which can build its own copy of ATLAS. Yeah, I agree with your comment on the numpy bug discussion that we should just test for this and warn users if they're hitting this case. Looks like we can test against I'd be fully in support of developing a more accelerated version of the MFT, definitely. There are some GPU experts in the building if you want to go that route. I'm tempted to think that the Cython approach could have better return on time invested than the complexities of dealing with GPUs, except the matrix_dft function is already mostly just calls to numpy ufuncs that are presumably written in C internally. So Cython might not give much speedup, but I suppose would at least work around the numpy.dot/Accelerate bug. Possibly useful notebook on linear algebra in Cython: http://nbviewer.ipython.org/github/carljv/cython_testing/blob/master/cython_linalg.ipynb |
Comment by josePhoenix "Good grief": my sentiments exactly. Further action on the GitHub issue at numpy/numpy sounds like there's a possibility of building wheels with ATLAS instead of Accelerate, perhaps even as the default going forward. I checked our internal |
Comment by josePhoenix A translation of the NumPy-based matrixDFT code into Cython, with a Cython matrix product function to replace If it had been only 4x slower, I would have been happy... The hot path (the matrix product code) is essentially C at this point, so I'm not sure how much additional benefit a C extension will have. |
Comment by mperrin Bleh. 23x bleh. I wonder why it's that much slower. There must be some deep SSE magic going on in the assembly code for Do you want to give pyCUDA a try? If so maybe talk to Perry or Tim in SSB, I think they have some experience with it or can direct you to those here who do. (FYI there was very mixed experience here a couple years ago trying GPUs for several projects like the CTE correction). Of course, given a goal of "easy distributability" anything GPUish is likely even more of a headache than this Accelerate mess so I'm wary. Right now we can at least be fairly confident that the existing code should be robust for multiprocessing on Linux, yes? In the long run it may be most efficient to just get things up and running on Python 3.4 since it sounds like that avoids the underlying issue, and we want to go that way anyhow. |
Comment by josePhoenix Yeah, getting things running on 3.4 is definitely something we should do so we're on a supported version of the language. We can just warn users on 2.7.x if they enable multiprocessing on an affected numpy build. I was looking at PyOpenCL for broader compatibility than PyCUDA, but they're both lower level than I'd like (have to separate our real and imaginary components, etc.). Re: Linux: it's hard to say, since there are a few different BLAS libs NumPy can be built with and no standard binary build for Linux as far as I know. It sounds like a similar issue has existed in other libraries and been worked around. I would want to check against our |
Comment by josePhoenix For future reference: |
Comment by mperrin Some thoughts on performance optimization. I've just done some benchmarking with line_profiler. For a relatively simple case (webbpsf NIRCam F212N direct imaging), the execution time breaks down as follows. Each level of indentation is what's inside the function call above, but I've (manually) converted percentages to be relative to the total execution time.
So, no real surprises. Nearly 80% of the execution time is spent in calls to
The only additional avenue I can think of to pursue would be GPU computing. In particular Continuum Analytics also has CUDA integrated into their Python JIT compiler toolchain NumbaPro. Here's a example notebook. Seems very slick. CUDA includes its own version of BLAS called CUBLAS. Under the right circumstances it sounds like this can provide significant speedups, and we should probably try it out. On the other hand, the overheads for transferring data in/out of GPU memory could potentially be significant. (In the back of my mind there's long been an idea that it would be possible to rewrite the code such that the entire propagation takes place on the GPU and only the final results get sent back to main memory, i.e. all the OpticalElement phasor creation code could be compiled to run on the GPUs too. But that is clearly way outside the scope of current efforts. That said, this seems like exactly the sort of thing that scikit-CUDA is intended for.) Bottom line I think right now we should:
|
Comment by mperrin Should add at least the error-check and warning before release 0.3.5:
|
Comment by JarronL I recently did a clean install of Mac OS X (10.10) on my MacPro along with the latest Anaconda (Python 2.7) and pyFFTW with corresponding FFTW libraries. After doing some extensive benchmarking, I discovered that the numpy fft isn't that much slower than the pyFFTW, and came to the conclusion that multiprocessing numpy fft is a much better use of my machine's resources. Fortunately, my numpy uses MKL, so I don't have the forking issue under Apple's Accelerate library. Unfortunately, the error checking statement in Poppy (line 1560) assumes that
Until there is a fix, I modified my copy to also check if that key exists:
|
Comment by mandarup Just commenting for reference in case this is useful to someone: A few years later, still using python2.7 due to some dependencies lagging behind, realized worked around using simple jitted dot implementation
|
Issue by josePhoenix
Friday Dec 12, 2014 at 16:17 GMT
Originally opened as mperrin/poppy#23
For some reason, this sample code (using WebbPSF) causes a hang on my Mac Pro but not the science5 server or my personal machine (MacBook Air). Needs a better test case that uses only POPPY.
(To avoid the sometimes unusual behavior of multiprocessing + the interactive interpreter, I saved this as a separate script and ran it from the command line.)
Running this prints out logging output through "Propagating wavelength..." lines for the first 20 of 40 wavelengths, and then nothing happens.
If I interrupt the script with Ctrl-C, I get tracebacks from all the running workers interleaved (to be expected, but hard to interpret) and it looks like they're all waiting on "task = get()" which in turn is waiting
on "racquire()" which seems to acquire a lock on a pipe.
The text was updated successfully, but these errors were encountered: