New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA accelerated propagation #239
Comments
Comment by mperrin This is spectacular! I've been wanting to dig into all these performance optimization toolkits but just haven't had time. In addition to cuda and numexpr did you do any tests with numba? As I recall, @josePhoenix had done some tests with Cython but wasn't able to get performance better than using a well-optimized numpy for the array operations. That was in the Fraunhofer code rather than the Fresnel though. I'm excited to see you got a 3x speedup using cuda! |
Comment by joseph-long Awesome stuff, @douglase! I remember last time I looked at this I was trying to target OpenCL, which alas seems to be dead in the water, and numba's CUDA support was still proprietary/paid-license only. Exciting to see that support has matured so much! |
Comment by douglase (much of the benchmarking for this project is recreated in this notebook: https://gist.github.com/douglase/3846e8f105cd9baec96706681d0b8ee5) |
Comment by mperrin Wow, those benchmarks are really striking. Numexpr is way out in front, and for much less programming effort than the more manual numba vectorize and cuda versions. I've tried a bit here in benchmarking and adding numexpr calls to parts of the Fraunhofer pathway too, and gotten some nice speedups there. But the real heavy lifting is still the calls to the numpy dot product, which doesn't fit easily into what numexpr can help with. |
Comment by mperrin @douglase I'm assuming you're around the AAS this week? I'm going to try re-starting the build of this PR on Travis now to see if it works (it was never clear why it died before). The actual code worked fine in my local tests. If necessary maybe we can meet up for half an hour sometime this week to try to figure out how to get this running in the Travis CI setup? |
Comment by mperrin I ended up just merging this from the command line manually - trying to do anything clever with pushing the conflict fix onto your branch first just led me down into an unproductive morass of inscrutable git error messages. Was much simpler to just manually merge into master instead. |
Issue by douglase
Monday Nov 13, 2017 at 00:03 GMT
Originally opened as mperrin/poppy#239
This is a proof of concept addition of to see if CUDA FFTs would significantly accelerate the angular spectrum code. While that is the case, in the course of implementation and benchmarking (profiling with ipython %prun) I discovered significant bottlenecks were np.fft.fftshift() and np.exp(). These are also addressed in this branch, the former with a numba gpu implementation fftshift and the latter using the NumExpr library.
This obviously adds a lot of optional dependencies, but my philosophy was follow the fftw approach and gracefully revert to numpy everywhere, so each addition gives partial improvement. On my machine tests are passing with and w/o numba or cuda installed.
While further optimization is possible, for a toy case with 3 optical surfaces (below) generating a PSF with a fresnel wavefront class is accelerated by as much as a factor of 3:
Benchmark Optical system:
Benchmark computer:
Google Compute Engine:
douglase included the following code: https://github.com/mperrin/poppy/pull/239/commits
The text was updated successfully, but these errors were encountered: