Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA accelerated propagation #239

Closed
mperrin opened this issue Aug 27, 2018 · 7 comments
Closed

CUDA accelerated propagation #239

mperrin opened this issue Aug 27, 2018 · 7 comments

Comments

@mperrin
Copy link
Collaborator

mperrin commented Aug 27, 2018

Issue by douglase
Monday Nov 13, 2017 at 00:03 GMT
Originally opened as mperrin/poppy#239


This is a proof of concept addition of to see if CUDA FFTs would significantly accelerate the angular spectrum code. While that is the case, in the course of implementation and benchmarking (profiling with ipython %prun) I discovered significant bottlenecks were np.fft.fftshift() and np.exp(). These are also addressed in this branch, the former with a numba gpu implementation fftshift and the latter using the NumExpr library.

This obviously adds a lot of optional dependencies, but my philosophy was follow the fftw approach and gracefully revert to numpy everywhere, so each addition gives partial improvement. On my machine tests are passing with and w/o numba or cuda installed.

While further optimization is possible, for a toy case with 3 optical surfaces (below) generating a PSF with a fresnel wavefront class is accelerated by as much as a factor of 3:

image

Benchmark Optical system:


wavelen = 770e-9

D_prim = 2.37 * u.m
fr_pri = 7.8
fl_pri = D_prim * fr_pri

D_relay = 20 * u.mm
fl_m2 = fl_pri * D_relay / D_prim
fr_m3 = 20.
fl_m3 = fr_m3 * D_relay

def wfirst_sys(npix=1024,ratio=0.25):
    wfirst_optsys = poppy.FresnelOpticalSystem(pupil_diameter=D_prim, npix=npix, beam_ratio=ratio)

    m1 = poppy.QuadraticLens(fl_pri, name='Primary')
    m2 = poppy.QuadraticLens(fl_m2, name='M2')
    m3 = poppy.QuadraticLens(fl_m3, name='M3')
    m4 = poppy.QuadraticLens(fl_m3, name='M4')

    wfirst_optsys.add_optic(poppy.CircularAperture(radius=D_prim.value/2))
    wfirst_optsys.add_optic(m1)
    wfirst_optsys.add_optic(m2, distance = fl_pri + fl_m2)
    wfirst_optsys.add_optic(m3, distance = fl_m2 + fl_m3)

    #wfirst_optsys.add_optic(poppy.ScalarTransmission(planetype=poppy.poppy_core.PlaneType.image, name='focus'),
    #                        distance=fl_m3)

    wfirst_optsys.add_optic(poppy.ScalarTransmission(planetype=poppy.poppy_core.PlaneType.intermediate,
                                                 name='focus'), distance=fl_m3)
    return wfirst_optsys
wfirst_optsys4096=wfirst_sys(npix=1024,ratio=0.25)
%timeit wfirst_optsys4096.calcPSF(wavelength=wavelen, display_intermediates=False, return_intermediates=False)

Benchmark computer:

Google Compute Engine:

Machine type

n1-standard-8 (8 vCPUs, 30 GB memory)
CPU platform

Intel Haswell
GPUs

1 x NVIDIA Tesla K80


douglase included the following code: https://github.com/mperrin/poppy/pull/239/commits

@mperrin
Copy link
Collaborator Author

mperrin commented Aug 27, 2018

Comment by douglase
Monday Nov 13, 2017 at 00:20 GMT


also, h/t to @neilzim for the benchmark prescription.

I will investigate the travis failure to build.

@mperrin
Copy link
Collaborator Author

mperrin commented Aug 27, 2018

Comment by mperrin
Tuesday Nov 14, 2017 at 01:53 GMT


This is spectacular! I've been wanting to dig into all these performance optimization toolkits but just haven't had time. In addition to cuda and numexpr did you do any tests with numba? As I recall, @josePhoenix had done some tests with Cython but wasn't able to get performance better than using a well-optimized numpy for the array operations. That was in the Fraunhofer code rather than the Fresnel though. I'm excited to see you got a 3x speedup using cuda!

@mperrin
Copy link
Collaborator Author

mperrin commented Aug 27, 2018

Comment by joseph-long
Tuesday Nov 14, 2017 at 17:54 GMT


Awesome stuff, @douglase! I remember last time I looked at this I was trying to target OpenCL, which alas seems to be dead in the water, and numba's CUDA support was still proprietary/paid-license only. Exciting to see that support has matured so much!

@mperrin
Copy link
Collaborator Author

mperrin commented Aug 27, 2018

Comment by douglase
Friday Nov 17, 2017 at 17:05 GMT


(much of the benchmarking for this project is recreated in this notebook: https://gist.github.com/douglase/3846e8f105cd9baec96706681d0b8ee5)

@mperrin
Copy link
Collaborator Author

mperrin commented Aug 27, 2018

Comment by mperrin
Friday Nov 17, 2017 at 22:47 GMT


Wow, those benchmarks are really striking. Numexpr is way out in front, and for much less programming effort than the more manual numba vectorize and cuda versions.

I've tried a bit here in benchmarking and adding numexpr calls to parts of the Fraunhofer pathway too, and gotten some nice speedups there. But the real heavy lifting is still the calls to the numpy dot product, which doesn't fit easily into what numexpr can help with.

@mperrin
Copy link
Collaborator Author

mperrin commented Aug 27, 2018

Comment by mperrin
Sunday Jan 07, 2018 at 21:39 GMT


@douglase I'm assuming you're around the AAS this week? I'm going to try re-starting the build of this PR on Travis now to see if it works (it was never clear why it died before). The actual code worked fine in my local tests. If necessary maybe we can meet up for half an hour sometime this week to try to figure out how to get this running in the Travis CI setup?

@mperrin
Copy link
Collaborator Author

mperrin commented Aug 27, 2018

Comment by mperrin
Monday Jan 08, 2018 at 19:05 GMT


I ended up just merging this from the command line manually - trying to do anything clever with pushing the conflict fix onto your branch first just led me down into an unproductive morass of inscrutable git error messages. Was much simpler to just manually merge into master instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant