New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Make special C++ implementations work on CUDA (and beyond!) #19601
Conversation
We'll need to figure out if / how to test this. The long term goal is to propose a special extension for the array API, but we need to be a bit the change we want to see. Here is a script I've been using that shows how to use CuPy to turn this into a CuPy ufunc and run it: import cupy as cp
import numpy as np
import scipy as sp
import scipy.special
preamble = '#include <special/lambertw.h>'
kern = cp.ElementwiseKernel('complex128 z, int64 k, float64 tol', 'complex128 out', 'out = special::lambertw(z, k, tol)', 'lambertw',
preamble = preamble, options = (f'--include-path={sp.special._get_include()}', '-std=c++14'))
z = np.arange(10, dtype = complex)
k = np.asarray(10, dtype = cp.int64)
tol = np.asarray(0.01, dtype = float)
out_sp = sp.special.lambertw(z, k, tol)
z = cp.arange(10, dtype = complex)
k = cp.asarray(10, dtype = cp.int64)
tol = cp.asarray(0.01, dtype = float)
out_cp = kern(z, k, tol)
print()
print(out_sp)
print(out_cp) |
A huge thank you for this! (had recently to copy-caste some cephes code into https://github.com/cupy/cupy/tree/main/cupyx/scipy/special --- would be great to avoid copy-pasting going forward) |
That is the plan! I believe @steppi has his 👀 on all of Cephes, possibly quite soon. |
Very nice @izaid! One small thing; can you put the formatting changes in a separate commit so it’s easier to look at the diffs? For the test, I think we can just add a test in I’ll give this a closer look tomorrow. |
I’ll do my best to have a PR ready this coming week. |
@steppi I have reverted the formatting, and will save that for a follow-up PR. I have some thoughts on how to test sensibly, but I'd actually like to leave that also for a follow-up PR. It will require some effort, and I'd like to get these C++ changes in so as to not conflict with your and others work. Of course, there is already testing in this PR by virtue of the host code being the default SciPy implementation. |
Ok. I can accept adding the CuPy tests later as long as this PR doesn’t break anything in SciPy. |
I’m seeing relevant test failures due to redefinitions of constants. |
Nice. Looks like that did it. |
Yeah, I think something unusual was going on in the build system for special, but I sort of guessed how to sidestep it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a synchronous conversation with @izaid, who explained how all of this works, since I don't have experience with CUDA programming.
In summary, when run on CUDA, we need to add __host__
and __device__
to the function signatures to tell the compiler we want them to be able to run on CPU and GPU respectively. There are also constants and standard library functions that are missing on CUDA, so if the preprocessor detects these are missing, the code defines them. Thrust is used for missing standard library functions, and only those actually needed for lambertw
are included. We'll add more as they are needed. It will be up to downstream users of this library to ensure the needed dependencies are available and linked.
This PR has no impact on the behavior of this code on the SciPy side, so I think it's fine for us not to add new tests. izaid has verified that this code will work with CuPy locally.
Amazing work izaid, really happy to see this!
Howdy, I lead the team that maintains https://github.com/NVIDIA/cccl I wanted to follow up with deferring to Thrust for the missing standard library functionality. I'd highly recommend using libcudacxx instead of Thrust as libcudacxx is intended specifically for this kind of use case where you want a heterogeneous C++ Standard library implementation that works with NVRTC. So instead of I'd also strongly recommend against injecting things into the |
Hi @jrhemstad! Thanks for the helpful thoughts. Indeed, I'm keeping an eye on Of course, noted about Would certainly be great to be in touch! Am sure a lot of the work you are doing will be quite useful here. |
cc @leofang (who may know) |
Also noting that I do intend to add tests to SciPy as soon as I can, essentially by generating CuPy ufuncs with ElementwiseKernel. I think this is non-trivial, as it's testing over multiple array APIs and without a specific extension existing. I have some thoughts on how to do this reasonably. |
Welcome @jrhemstad! I'm so happy you found your way here. We'd been meaning to try to reach out to someone at NVIDIA about the work we're beginning. I now recall @izaid had told me about |
Happy to help any way I can! Let me know if you want to set up a time to chat more in depth and I'd be happy to answer any questions or explain where we're going. |
Reference issue
RFC 19404
What does this implement/fix?
This is the next step in the C++ migration of sp.special. What it does in particular is update the new C++ lambertw to work on CUDA. This required a bit of extra work and boilerplate, but after that was done the code looks almost the same. By doing this now, we'll make more special functions CUDA-compatible (and array API compatible more generally) as we go.
What you will note is that NVRTC lacks some of the standard library. People are aware of that and there are many ways to work around it. What I've done, and will advocate for here, is a simple config.h file that checks to see if certain std include headers exist. If they don't, we either alias existing functions or types, or simply add them ourselves. A small amount of work can take us a long way here. I have chosen to define them directly in
std
, rather than declare some other namespace, and we can talk more about that if people want.NVIDIA is working on this more generally, see https://github.com/nvidia/cccl