Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Make special C++ implementations work on CUDA (and beyond!) #19601

Merged
merged 6 commits into from Nov 30, 2023
Merged

ENH: Make special C++ implementations work on CUDA (and beyond!) #19601

merged 6 commits into from Nov 30, 2023

Conversation

izaid
Copy link
Contributor

@izaid izaid commented Nov 26, 2023

Reference issue

RFC 19404

What does this implement/fix?

This is the next step in the C++ migration of sp.special. What it does in particular is update the new C++ lambertw to work on CUDA. This required a bit of extra work and boilerplate, but after that was done the code looks almost the same. By doing this now, we'll make more special functions CUDA-compatible (and array API compatible more generally) as we go.

What you will note is that NVRTC lacks some of the standard library. People are aware of that and there are many ways to work around it. What I've done, and will advocate for here, is a simple config.h file that checks to see if certain std include headers exist. If they don't, we either alias existing functions or types, or simply add them ourselves. A small amount of work can take us a long way here. I have chosen to define them directly in std, rather than declare some other namespace, and we can talk more about that if people want.

NVIDIA is working on this more generally, see https://github.com/nvidia/cccl

@izaid
Copy link
Contributor Author

izaid commented Nov 26, 2023

We'll need to figure out if / how to test this. The long term goal is to propose a special extension for the array API, but we need to be a bit the change we want to see.

Here is a script I've been using that shows how to use CuPy to turn this into a CuPy ufunc and run it:

import cupy as cp

import numpy as np

import scipy as sp
import scipy.special

preamble = '#include <special/lambertw.h>'

kern = cp.ElementwiseKernel('complex128 z, int64 k, float64 tol', 'complex128 out', 'out = special::lambertw(z, k, tol)', 'lambertw',
    preamble = preamble, options = (f'--include-path={sp.special._get_include()}', '-std=c++14'))

z = np.arange(10, dtype = complex)
k = np.asarray(10, dtype = cp.int64)
tol = np.asarray(0.01, dtype = float)

out_sp = sp.special.lambertw(z, k, tol)

z = cp.arange(10, dtype = complex)
k = cp.asarray(10, dtype = cp.int64)
tol = cp.asarray(0.01, dtype = float)

out_cp = kern(z, k, tol)

print()
print(out_sp)
print(out_cp)

@izaid
Copy link
Contributor Author

izaid commented Nov 26, 2023

@lucascolley lucascolley added scipy.special enhancement A new feature or improvement C/C++ Items related to the internal C/C++ code base labels Nov 26, 2023
@ev-br
Copy link
Member

ev-br commented Nov 26, 2023

A huge thank you for this!

(had recently to copy-caste some cephes code into https://github.com/cupy/cupy/tree/main/cupyx/scipy/special --- would be great to avoid copy-pasting going forward)

@izaid
Copy link
Contributor Author

izaid commented Nov 26, 2023

A huge thank you for this!

(had recently to copy-caste some cephes code into https://github.com/cupy/cupy/tree/main/cupyx/scipy/special --- would be great to avoid copy-pasting going forward)

That is the plan! I believe @steppi has his 👀 on all of Cephes, possibly quite soon.

@steppi
Copy link
Contributor

steppi commented Nov 26, 2023

Very nice @izaid!

One small thing; can you put the formatting changes in a separate commit so it’s easier to look at the diffs?

For the test, I think we can just add a test in scipy/special/tests that gets skipped if CuPy isn’t available.

I’ll give this a closer look tomorrow.

@steppi
Copy link
Contributor

steppi commented Nov 26, 2023

That is the plan! I believe @steppi has his 👀 on all of Cephes, possibly quite soon.

I’ll do my best to have a PR ready this coming week.

@izaid
Copy link
Contributor Author

izaid commented Nov 26, 2023

@steppi I have reverted the formatting, and will save that for a follow-up PR.

I have some thoughts on how to test sensibly, but I'd actually like to leave that also for a follow-up PR. It will require some effort, and I'd like to get these C++ changes in so as to not conflict with your and others work. Of course, there is already testing in this PR by virtue of the host code being the default SciPy implementation.

@steppi
Copy link
Contributor

steppi commented Nov 26, 2023

@steppi I have reverted the formatting, and will save that for a follow-up PR.

I have some thoughts on how to test sensibly, but I'd actually like to leave that also for a follow-up PR. It will require some effort, and I'd like to get these C++ changes in so as to not conflict with your and others work. Of course, there is already testing in this PR by virtue of the host code being the default SciPy implementation.

Ok. I can accept adding the CuPy tests later as long as this PR doesn’t break anything in SciPy.

@steppi
Copy link
Contributor

steppi commented Nov 26, 2023

I’m seeing relevant test failures due to redefinitions of constants.

@steppi
Copy link
Contributor

steppi commented Nov 26, 2023

Nice. Looks like that did it.

@izaid
Copy link
Contributor Author

izaid commented Nov 26, 2023

Yeah, I think something unusual was going on in the build system for special, but I sort of guessed how to sidestep it.

Copy link
Contributor

@steppi steppi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a synchronous conversation with @izaid, who explained how all of this works, since I don't have experience with CUDA programming.

In summary, when run on CUDA, we need to add __host__ and __device__ to the function signatures to tell the compiler we want them to be able to run on CPU and GPU respectively. There are also constants and standard library functions that are missing on CUDA, so if the preprocessor detects these are missing, the code defines them. Thrust is used for missing standard library functions, and only those actually needed for lambertw are included. We'll add more as they are needed. It will be up to downstream users of this library to ensure the needed dependencies are available and linked.

This PR has no impact on the behavior of this code on the SciPy side, so I think it's fine for us not to add new tests. izaid has verified that this code will work with CuPy locally.

Amazing work izaid, really happy to see this!

@steppi steppi merged commit 9a43b5c into scipy:main Nov 30, 2023
26 of 27 checks passed
@steppi steppi added this to the 1.12.0 milestone Nov 30, 2023
@jrhemstad
Copy link

jrhemstad commented Nov 30, 2023

Howdy, I lead the team that maintains https://github.com/NVIDIA/cccl

I wanted to follow up with deferring to Thrust for the missing standard library functionality.

I'd highly recommend using libcudacxx instead of Thrust as libcudacxx is intended specifically for this kind of use case where you want a heterogeneous C++ Standard library implementation that works with NVRTC.

So instead of thrust::complex you use cuda::std::complex or instead of std::numeric_limits, cuda::std::numeric_limits.

I'd also strongly recommend against injecting things into the std:: namespace as that is undefined behavior and can lead to all sorts of nasty problems.

@izaid
Copy link
Contributor Author

izaid commented Nov 30, 2023

Hi @jrhemstad! Thanks for the helpful thoughts. Indeed, I'm keeping an eye on cccl, and we'll be looking to use cuda::std in general as soon as we can. Right now, the priority is making things work with CuPy, so we need to explore whether cuda::std is available there.

Of course, noted about std injection, and am well aware of that. It's a choice we're making just for now, in this particular set of circumstances, but we may well change that as we go.

Would certainly be great to be in touch! Am sure a lot of the work you are doing will be quite useful here.

@jakirkham
Copy link
Contributor

Right now, the priority is making things work with CuPy, so we need to explore whether cuda::std is available there.

cc @leofang (who may know)

@izaid
Copy link
Contributor Author

izaid commented Nov 30, 2023

Also noting that I do intend to add tests to SciPy as soon as I can, essentially by generating CuPy ufuncs with ElementwiseKernel. I think this is non-trivial, as it's testing over multiple array APIs and without a specific extension existing. I have some thoughts on how to do this reasonably.

@steppi
Copy link
Contributor

steppi commented Dec 1, 2023

Welcome @jrhemstad! I'm so happy you found your way here. We'd been meaning to try to reach out to someone at NVIDIA about the work we're beginning. I now recall @izaid had told me about libcudacxx when we spoke Monday. He told me he chose thrust because he knew it was vendored by CuPy. He'd also explained the issues with injecting into std and that it is just a temporary expedient. Our work is still tentative, and you should think of this PR as a proof of concept. We really appreciate the advice, and like izaid said, it would be great if you could be in touch while we carry out this project.

@jrhemstad
Copy link

Happy to help any way I can! Let me know if you want to set up a time to chat more in depth and I'd be happy to answer any questions or explain where we're going.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C/C++ Items related to the internal C/C++ code base enhancement A new feature or improvement scipy.special
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants