-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: trying atomics and tree reduction for CUDA reducer kernels #3123
Conversation
@ManasviGoyal I'm working on implementing https://github.com/CoffeaTeam/coffea-benchmarks/blob/master/coffea-adl-benchmarks.ipynb as much as possible in cuda kernels to start benchmarking realistic throughput. We've already put together cupy based histograms that conform to HEP expectations, so we can nominally do full analysis workflows on the GPU. @nsmith- will be working on a first-try at uproot-on-GPU using DMA over PCI-express. I'll be working on a mock-up using parquet and cudf so we can understand the full workload's performance. The first thing we're missing is the ability to slice arrays, which I understand from talking to Jim is intertwined with the reducer implementation. I'm happy to help test things in realistic use cases when you have implementations ready. Keep us in the loop and we'll be responsive! |
Sure. I'll keep you updated. I am still figuring out how to handle some cases for reducers. Is there any specific kernels that you need first for slicing? I can prioritize them. The best was to test would be writing the test with arrays in cuda backend and see what error message it gives you. It would give you the name of the missing kernel that is needed for the function. |
I only have access to virtualized GPUs (they are MIG-partitioned a100s at Fermilab) and for some reason instead of giving me an error it hangs forever! So that's a bit of a show stopper on my side. As highest priority we would need boolean slicing and then as next highest priority we would need index-based slicing. After that we'll need argmin and argmax on the reducer side! |
If you have a FNAL computing account I can help you reproduce the failure mode I am seeing. |
I don't have a FNAL computing account. But for the current state it should give you a "kernel not implemented error". If you get any other errors, then the error might be because of a different reason. Maybe you can open an issue explaining the steps to reproduce and the error and I can check that on my GPU. |
The major problem blocking a common a simple reproducer is that it involves setting up kubernetes and mounting a MIG-partitioned virtualized GPU into a container in order to get the faulty behavior. Some of these configuration options are not possible with a consumer GPU (particularly MIG partitioning), and I have no idea which component is causing the problem. Do you have access to a cluster with such a setup through other means? |
I thought the error we were talking about was just slicing ragged arrays: >>> import awkward as ak
>>> array = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]], backend="cuda")
>>> array > 3
<Array [[False, False, True], [], [True, True]] type='3 * var * bool'>
>>> array[array > 3] although this does give the expected "kernel not found" error:
I can ssh into |
@lgray I have started working on slicing kernels along with reducers so that you can start testing. I tested the example @jpivarski gave in #3140 and it works. You can try and see if this simple example works in your GPU. There is one more slicing kernel that is left now (I still need to test the ones I have added more extensively). I will add rest of the kernels you mentioned as soon as possible. >>> import awkward as ak
>>> array = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]], backend="cuda")
>>> array > 3
<Array [[False, False, True], [], [True, True]] type='3 * var * bool'>
>>> array[array > 3]
<Array [[3.3], [], [4.4, 5.5]] type='3 * var * float64'> |
awesome! |
Ah - also - combinations / argcombinations is relatively high priority as well. |
@jpivarski the error I was talking about should be that error. Instead the process hangs indefinitely with no error emitted. |
@ManasviGoyal to make the prioritization a little bit more clear, you can use this set of analysis functionality benchmarks: These are what I am currently using to see what's possible on GPU. Since this PR isn't merged in yet, and we found some other issues today, I'm currently just finished with Query 3. Query 4 requires this PR since it contains a reduction that you've already implemented. You can more or less look for the various awkward operations in these functionality tests and prioritize by that ordering what is needed! |
Thanks! This helps a lot in prioritizing the kernels. I'll finish up with all the reducers soon and start combinations. |
I was also trying to check only this PR on the sum memory usage I brought up over in #3136 but it seems it's not actually implemented yet here. Looking in the files for But I can't really proceed with checking due to that. |
@ManasviGoyal - what is the status of this PR? Are you working on it? Thanks! |
This just includes some studies I did for reducers. It can either be merger into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll merge this into main
. It only adds files to the studies
directory, which would be easier to find in the future (as historical information) than this PR number, if it's closed without merging.
I'll wait until you're done with that. It has to be merged by me because it won't pass the "merge without waiting for requirements to be met." The tests don't run if there is no change to the code, which is the case here because it only affects Meanwhile, I'll bring this up to the present, though. |
It looks like you didn't have any updates on Monday (and if so, they can be a new PR), so I'll merge this into |
🎉 🚀 |
Yes. There were no other commits to be pushed. Thanks! |
Kernels tested for different block sizes