-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Direct nvlink support #406
Conversation
I double checked the DGX1 test and I'm seeing different performance for the
|
How does this compare to your workaround @pentschev? |
Interesting, what do you get with infiniband enabled: |
with
|
I get the same performance with UCX core when NVLink is enabled:
|
Well that's exciting. If anyone ends up generating a performance report of a merge or set_index computation I would love to see it. |
Btw, none of the UCX results on the description reflect just UCX with NVLink (those use |
This PR implements our own nvlink support that doesn't use UCX. It started as an experiment to debug performance issue with nvlink and UCX but I ended up implementing direct nvlink support in ucxpy :)
Hopefully, UCX and how we use it gets to a point where this isn't necessary but until then I think this is useful; both to get a baseline of what we can expect of UCX and also for UCX performance debugging.
In order to use this new nvlink implementation, set
UCXPY_OWN_CUDA_IPC=1
and make sure to use RMM memory allocations.Notice, this PR is not ready for review -- I still need to do a lot of code cleaning e.g right now must of it is implemented in
public_api.py
and should be moved intocore.pyx
Running the
local-send-recv.py
benchmark, I get 3 times speedup on large messages (1GB) and - 26 times speedup on small messages (1MB).Beside better performance, this implementation is not sensitive to other UCX options. For instance, it doesn't degrade performance to have infiniband enabled while using nvlink.
On DGX-1 with new nvlink implementation (1GB)
On DGX-1 with old UCX implementation (1GB)
On DGX-1 with new nvlink implementation (1MB)
On DGX-1 with old UCX implementation (1MB)