Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Direct nvlink support #406

Closed
wants to merge 2 commits into from
Closed

Conversation

madsbk
Copy link
Member

@madsbk madsbk commented Feb 5, 2020

This PR implements our own nvlink support that doesn't use UCX. It started as an experiment to debug performance issue with nvlink and UCX but I ended up implementing direct nvlink support in ucxpy :)

Hopefully, UCX and how we use it gets to a point where this isn't necessary but until then I think this is useful; both to get a baseline of what we can expect of UCX and also for UCX performance debugging.

In order to use this new nvlink implementation, set UCXPY_OWN_CUDA_IPC=1 and make sure to use RMM memory allocations.

Notice, this PR is not ready for review -- I still need to do a lot of code cleaning e.g right now must of it is implemented in public_api.py and should be moved into core.pyx

Running the local-send-recv.py benchmark, I get 3 times speedup on large messages (1GB) and - 26 times speedup on small messages (1MB).

Beside better performance, this implementation is not sensitive to other UCX options. For instance, it doesn't degrade performance to have infiniband enabled while using nvlink.

On DGX-1 with new nvlink implementation (1GB)

$ UCXPY_OWN_CUDA_IPC=1 python local-send-recv.py -n "1GB" --server-dev 1 --client-dev 2 --object_type rmm 
Roundtrip benchmark
--------------------------
n_iter      | 10
n_bytes     | 1000.00 MB
object      | rmm
reuse alloc | False
==========================
Device(s)   | 1, 2
Average     | 37.40 GB/s
--------------------------
Iterations
--------------------------
000         | 12.94 GB/s
001         | 46.83 GB/s
002         | 47.59 GB/s
003         | 47.58 GB/s
004         | 47.51 GB/s

On DGX-1 with old UCX implementation (1GB)

$ python local-send-recv.py -n "1GB" --server-dev 1 --client-dev 2 --object_type rmm 
Roundtrip benchmark
--------------------------
n_iter      | 10
n_bytes     | 1000.00 MB
object      | rmm
reuse alloc | False
==========================
Device(s)   | 1, 2
Average     | 14.88 GB/s
--------------------------
Iterations
--------------------------
000         | 10.01 GB/s
001         | 15.50 GB/s
002         | 15.35 GB/s
003         | 15.74 GB/s
004         | 15.78 GB/s

On DGX-1 with new nvlink implementation (1MB)

$ UCXPY_OWN_CUDA_IPC=1 python local-send-recv.py -n "1MB" --server-dev 1 --client-dev 2 --object_type rmm 
Roundtrip benchmark
--------------------------
n_iter      | 10
n_bytes     | 1000.00 kB
object      | rmm
reuse alloc | False
==========================
Device(s)   | 1, 2
Average     | 1.81 GB/s
--------------------------
Iterations
--------------------------
000         |389.90 MB/s
001         |  3.11 GB/s
002         |  3.15 GB/s
003         |  3.01 GB/s
004         |  3.09 GB/s
005         |  2.37 GB/s

On DGX-1 with old UCX implementation (1MB)

$ python local-send-recv.py -n "1MB" --server-dev 1 --client-dev 2 --object_type rmm 
Roundtrip benchmark
--------------------------
n_iter      | 10
n_bytes     | 1000.00 kB
object      | rmm
reuse alloc | False
==========================
Device(s)   | 1, 2
Average     | 132.03 MB/s
--------------------------
Iterations
--------------------------
000         |121.56 MB/s
001         |108.79 MB/s
002         |137.85 MB/s
003         |138.23 MB/s
004         |136.94 MB/s

@quasiben
Copy link
Member

quasiben commented Feb 5, 2020

I double checked the DGX1 test and I'm seeing different performance for the 1GB test:

(rapidsai-latest) bzaitlen@dgx13:~/GitRepos/ucx-py/benchmarks$ UCX_SOCKADDR_TLS_PRIORITY=sockcm UCX_TLS=tcp,cuda_copy,sockcm,cuda_ipc python local-send-recv.py -n "1GB" --server-dev 1 --client-dev 2 --object_type rmm
[1580919224.224261] [dgx13:37751:0]          mpool.c:43   UCX  WARN  object 0x5561411aae40 was not returned to mpool ucp_am_bufs
Roundtrip benchmark
--------------------------
n_iter      | 10
n_bytes     | 1000.00 MB
object      | rmm
reuse alloc | False
==========================
Device(s)   | 1, 2
Average     | 47.03 GB/s
--------------------------
Iterations
--------------------------
000         | 42.58 GB/s
001         | 47.87 GB/s
002         | 47.87 GB/s
003         | 47.95 GB/s
004         | 47.94 GB/s
005         | 47.92 GB/s
006         | 48.02 GB/s
007         | 47.90 GB/s
008         | 46.49 GB/s
009         | 46.33 GB/s

@jakirkham
Copy link
Member

How does this compare to your workaround @pentschev?

@madsbk
Copy link
Member Author

madsbk commented Feb 5, 2020

I double checked the DGX1 test and I'm seeing different performance for the 1GB test:

Interesting, what do you get with infiniband enabled: UCX_TLS=tcp,cuda_copy,sockcm,cuda_ipc,rc?

@quasiben
Copy link
Member

quasiben commented Feb 5, 2020

with rc . But devs 1/2 are on different IB devices

(rapidsai-latest) bzaitlen@dgx13:~/GitRepos/ucx-py/benchmarks$ UCX_SOCKADDR_TLS_PRIORITY=sockcm UCX_TLS=tcp,cuda_copy,sockcm,cuda_ipc,rc python local-send-recv.py -n "1GB" --server-dev 1 --client-dev 2 --object_type rmm
Roundtrip benchmark
--------------------------
n_iter      | 10
n_bytes     | 1000.00 MB
object      | rmm
reuse alloc | False
==========================
Device(s)   | 1, 2
Average     | 15.98 GB/s
--------------------------
Iterations
--------------------------
000         | 15.30 GB/s
001         | 16.31 GB/s
002         | 16.28 GB/s
003         | 16.15 GB/s
004         | 16.33 GB/s
005         | 16.07 GB/s
006         | 16.02 GB/s
007         | 16.04 GB/s
008         | 15.67 GB/s
009         | 15.70 GB/s

@pentschev
Copy link
Member

I get the same performance with UCX core when NVLink is enabled:

$ UCX_TLS=tcp,sockcm,cuda_copy,cuda_ipc UCX_SOCKADDR_TLS_PRIORITY=sockcm python local-send-recv.py -n "1GB" --server-dev 1 --client-dev 2 --object_type rmm
[1580923095.059815] [dgx11:52043:0]          mpool.c:43   UCX  WARN  object 0x558516fa1200 was not returned to mpool ucp_am_bufs
Roundtrip benchmark
--------------------------
n_iter      | 10
n_bytes     | 1000.00 MB
object      | rmm
reuse alloc | False
==========================
Device(s)   | 1, 2
Average     | 45.93 GB/s
--------------------------
Iterations
--------------------------
000         | 41.93 GB/s
001         | 46.87 GB/s
002         | 47.06 GB/s
003         | 46.92 GB/s
004         | 46.72 GB/s
005         | 46.40 GB/s
006         | 46.69 GB/s
007         | 46.72 GB/s
008         | 45.24 GB/s
009         | 45.27 GB/s

@mrocklin
Copy link
Collaborator

mrocklin commented Feb 5, 2020

Well that's exciting. If anyone ends up generating a performance report of a merge or set_index computation I would love to see it.

@pentschev
Copy link
Member

Btw, none of the UCX results on the description reflect just UCX with NVLink (those use UCX_TLS=all), so that's why we're seeing discrepancies, some (or perhaps all) of the transfers are using different transports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants