[WIP] Direct nvlink support #406

madsbk · 2020-02-05T16:08:29Z

This PR implements our own nvlink support that doesn't use UCX. It started as an experiment to debug performance issue with nvlink and UCX but I ended up implementing direct nvlink support in ucxpy :)

Hopefully, UCX and how we use it gets to a point where this isn't necessary but until then I think this is useful; both to get a baseline of what we can expect of UCX and also for UCX performance debugging.

In order to use this new nvlink implementation, set UCXPY_OWN_CUDA_IPC=1 and make sure to use RMM memory allocations.

Notice, this PR is not ready for review -- I still need to do a lot of code cleaning e.g right now must of it is implemented in public_api.py and should be moved into core.pyx

Running the local-send-recv.py benchmark, I get 3 times speedup on large messages (1GB) and - 26 times speedup on small messages (1MB).

Beside better performance, this implementation is not sensitive to other UCX options. For instance, it doesn't degrade performance to have infiniband enabled while using nvlink.

On DGX-1 with new nvlink implementation (1GB)

$ UCXPY_OWN_CUDA_IPC=1 python local-send-recv.py -n "1GB" --server-dev 1 --client-dev 2 --object_type rmm 
Roundtrip benchmark
--------------------------
n_iter      | 10
n_bytes     | 1000.00 MB
object      | rmm
reuse alloc | False
==========================
Device(s)   | 1, 2
Average     | 37.40 GB/s
--------------------------
Iterations
--------------------------
000         | 12.94 GB/s
001         | 46.83 GB/s
002         | 47.59 GB/s
003         | 47.58 GB/s
004         | 47.51 GB/s

On DGX-1 with old UCX implementation (1GB)

$ python local-send-recv.py -n "1GB" --server-dev 1 --client-dev 2 --object_type rmm 
Roundtrip benchmark
--------------------------
n_iter      | 10
n_bytes     | 1000.00 MB
object      | rmm
reuse alloc | False
==========================
Device(s)   | 1, 2
Average     | 14.88 GB/s
--------------------------
Iterations
--------------------------
000         | 10.01 GB/s
001         | 15.50 GB/s
002         | 15.35 GB/s
003         | 15.74 GB/s
004         | 15.78 GB/s

On DGX-1 with new nvlink implementation (1MB)

$ UCXPY_OWN_CUDA_IPC=1 python local-send-recv.py -n "1MB" --server-dev 1 --client-dev 2 --object_type rmm 
Roundtrip benchmark
--------------------------
n_iter      | 10
n_bytes     | 1000.00 kB
object      | rmm
reuse alloc | False
==========================
Device(s)   | 1, 2
Average     | 1.81 GB/s
--------------------------
Iterations
--------------------------
000         |389.90 MB/s
001         |  3.11 GB/s
002         |  3.15 GB/s
003         |  3.01 GB/s
004         |  3.09 GB/s
005         |  2.37 GB/s

On DGX-1 with old UCX implementation (1MB)

$ python local-send-recv.py -n "1MB" --server-dev 1 --client-dev 2 --object_type rmm 
Roundtrip benchmark
--------------------------
n_iter      | 10
n_bytes     | 1000.00 kB
object      | rmm
reuse alloc | False
==========================
Device(s)   | 1, 2
Average     | 132.03 MB/s
--------------------------
Iterations
--------------------------
000         |121.56 MB/s
001         |108.79 MB/s
002         |137.85 MB/s
003         |138.23 MB/s
004         |136.94 MB/s

quasiben · 2020-02-05T16:18:01Z

I double checked the DGX1 test and I'm seeing different performance for the 1GB test:

(rapidsai-latest) bzaitlen@dgx13:~/GitRepos/ucx-py/benchmarks$ UCX_SOCKADDR_TLS_PRIORITY=sockcm UCX_TLS=tcp,cuda_copy,sockcm,cuda_ipc python local-send-recv.py -n "1GB" --server-dev 1 --client-dev 2 --object_type rmm
[1580919224.224261] [dgx13:37751:0]          mpool.c:43   UCX  WARN  object 0x5561411aae40 was not returned to mpool ucp_am_bufs
Roundtrip benchmark
--------------------------
n_iter      | 10
n_bytes     | 1000.00 MB
object      | rmm
reuse alloc | False
==========================
Device(s)   | 1, 2
Average     | 47.03 GB/s
--------------------------
Iterations
--------------------------
000         | 42.58 GB/s
001         | 47.87 GB/s
002         | 47.87 GB/s
003         | 47.95 GB/s
004         | 47.94 GB/s
005         | 47.92 GB/s
006         | 48.02 GB/s
007         | 47.90 GB/s
008         | 46.49 GB/s
009         | 46.33 GB/s

jakirkham · 2020-02-05T16:19:59Z

How does this compare to your workaround @pentschev?

madsbk · 2020-02-05T16:37:37Z

I double checked the DGX1 test and I'm seeing different performance for the 1GB test:

Interesting, what do you get with infiniband enabled: UCX_TLS=tcp,cuda_copy,sockcm,cuda_ipc,rc?

quasiben · 2020-02-05T16:43:37Z

with rc . But devs 1/2 are on different IB devices

(rapidsai-latest) bzaitlen@dgx13:~/GitRepos/ucx-py/benchmarks$ UCX_SOCKADDR_TLS_PRIORITY=sockcm UCX_TLS=tcp,cuda_copy,sockcm,cuda_ipc,rc python local-send-recv.py -n "1GB" --server-dev 1 --client-dev 2 --object_type rmm
Roundtrip benchmark
--------------------------
n_iter      | 10
n_bytes     | 1000.00 MB
object      | rmm
reuse alloc | False
==========================
Device(s)   | 1, 2
Average     | 15.98 GB/s
--------------------------
Iterations
--------------------------
000         | 15.30 GB/s
001         | 16.31 GB/s
002         | 16.28 GB/s
003         | 16.15 GB/s
004         | 16.33 GB/s
005         | 16.07 GB/s
006         | 16.02 GB/s
007         | 16.04 GB/s
008         | 15.67 GB/s
009         | 15.70 GB/s

pentschev · 2020-02-05T17:20:31Z

I get the same performance with UCX core when NVLink is enabled:

$ UCX_TLS=tcp,sockcm,cuda_copy,cuda_ipc UCX_SOCKADDR_TLS_PRIORITY=sockcm python local-send-recv.py -n "1GB" --server-dev 1 --client-dev 2 --object_type rmm
[1580923095.059815] [dgx11:52043:0]          mpool.c:43   UCX  WARN  object 0x558516fa1200 was not returned to mpool ucp_am_bufs
Roundtrip benchmark
--------------------------
n_iter      | 10
n_bytes     | 1000.00 MB
object      | rmm
reuse alloc | False
==========================
Device(s)   | 1, 2
Average     | 45.93 GB/s
--------------------------
Iterations
--------------------------
000         | 41.93 GB/s
001         | 46.87 GB/s
002         | 47.06 GB/s
003         | 46.92 GB/s
004         | 46.72 GB/s
005         | 46.40 GB/s
006         | 46.69 GB/s
007         | 46.72 GB/s
008         | 45.24 GB/s
009         | 45.27 GB/s

mrocklin · 2020-02-05T17:22:13Z

Well that's exciting. If anyone ends up generating a performance report of a merge or set_index computation I would love to see it.

pentschev · 2020-02-05T17:24:13Z

Btw, none of the UCX results on the description reflect just UCX with NVLink (those use UCX_TLS=all), so that's why we're seeing discrepancies, some (or perhaps all) of the transfers are using different transports.

jakirkham mentioned this pull request Feb 5, 2020

Dask-cudf multi partition merge slows down with ucx #402

Closed

madsbk added 2 commits February 12, 2020 02:52

Start working on our own nvlink communication

d3c304e

Using the primary CUDA context

40b4738

madsbk force-pushed the own_cuda_ipc branch from f73b172 to 40b4738 Compare February 13, 2020 19:18

madsbk closed this May 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Direct nvlink support #406

[WIP] Direct nvlink support #406

madsbk commented Feb 5, 2020 •

edited

Loading

quasiben commented Feb 5, 2020

jakirkham commented Feb 5, 2020

madsbk commented Feb 5, 2020

quasiben commented Feb 5, 2020

pentschev commented Feb 5, 2020

mrocklin commented Feb 5, 2020

pentschev commented Feb 5, 2020

[WIP] Direct nvlink support #406

[WIP] Direct nvlink support #406

Conversation

madsbk commented Feb 5, 2020 • edited Loading

On DGX-1 with new nvlink implementation (1GB)

On DGX-1 with old UCX implementation (1GB)

On DGX-1 with new nvlink implementation (1MB)

On DGX-1 with old UCX implementation (1MB)

quasiben commented Feb 5, 2020

jakirkham commented Feb 5, 2020

madsbk commented Feb 5, 2020

quasiben commented Feb 5, 2020

pentschev commented Feb 5, 2020

mrocklin commented Feb 5, 2020

pentschev commented Feb 5, 2020

madsbk commented Feb 5, 2020 •

edited

Loading