New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SUMMIT IB Configuration #616
Comments
for building ucx-py, I have to use |
We are now going through a set of benchmark tests to confirm proper IB setup. For example, we can test at a high level with a dask-cuda benchmark of merging dataframes . We can also test point-to-point comms with ucx-py. With the merge benchmark we are not seeing good performance. We are expecting ~11 GB/s for comms over IB
For point-to-point IB benchmark comms on Summit we observe ~11 GB/s (though there is a suspicion something is incorrect about this test)..
|
Should also mention that @MattBBaker also joined us for the IB testing session. Thanks @MattBBaker! |
I've also been running UCX-Py benchmarks on a DGX-1 and I see around 10GB/s for 100MB transfers:
However, I see only 300MB/s using
|
Pinging @shamisp |
Can you try regular rc and not rc_x ? |
Originally we were trying with |
@pentschev what is the difference in #616 (comment) between the 10 GBs and 300 MBs command lines? |
Sorry @yosefe , I forgot to write that's on Summit, I updated the comment now. The first report is on a DGX-1, the remaining two are both on Summit. |
What's really triggering me now are the errors below:
Those errors don't happen on a DGX-1, and we see 10GB/s bandwidth there. They also don't happen on Summit for host memory, and I can also measure 10GB/s in that case. It only happens for CUDA memory and then we see very low bandwidth, and if you look closely at the timestamp, they're happening 500ms apart from each other, which probably means there's some sort of timeout happening and that's what makes it so slow. Any ideas of what could be the reason for those errors? |
Hi all, The next plot is what I got by running the local-send-recv.py benchmark in a Summit node. It uses ucx-py 0.16. Note that performance closes to 50GB/s as message size increases, which I would say is an expected behavior. These results follows the same trend reported in (slide 46) Pritchard et al. "Getting It Right with Open MPI: Best Practices for Deployment andTuning of Open MPI" Exascale Computing Project Annual Meeting 2020, 2020-02-03 (Houston,Texas, United States) Feel free to review the compilation flags I used on UCX in
and the LSF script with UCX env. variables I used to execute the local-send-recv.py benchmark in
|
I didn't observe any intra-node issues with ucx-py either, @benjha . However, we're mainly worried about internode performance over IB. |
For those without access to @benjha's files, the test was executed with the following config:
UCX was built with:
And the test was between GPUs 0,1:
As @jglaser notes, I think we either want internode testing or testing just IB between GPUs: |
The important part in above's flags is that you have |
According to the message I think our Summit build is really missing some configuration that gets picked up automatically when building on DGX-1 and we probably need to specify something else at build time. |
@hoopoepg and I spent a couple hours looking into various different configurations on Summit and trying to replicate the environment that we seek works on a DGX-1 without success. We tried enabling/disabling knem and gdrcopy, as well as checking that required CUDA capabilities are correctly built into UCX's Summit builds, but both cases still give us the same registration errors as before with low performance. We also verified that the libraries we're using on Summit (RMM, CuPy) are linking dynamically to CUDA, as pointed by @hoopoepg that static linking may cause issues when UCX tries to capture Right now we don't have a solution for the errors above and low performance. Also pinging @Akshay-Venkatesh and @spotluri in case they have ideas. |
cc @bureddy in case you have any ideas as well. |
I just tried running with
And looking at UCX debug info I see the same errors that I see on Summit:
It looks like this is some configuration on Summit's compute nodes. I remember we had this kind of problem in our DGX-1s in the past and they were resolved at the system level by our devops team with some software updates and configurations. @jglaser @benjha is this something you can check with Summit admins? @Akshay-Venkatesh I remember you helped our devops team finding out about configurations, have you ever tested GPUDirect RDMA on Summit? |
I've tested GPUDirect RDMA in the past and I just ran again to double check. Seems like performance is as expected:
The OpenMPI build used to get these results doesn't use wakeup feature so that may change things but I'm not sure if UCXpy uses wakeup or not. |
UCX-Py uses the wakeup feature by default, but I tried disabling it and running in non-blocking mode to see if that would change anything and I still see the same errors. The registration errors that we see come from the UCX layer, and not from UCX-Py though. It may still be the case that we're misconfiguring something, but I don't see any hints as to what's causing that, except for what I wrote in #616 (comment) . Does anyone know how could we identify what's causing that, or if anyone has suggestions to something that we should be doing differently in Summit that we don't need to do on a DGX-1, that's very welcome. |
To reiterate @pentschev, we have successfully tested UCX-Py and very large workloads on many systems. When we have seen errors in the past and generally these have pointed to system configuration issues but we don't know how to identify them easily. For example, in the past we found some machines without |
We just discussed this offline with @yosefe and he pointed that we need to
@jglaser @benjha can you try that as well and see how it performs? |
Can anyone list what are the pieces needed so we can verify with HPC ops if they are set up ? |
It seems that |
Result of
adding
Btw, all my runs got this error:
|
@benjha could you give us more information on how you're setting things up when you see the OOM errors? I think OOM is not directly related to the issue we're discussing here, so to avoid that we end up with an endless thread, I would suggest starting a new issue in this repo to discuss that. |
I can confirm @benjha 's errors. With
for TPCXBB queries that ran fine previously (but slowly). Without that argument, I see
On a positive note, without the Environment variables for the workers
command line for the workers
I haven't tested the 32GB GPUs yet. |
Can you try to use a pool size that's very close the total amount of GPU memory? Those are 16GB GPUs, so I'd recommend 15GB, or 14GB if 15GB still is too much. The |
No luck yet with either of these pool sizes. Will try on the 32 GB GPUs as soon as I get access. |
When doing the RAPIDS performance evaluation, we found that in some cases fat workers (e.g. 1 worker with 6 GPUs, 1 worker per node) worked better than thin workers (1 worker per GPU, 6 workers per node), in particular SVD's CuPy performed better with fat workers and cuDF worked better with thin workers. It might be something worth to explore with BSQL @jglaser |
How do you address other GPUs then? CuPy for example is going to always address GPU 0, which is fine if you have multiple workers each addressing a different GPU, so each worker is always working on GPU 0 relative to the |
jsrun allows the isolation of resources as you describe. On the other hand, I thought DASK distributed the load across GPUs of the same worker, is this the way it works with CuPy ? Anyway, for some reason I ended up using 1 GPU per worker... |
That's correct, but when you isolate resources via jsrun, you'll be effectively creating a worker per resource, in that case a resource being a GPU.
Mainline Dask will do no addressing of GPUs at all, so libraries such as CuPy and cuDF will run by default on GPU 0, meaning all other GPUs are idle. On the other hand, Dask-CUDA was specifically written to support a one-process(worker)-per-GPU model, in which we set
As I mentioned above, this is the only supported case by Dask-CUDA today, so it feels that you'd naturally end up using that. However, if you are certain you used a single Dask worker with multiple GPUs, that's something I'd be interested in knowing how it was done, it's not technically impossible but likely very challenging. |
Here's a datapoint with 4MB message size and UCX master (
I have yet to run the benchmark again.. hopefully I won't see the OOM errors on the 32GB GPUs. |
I'm happy to see we're doing better. I remember it was very challenging for folks to get memory utilization correctly for TPCx-BB, and indeed adding UCX to the workflow changes the requirements a bit, but we shouldn't double the memory utilization or something fo that sort. Keep in mind that we can't use managed memory with CUDA IPC, so we lose that ability and increase the perceived memory utilization. It's also important to use |
On a side note, what exactly is the limitation of managed memory w/regard to IPC/NVLINK? |
It's a CUDA IPC limitation in itself, see #409 for some discussion. |
I think we can close this now. @jglaser are you ok with that ? |
On summit, the nodes have the following configuration:
Each node has 6 GPUs and 4 MLNX Devices. I'm not sure what the optimal pairing of GPU and MLNX device should be. Normally, I would rely on
hwloc
to figure this out however, on Summit i get errors like the following (when using--net-devices='auto' with dask-cuda
):Still, I can set up the worker manually with something like:
GPU 0
GPU 1
And so on. What should
--net-devices
and--interface
be set to for each of the six GPUs ?cc @MattBBaker in case he has thoughts
The text was updated successfully, but these errors were encountered: