Open
Description
For Now I am using mscclpp and I find that we have to explicitly name each devices for each GPU, while NCCL can find the best device for each GPU automatically. Most time we have to set MSCCLPP_HCA_DEVICES
for different nodes to achieve the best performance. A feature about parsing the topo and auto-choosing the device will help a lot.
In addition, sometimes "libibverbs.so" cannot be found and only "libibverbs.so.1" can be found. For now, I use ln -sv /usr/lib/x86_64-linux-gnu/libibverbs.so.1 /usr/lib/x86_64-linux-gnu/libibverbs.so
to fix this. But I think if you can search "libibverbs.so.1" in the mscclpp code, that will help a lot.
This is a related issue. sgl-project/sglang#6834
Metadata
Metadata
Assignees
Labels
No labels