issues Search Results · repo:microsoft/mscclpp language:C++
Filter by
128 results
(65 ms)128 results
inmicrosoft/mscclpp (press backspace or delete to remove)I found that NCCL has support blackwell GPU with multi-node Nvlink since nccl 2.25, is this feature supported on
mscclpp? this is a related blog: https://docs.nvidia.com/multi-node-nvlink-systems/multi-node-tuning-guide/nccl.html ...
zyksir
- 5
- Opened 3 days ago
- #548
For Now I am using mscclpp and I find that we have to explicitly name each devices for each GPU, while NCCL can find the
best device for each GPU automatically. Most time we have to set MSCCLPP_HCA_DEVICES ...
zyksir
- 2
- Opened 9 days ago
- #542
Hi, the current implementation of proxy service keeps ownership of RegisteredMemory:
https://github.com/microsoft/mscclpp/blob/c184485808aeaaec5625bc97905c819db1514184/src/port_channel.cc#L41 Thus, any ...
liangyuRain
- 5
- Opened 10 days ago
- #540
https://github.com/microsoft/mscclpp/blob/2b9b18d562fc9eea4574927e3ceb7f40b0b20d63/python/mscclpp/gpu_utils_py.cpp#L27
The correct string representations of torch.float and torch.int seem to be torch.float32 ...
liangyuRain
- Opened 16 days ago
- #536
Hi, we observe a strange behavior with NvlsConnection. When we bind two memory buffers to the same NvlsConnection, the
DeviceMulticastPointer returned for the second buffer actually points to the first ...
liangyuRain
- 2
- Opened 17 days ago
- #535
Hi, the following code causes GPU OOM on hopper with nvls enabled. I am using the latest main branch.
from mscclpp import Transport, TcpBootstrap, Communicator
from mscclpp._mscclpp import Context, RawGpuBuffer ...
liangyuRain
- 1
- Opened 18 days ago
- #533
I am trying to implement a one-shot allreduce in sglang. you can see my code in this PR.
I want to use the algorithm used in allreduce_bench.py. To fit everything in sgl-kernel, I rewrite the API in cpp. ...
zyksir
- 3
- Opened 20 days ago
- #531
Hi, we observe the implement of memChan. put may have bug. eg memChan.put(0, 0, nElem * sizeof(int), threadIdx.x,
blockDim.x); if (threadIdx.x == 0) memChan.signal();
This operation put does not include ...
qishilu
- 2
- Opened on Apr 29
- #517
Hi, we observe that when many kernels are pushed to launch queue, the cudaMemcpyAsync used in flushing fifo can get
blocked due to full launch queue, which in turn blocks the whole proxy thread? This can ...
liangyuRain
- 8
- Opened on Apr 27
- #516
When issuing multiple sends to a PortChannel, the Memcpy kernel launch overhead may lead to bad performance for small
message sizes. For example, using MemChannel is much faster than PortChannel for small ...
cubele
- 3
- Opened on Apr 16
- #504

Learn how you can use GitHub Issues to plan and track your work.
Save views for sprints, backlogs, teams, or releases. Rank, sort, and filter issues to suit the occasion. The possibilities are endless.Learn more about GitHub IssuesProTip!
Press the /
key to activate the search input again and adjust your query.
Learn how you can use GitHub Issues to plan and track your work.
Save views for sprints, backlogs, teams, or releases. Rank, sort, and filter issues to suit the occasion. The possibilities are endless.Learn more about GitHub IssuesProTip!
Restrict your search to the title by using the in:title qualifier.