Skip to content

issues Search Results · repo:microsoft/mscclpp language:C++

Filter by

128 results
 (65 ms)

128 results

inmicrosoft/mscclpp (press backspace or delete to remove)

I found that NCCL has support blackwell GPU with multi-node Nvlink since nccl 2.25, is this feature supported on mscclpp? this is a related blog: https://docs.nvidia.com/multi-node-nvlink-systems/multi-node-tuning-guide/nccl.html ...
  • zyksir
  • 5
  • Opened 
    3 days ago
  • #548

For Now I am using mscclpp and I find that we have to explicitly name each devices for each GPU, while NCCL can find the best device for each GPU automatically. Most time we have to set MSCCLPP_HCA_DEVICES ...
  • zyksir
  • 2
  • Opened 
    9 days ago
  • #542

Hi, the current implementation of proxy service keeps ownership of RegisteredMemory: https://github.com/microsoft/mscclpp/blob/c184485808aeaaec5625bc97905c819db1514184/src/port_channel.cc#L41 Thus, any ...
  • liangyuRain
  • 5
  • Opened 
    10 days ago
  • #540

https://github.com/microsoft/mscclpp/blob/2b9b18d562fc9eea4574927e3ceb7f40b0b20d63/python/mscclpp/gpu_utils_py.cpp#L27 The correct string representations of torch.float and torch.int seem to be torch.float32 ...
  • liangyuRain
  • Opened 
    16 days ago
  • #536

Hi, we observe a strange behavior with NvlsConnection. When we bind two memory buffers to the same NvlsConnection, the DeviceMulticastPointer returned for the second buffer actually points to the first ...
  • liangyuRain
  • 2
  • Opened 
    17 days ago
  • #535

Hi, the following code causes GPU OOM on hopper with nvls enabled. I am using the latest main branch. from mscclpp import Transport, TcpBootstrap, Communicator from mscclpp._mscclpp import Context, RawGpuBuffer ...
  • liangyuRain
  • 1
  • Opened 
    18 days ago
  • #533

I am trying to implement a one-shot allreduce in sglang. you can see my code in this PR. I want to use the algorithm used in allreduce_bench.py. To fit everything in sgl-kernel, I rewrite the API in cpp. ...
  • zyksir
  • 3
  • Opened 
    20 days ago
  • #531

Hi, we observe the implement of memChan. put may have bug. eg memChan.put(0, 0, nElem * sizeof(int), threadIdx.x, blockDim.x); if (threadIdx.x == 0) memChan.signal(); This operation put does not include ...
  • qishilu
  • 2
  • Opened 
    on Apr 29
  • #517

Hi, we observe that when many kernels are pushed to launch queue, the cudaMemcpyAsync used in flushing fifo can get blocked due to full launch queue, which in turn blocks the whole proxy thread? This can ...
  • liangyuRain
  • 8
  • Opened 
    on Apr 27
  • #516

When issuing multiple sends to a PortChannel, the Memcpy kernel launch overhead may lead to bad performance for small message sizes. For example, using MemChannel is much faster than PortChannel for small ...
  • cubele
  • 3
  • Opened 
    on Apr 16
  • #504
Issue origami icon

Learn how you can use GitHub Issues to plan and track your work.

Save views for sprints, backlogs, teams, or releases. Rank, sort, and filter issues to suit the occasion. The possibilities are endless.Learn more about GitHub Issues
ProTip! 
Press the
/
key to activate the search input again and adjust your query.
Issue origami icon

Learn how you can use GitHub Issues to plan and track your work.

Save views for sprints, backlogs, teams, or releases. Rank, sort, and filter issues to suit the occasion. The possibilities are endless.Learn more about GitHub Issues
ProTip! 
Restrict your search to the title by using the in:title qualifier.
Issue search results · GitHub