[distributed] calling nccl reduce with inconsistent dst hangs #39706
Labels
module: deadlock
Problems related to deadlocks (hang without exiting)
module: nccl
Problems related to nccl support
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Not sure if intended or avoidable, but if
dst
is inconsistent across ranks,reduce
finishes, but future kernels seems to hang. E.g.,cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar
The text was updated successfully, but these errors were encountered: