Cannot init, destroy, and then re-init process groups #55967
Labels
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
馃悰 Bug
Calling
dist.init_process_group
followed byc10d.destroy_process_group
and then re-initing pg does not appear to work, hanging in_store_based_barrier
. See the following repro:The recently added python tracebacks indicate the issue is in
_store_based_barrier
: P402636997.Specifically, it looks like
destroy_process_group
resets_group_count
. This means that when we callinit_process_group
a 2nd time with the same store, it queries for a group count of 1, and finds it already in the store, changing the behavior of the worker counting logic in_store_based_barrier
.A workaround for now is using a new store each time. However, I thought that we already wrap stores with
PrefixStore
as needed to avoid these sort of collision issues.cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu @gcramer23
The text was updated successfully, but these errors were encountered: