Cannot init, destroy, and then re-init process groups #55967

rohan-varma · 2021-04-13T22:37:53Z

🐛 Bug

Calling dist.init_process_group followed by c10d.destroy_process_group and then re-initing pg does not appear to work, hanging in _store_based_barrier. See the following repro:

@requires_nccl()
    @skip_if_lt_x_gpu(2)
    def test_init_then_reinit_pg(self):
        store = c10d.FileStore(self.file_name, self.world_size)
        # process_group = c10d.ProcessGroupNCCL(store, self.rank, self.world_size)
        print("initializing pg")
        dist.init_process_group(backend="nccl", world_size=self.world_size, rank=self.rank, store=store)
        print("Done init pg")
        c10d.destroy_process_group()
        import time ; time.sleep(5)
        dist.init_process_group(backend="nccl", world_size=self.world_size, rank=self.rank, store=store)

The recently added python tracebacks indicate the issue is in _store_based_barrier: P402636997.

Specifically, it looks like destroy_process_group resets _group_count. This means that when we call init_process_group a 2nd time with the same store, it queries for a group count of 1, and finds it already in the store, changing the behavior of the worker counting logic in _store_based_barrier.

A workaround for now is using a new store each time. However, I thought that we already wrap stores with PrefixStore as needed to avoid these sort of collision issues.

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu @gcramer23

The text was updated successfully, but these errors were encountered:

rohan-varma · 2021-05-07T18:46:53Z

cc @pritamdamania87 for store based barrier issue

kit1980 · 2022-12-02T00:01:18Z

I've just saw a old TODO that depends on this. Was there any progress with the issue?

rohan-varma added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Apr 13, 2021

rohan-varma changed the title ~~Cannot init, destroy, and re-init process groups~~ Cannot init, destroy, and then re-init process groups Apr 13, 2021

rohan-varma added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 7, 2021

pritamdamania87 self-assigned this May 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot init, destroy, and then re-init process groups #55967

Cannot init, destroy, and then re-init process groups #55967

rohan-varma commented Apr 13, 2021 •

edited by pytorch-probot bot

rohan-varma commented May 7, 2021

kit1980 commented Dec 2, 2022

Cannot init, destroy, and then re-init process groups #55967

Cannot init, destroy, and then re-init process groups #55967

Comments

rohan-varma commented Apr 13, 2021 • edited by pytorch-probot bot

🐛 Bug

rohan-varma commented May 7, 2021

kit1980 commented Dec 2, 2022

rohan-varma commented Apr 13, 2021 •

edited by pytorch-probot bot