Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot init, destroy, and then re-init process groups #55967

Open
rohan-varma opened this issue Apr 13, 2021 · 2 comments
Open

Cannot init, destroy, and then re-init process groups #55967

rohan-varma opened this issue Apr 13, 2021 · 2 comments
Assignees
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@rohan-varma
Copy link
Member

rohan-varma commented Apr 13, 2021

馃悰 Bug

Calling dist.init_process_group followed by c10d.destroy_process_group and then re-initing pg does not appear to work, hanging in _store_based_barrier. See the following repro:

@requires_nccl()
    @skip_if_lt_x_gpu(2)
    def test_init_then_reinit_pg(self):
        store = c10d.FileStore(self.file_name, self.world_size)
        # process_group = c10d.ProcessGroupNCCL(store, self.rank, self.world_size)
        print("initializing pg")
        dist.init_process_group(backend="nccl", world_size=self.world_size, rank=self.rank, store=store)
        print("Done init pg")
        c10d.destroy_process_group()
        import time ; time.sleep(5)
        dist.init_process_group(backend="nccl", world_size=self.world_size, rank=self.rank, store=store)

The recently added python tracebacks indicate the issue is in _store_based_barrier: P402636997.

Specifically, it looks like destroy_process_group resets _group_count. This means that when we call init_process_group a 2nd time with the same store, it queries for a group count of 1, and finds it already in the store, changing the behavior of the worker counting logic in _store_based_barrier.

A workaround for now is using a new store each time. However, I thought that we already wrap stores with PrefixStore as needed to avoid these sort of collision issues.

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu @gcramer23

@rohan-varma rohan-varma added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Apr 13, 2021
@rohan-varma rohan-varma changed the title Cannot init, destroy, and re-init process groups Cannot init, destroy, and then re-init process groups Apr 13, 2021
@rohan-varma
Copy link
Member Author

cc @pritamdamania87 for store based barrier issue

@rohan-varma rohan-varma added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 7, 2021
@pritamdamania87 pritamdamania87 self-assigned this May 10, 2021
@kit1980
Copy link
Member

kit1980 commented Dec 2, 2022

I've just saw a old TODO that depends on this. Was there any progress with the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

3 participants