Sync before and after deleting #2268

pplantinga · 2023-11-26T18:10:51Z

To prevent the error in #2250 this PR adds a barrier before and after deletion so that no processes can write at the same time as the deletion.

…the same time

TParcollet · 2023-11-26T18:43:25Z

@pplantinga I checked and it's not solving my issue (the PR made by Adel is). As you can see, it hangs at the end of an epoch, once validation is done and wants to go into the next

pplantinga · 2023-11-26T22:32:58Z

Looks like the issue here is an if_main_process() in the recipe which doesn't play nice with torch.barrier(). Although this technically could be said to be an issue with the recipe, not the core code, we should improve the situation so that these two functions work together better. I briefly looked into converting if_main_process() into a context manager with setting an environment variable or something to check when running torch.barrier() but unfortunately this doesn't allow skipping the execution of the body of the with statement without some very hacky workarounds. If anybody has a bright idea about a solution to this, I'm all figurative ears (even if I'm literally deaf).

… checkpoint

pplantinga · 2023-11-27T15:27:16Z

Okay, this is ready for review again @TParcollet @Adel-Moumen , basically I propose replacing instances of if_main_process with run_on_main which will signal that it is executing in single-threaded mode and skip the barriers. You can see an example in the TIMIT recipe that is updated.

speechbrain/utils/distributed.py

TParcollet · 2023-11-28T15:27:57Z

speechbrain/utils/checkpoints.py

+        # Sync before deleting to avoid another process saving at the same time.
+        # This has led to errors as documented here:
+        # https://github.com/speechbrain/speechbrain/issues/2250
+        ddp_barrier()


@pplantinga what will happen if torch_recovery is called outside of a run on main? These barrier would be hit and MAIN_PROC_ENV wouldn't be 1?

Outside of run_on_main the program should be operating with multiple processes, so all should hit the barrier together. The only scenario where it would still freeze is if you are inside if if_main_process(): block, which we should discourage use of.

Sounds good.

I think a solution could be developed to catch those bugs where you branch based on the main_process, but inside that branch you call some code which should hit a DDP barrier. So this will not automatically solve problems, but should help catch bugs. This would replace the if_main_process() (almost drop-in, just adds indentation).

BARRIER_PROTECTOR = "SPEECHBRAIN_DDP_BARRIER_PROTECTOR" os.environ[BARRIER_PROTECTOR] = 0 class DDPProtector(object): """Protects from running into DDP Barrier in a code block that has already branched""" def __enter__(self): # Increment so that we can support nested protectors os.environ[BARRIER_PROTECTOR] = str(int(os.environ[BARRIER_PROTECTOR])+1) def on_main_process(self): # ...There would be a check here... return ## True if on main process, else False def __exit__(self, exception_type, exception_value, traceback): <something to possibly handle exceptions> os.environ[BARRIER_PROTECTOR] = str(int(os.environ[BARRIER_PROTECTOR])-1) return def ddp_barrier(): """In DDP mode, this function will synchronize all processes. torch.distributed.barrier() will block processes until the whole group enters this function. """ if int(os.environ[BARRIER_PROTECTOR]) > 0: raise RuntimeError("DDP Barrier inside a main process only branch, this will create a deadlock or a subtle bug.") # Check if we're in a single-threaded section, skip barrier elif os.environ.get(MAIN_PROC_ENV, "0") == "1": return elif torch.distributed.is_initialized(): torch.distributed.barrier()

This would be simply used to mark that you intend not to run into DDP Barriers in this part of the code:

with DDPProtector() as protector: if protector.on_main_process(): ...

So when if_main_process() is replaced by this, we should catch some bugs more easily.

TParcollet · 2023-11-28T17:57:13Z

It fixes my bug now. We still need Ryan to verify if it solves issue 2250 as well.

speechbrain/utils/distributed.py

Gastron

I think at least the post_func logic should be checked. I also left a suggestion, which could help catch some bugs.

Gastron · 2023-11-29T19:52:51Z

speechbrain/utils/distributed.py

        else:
-            # But main comes here
+            main_process_only(post_func)(*post_args, **post_kwargs)


I think the logic is now inverted, post_func is meant to be run on everything else except main (e.g. load a tokenizer that was just created). With run_post_on_main, post_func is also run on main.

Aha, you are totally right about this... I'll go ahead and fix this.

Should be fixed in latest commit

Gastron · 2023-11-29T20:03:41Z

speechbrain/utils/distributed.py

 """
 import datetime
 import os
 import torch
 from functools import wraps

+MAIN_PROC_ENV = "MAIN_PROC_ONLY"


Perhaps this should have a SPEECHBRAIN_ prefix just in case.

Changed to module-level variable, which makes this unnecessary

Gastron · 2023-11-29T21:18:41Z

speechbrain/utils/checkpoints.py

+        # Sync before deleting to avoid another process saving at the same time.
+        # This has led to errors as documented here:
+        # https://github.com/speechbrain/speechbrain/issues/2250
+        ddp_barrier()


I think a solution could be developed to catch those bugs where you branch based on the main_process, but inside that branch you call some code which should hit a DDP barrier. So this will not automatically solve problems, but should help catch bugs. This would replace the if_main_process() (almost drop-in, just adds indentation).

BARRIER_PROTECTOR = "SPEECHBRAIN_DDP_BARRIER_PROTECTOR" os.environ[BARRIER_PROTECTOR] = 0 class DDPProtector(object): """Protects from running into DDP Barrier in a code block that has already branched""" def __enter__(self): # Increment so that we can support nested protectors os.environ[BARRIER_PROTECTOR] = str(int(os.environ[BARRIER_PROTECTOR])+1) def on_main_process(self): # ...There would be a check here... return ## True if on main process, else False def __exit__(self, exception_type, exception_value, traceback): <something to possibly handle exceptions> os.environ[BARRIER_PROTECTOR] = str(int(os.environ[BARRIER_PROTECTOR])-1) return def ddp_barrier(): """In DDP mode, this function will synchronize all processes. torch.distributed.barrier() will block processes until the whole group enters this function. """ if int(os.environ[BARRIER_PROTECTOR]) > 0: raise RuntimeError("DDP Barrier inside a main process only branch, this will create a deadlock or a subtle bug.") # Check if we're in a single-threaded section, skip barrier elif os.environ.get(MAIN_PROC_ENV, "0") == "1": return elif torch.distributed.is_initialized(): torch.distributed.barrier()

This would be simply used to mark that you intend not to run into DDP Barriers in this part of the code:

with DDPProtector() as protector: if protector.on_main_process(): ...

So when if_main_process() is replaced by this, we should catch some bugs more easily.

TParcollet · 2023-11-29T21:55:59Z

@pplantinga what do you think of @Gastron comments?

TParcollet · 2023-11-29T21:57:11Z

My take is that the point of this PR is to fix bugs. We should merge it with the fixes so that unstable can be merged in develop and then we PR @Gastron idea into dev?

Gastron · 2023-11-29T22:13:13Z

speechbrain/utils/distributed.py

@@ -103,8 +94,13 @@ def main_process_only(function):
    @wraps(function)
    def main_proc_wrapped_func(*args, **kwargs):
        """This decorated function runs only if this is the main process."""
+        os.environ[MAIN_PROC_ENV] = "1"


Additionally I wonder if the environment variables (like MAIN_PROC_ENV here) are the right way to do this sort of process-wide communication. I think something like a variable in a module (Python modules are singletons) should be enough here. So instead of this, I think we could just have:

MAIN_PROC_FLAG=0 def main_proc_wrapped_func(*args, **kwargs): global __MAIN_PROC_FLAG MAIN_PROC_FLAG = 1 ... MAIN_PROC_FLAG = 0 def ddp_barrier(): # Note: as long as this doesn't locally redefine MAIN_PROC_FLAG, # it doesn't need to be marked as global, as it is not mutated. if MAIN_PROC_FLAG == 1: ...

Ah yes, a module-level flag is better here.

pplantinga · 2023-11-29T23:04:40Z

You have some nice suggestions here @Gastron. You're right that the current setup would fail if there's a run_on_main inside another run_on_main which perhaps an increment/decrement could handle.

I was trying to avoid having double-indentation for this scenario, and the run_on_main already exists as a construct so I was hoping we could just extend its use a little. I'm not too crazy about the extra context manager, but could be convinced if we aren't able to accomplish the same thing using run_on_main.

Gastron · 2023-11-30T07:07:58Z

You have some nice suggestions here @Gastron. You're right that the current setup would fail if there's a run_on_main inside another run_on_main which perhaps an increment/decrement could handle.

I was trying to avoid having double-indentation for this scenario, and the run_on_main already exists as a construct so I was hoping we could just extend its use a little. I'm not too crazy about the extra context manager, but could be convinced if we aren't able to accomplish the same thing using run_on_main.

Perhaps we could encourage all user code to use run_on_main, then any branching flags or counters or such things can be implemented behind the scenes in library code.

In the short term I think forcing run_on_main everywhere (getting rid of if_main_process) would mean a bigger refactor, since the local code (in the if if_main_process: block) would need to be moved into a new function or other callable.

TParcollet · 2023-11-30T07:56:35Z

@Gastron and @pplantinga just to clarify one thing, if we keep using if_main_process (and we should, I agree with Aku), it will still work with this PR right? I don't see any reason why we must replace all the run_on_main? So basically, Mirco wants to release unstable in dev this week, so we need to settle this PR. We either revert with Adel's PR or we move forward with this one. @Gastron what is your opinion on merging this code and opening a new PR to develop your idea? I like the context manager if we can have something simple -- as if_main_process. @pplantinga if we merge, could you confirm that if_main_process will still work or it will break them and we must change all of them?

pplantinga · 2023-11-30T15:33:08Z

In this PR, if_main_process is incompatible with ddp_barrier, meaning you can't call delete_checkpoint inside if_main_process without freezing. As far as I'm aware, none of the recipes are doing this so none of them would currently crash. But it would be a surprising behavior if someone were to try this, so eventually we should either convert to run_on_main or to Aku's suggestion. I'm fine to merge now to fix bugs but we need to address this before any releases.

TParcollet · 2023-11-30T16:14:08Z

@pplantinga could we briefly assess what is impacted by this deadly combination? If it's minor, we could easily fix on dev.

TParcollet · 2023-11-30T16:17:24Z

I don't understand why this is a problem introduced by this PR though. if_main_process always was incompatible with DDP barrier no? So we should not see it anywhere?

pplantinga · 2023-11-30T16:18:25Z

Yes, if_main_process was always incompatible with ddp_barrier. This PR added ddp_barrier inside delete_checkpoint which was inside if_main_process on some recipes. I deleted the if_main_process from those recipes so we should be good to go in terms of no more bugs.

TParcollet

I think we can merge in dev and then move forward with a proper discussion between the people interested to go for a better handling of DDP barrier

TParcollet · 2023-11-30T17:48:29Z

@Gastron do you agree? If so, we merge and plan a meeting next week (if you guys are available) to solve this design issue properly.

Gastron · 2023-11-30T19:40:20Z

I think improving this incrementally makes sense, I guess we can indeed merge this and make further improvements soonish

mravanelli · 2023-11-30T20:14:30Z

Thank you all for working on this. So, based on what is discussed here, I'm going to merge it. After that, you can discuss better solutions and implement them in another PR.

Sync before and after deleting to prevent another process writing at …

97c8abe

…the same time

pplantinga requested review from TParcollet and Adel-Moumen November 26, 2023 18:10

pplantinga self-assigned this Nov 26, 2023

pplantinga mentioned this pull request Nov 26, 2023

Fix checkpointing #2267

Closed

13 tasks

pplantinga requested a review from Gastron November 26, 2023 22:33

pplantinga added 2 commits November 27, 2023 09:27

Remove unneeded if_main_process that conflicts with barrier in delete…

717953b

… checkpoint

Add env variable to signify single-threaded execution

3bc59fe

Fix wrong call to run_on_main in timit recipe

ff564b8

TParcollet reviewed Nov 28, 2023

View reviewed changes

speechbrain/utils/distributed.py Outdated Show resolved Hide resolved

TParcollet reviewed Nov 28, 2023

View reviewed changes

Fix bug in run_on_main logic

707193f

TParcollet mentioned this pull request Nov 28, 2023

Checkpoint not being fully removed #2250

Closed

asumagic self-requested a review November 29, 2023 09:55

asumagic reviewed Nov 29, 2023

View reviewed changes

speechbrain/utils/distributed.py Show resolved Hide resolved

Fix bug in run_on_main logic

4513a2d

mravanelli marked this pull request as ready for review November 29, 2023 15:58

Gastron requested changes Nov 29, 2023

View reviewed changes

Gastron reviewed Nov 29, 2023

View reviewed changes

pplantinga added 2 commits November 29, 2023 18:13

Convert single-threaded flag to module variable counter

bd56517

Fix logic of run_on_main post_func

f51ec61

asumagic mentioned this pull request Nov 30, 2023

Streamable Conformer-Transducer ASR model for LibriSpeech #2140

Merged

26 tasks

TParcollet approved these changes Nov 30, 2023

View reviewed changes

mravanelli self-requested a review November 30, 2023 20:14

mravanelli approved these changes Nov 30, 2023

View reviewed changes

mravanelli merged commit 3fcbbba into speechbrain:develop Nov 30, 2023
5 checks passed

pplantinga deleted the feature/sync-deletion branch December 5, 2023 14:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync before and after deleting #2268

Sync before and after deleting #2268

pplantinga commented Nov 26, 2023

TParcollet commented Nov 26, 2023 •

edited

pplantinga commented Nov 26, 2023

pplantinga commented Nov 27, 2023

TParcollet Nov 28, 2023

pplantinga Nov 28, 2023

TParcollet Nov 28, 2023

Gastron Nov 29, 2023 •

edited

TParcollet commented Nov 28, 2023

Gastron left a comment

Gastron Nov 29, 2023

pplantinga Nov 29, 2023

pplantinga Nov 29, 2023

Gastron Nov 29, 2023

pplantinga Nov 29, 2023

Gastron Nov 29, 2023 •

edited

TParcollet commented Nov 29, 2023

TParcollet commented Nov 29, 2023

Gastron Nov 29, 2023 •

edited

pplantinga Nov 29, 2023

pplantinga commented Nov 29, 2023

Gastron commented Nov 30, 2023

TParcollet commented Nov 30, 2023

pplantinga commented Nov 30, 2023 •

edited

TParcollet commented Nov 30, 2023

TParcollet commented Nov 30, 2023

pplantinga commented Nov 30, 2023 •

edited

TParcollet left a comment

TParcollet commented Nov 30, 2023

Gastron commented Nov 30, 2023

mravanelli commented Nov 30, 2023

Sync before and after deleting #2268

Sync before and after deleting #2268

Conversation

pplantinga commented Nov 26, 2023

TParcollet commented Nov 26, 2023 • edited

pplantinga commented Nov 26, 2023

pplantinga commented Nov 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Gastron Nov 29, 2023 • edited

Choose a reason for hiding this comment

TParcollet commented Nov 28, 2023

Gastron left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Gastron Nov 29, 2023 • edited

Choose a reason for hiding this comment

TParcollet commented Nov 29, 2023

TParcollet commented Nov 29, 2023

Gastron Nov 29, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pplantinga commented Nov 29, 2023

Gastron commented Nov 30, 2023

TParcollet commented Nov 30, 2023

pplantinga commented Nov 30, 2023 • edited

TParcollet commented Nov 30, 2023

TParcollet commented Nov 30, 2023

pplantinga commented Nov 30, 2023 • edited

TParcollet left a comment

Choose a reason for hiding this comment

TParcollet commented Nov 30, 2023

Gastron commented Nov 30, 2023

mravanelli commented Nov 30, 2023

TParcollet commented Nov 26, 2023 •

edited

Gastron Nov 29, 2023 •

edited

Gastron Nov 29, 2023 •

edited

Gastron Nov 29, 2023 •

edited

pplantinga commented Nov 30, 2023 •

edited

pplantinga commented Nov 30, 2023 •

edited