Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending pretrained interface for different fetching modes #1817

Merged
merged 26 commits into from
Apr 11, 2023

Conversation

anautsch
Copy link
Collaborator

@anautsch anautsch commented Jan 25, 2023

SpeechBrain's pretrained & fetching functions are heavily relying on symbolic links, even though these are not necessary all the time. This PR tackles to disambiguate between different use cases & eases the use of sym:links for model & audio fetching.

A testing template is provided (at tests/templates/fetching_ddp_dynbatch_finetuning/finetune.py) that demonstrates some fine-tuning use case with frozen model parts, where the pretrainer (class Pretrainer in speechbrain/utils/parameter_transfer.py) relies on pretrained tokenizer, LM & ASR models fetched from each a different location type (def fetch in speechbrain/pretrained/fetching.py). Therefore, a new enum is introduced:

class FetchFrom(Enum):
    LOCAL = 1  # in the example, for the template-trained ASR
    HUGGING_FACE = 2  # in the example, for the pretrained LM
    ONLINE = 3  # in the example, for the pretrained tokenizer

This enum allows to distinguish, whether/not sym:links are necessary: FetchFrom.LOCAL will load directly from local source, whether model or audio.


Since there is also a class Pretrained in speechbrain/pretrained/interfaces.py (a close miss from "PretraineR"), the example continues. Three different ways of running eval() are demonstrated: the usual asr_brain.evaluate; a custom eval_test_use_recipe_dataio that bypasses the Pretrained.load_audio function (see #1804 for concepts), and another custom eval_test_batch_from_scratch that uses the Pretrained.load_audio function to create batches.

The test template show-cases also how DDP & dynamic batching can be used here (btw, the verbosity level of some dyn:batch logs is lowered). As a dataset, minilibrispeech is used (as of the SpeechBrain templates), since the goal is to demonstrate the method, not any particular outcome. The DDP test case helps to show a nuance in usage for hparams["pretrainer_ASR"].collect_files(fetch_from=FetchFrom.LOCAL) => it is used w/o run_on_main (no fetching, since all is local; also, no delicatesses as for symlink overheads).


Related issue & PR complex: #1804 #1797 #1303 #1070 #1177 #1253 #1278 #1291 #1341 #1352 #1358 #1254 #1268
To test for being adequately an aid to each of them is outside of the scope of this PR. I'm aware of the situation and that's a step toward pacifying these topics.

Why now?
We made extensive progress in extending SpeechBrain's testing capabilities over summer and winter, which will soon be merged on develop. This PR will need all of these tests. As mentioned somewhere in the xref:ed issues & PRs: the pretrained interface has more than one use case and is used throughout different compartments in the SpeechBrain ecosystem. Ideally, testing coverage is soon appropriate – but we shall see what breaks 😋

Another feature. Somewhere in these xref topics, demands were raised for better configuration ability when loading audios. I extended the function signature in speechbrain/pretrained/interfaces.Pretrained:

def load_audio(self, path, savedir="audio_cache", **kwargs)

which will separate kwargs into fetch_kwargs and others for torchaudio.load; note:

channels_first = kwargs.pop("channels_first", False)  # False as default value: SB consistent tensor format
...
signal, sr = torchaudio.load(str(path), channels_first=channels_first, **kwargs)

To summarise:

  • FetchFrom enum & availing it through interfaces
    i.e. avoid sym:links for data that's local already
  • Split up yaml files & use params: !include:ASR.yaml to simplify reading
  • Create a sym:linked testing folder structure that works when the ASR template is run and fills up path gaps for later parts of the testing example
  • Test case that demonstrates usage as well
    new folder tests/templates as a continuation of SpeechBrain templates
  • DDP & dyn batching for some finetuning example
  • DDP:able valid & eval functions
    preliminary study done, testing code provided (this implies a refactoring beyond this PR)
  • DDP:able custom testing with pretrained interface
    (better to use pretrainer here, but sometimes one has the model card w/ hparams most ready)
  • Ensure all Pretrained interfaces can take audio paths as well as tensor data for input
    pretrained interfaces either have a forward() function &/or a function to process batched tensors

@anautsch anautsch added enhancement New feature or request work in progress Not ready for merge labels Jan 25, 2023
@anautsch
Copy link
Collaborator Author

The changes to speechbrain/lobes/augment.py should be legacy compatible & allow for more future flexiblity when it comes to relocating OpenRIRs (or addressing them from another path/nestedness level within one's folder structure). Before this change, OpenRIRs were addressable only by recipes which were either located in the same directory or at the same level in the folder structure. Since re-downloading a fixed dataset like OpenRIR for every other dataset can be overhead, this change is contributed.

With this PR, I wanted to demonstrate & contribute a testing tool in the form of a templat through which one can test how SpeechBrain works from finetuning to releasing a model - a process which can draw pretrained models from different sources. Yet, to avail DDP to valid & eval stages in the core.Brain class will result in a larger refactoring (beyond the scope of this PR). For the time being, valid & eval stages are run_on_main only.

For pretrained interfaces, I provide a DDP-gathering of WER results from multiple GPU nodes into the main one. This seems to produce the same results compared to if all would be run on a single node, if the testing batch size = 1. Also for the time being, this is a standing suggestion to properly estimate performance on eval sets. Please see the notes section of the provided README file for details. (There's a major refactoring due.)

A major contribution of this PR is the ability to compare ways to use the pretrained interface where it is alike & how it is different from the core Brain class (e.g. speechbrain.pretrained.Pretrained.load_audio & speechbrain.dataio.dataio.read_audio)—provided code snippets make use of either and demonstrate to achieve same performance. Thereby, DDP-gathering is provided for testing datasets.

In short, this testing template also show-cases how to use pretrained models to estimate performance using DDP (with testing batch size = 1). This cuts down testing time, e.g., to gather speaker recognition scores for later stages of score calibration, fusion & normalisation. The WER related code snippet:

for log_metric, specs in reports.items():
    result_list = [None for _ in range(int(os.environ["WORLD_SIZE"]))]
    # WARNING: https://pytorch.org/docs/stable/distributed.html - underlying `pickle` module is known to be insecure
    torch.distributed.all_gather_object(result_list, specs["tracker"].scores)
    specs["tracker"].scores = list()
    for r in result_list:
        specs["tracker"].scores.extend(r)
    summary = specs["tracker"].summarize()

@anautsch anautsch added ready to review Waiting on reviewer to provide feedback and removed work in progress Not ready for merge labels Jan 31, 2023
@Gastron Gastron self-requested a review February 2, 2023 11:41
hparams["pretrainer_tokenizer"].load_collected(run_opts["device"])
hparams["pretrainer_LM"].load_collected(run_opts["device"])
# LOCAL fetching takes sources directly from their location
hparams["pretrainer_ASR"].collect_files(fetch_from=FetchFrom.LOCAL)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As seen here, the same code cannot now handle local and external files.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We avoid symlink handling through this PR - symlinks are handled in this sub-setting which now does not need to be called, when data is local to begin with. It's a benefit. I.e. when one directly specifies FetchFrom.LOCAL.

Yet, I reckon, you see there's a compatability problem with recipes perhaps. However, when FetchFrom.LOCAL is not specified, as in existing recipes, there should not be an issue.

Copy link
Collaborator

@Gastron Gastron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like that we are getting tests for this.

I can see that FetchFrom makes the source type explicit, which can help with unintended effects like HuggingFace path and local path getting confused.

I wish we could use the same code for online and offline sources, so that only a change in YAML would be necessary for changing the pretraining source. Should we bake the run_on_main wrapping into the collect_files method? Basically, collect_files could use run_on_main when fetch_from==FetchFrom.HUGGING_FACE (or ONLINE), and for local, it would run it without run_on_main

hparams["pretrainer_tokenizer"].load_collected(run_opts["device"])
hparams["pretrainer_LM"].load_collected(run_opts["device"])
# LOCAL fetching takes sources directly from their location
hparams["pretrainer_ASR"].collect_files(fetch_from=FetchFrom.LOCAL)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, the same code cannot handle both local and online sources in distributed setup. Local must be called without run_on_main, online must be called with it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The run_on_main wrapping might get easily overused and applied stacked as one wrapper on another one.

I'd agree with moving such wrappers to the instance when they are needed, and not as a general thing to do for each recipe. It must look somewhat absurd to users at some point to always have to put this. Or they just do and move on. Implementing that change is a major refactoring. cc @TParcollet @mravanelli

@Gastron is this comment good/bad or an observation? ^^"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are some conflicting ideals in this:

  1. Like you mentioned, adding this wrapper is an additional burden on the user.
  2. However, something getting run on the main and not on other processes can be surprising,
  3. and library code should ideally deal with a single responsibility. Adding multi-process orchestration to that is another extra responsibility IMO.

Perhaps we could have come up with some name suffix (like collect_files_ddp_compliant) that would signal a version of a method that takes care of multi-process issues - that would deal with the surprise by signaling in the name, and this special version of the method would take of the orchestration, using the non-wrapped code as the implementation, so it would again be single-responsibility-principle-good.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yeah changing it everywhere would be a bigger task. But I am also OK with not going through with a major refactor, perhaps we can just put run_on_main into collect_files in this one case and leave it there for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah btw - for hparams["pretrainer_ASR"].collect_files(fetch_from=FetchFrom.LOCAL)
I tried running this with run_on_main - funny side effect ;)
It only loads the model on main, but on none of the other DDP nodes, so they are stuck &/or crash.

With FetchFrom.LOCAL files are directly loaded onto each DDP node. When, for other FetchFrom values, run_on_main is used, it's sole purpose is: download & symlinking. In that way, SB avoids that all DDP nodes try to download & symlink while their neighbor DDP rank is doing the same - they run into a conflict. As with local files, there is no download & there is no symlink. Thus, all DDP nodes can load data directly from disk. A separation between collecting & loading is no further necessary.

Maybe this is a function naming & their purpose scoping issue. Semantically, for FetchFrom.LOCAL, collect_files and load_collected are the same thing. But for the current SB implementation, this way is more "coherent".

Good catch... question is what to do about it.

About splitting for subfuncs on whether/not DDP - this is also a larger paradigm shift. Yet, in general, some refactoring for specific re-appearing code structures that can only be used in one way ... there need to be helper functions which bulk all that together.

@Gastron
Copy link
Collaborator

Gastron commented Feb 7, 2023

Hey, I've just realized that this fetch_from argument to Pretrainer also makes it impossible to use local and Hugging Face sources at the same time.

So to recap: this FetchFrom solution makes it possible to explicitly set a source, and it tackles with the problem that local and huggingface paths could be confused, and it avoids unnecessary local symlinks.
However, it also makes two things less flexible:

  1. Mixed sources for Pretrainer no longer work
  2. Pretrainer.collect_files wrapping in run_on_main is inconsistent: for local sources it must not be wrapped, and for online sources it must be wrapped (for DDP).

If those can be fixed, I'd be really happy.

I wonder how you feel about still revisiting the FetchFrom enumeration idea. Perhaps a similar, but case-by-case solution would be to have the sources explicit in the string, in the URL type format:

hf://
https://
file://

Or similar. Out of these, the local file:// could be implicitly assumed if none is given. Finally, with this local source information, the Pretrainer could simply avoid collecting the paths in its internal storage during collect_files, and instead, it could just load (since it anyway has the local path).

@anautsch
Copy link
Collaborator Author

anautsch commented Feb 7, 2023

Hey, I've just realized that this fetch_from argument to Pretrainer also makes it impossible to use local and Hugging Face sources at the same time.

Yep. The "ideal thing" would be a large rewrite requiring to develop some design concepts. Yet, there are more than one issues that depend on getting this done better.

So to recap: this FetchFrom solution makes it possible to explicitly set a source, and it tackles with the problem that local and huggingface paths could be confused, and it avoids unnecessary local symlinks.

Yes.

However, it also makes two things less flexible:

  1. Mixed sources for Pretrainer no longer work
  2. Pretrainer.collect_files wrapping in run_on_main is inconsistent: for local sources it must not be wrapped, and for online sources it must be wrapped (for DDP).

If those can be fixed, I'd be really happy.

@Gastron How do you mean 1. ? The example shows how to mix them. Or was it before that everything could be dynamically loaded from different sources in one go? (Please point me to an example, I may have lost sight of it.)

About 2. - well, it could be a wrapper to that, so that future recipes handle it uniformly & legacy recipes still work, too. Other ways are possible too, lmk.

I wonder how you feel about still revisiting the FetchFrom enumeration idea.

The magic way is hard to debug—it needs to work for many cases, as pointed out by the community in the xref:ed issues & PRs. Open for alternatives, but beyond speculations ;-)
For this PR, the main goal is to have a testing template when it comes to integrating a whole bunch of features at-once. How the fetching is improved underneath, we need to find a strong solution.

Perhaps a similar, but case-by-case solution would be to have the sources explicit in the string, in the URL type format:

hf://
https://
file://

Here's the RFC to be implemented if we opt to go that way:
https://datatracker.ietf.org/doc/html/rfc3986
... and I won't be doing that 😋

For comparison, a more digestible list:
https://en.wikipedia.org/wiki/List_of_URI_schemes

That way leads to way too many future difficulties.

Or similar. Out of these, the local file:// could be implicitly assumed if none is given.

Do you expect all users to bear with you on these assumptions?
One could as demonstrated here also assume a local speechbrain folder which then has subfolders identical to our HF structure -but- with their own models. In this sort of special cases, there's no way around of some explicit call for how to handle it, since the source is indistinguishable. Enums in Python are untypical, I get that. Another way would be different functions - which bloats the fetching interfaces.

Finally, with this local source information, the Pretrainer could simply avoid collecting the paths in its internal storage during collect_files, and instead, it could just load (since it anyway has the local path).

That's the goal also achieved by the current implementation of this PR.
The main discrepancy is, as you point out, collecting & loading data which both are handled as some 'fetching'.

Do you have an example that currently works which features your 1. & 2. concerns?

@Gastron
Copy link
Collaborator

Gastron commented Feb 10, 2023

Or was it before that everything could be dynamically loaded from different sources in one go? (Please point me to an example, I may have lost sight of it.)

Yes, in the original implementation on Pretrainer could be used to load from many different sources. Here is an example that uses HuggingFace for most sources, but a direct HTTPS link for the LM:

# Models
asr_model: !apply:speechbrain.pretrained.EncoderDecoderASR.from_hparams
source: speechbrain/asr-crdnn-rnnlm-librispeech
run_opts: {"device":"cuda:0"}
overrides:
beam_size: !ref <asr_beam_size>
lm_model:
output_neurons: !ref <num_asr_tokens>
embedding_dim: 128
dropout: 0.
rnn_layers: 2
rnn_neurons: 2048
dnn_blocks: 1
dnn_neurons: 512
return_hidden: True
pretrainer:
paths:
lm: "https://www.dropbox.com/s/h2nigdmx55o9rjx/timers-and-such-lm.ckpt?dl=1"

About 2. - well, it could be a wrapper to that, so that future recipes handle it uniformly & legacy recipes still work, too. Other ways are possible too, lmk.

I think this current proposed change affects legacy recipes, where collect_files is wrapped with run_on_main. For instance, this comment notes that though HuggingFace is the default, other sources (local) could be used as well:

# We download the tokenizer from HuggingFace (or elsewhere depending on
# the path given in the YAML file).
run_on_main(hparams["pretrainer"].collect_files)
hparams["pretrainer"].load_collected(device=run_opts["device"])

The LibriSpeech ASR recipe consists of training a Tokenizer, then loading that for training an LM, then putting those together and training the ASR+LM system. The recipe by default offers the pretrained ones, sure. With the proposed change, the recipe training scripts would need to be changed to build the whole system again from scratch.

However, I still like the idea of a separate collect_files_ddp_compliant that handles the run_on_main wrapping internally. Alternatively, a method that handles the whole collect-and-load operation could be a good implementation. Both of these would require changes to existing recipes tho.

Here's the RFC to be implemented if we opt to go that way:
https://datatracker.ietf.org/doc/html/rfc3986
... and I won't be doing that 😋

Yes the full set is of course all too large. I just meant those three schemes, just borrowing the syntax from URIs, like hf:// for HuggingFace, https:// or http:// for HTTP sources and file:// or some such for local sources. (Maybe local:// is better?) Or a similar addition to the sources, maybe just hf: without the two slashes? A similar, though slightly heavier solution would be a namedtuple with two fields, source_type, and path.

What I mean is that making the source type explicit is very important. However, here it is achieved by an additional argument, and it just takes a bunch of effort for the user to keep these two different arguments in synchrony - changing from HuggingFace to local requires changing the path, and then changing the FetchFrom. If it was connected, the user would only have to make a change in one place.

@anautsch
Copy link
Collaborator Author

anautsch commented Feb 10, 2023

Thanks @Gastron for the pointer to the working example with multiple fetch sources!
... I was looking for sth like this when I started this PR to have it as a testing case, now I can put it in and include it in the solution space. (Didn't had it, so what you see in the test template is my approach of recreating such. Let's see how to get both working.)

I have a test case in for fetching of HuggingFace pretrained models, as for one of their model cards (that it is not violated). As mentioned in the opening post, it is now since we can now perform a regression test for legacy recipes, just as you mentioned. Before merging that PR earlier this week, SpeechBrain didn't had this capacity.


Also thank you for clarifying your concerns.
The uri magic you like to have might bring more troubles from their domain in as for what needs to be done here. For example, I'd not jump on whether/not all OS will handle some file://../relative/path in the same way a regular Linux user would think about as they would be handled. Subsequent discussions will be on the level of then imposing some restrictions in how it can be used - and at that point, it is an enum (just with more overheads).

I do not intend that using the FetchFrom is a must - the opposite, it is intended as an additional feature; to be used when necessary. For example, in the case demonstration here, there is a subfolder "speechbrain" with local pretrained data. This cannot be fetched in this way with the current method, as it conflicts with HuggingFace (one way or the other, one of these fetchings will need an explicit definition of where to get it from).

As a straight-forward way, I'll look into getting the test case for existing recipes in, to which you pointed me. If that one works, then functionality is there & additional ways of how to also address this can come in future PRs (it will be the same interception points just with a different color to it).


The collect_files_ddp_compliant needs to be in a way that one can use such a helper function but also write all of it directly in the recipe. In general, for many recipes, lots of such helper functions summarising usual copy/paste snippets could be summarised now in some helpers package. Especially, since copy/paste errors are re-occuring and all of SpeechBrain's flexibility will be scaling in the nearby future to be less maintainable (not only for the DDP-related concerns; some use it more than others & the code structure shouldn't be "the discrimination" here).

@Gastron
Copy link
Collaborator

Gastron commented Feb 13, 2023

I do not intend that using the FetchFrom is a must - the opposite, it is intended as an additional feature; to be used when necessary. For example, in the case demonstration here, there is a subfolder "speechbrain" with local pretrained data. This cannot be fetched in this way with the current method, as it conflicts with HuggingFace (one way or the other, one of these fetchings will need an explicit definition of where to get it from).

Thanks, this clears up a bunch of confusion! I was under the impression that FetchFrom would be always needed. But I think I understand now: FetchFrom is an additional argument that is meant to solve reported problems, while keeping the interface backwards-compatible. I believe the Pretrainer troubles remain, but this makes sense to me.

The core of my thoughts in the URI-scheme-like approach above were just about tying the source type and the source string itself together. I think that would make it possible to treat each fetched model/file/whatever separately, which I think is needed for collecting files from multiple sources. I understand that you're not liking the URI-like approach, and that is fine, but there could be some other implementation, e.g. a namedtuple, maybe one that contains the FetchFrom Enum too:

FetchSource = nametuple(["from", "path"])
fetch_path = FetchSource(FetchFrom.HUGGING_FACE, "speechbrain/spkrec-ecapa-voxceleb")

But I am ok with other solutions too, just wanted to clarify this part a little.

@anautsch
Copy link
Collaborator Author

Thanks, this clears up a bunch of confusion! I was under the impression that FetchFrom would be always needed. But I think I understand now: FetchFrom is an additional argument that is meant to solve reported problems, while keeping the interface backwards-compatible.

Finally :)
Yep—I should have asked for that testing case before...

I believe the Pretrainer troubles remain, but this makes sense to me.

We need a minimal or more extensive testing case, so we then know what's going on. I'm migrating the recipe you mentioned to a test case here but it takes more time than I expected.

The core of my thoughts in the URI-scheme-like approach above were just about tying the source type and the source string itself together. I think that would make it possible to treat each fetched model/file/whatever separately, which I think is needed for collecting files from multiple sources. I understand that you're not liking the URI-like approach, and that is fine, but there could be some other implementation, e.g. a namedtuple, maybe one that contains the FetchFrom Enum too:

FetchSource = nametuple(["from", "path"])
fetch_path = FetchSource(FetchFrom.HUGGING_FACE, "speechbrain/spkrec-ecapa-voxceleb")

But I am ok with other solutions too, just wanted to clarify this part a little.

Now, we are talking!
This looks elegant—instead of an optional fetch_from parameter, "source" could also be a nametuple which we would expect to have two fields. Does the same job & is more appealing to both of our perspectives. Looking into it...

@Gastron Gastron self-requested a review February 20, 2023 13:45
Copy link
Collaborator

@Gastron Gastron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR solves many pain points, and the changes look good to me.

This PR introduces this tests/templates directory, which is fine by me, but I urge that @anautsch you think whether it could better fit in the top level templates directory, perhaps under something like: templates/utility or templates/patterns. I am fine with any choice you make though.

Similarly I urge you to think if the separate fetch_from argument in fetch and related functions/methods is necessary, or if maybe the FetchSource approach alone is enough. This current approach seems to create two parallel interface-patterns. But I will leave this decision to you.

In future PRs (someday) we should create some solution for the difficult case of distributed training plus mixed sources in one single Pretrainer. I suggest an additional method on Pretrainer which does everything: collect, load, and deal with run_on_main wrapping for non-local fetch sources.

@Gastron
Copy link
Collaborator

Gastron commented Feb 20, 2023

When resolving the module docstring conflict on speechbrain/pretrained/interfaces.py, I suggest fixing the year typo on:

Pooneh Mousavi 20023

@anautsch
Copy link
Collaborator Author

I think this PR solves many pain points, and the changes look good to me.

Thanks @Gastron !

This PR introduces this tests/templates directory, which is fine by me, but I urge that @anautsch you think whether it could better fit in the top level templates directory, perhaps under something like: templates/utility or templates/patterns. I am fine with any choice you make though.

Happy to discuss; that's a first choice. Right now, the example is not "worthwhile" to be considered a "template" because the existing templates fulfill a proven function. This one use some totally mock-up experiment which has no control to its internals to embody anything that should make sense as ML theory goes. It's just "something"—the main point of this one is entirely in the "use" domain, and there it is a template. Therfore, templates/utility or templates/utils (as for other folders) also apply as alternatives. Indeed, having a template for testing something is "new" ;-)

We should discuss with the others. As with the PR 1600, it needed to be put "somewhere" in an initial draft; it's a new concept, so final placement is not only my decision.

Similarly I urge you to think if the separate fetch_from argument in fetch and related functions/methods is necessary, or if maybe the FetchSource approach alone is enough. This current approach seems to create two parallel interface-patterns. But I will leave this decision to you.

I removed it and substituted it with the FetchSource construction that you recommended throughout (looking at the files, I missed to remove one docstring & one code comment; will be improved).

In speechbrain/pretrained/fetching.py & speechbrain/utils/parameter_transfer.py, I use a local parameter with that name which is necessary to dispatch the FetchSource information in that what is needed for handling its internals properly and adequate. Is it discussing "names", here?

In future PRs (someday) we should create some solution for the difficult case of distributed training plus mixed sources in one single Pretrainer. I suggest an additional method on Pretrainer which does everything: collect, load, and deal with run_on_main wrapping for non-local fetch sources.

I provided three code examples that demonstrate the various ways of how this can be done (including one that you valued much); please elaborate here. With SpeechBrain, the main objective is to provide users full flexibility in how they dear to tackle their tasks—with a Python & YAML dualism, different ways are possible (as demonstrated).

@Gastron
Copy link
Collaborator

Gastron commented Feb 21, 2023

We should discuss with the others. As with the PR 1600, it needed to be put "somewhere" in an initial draft; it's a new concept, so final placement is not only my decision.

Sure, perhaps worth a quick discussion in today's core team call.

I removed it and substituted it with the FetchSource construction that you recommended throughout (looking at the files, I missed to remove one docstring & one code comment; will be improved).

Ah, I see, I think I just looked at that fetch docstring and thought it was there, sorry, I misunderstood, that's all.

I provided three code examples that demonstrate the various ways of how this [distributed training plus mixed sources in one single Pretrainer] can be done (including one that you valued much); please elaborate here. With SpeechBrain, the main objective is to provide users full flexibility in how they dear to tackle their tasks—with a Python & YAML dualism, different ways are possible (as demonstrated).

Perhaps I missed something. I believe that with the current implementation, local sources still require collect_files to be called on all processes (otherwise the local path will not be in Pretrainer.paths), but external sources require collect_files to be wrapped in run_on_main. Can a single Pretrainer work for mixed sources on distributed setups somehow?

@anautsch
Copy link
Collaborator Author

anautsch commented Feb 21, 2023

Light at the end of the tunnel, eh ;-)

As written in the README in the new folder, there are three ways demonstrated (assuming terminal path is cd:ed into this folder).

a) the core contribution of this PR

CUDA_VISIBLE_DEVICES=0,1 PYTHONPATH=../../.. python3 -m torch.distributed.launch --nproc_per_node=2 finetune.py finetune.yaml --distributed_launch --distributed_backend='nccl'

b) HuggingFace model card code snippet as a script

PYTHONPATH=../../.. python single_node_pretrained.py

c) What we are talking about as a final point - this is your previous concern of multi-source called once in Python - with multi-source

cd ../../.. && PYTHONPATH=. python tests/templates/fetching_ddp_dynbatch_finetuning/multisource_mini_recipe.py tests/templates/fetching_ddp_dynbatch_finetuning/multisource_mini_recipe.yaml --debug --debug_persistently; cd -

NOTE: these entry calling points is what I also see as primary 'testing'-like, not as primary 'template'-like.

see https://github.com/anautsch/speechbrain/blob/23a1414d8493f79ff3c484b01b099514411576fb/tests/templates/fetching_ddp_dynbatch_finetuning/multisource_mini_recipe.py#L357

# We download and pretrain the tokenizer
run_on_main(hparams["pretrainer"].collect_files)
hparams["pretrainer"].load_collected(device=run_opts["device"])

and https://github.com/anautsch/speechbrain/blob/23a1414d8493f79ff3c484b01b099514411576fb/tests/templates/fetching_ddp_dynbatch_finetuning/multisource_mini_recipe.yaml#L67

# Models
asr_model: !apply:speechbrain.pretrained.EncoderDecoderASR.from_hparams
    # source: speechbrain/asr-crdnn-rnnlm-librispeech  # could create a local path issue; specific to this testing folder
    source: speechbrain/asr-crdnn-transformerlm-librispeech
    run_opts: {"device":"cuda:0"}
    overrides:
        beam_size: !ref <asr_beam_size>
        lm_model:
            vocab: !ref <num_asr_tokens>
            d_model: 768
            nhead: 12
            num_encoder_layers: 12
            num_decoder_layers: 0
            d_ffn: 3072
            dropout: 0.0
            # activation: !name:torch.nn.GELU  # seems to cause issues within ruaml
            normalize_before: False
        pretrainer:
            paths:
                lm: "https://huggingface.co/speechbrain/asr-transformer-transformerlm-librispeech/resolve/main/lm.ckpt"

... here, my assumption would be: either it did work with DDP already before, or not.

So, you are asking for a case d) which combines now the namespace with c), right?
Why would this (after the demonstrated work here) be qualifiable for «In future PRs (someday)» when it is an obvious one step here? 😋

Can get to it.

@Gastron
Copy link
Collaborator

Gastron commented Feb 21, 2023

Yeah it can be done here as well, sure! But yeah, before this PR, collect_files could be wrapped in run_on_main and it would always just work, because we used symlinks for everything. Now, local files need their location known. But it would kind of make sense to combine collect_files and load_collected into one method, which deals with the mixed sources as well.

@anautsch
Copy link
Collaborator Author

How about this for develop?

... and one follow-up PR for another tidyness level for unstable-v0.6?

@Gastron this would be the time to brainstorm a general re-drafting of the speechbrain.pretrained structure, as you indicated. Right now, with more and more interfaces coming in, it's prone to become even larger (less manageable with future revisions).

For a pretrained interface refactoring, there are different ways possible, e.g., by the larger use case task (and general interfaces in a "generics.py" module, or so).

@anautsch
Copy link
Collaborator Author

@Gastron your earlier concern, it's a relevant test case btw.

speechbrain.utils.parameter_transfer - Loading pretrained files for: tokenizer, lm, model
speechbrain.utils.parameter_transfer - Redirecting (loading from local path): results/CRDNN_BPE_960h_LM/2602/save/model.ckpt -> tests/templates/fetching_ddp_dynbatch_finetuning/speechbrain/asr-crdnn-rnnlm-librispeech/model.ckpt
Traceback (most recent call last):
  File "tests/templates/fetching_ddp_dynbatch_finetuning/finetune_fetch_once.py", line 241, in <module>
    hparams["pretrainer"].load_collected(run_opts["device"])
  File "speechbrain/utils/parameter_transfer.py", line 281, in load_collected
    self._call_load_hooks(paramfiles, device)
  File "speechbrain/utils/parameter_transfer.py", line 298, in _call_load_hooks
    default_hook(obj, loadpath, device=device)
  File "speechbrain/utils/checkpoints.py", line 144, in torch_parameter_transfer
    torch.load(path, map_location=device), strict=False
  File "torch/serialization.py", line 771, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "torch/serialization.py", line 270, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "torch/serialization.py", line 251, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'results/CRDNN_BPE_960h_LM/2602/save/model.ckpt'

Now, this on its own is illogical since speechbrain/utils/parameter_transfer.py was changed to cope with that. BUT the ASR model is the one multi-fetch source that needs special treatment, as you also remarked earlier - sth. is going on with it that is different to others. When using DDP, the error continues:

torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
finetune_fetch_once.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-02-22_10:24:42
  host      : 
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 20973)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

DDP issues rank 0 (main+child) & rank 1 (child only). I.e. it works only for run_on_main in the way it's currently implemented (so I need to fix that).

When I run it w/o DDP, so it's no rank=1, it just works out of the box—which is made a lot simpler in defining through yaml bc of your namespace suggestion!

just fyi

@anautsch
Copy link
Collaborator Author

@Gastron please take a look at the readme file to run the test cases.

The file naming etc. can be improved, as documentation -> please lmk.

About the run_on_main issue. I put in a switch internal_ddp_handling=False, so there is no legacy conflict. Then, I re-ran the four test cases. They seem to work; but please check.

If we target a follow-up PR to unstable/0.6 to address the run_on_main better - we would need to touch all recipes.

@TParcollet
Copy link
Collaborator

@Gastron if you are happy with this PR, could you please merge it once we have released the new minor version? Most likely next week? I think we should keep this change in develop for a bit of time to make sure that everything works as intended.

@Gastron
Copy link
Collaborator

Gastron commented Mar 7, 2023

Yeah this looks good to me now (even better than before!). I'm happy to merge, but @TParcollet I am actually not sure if that minor version you mention has already been released or no?

@TParcollet
Copy link
Collaborator

No... we will do the minor when PyTorch 2.0 is out.

@Gastron
Copy link
Collaborator

Gastron commented Mar 31, 2023

Now that the minor version is out, it's time to merge! However, let's re-run the checks. I will do some dummy changes to re-trigger.

@Gastron Gastron changed the base branch from develop to unstable-v0.6 March 31, 2023 12:38
@Gastron Gastron changed the base branch from unstable-v0.6 to develop March 31, 2023 12:38
@Gastron Gastron self-requested a review March 31, 2023 12:40
@Gastron
Copy link
Collaborator

Gastron commented Mar 31, 2023

Actually, it seems I cannot easily re-run the CI checks. @anautsch could you e.g. push some trivial change (even empty commit should be possible)

@Adel-Moumen
Copy link
Collaborator

Hi @Gastron, I think you will need to do it.

@anautsch leaved Avignon University and therefore SpeechBrain and he is not (I think) planning to be active anymore soon so we will have to handle the ongoing PRs/issues

Copy link
Collaborator

@Gastron Gastron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reran the CI checks in another PR, they passed.

@Gastron Gastron merged commit e353e95 into speechbrain:develop Apr 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request ready to review Waiting on reviewer to provide feedback
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants