-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extending pretrained interface for different fetching modes #1817
Conversation
The changes to speechbrain/lobes/augment.py should be legacy compatible & allow for more future flexiblity when it comes to relocating OpenRIRs (or addressing them from another path/nestedness level within one's folder structure). Before this change, OpenRIRs were addressable only by recipes which were either located in the same directory or at the same level in the folder structure. Since re-downloading a fixed dataset like OpenRIR for every other dataset can be overhead, this change is contributed. With this PR, I wanted to demonstrate & contribute a testing tool in the form of a templat through which one can test how SpeechBrain works from finetuning to releasing a model - a process which can draw pretrained models from different sources. Yet, to avail DDP to valid & eval stages in the core.Brain class will result in a larger refactoring (beyond the scope of this PR). For the time being, valid & eval stages are For pretrained interfaces, I provide a DDP-gathering of WER results from multiple GPU nodes into the main one. This seems to produce the same results compared to if all would be run on a single node, if the testing batch size = 1. Also for the time being, this is a standing suggestion to properly estimate performance on eval sets. Please see the notes section of the provided README file for details. (There's a major refactoring due.) A major contribution of this PR is the ability to compare ways to use the pretrained interface where it is alike & how it is different from the core Brain class (e.g. In short, this testing template also show-cases how to use pretrained models to estimate performance using DDP (with testing batch size = 1). This cuts down testing time, e.g., to gather speaker recognition scores for later stages of score calibration, fusion & normalisation. The WER related code snippet: for log_metric, specs in reports.items():
result_list = [None for _ in range(int(os.environ["WORLD_SIZE"]))]
# WARNING: https://pytorch.org/docs/stable/distributed.html - underlying `pickle` module is known to be insecure
torch.distributed.all_gather_object(result_list, specs["tracker"].scores)
specs["tracker"].scores = list()
for r in result_list:
specs["tracker"].scores.extend(r)
summary = specs["tracker"].summarize() |
hparams["pretrainer_tokenizer"].load_collected(run_opts["device"]) | ||
hparams["pretrainer_LM"].load_collected(run_opts["device"]) | ||
# LOCAL fetching takes sources directly from their location | ||
hparams["pretrainer_ASR"].collect_files(fetch_from=FetchFrom.LOCAL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As seen here, the same code cannot now handle local and external files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We avoid symlink handling through this PR - symlinks are handled in this sub-setting which now does not need to be called, when data is local to begin with. It's a benefit. I.e. when one directly specifies FetchFrom.LOCAL
.
Yet, I reckon, you see there's a compatability problem with recipes perhaps. However, when FetchFrom.LOCAL
is not specified, as in existing recipes, there should not be an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like that we are getting tests for this.
I can see that FetchFrom makes the source type explicit, which can help with unintended effects like HuggingFace path and local path getting confused.
I wish we could use the same code for online and offline sources, so that only a change in YAML would be necessary for changing the pretraining source. Should we bake the run_on_main
wrapping into the collect_files
method? Basically, collect_files
could use run_on_main when fetch_from==FetchFrom.HUGGING_FACE
(or ONLINE), and for local, it would run it without run_on_main
hparams["pretrainer_tokenizer"].load_collected(run_opts["device"]) | ||
hparams["pretrainer_LM"].load_collected(run_opts["device"]) | ||
# LOCAL fetching takes sources directly from their location | ||
hparams["pretrainer_ASR"].collect_files(fetch_from=FetchFrom.LOCAL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, the same code cannot handle both local and online sources in distributed setup. Local must be called without run_on_main
, online must be called with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The run_on_main
wrapping might get easily overused and applied stacked as one wrapper on another one.
I'd agree with moving such wrappers to the instance when they are needed, and not as a general thing to do for each recipe. It must look somewhat absurd to users at some point to always have to put this. Or they just do and move on. Implementing that change is a major refactoring. cc @TParcollet @mravanelli
@Gastron is this comment good/bad or an observation? ^^"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are some conflicting ideals in this:
- Like you mentioned, adding this wrapper is an additional burden on the user.
- However, something getting run on the main and not on other processes can be surprising,
- and library code should ideally deal with a single responsibility. Adding multi-process orchestration to that is another extra responsibility IMO.
Perhaps we could have come up with some name suffix (like collect_files_ddp_compliant
) that would signal a version of a method that takes care of multi-process issues - that would deal with the surprise by signaling in the name, and this special version of the method would take of the orchestration, using the non-wrapped code as the implementation, so it would again be single-responsibility-principle-good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But yeah changing it everywhere would be a bigger task. But I am also OK with not going through with a major refactor, perhaps we can just put run_on_main
into collect_files in this one case and leave it there for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah btw - for hparams["pretrainer_ASR"].collect_files(fetch_from=FetchFrom.LOCAL)
I tried running this with run_on_main - funny side effect ;)
It only loads the model on main, but on none of the other DDP nodes, so they are stuck &/or crash.
With FetchFrom.LOCAL
files are directly loaded onto each DDP node. When, for other FetchFrom values, run_on_main is used, it's sole purpose is: download & symlinking. In that way, SB avoids that all DDP nodes try to download & symlink while their neighbor DDP rank is doing the same - they run into a conflict. As with local files, there is no download & there is no symlink. Thus, all DDP nodes can load data directly from disk. A separation between collecting & loading is no further necessary.
Maybe this is a function naming & their purpose scoping issue. Semantically, for FetchFrom.LOCAL
, collect_files
and load_collected
are the same thing. But for the current SB implementation, this way is more "coherent".
Good catch... question is what to do about it.
About splitting for subfuncs on whether/not DDP - this is also a larger paradigm shift. Yet, in general, some refactoring for specific re-appearing code structures that can only be used in one way ... there need to be helper functions which bulk all that together.
Hey, I've just realized that this So to recap: this FetchFrom solution makes it possible to explicitly set a source, and it tackles with the problem that local and huggingface paths could be confused, and it avoids unnecessary local symlinks.
If those can be fixed, I'd be really happy. I wonder how you feel about still revisiting the FetchFrom enumeration idea. Perhaps a similar, but case-by-case solution would be to have the sources explicit in the string, in the URL type format:
Or similar. Out of these, the local |
Yep. The "ideal thing" would be a large rewrite requiring to develop some design concepts. Yet, there are more than one issues that depend on getting this done better.
Yes.
@Gastron How do you mean 1. ? The example shows how to mix them. Or was it before that everything could be dynamically loaded from different sources in one go? (Please point me to an example, I may have lost sight of it.) About 2. - well, it could be a wrapper to that, so that future recipes handle it uniformly & legacy recipes still work, too. Other ways are possible too, lmk.
The magic way is hard to debug—it needs to work for many cases, as pointed out by the community in the xref:ed issues & PRs. Open for alternatives, but beyond speculations ;-)
Here's the RFC to be implemented if we opt to go that way: For comparison, a more digestible list: That way leads to way too many future difficulties.
Do you expect all users to bear with you on these assumptions?
That's the goal also achieved by the current implementation of this PR. Do you have an example that currently works which features your 1. & 2. concerns? |
Yes, in the original implementation on Pretrainer could be used to load from many different sources. Here is an example that uses HuggingFace for most sources, but a direct HTTPS link for the LM:
I think this current proposed change affects legacy recipes, where collect_files is wrapped with run_on_main. For instance, this comment notes that though HuggingFace is the default, other sources (local) could be used as well: speechbrain/recipes/LibriSpeech/LM/train.py Lines 187 to 190 in ceaecf7
The LibriSpeech ASR recipe consists of training a Tokenizer, then loading that for training an LM, then putting those together and training the ASR+LM system. The recipe by default offers the pretrained ones, sure. With the proposed change, the recipe training scripts would need to be changed to build the whole system again from scratch. However, I still like the idea of a separate
Yes the full set is of course all too large. I just meant those three schemes, just borrowing the syntax from URIs, like What I mean is that making the source type explicit is very important. However, here it is achieved by an additional argument, and it just takes a bunch of effort for the user to keep these two different arguments in synchrony - changing from HuggingFace to local requires changing the path, and then changing the FetchFrom. If it was connected, the user would only have to make a change in one place. |
Thanks @Gastron for the pointer to the working example with multiple fetch sources! I have a test case in for fetching of HuggingFace pretrained models, as for one of their model cards (that it is not violated). As mentioned in the opening post, it is now since we can now perform a regression test for legacy recipes, just as you mentioned. Before merging that PR earlier this week, SpeechBrain didn't had this capacity. Also thank you for clarifying your concerns. I do not intend that using the As a straight-forward way, I'll look into getting the test case for existing recipes in, to which you pointed me. If that one works, then functionality is there & additional ways of how to also address this can come in future PRs (it will be the same interception points just with a different color to it). The |
Thanks, this clears up a bunch of confusion! I was under the impression that FetchFrom would be always needed. But I think I understand now: FetchFrom is an additional argument that is meant to solve reported problems, while keeping the interface backwards-compatible. I believe the The core of my thoughts in the URI-scheme-like approach above were just about tying the source type and the source string itself together. I think that would make it possible to treat each fetched model/file/whatever separately, which I think is needed for collecting files from multiple sources. I understand that you're not liking the URI-like approach, and that is fine, but there could be some other implementation, e.g. a namedtuple, maybe one that contains the FetchFrom Enum too: FetchSource = nametuple(["from", "path"])
fetch_path = FetchSource(FetchFrom.HUGGING_FACE, "speechbrain/spkrec-ecapa-voxceleb") But I am ok with other solutions too, just wanted to clarify this part a little. |
Finally :)
We need a minimal or more extensive testing case, so we then know what's going on. I'm migrating the recipe you mentioned to a test case here but it takes more time than I expected.
Now, we are talking! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR solves many pain points, and the changes look good to me.
This PR introduces this tests/templates
directory, which is fine by me, but I urge that @anautsch you think whether it could better fit in the top level templates directory, perhaps under something like: templates/utility
or templates/patterns
. I am fine with any choice you make though.
Similarly I urge you to think if the separate fetch_from
argument in fetch
and related functions/methods is necessary, or if maybe the FetchSource approach alone is enough. This current approach seems to create two parallel interface-patterns. But I will leave this decision to you.
In future PRs (someday) we should create some solution for the difficult case of distributed training plus mixed sources in one single Pretrainer
. I suggest an additional method on Pretrainer which does everything: collect, load, and deal with run_on_main wrapping for non-local fetch sources.
When resolving the module docstring conflict on speechbrain/pretrained/interfaces.py, I suggest fixing the year typo on:
|
Thanks @Gastron !
Happy to discuss; that's a first choice. Right now, the example is not "worthwhile" to be considered a "template" because the existing templates fulfill a proven function. This one use some totally mock-up experiment which has no control to its internals to embody anything that should make sense as ML theory goes. It's just "something"—the main point of this one is entirely in the "use" domain, and there it is a template. Therfore, We should discuss with the others. As with the PR 1600, it needed to be put "somewhere" in an initial draft; it's a new concept, so final placement is not only my decision.
I removed it and substituted it with the In
I provided three code examples that demonstrate the various ways of how this can be done (including one that you valued much); please elaborate here. With SpeechBrain, the main objective is to provide users full flexibility in how they dear to tackle their tasks—with a Python & YAML dualism, different ways are possible (as demonstrated). |
Sure, perhaps worth a quick discussion in today's core team call.
Ah, I see, I think I just looked at that
Perhaps I missed something. I believe that with the current implementation, local sources still require |
Light at the end of the tunnel, eh ;-) As written in the README in the new folder, there are three ways demonstrated (assuming terminal path is cd:ed into this folder). a) the core contribution of this PR CUDA_VISIBLE_DEVICES=0,1 PYTHONPATH=../../.. python3 -m torch.distributed.launch --nproc_per_node=2 finetune.py finetune.yaml --distributed_launch --distributed_backend='nccl' b) HuggingFace model card code snippet as a script PYTHONPATH=../../.. python single_node_pretrained.py c) What we are talking about as a final point - this is your previous concern of multi-source called once in Python - with multi-source cd ../../.. && PYTHONPATH=. python tests/templates/fetching_ddp_dynbatch_finetuning/multisource_mini_recipe.py tests/templates/fetching_ddp_dynbatch_finetuning/multisource_mini_recipe.yaml --debug --debug_persistently; cd - NOTE: these entry calling points is what I also see as primary 'testing'-like, not as primary 'template'-like. # We download and pretrain the tokenizer
run_on_main(hparams["pretrainer"].collect_files)
hparams["pretrainer"].load_collected(device=run_opts["device"]) # Models
asr_model: !apply:speechbrain.pretrained.EncoderDecoderASR.from_hparams
# source: speechbrain/asr-crdnn-rnnlm-librispeech # could create a local path issue; specific to this testing folder
source: speechbrain/asr-crdnn-transformerlm-librispeech
run_opts: {"device":"cuda:0"}
overrides:
beam_size: !ref <asr_beam_size>
lm_model:
vocab: !ref <num_asr_tokens>
d_model: 768
nhead: 12
num_encoder_layers: 12
num_decoder_layers: 0
d_ffn: 3072
dropout: 0.0
# activation: !name:torch.nn.GELU # seems to cause issues within ruaml
normalize_before: False
pretrainer:
paths:
lm: "https://huggingface.co/speechbrain/asr-transformer-transformerlm-librispeech/resolve/main/lm.ckpt"
... here, my assumption would be: either it did work with DDP already before, or not. So, you are asking for a case d) which combines now the namespace with c), right? Can get to it. |
Yeah it can be done here as well, sure! But yeah, before this PR, |
How about this for ... and one follow-up PR for another tidyness level for @Gastron this would be the time to brainstorm a general re-drafting of the For a pretrained interface refactoring, there are different ways possible, e.g., by the larger use case task (and general interfaces in a "generics.py" module, or so). |
@Gastron your earlier concern, it's a relevant test case btw.
Now, this on its own is illogical since
DDP issues rank 0 (main+child) & rank 1 (child only). I.e. it works only for run_on_main in the way it's currently implemented (so I need to fix that). When I run it w/o DDP, so it's no rank=1, it just works out of the box—which is made a lot simpler in defining through yaml bc of your namespace suggestion! just fyi |
@Gastron please take a look at the readme file to run the test cases. The file naming etc. can be improved, as documentation -> please lmk. About the run_on_main issue. I put in a switch If we target a follow-up PR to unstable/0.6 to address the run_on_main better - we would need to touch all recipes. |
@Gastron if you are happy with this PR, could you please merge it once we have released the new minor version? Most likely next week? I think we should keep this change in develop for a bit of time to make sure that everything works as intended. |
Yeah this looks good to me now (even better than before!). I'm happy to merge, but @TParcollet I am actually not sure if that minor version you mention has already been released or no? |
No... we will do the minor when PyTorch 2.0 is out. |
Now that the minor version is out, it's time to merge! However, let's re-run the checks. I will do some dummy changes to re-trigger. |
Actually, it seems I cannot easily re-run the CI checks. @anautsch could you e.g. push some trivial change (even empty commit should be possible) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reran the CI checks in another PR, they passed.
SpeechBrain's pretrained & fetching functions are heavily relying on symbolic links, even though these are not necessary all the time. This PR tackles to disambiguate between different use cases & eases the use of sym:links for model & audio fetching.
A testing template is provided (at
tests/templates/fetching_ddp_dynbatch_finetuning/finetune.py
) that demonstrates some fine-tuning use case with frozen model parts, where the pretrainer (class Pretrainer
inspeechbrain/utils/parameter_transfer.py
) relies on pretrained tokenizer, LM & ASR models fetched from each a different location type (def fetch
inspeechbrain/pretrained/fetching.py
). Therefore, a new enum is introduced:This enum allows to distinguish, whether/not sym:links are necessary:
FetchFrom.LOCAL
will load directly from local source, whether model or audio.Since there is also a
class Pretrained
inspeechbrain/pretrained/interfaces.py
(a close miss from "PretraineR"), the example continues. Three different ways of running eval() are demonstrated: the usualasr_brain.evaluate
; a customeval_test_use_recipe_dataio
that bypasses thePretrained.load_audio
function (see #1804 for concepts), and another customeval_test_batch_from_scratch
that uses thePretrained.load_audio
function to create batches.The test template show-cases also how DDP & dynamic batching can be used here (btw, the verbosity level of some dyn:batch logs is lowered). As a dataset, minilibrispeech is used (as of the SpeechBrain templates), since the goal is to demonstrate the method, not any particular outcome. The DDP test case helps to show a nuance in usage for
hparams["pretrainer_ASR"].collect_files(fetch_from=FetchFrom.LOCAL)
=> it is used w/orun_on_main
(no fetching, since all is local; also, no delicatesses as for symlink overheads).Related issue & PR complex: #1804 #1797 #1303 #1070 #1177 #1253 #1278 #1291 #1341 #1352 #1358 #1254 #1268
To test for being adequately an aid to each of them is outside of the scope of this PR. I'm aware of the situation and that's a step toward pacifying these topics.
Why now?
We made extensive progress in extending SpeechBrain's testing capabilities over summer and winter, which will soon be merged on develop. This PR will need all of these tests. As mentioned somewhere in the xref:ed issues & PRs: the pretrained interface has more than one use case and is used throughout different compartments in the SpeechBrain ecosystem. Ideally, testing coverage is soon appropriate – but we shall see what breaks 😋
Another feature. Somewhere in these xref topics, demands were raised for better configuration ability when loading audios. I extended the function signature in
speechbrain/pretrained/interfaces.Pretrained
:which will separate
kwargs
intofetch_kwargs
and others fortorchaudio.load
; note:To summarise:
i.e. avoid sym:links for data that's local already
params: !include:ASR.yaml
to simplify readingnew folder
tests/templates
as a continuation of SpeechBrain templatesDDP:able valid & eval functionspreliminary study done, testing code provided (this implies a refactoring beyond this PR)
(better to use pretrainer here, but sometimes one has the model card w/ hparams most ready)
Ensure all Pretrained interfaces can take audio paths as well as tensor data for inputpretrained interfaces either have a
forward()
function &/or a function to process batched tensors