Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streamable Conformer-Transducer ASR model for LibriSpeech #2140

Merged
merged 84 commits into from Dec 18, 2023

Conversation

asumagic
Copy link
Collaborator

@asumagic asumagic commented Aug 21, 2023

Contribution in a nutshell

This PR implements a streaming ASR model for SpeechBrain.

Resolves: #1970

This PR will (probably) not be implementing the high-level interfaces (think interfaces.py) to perform inference as my work ends at the end of August, but I'm hoping to get everything in a functional state with a streaming inference proof-of-concept done. The high level interfaces will be implemented separately.

Scope

  • Implement the Dynamic Chunk Training method to mask transformer keys according to chunks of time, and vary chunk size across batches.
  • Implement AWS' Dynamic Chunk Convolution technique for the Conformer.
  • Allow training a model with fully streaming compliant constraints and emulate streaming at inference through masking.
  • Provide a tool to empirically analyze dependencies between output tokens and input tokens, in order to validate whether the intended time constraints are respected.
  • Write a LibriSpeech Conformer-Transducer recipe.
  • Provide a fully trained streaming model.
  • Implementing a live microphone demo (using PyAudio)

Additionally, still TODO for me:

  • Upload the model somewhere: https://drive.google.com/drive/folders/1vL0Wr15eQZWS8SHp6RJQ_acj__7iri6J?usp=sharing to be transferred to Dropbox
  • Update README
  • Solve essential TODOs
  • Solve pre-commit issues
  • Change the experiment seed
  • Document new things heavily
  • Ensure the code isn't breaking other things
  • Probably figuring out what to do with custom_tgt_module, again -- probably needs undoing and patching the model same as the other conformer-transducer
  • Implement a fully functional proof-of-concept of chunked streaming (WIP)
  • Implementing transducer greedy search

Out-of-scope for this PR (will be done separately):

  • Implementing high level interfaces (for HF demos etc.)

Notes for reviewing (optional)

See thread, there is an accuracy issue for streaming decoding ATM. Solved

Explanations and technical details

This should go quite in a bit of detail for the big picture. It could probably eventually be turned into a general documentation on streaming once SB's support matures out.

https://gist.github.com/asumagic/aaf875929d1c03a96fd08a897599aaa8

Modifying the existing recipe?

Streaming has required very little hparams changes so far. I figure that considering the only current difference between a non-streaming model and a streaming model is changing one hyperparameter, it made more sense to keep it at one file for now.

Additionally, as per the above technical doc, the model is also trained with a context extremely close to non-streaming for ~40% of the batches. The result is that the streaming model can operate in a non-streaming fashion, and for LibriSpeech, the non-streaming performance is very close between a streaming: False and a streaming: True model, at least in my testing.

Self-attention logic changes

The attention.py change which moves the scaling before in order to reduce the likelihood of float16 overflow should be reviewed thoroughly as it is extensively used. Namely, even though it looks mathematically correct, the rel_shift trick scares me a bit and someone with a better understanding than me on the subject should double-check it.

Performing the softmax in fp32 space did not seem to affect performance for me and it seems more correct.

Ultimately I trained the model using bf16. I am not sure if training in fp16 is fully stable yet, but I know that these changes reduced the amount of training failures significantly in fp16 mode.

Pre-review

  • N/A (if applicable) add an extra_requirements.txt file
  • N/A (if applicable) add database preparation scripts & use symlinks for nested folders (to the level of task READMEs)
  • (if applicable) add a recipe test entry in the depending CSV file under: tests/recipes
  • create a fresh testing environment (install SpeechBrain from cloned repo branch of this PR)
  • (if applicable) run a recipe test for each yaml/your recipe dataset
  • check function comments: are there docstrings w/ arguments & returns? If you're not the verbose type, put a comment every three lines of code (better: every line)
  • use CI locally: pre-commit run -a to check linters; run pytest tests/consistency
  • (optional) run tests/.run-doctests.sh & tests/.run-unittests.sh
  • exhausted patience before clicking « Ready for review » in the merge box 🍄

@asumagic
Copy link
Collaborator Author

asumagic commented Aug 29, 2023

I wrote a relatively extensive description of the whole streaming approach.

This should hopefully clear up a lot of how and why the approach works and hopefully in a more understandable/intuitive way than some of the litterature.

Additionally, I got a simple greedy search prototype working yesterday and could successfully encode several minutes of spoken voice correctly.
Currently, some code is missing as part of a notebook as I needed some external code to glue everything up for streaming, which I'll clean up and share.
I even got a basic, local microphone demo working which got to harm my ego a bit considering how hard of a time it has with french accents, apparently. I blame LibriSpeech 😉

@asumagic
Copy link
Collaborator Author

asumagic commented Aug 31, 2023

I probably won't be able to spend much more time on this since my work ends today, and I'll share my notebook with the WIP streaming code today. The PR and code is generally cleaned up.

There is however an issue... The WER% when using the streaming code is bad (when it is fairly decent when emulating streaming by masking), so something is broken somewhere, and I am not sure what. I edited the technical document with a section on what I have attempted so far to debug the issue, and things to try.

So unless my priorities change, someone will have to pick up work from this. Hopefully a new set of eyes would spot the issues that I couldn't find in time.
With this issue solved (and I don't know how much effort that would be, hopefully it wasn't improper masking in training), there technically isn't a lot remaining to have streaming working, though: some cleanups and proper generic interfaces that could be reused for other models, and stuff to actually push it to HF.

I can still answer questions, clarify things and provide some help/guidance if needed of course.

So will all that said, I will switch the PR to "ready to review" once I upload the notebook somewhere, but it is up to you how to figure out what's the next best course of action.

@asumagic
Copy link
Collaborator Author

asumagic commented Aug 31, 2023

Archive containing the streaming test jupyter notebook. Should be relatively straightforward to run (checkout this branch in a new environment, edit the paths, get the model directory from gdrive and place it in the proper results subdirectory for the conformer-transducer recipe). Outdated

The WIP code to actually do streaming inference lives there (though this PR includes a bunch of model methods to help with it), alongside some demo and test code.

@asumagic asumagic marked this pull request as ready for review August 31, 2023 12:49
@mhn226 mhn226 self-requested a review September 26, 2023 07:43
@TParcollet
Copy link
Collaborator

Hey @asumagic please ping me when it's ready to review again :)

@asumagic
Copy link
Collaborator Author

Trying to fix fp16 training NaN loss issues so that I can retrain a model after fixing a mismatch in the new convolution code, and rebasing to be up to date with the core changes.

@asumagic asumagic changed the base branch from develop to unstable-v0.6 November 23, 2023 14:17
@asumagic
Copy link
Collaborator Author

Trying to fix fp16 training NaN loss issues so that I can retrain a model after fixing a mismatch in the new convolution code, and rebasing to be up to date with the core changes.

Status update:

Not sure when the training will be done especially as narval will go under maintenance tomorrow, but at least I finally got the ball rolling there.

@TParcollet
Copy link
Collaborator

Did you try removing the loss computation from the mixed training + the last softmax after the joiner network?

One thing to check as well, is the RNNT loss a sum or an average? If it's a sum, dynamic batching might be slightly less stable.

@asumagic
Copy link
Collaborator Author

Did you try removing the loss computation from the mixed training + the last softmax after the joiner network?

One thing to check as well, is the RNNT loss a sum or an average? If it's a sum, dynamic batching might be slightly less stable.

I think I did try forcing the loss computation to fp32 to no avail. I didn't try touching the joiner network/softmax though. For now I'm tempted to keep training with bf16 as I started one (.. once compute canada goes out of maintenance) and to try to investigate the fp16 problem after that so that the PR is ready ASAP.

The RNN-T loss uses a sum here.

@TParcollet
Copy link
Collaborator

TParcollet commented Nov 28, 2023

The reason could be the sum here then. Could you please try to initialise the grad scaler with a much smaller value? Like 10 instead of 65536? Or even 2.0? The rational behind this is that torch initialise this value to a crazy high number that is ok when you have averaged losses -- see here. You can certainly also try to play with growth_factor=2.0, backoff_factor=0.5). Another good thing to do to see exactly what is happening and adapt accordingly could be to print the value of the scaling factor and see how it behaves, it may either vanish or explode due to the fact that we use a sum... @Adel-Moumen might be interested by this as well. But I am pretty sure that it comes from here.

@asumagic
Copy link
Collaborator Author

cc @TParcollet @Adel-Moumen: nevermind, a mean is actually used in the transducer loss itself, I was looking at the wrong thing (the joiner network).

I've tried reducing the initial gradscaler scale very significantly, but it did not help. Printing its state_dict to monitor what's going on shows that the scale is vanishing, so it seems to be endlessly trying to reduce scale.

I'll be retrying to force different parts to float32 to diagnose the issue, since it's possible I had done it wrong the last time I tried. I think disabling autocast with enabled=False might not actually suffice as the inputs may need to be converted back to float explicitly, see: https://pytorch.org/docs/stable/amp.html#autocasting

Looking into this now.

@TParcollet
Copy link
Collaborator

TParcollet commented Nov 30, 2023

@asumagic right ... what about the rest of the recipe, is it ready? If so, I'll review one last time and try it.

@asumagic
Copy link
Collaborator Author

asumagic commented Nov 30, 2023

@asumagic right ... what about the rest of the recipe, is it ready? If so, I'll review one last time and try it.

The recipe itself should™ be ready, other than that fp16 is still not fixed. The new training (with bf16) should end tomorrow or on Monday. Once done, I will test accuracy and everything once again.

Notes and reminders:

  • This PR depends on Sync before and after deleting #2268 for training with DDP. Rebased against unstable-v0.6, so this is now in.
  • In TransformerASR, I still have to launch training with the custom_tgt_module line commented out. Because that field is not used when only encoding, DDP raises an error during backpropagation. To work around it, when the model is done training, I manually add it back to the model to avoid missing key errors. This issue was mentioned back with the other Conformer PR but we postponed a fix.
  • The interfaces to conveniently do chunked streaming are not currently part of the PR. This PR only includes the recipe changes and the model changes. I'll update and reupload my test notebook when it's done.

The high-level interfaces shouldn't take too long to implement (comparatively) but I kind of wanted to make sure everything works before trying to design something that may also end up being reused for other streaming models (or for adapting non-streaming models to streaming contexts like whispercpp etc. do IIRC).

@TParcollet
Copy link
Collaborator

TParcollet commented Nov 30, 2023

@asumagic as soon as you are confident in the results, ping me and I'll review. For the three notes: 1. Ok. 2. Who decided to postpone this issue, it sounds like something that would make this recipe unusable for non-expert people; 3. Ok, this is totally fine and I actually think that the interface should be another PR. Could you ping to an issue showing 2? I will fix this.

@asumagic
Copy link
Collaborator Author

The custom_tgt_module issue was brought up in the original PR starting with this comment: #1782 (comment)

@TParcollet
Copy link
Collaborator

TParcollet commented Nov 30, 2023

wouldn't --find_unused_parameters solve the issue without commenting out the code? "The problem is that simply adding this check will not cut it as it will break every existing trained model, since trained models will include keys that will not exist (as the custom_tgt_module will not have been initialized)." -- right. This is a problem. Any idea from our checkpoint experts @Gastron and @pplantinga?

@asumagic
Copy link
Collaborator Author

asumagic commented Dec 1, 2023

wouldn't --find_unused_parameters solve the issue without commenting out the code?

No, all this does is to print which parameters were unused, but you still get an error. If you don't specify it, it does not tell you which parameter was the problem.

(I imagine the default just checks a counter, whereas this option also tracks and sends the list of parameter names over DDP, which is probably why it's not default).

@TParcollet
Copy link
Collaborator

TParcollet commented Dec 15, 2023

@asumagic also, it looks like when streaming = True, the training speed slowdown seriously - on 8 GPUs it goes from 10min per epoch to 15-20. Is this expected? I guess it's due to the processing per chunk with all the for loops? Should we try to jit this? GPU util is also seems to be lower, once we have merged that, maybe we might want to have a look at optimising the code, if this is of interest to you.

@asumagic
Copy link
Collaborator Author

asumagic commented Dec 15, 2023

I'll try to profile it. It's possible that some of the list comprehensions inside of the convolution module are what harm performance, especially with longer utterances and smaller chunk sizes.
I might have underestimated their effect on performance, but I think most of that could be mitigated if it's that.

Should this be solved as part of this PR or another? I imagine the change would be not too extensive, but might take a bit to properly implement and test. nvm, didn't read it through, didn't realize you suggested doing this after.

@asumagic
Copy link
Collaborator Author

asumagic commented Dec 15, 2023

On a side note, I did try poking with torch.compile a few days ago and impressively enough, it succeeds compiling the entire compute_forward. However, the dynamic shape support is still too lacking as compilation crashes hard with it enabled (and it's kind of a requirement to avoid extremely slow constant recompiles), and the BS fails to compile with it (but not the GS). So right now it's a dead end, but it's promising.

@asumagic
Copy link
Collaborator Author

Yet again updated the notebook due to the renaming refactor:
streaming-tests.zip

@asumagic
Copy link
Collaborator Author

asumagic commented Dec 18, 2023

@TParcollet Ok, I think it should be all done now. There are some conversations I didn't resolve which were mostly answers to some remarks/questions.

With the very latest code, I have checked that:

  • training runs for a few iterations without errors
  • inference still works with the expected accuracy with the real streaming code
  • the integration test recipe is actually able to output tokens after a much longer number of epochs than is defined in the code
  • verifying that masking and streaming encoder code paths produce almost exactly the same results when chunking after feature extraction (the feature extraction may behave slightly differently in streaming, which does not cause issues)

Copy link
Collaborator

@TParcollet TParcollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Awesome work @asumagic !

@TParcollet TParcollet merged commit b01944f into speechbrain:unstable-v0.6 Dec 18, 2023
5 checks passed
mravanelli added a commit that referenced this pull request Jan 7, 2024
* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* rename HF's files

* fix docstrings

* fix args docstrings

* fix docstrings

* change classes' names

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Refactor HF interface, adapt recipes

* Fix docstrings

* commonvoice

* switchboard

* update readme

* update readme

* update lionk in test file

* remove unused space token

* update torchaudio

* remove deprecated language model path

* fix merge

* fix vocab

* fix switchboard

* commit

* fix test

* fix style

* remove unsued hparam

* fix consistancy blank_skip_threshold

* text frames

* CTCPrefixBeamSearcher timestamps

* pre-commit

* test

* test 2

* fix prints

* update ctcprefixbeamsearch timestamps

* remove frames from prefix bs

* ≈Revert "remove frames from prefix bs"

This reverts commit 30900d9.

* remove prefix bs

* ≈Revert "remove prefix bs"

This reverts commit 2f0c3cd.

* Revert "update ctcprefixbeamsearch timestamps"

This reverts commit ce09e19.

* Revert "fix prints"

This reverts commit bf36037.

* Revert "test 2"

This reverts commit 84cda94.

* Revert "test"

This reverts commit f17349f.

* Revert "pre-commit"

This reverts commit 4e1cf0d.

* Revert "CTCPrefixBeamSearcher timestamps"

This reverts commit c3d3cf7.

* Revert "text frames"

This reverts commit e67c761.

* Revert "fix consistancy blank_skip_threshold"

This reverts commit f97a391.

* Update ctc.py

* arg / timestamps

* precommit

* timesteps -> text_frames

* ls seq2seq

* transformer ls

* fix naming

* librispeech

* aishell

* fix linter

* precommit

* switchboard

* timit

* Dynamic batching fixed

* authors

* fix conformer large

* indent

* Revert "Fix dynamic batching" (#2173)

* update doctest skip

* Fix dynamic batching (#2174)

* Revert "Revert "Fix dynamic batching" (#2173)"

This reverts commit faa5e76.

* Update interfaces.py

* Update interfaces.py

* Update text_to_sequence.py

* fix w2v

* aishell

* cv

* ls transformer

* ls ssl

* switchboard

* timit

* precommit

* fix indent

* fix arg

* unit test sorting

* unittests

* remove if main

* Small fixes in averaging checkpoints (#2181)

* add ckpt avg unittest

* avoid hard-coding number of averages

* last fixes

* fix recipe test

* fix recipe test

* convert print into logger

* fix transducer recipe

* remove typing

* fix merge

* precommit

* Update LibriSpeech.csv

* update to new dynamic batching args

* Update unstable branch with new commits  (#2196)

* hyper branch/conf -former fixes

* remove ctc.py from doctest

* get back ctc.py

* remove doctest for torchaudio

* adapt gpt recipe

* adapt gpt recipe

* small follow up fix on openrir

* remove doc test (for now)

* fix issue greedy search

* docstring

* pre-commit

* Fix issues unstable (#2216)

Thank you @Adel-Moumen! I did the tests again and everything works now. As for your points on the recipe tests, I agree. We can eventually do that in another PR.

* Fix missing file / import in huggingface_transformers (#2224)

* init/imports

* comment

* add partial import

* wav2vec -> wav2vec2

* fix ci

* Text based HF (#2214)

* add mbart

* Add tristage scheduler

* Add mbart beam search

* Add IWLST recipes

* Add new models' inteference interface

* Add info of new models

* Add nllb scores

* Add new models' info

* Add test info IWSLT recipe

* Add test info IWSLT recipe

* add docstrings for S2STransformerBeamSearcher

* Update IWSLT recipes

* Update IWSLT recipes

* fix doctest

* add requirements

* add protobuf

* fix doctest

* small fixes

* Add protobuf install

* Minor reform

* Remove protobuf

* Fix docstings

* Fix docstrings

* minor reform

* remove labse

* change authorship

* remove comments

* minor changes

* change authorship

* Fix recipe test

* add info

* Update README.md

* Update README.md

* change recipe structure

---------

Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com>
Co-authored-by: Adel Moumen <88119391+Adel-Moumen@users.noreply.github.com>

* Neural LM Rescoring (#2187)

* baserescorerinterface

* add rescorers

* first attempt

* update code

* 1.57 wer

* update

* update code

* update code

* docstring example rnn

* updata loader

* docstring example

* tests

* docstring example

* update

* tmpdir

* change path

* update doc

* docstring

* docstring args

* doctest

* fix docstring example

* unnittest

* interface

* yamls update

* full_infernece tests

* model link

* readme

* yaml/inference tests

* update res

* fix wav2vec with wav2vec2

---------

Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com>

* Add wrappers for Encodec and Vocos vocoders (#2231)

* Add wrappers for Encodec and Vocos from Huggingface

* Encodec: Add a comment

* Encodec/Vocos: Add examples, restructure, fix masks

* Vocos: Add a comment about the open pull request

* Encodec/Vocos: Add the ability to customize save_path, fix a log message

* Encodec/Vocos: Cosmetic changes

* Vocos: Cosmetic changes

* Encodec/Vocos: Remove the mandatory Vocos requirement

* Vocos: Remove vocos from __init__.py

* fix init

* Vocos: Add a check for vocos in conftest.py

* Vocos/Encodec: Update documentation, add bandwidth control

* Fix old path in conftest.py

* Cosmetic changes

* Encodec/Vocos: Add support for embedding vectors

* Encodec: Update example

* Encodec/Vocodec: Add automatic reshaping, minor cosmetic changes

---------

Co-authored-by: flexthink <flexthink@users.noreply.github.com>
Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com>

* Semantically-Aligned Multimodal Utterance-level (SAMU) pre-training (#2223)

* add mbart

* Add tristage scheduler

* Add mbart beam search

* Add IWLST recipes

* Add new models' inteference interface

* Add info of new models

* Add nllb scores

* Add new models' info

* Add test info IWSLT recipe

* Add test info IWSLT recipe

* add docstrings for S2STransformerBeamSearcher

* Update IWSLT recipes

* Update IWSLT recipes

* fix doctest

* add requirements

* add protobuf

* fix doctest

* small fixes

* Add protobuf install

* Minor reform

* Remove protobuf

* Fix docstings

* Fix docstrings

* minor reform

* remove labse

* Add attention pooling

* Add labse

* Add info about SAMU

* add iwslt recipes with samu

* fix recipe test

* fix comments

* fix recipe test

* change recipe structure

* fix test recipe

* Add new recipes

* minor doctest change

* minor doctest change

* small changes

* add dropbox links

---------

Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com>

* fix norm (#2237)

* Discrete SSL (#2233)

* clustering training recipies for LibriSpeech for different SSL model

* add Discrete Hubert Model

* load from HF, fix minor issues

* fix hyper-param value

* fix precommit

* fix flake8

* fix batch_size and n_clus values in hyperparams

* fix typos

* fix typo and some cleaning

* fix precommit

* fix device incompatibility and memroty issue

* use fit instead of partial fit

* add README file

* add test recipies

* remove unused fields from hparams

* fix precommmit-yamllint - extra whitespace

* add docstring for load_kmeans for Discrete_hubert.py

* add discrete wavlm, wav2vec

* avoid docstring testing for discrete_ssl models

* fix docstring failed issue

* add discrete_interface to conftest.py

* fix precommit

* Fixes for Encodec (#2240)

* Add wrappers for Encodec and Vocos from Huggingface

* Encodec: Add a comment

* Encodec/Vocos: Add examples, restructure, fix masks

* Vocos: Add a comment about the open pull request

* Encodec/Vocos: Add the ability to customize save_path, fix a log message

* Encodec/Vocos: Cosmetic changes

* Vocos: Cosmetic changes

* Encodec/Vocos: Remove the mandatory Vocos requirement

* Vocos: Remove vocos from __init__.py

* fix init

* Vocos: Add a check for vocos in conftest.py

* Vocos/Encodec: Update documentation, add bandwidth control

* Fix old path in conftest.py

* Cosmetic changes

* Encodec/Vocos: Add support for embedding vectors

* Encodec: Update example

* Encodec/Vocodec: Add automatic reshaping, minor cosmetic changes

* Encodec: Decoupled token extraction, fixed CPU/GPU issues

* Encodec: Add renormalization

---------

Co-authored-by: flexthink <flexthink@users.noreply.github.com>
Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com>

* Refactoring of the 'fit_batch' function (#2010)

* add dataclass

* turn False

* remove valid_step

* update core.py

* update core.py

* update core.py

* precommit

* self.autocast + GradScaler enabled

* freeze opt

* naming

* update core.py

* comments

* example transducer conformer

* update core.py

* small changes

* naming + skip_grad_nans

* doc

* check

* support cpu training

* precision + doctrsting

* name

* change w2v

* restore ckpt

* remove file

* remove casting

* tests

* whisper + fix tests

* seq2seq ls

* update transducer / transformer

* remove on_optimizers_step_end + comments

* update check yaml

* remove default arg

* add precision in yamls

* add precision inside of the yamls

* ckpt and scaler

* run_opt outside brain + test

* several recipe updates

* improve w2v fit_batch fn

* add arg

* update name

* timit

* context manager

* on_fit_batch_start

* update CV

* should_step with noam

* add flag precision

* naming

* aishell

* aishell

* update recipes

* so many recipes 0.0

* update recipes

* last recipes

* zero_grad

* fix grad_accumulation_factor

* update recipes

* update auto_mix_prec flag

* remove opt flag test

* librispeech

* cv ssl

* audio mnist / realm

* voicebank

* fix rescuespeech

* fix lr annealing

* libritts

* multiwoz

* slurp nlu

* should_step

* update yamls

* update yaml

* update batch smpler tedlium

* remove fit batch

* precision flag

* update sampler

* add precision inside of the yamls

* run_opt outside brain + test

* fix auto_mix_prec flag

* docstring

* grad acc

* failing test

* update unittests

* update jarod's pr

* fix removed avg_checkpoint param

* update path

* fix some recipe tests

* update samu recipe

* fix hifigan/IWSLT

* tedlium

---------

Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com>

* Refactor Augmentation (#2206)

* update

* update

* change folder

* remove unnecesary file

* update folder structure

* add noise, add rev

* augmenter refactor

* refactor augment + example in templace

* fix tests + linters

* address comments

* supporting variable-length augmentations in augmenter (e.g., speed change)

* lib refactor (splitting time and freq augmentations)

* fine tune freq drop

* refactor of specaugment (freq-domain) - part 1

* converted specaument (freq domain)

* refactor random shift

* implemented cutcat, swap, and random selection

* extended unittests + small fixes

* improvements and fixes in augment

* plugged feature augmentation + various fixes and improvements

* add sum_batch noise (similat to babble) + various fixes

* add drop bit resolution

* added coded augmentation

* added more unittests

* restore all augmentations

* making AddReveb more similar to AddNoise

* fix device mismatch + fix last batch management

* add workes to speed up AddNoise and AddRev

* improve comments in template yaml

* speed up template (sorting dev and test)

* extend augmenter by adding activation provability

* implemented enable augmentation flag (useful of hparam tuning) + other improvements

* plugged coded augment

* fixed coded augment

* remove old files

* fix integration test

* remove knowledge distill TIMIT reicpes. Too many yaml files to maintain

* convert TIMIT

* fix recipe

* converted templates using EnvCorr

* converted voxceleb

* converted GSC + fixes on voxceleb

* convrted UrbanSound8k

* converted voicebank

* converted other recipes

* converted CommonLanguage, VoxLingua, timers-and-such

* converted all recipes using envcorr

* CommonVoice

* REAL-M

* Aishell1Mix

* LibriMix

* converted all recipes!

* fix linters - part1

* fix linters - part2

* add a note in the template regarding augmentation

* fix docstring tests

* fix yamls

* remove coded tests from docstring

* revised coded tests

* fix identation in codec.py

* try to fix doc issue

* revise lib header in codec.oy

* fix doc

* fix doc attempt

* rename sections

* fix doc

* fix (most) recipe tests

* fix other recipe tests

* address comments

* fix yaml

* fix

* convert recipe

* fix recipes

* fix aug in rescoring recipes

* Delete tmpdir_vocoder directory

* Refactor Inference (files and folders) (#2252)

* refactor inference files and folders

* fix some tests

* fix some tests

* fix doctest

* import lib

* small fixes

* Fix beam search (#2253)

* fix starting pos prefix_length

* block path ctc + fix default value to the old one

* fix issue with score being -inf

* remoev print

* precommit

* Fix ctc beam search (#2263)

* fix logprobs / space_token / warnings

* fix space_token

* pre-commit

* space_token

* simplify parameters

* simplify yamls

* remove comma

* update beam search

* fix vocab/str (#2265)

* Fix blank index ctc (#2266)

* update blank_index

* whisper

* revert change

* mistake

* Cv unstable merge (#2254)

* add fr preproccesing to Common_voice_prepare.py

* add CV , CTC, new languages

* fix precommit and test

* add transducer recipie

* add transformer recipies

* update augmentation of CTC recipies

* update seq-to-seq recipies

* fix whisper HF interface bug. (return str insted of list)

* fix recipe tests

* add fr preproccesing to Common_voice_prepare.py

* add CV , CTC, new languages

* fix precommit and test

* add transducer recipie

* add transformer recipies

* update augmentation of CTC recipies

* update seq-to-seq recipies

* fix whisper HF interface bug. (return str insted of list)

* fix recipe tests

* modify beamsearch for CTC: ar.es.pt and zh-CN

* fix interface conflict

* fix transducer interface bug

---------

Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com>

* Add warnings and fix numba (#2271)

* upperbound torch/tochaudio + remove opt dependancy

* add back automix/bf flags

* linters

* oops

* transformers back

* test requirements

* Fix Bug: CommonVoice Transformer Bug loading correct optimizer (#2278)

* fix trnsfrm bug to load correct opt:adam vs sgd

* add  data_root to the path of common_voice_prepare.py

* add epoch/_counter pretrainer to fr and it recepie

* revert releative path change

* fix opt bug without the need to add epoch_ckpt

* add log and delete launch file

* update the log message

* update WeightedSSLModel (#2272)

* update WeightedSSLModel

* requirements.txt

* fix pre-commit

* Sg/dac (#2246)

* introducing DAC

* lint errors

* black

* documenttion

* remove unused init file

* Fixing tests

* More doc strings

* More doc strings

* PR review

* PR review

* PR review

* Update dac.py

* Update dac.py

* Update dac.py

* make doctests smaller to avoid memory issues in CI

* even smaller tests

---------

Co-authored-by: Shubham Gupta <shubhamgupta@Shubhams-MacBook-Pro-2.local>
Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com>

* add quantization recipies fro IEMCAP, CV, LibriSpeech and LJSpeech (#2255)

* add quantization recipies fro IEMCAP, CV, LibriSpeech and LJSpeech

* update discrete_ssl models

* add iemocap_prepare to main folder + add test

* ix test for iemocap

* fik typos

* fix test recepies,  minor dormat editting

* fix typo in coomonvoice.csv

* fix typo in yaml file

* fix doctests (those that we do not run in the CI)

---------

Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com>

* change emdedding type from long to float to vaoid getting al zeros embedding (#2292)

* Update CVSS (#2285)

* Update CVSS

* Update train_fr-en.yaml

* Update train_fr-en.yaml

* Update HF interface (#2293)

* RNN Tranducer Numba Loss: Add FP16 and BF16 support (code from Samsung AI Cambridge) (#2296)

* Make lobes use fp32 when AMP is active (#2295)

* Added utils.autocast with a fwd_default_precision function

* Decorate all lobes to require float32 precision in AMP

* Fix trailing space in docstring

* Less confusing doc for fwd_default_precision

* Be explicit that only fp inputs are affected by fwd_default_precision

* Typo in docstring

* Remove dtype annotation that is broken for some reason

* Precommit checks will be the end of me

* Fix tests

* Add docstring to precision wrapper function

* Fix style check again..

* adding support for fp16 transducer loss numba

* adding support for fp16 transducer loss numba

* fix fp16 transducer recipe

* add note on half precision

---------

Co-authored-by: asu <sdelang@sdelang.fr>
Co-authored-by: Titouan Parcollet/Embedded AI /SRUK/Engineer/Samsung Electronics <t.parcollet@sruk-ccn4.eu.corp.samsungelectronics.net>
Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com>

* Fix recipe tests for TransformerASR (#2282)

* fix position embedding (#2283)

* fix position embedding

* use speechbrain internal postional encoding and generate mask from sequence lengths

* call mask function from core for tacotron

* minor fix

* fix device

* reduce training epochs

* update links

---------

Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com>

* Gradscaler flags (#2281)

* add flags for gradscaler

* add check_loss_isfinite

* update dict

* typo

* remove default

* better message

* fix pre-commit

* remove checks

* remove new arguments

---------

Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com>

* add llama2 recipies (#2299)

* add llama2 recipies

* fix symbolic links

* fix  bug

* remove unneccary input in docstring

* fix typo

* cleaning llama2 recepies

* update readme

* update interface and add licence to readme

* fic doc string

* fix precommit

* fix extra-dependency

* remove  commented lines

* inter epoch checkpoint

* minor fixes

* add extra req info in llama.py

* fix linters

---------

Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com>

* small fixes

* make all recipes cpu-compliant + make recipe tests passing on both cpu and gpu

* fix some broken links

* remove link to private HF repo

* remove link to private HF repo

* fix libritts recipe test

* fix ljspeech recipe test

* Streamable Conformer-Transducer ASR model for LibriSpeech (#2140)

* Introduce DCT+DCConv logic

* DDP fix?

* Batch of changes and things brought back

* Streaming fixes (successfully trains)

* WIP streaming code

* WIP functional streaming code

* Fix left context

* Fix formatting

* Cleanups and docs in streaming utils

* Better comment hparams, change seed back to orig, improve naming

* uncomment averaging stuff; it was some ipython issue

* Remove pin_memory as it was not beneficial

* More cleanups, comments on context stuff

* More comments and TODOs

* encode_streaming docstring

* Dirty TransducerBeamSearcher change for streaming GS

* Fix precommit

* Fix encoders that do not support chunk_size

* Pre-commit again

* Make chunk_size type consistent

* Fix formatting of doctest in split_wav_lens

* Remove outdated TODO

* Add hasattr streaming to retain model backcompat

* Cleanup doc and naming for transducer_greedy_decode

* Cite paper for chunked attention

* Remove lost comment

* Update comment in self-attention

* Don't apply masked fill fix in the non-bool mask case

* Added TODO README update

* Revert change to custom_tgt_module; patching model instead

* Remove added entry in README

* Fix streaming conformer conv mismatch

* More conformer conv adjustments

* Adjust context size

* Remove outdated comment

* Fixed causal conformer decoder

* Fix linting

* Gate `custom_tgt_module` creation behind the presence of decoder layers

* Re-enable checkpoint averaging

* Change averaged ckpt count to 10

* Add new model results to README

* WIP refactor: Introduce DCTConfig dataclass

* Improved notice in README

* Formatting and linting fixes

* Attempt at fixing circular import?

* utils can't depend on core it seems; move dct

* Whoops, missed file

* Add DCT test, fix issues

* Remove now obsolete yaml variables for streaming

* Formatting

* Add dummy dct_config parameter to keep unsupported encoders working

* Linting fix

* Fix typo

* Add note on runtime autocast accuracy

* Fix very bad typo from refactor in YAML

* Fix hasattr streaming check

* Remove legacy comment

* Fix left context size calculation in new mask code

* Fix causal models in TransformerASR

* Remove comment on high-level inference code

* YAML formatting + commenting dynchunktrain stuff

* Remove outdated comment about DCConv left contexts

* Remove commented out debug prints from TransformerASR

* Move DCT into utils again

* Rename all(?) mentions of DCT to explicit dynamic chunk training

* Clarify padding logic

* Remove now-useless _do_conv, fix horrible formatting

* Slightly fix formatting further

* Add docstrings to forward_streaming methods

* Add a reference on Dynamic Chunk Training

* Rework conformer docstring docs

* Update conformer author list, fix doc formatting for authors

* Fix trailing whitespace in conformer

* Improved comments in Conformer.forward

* Added random dynchunktrain sampler example

* More explicit names for mask functions in TransformerASR

* Added docstring example on encode_streaming

* Pre-commit fix

* Fix typo in conformer

* Initial streaming integration test

* Precommit fix

* Fix indent in YAML

* More consistent spelling in streaming integration test

* Update CommonVoice.csv

* Add KenLM n-gram training recepie (#2304)

* add kenlm training

* fix precommit

* update readmefile with new result

* fix pre-commit

* fix typo

* fix commit reviews

* fix bug in testing

* add docstring and fix indentation

* fix bug in ASR interface

* change encoderasr interface to support ctc beam

* add suppourt fro kenlm in enoderasr interface

* fix typo

* little changes in REAMDE files to improve clarity)

* use binaries sources in bashrc

* fix trailing-whitespace

---------

Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com>

* Create Performance file (automatically) (#2314)

* add performance readme builder

* update recipe csv files

* update README files

* add not in prerelease test

* added performance.md

* fix linters

* update info in README

* Llama2 interface bug (#2318)

* fix llama2 interface bug

* fix minor bug

* update multiwox.csv with correct db and HF link

* New README file (#2315)

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Optimize masked Dynamic Chunk Convolution (#2308)

* Reorganized some conformer convolution module to be faster

* Completely get rid of the list of slices in the conformer conv module

* Fix linter check

* Remove unused variable

* More unused variables..

* Remove unused import

* Add conformer streaming code path test

* Fix test formatting

* small fixes in tests

* Update RNNLM.yaml

* BayesSpeech (#2326)

* Create train_bayesspeech.py

* Create bayesspeech.yaml

* Update README.md

* Update LibriSpeech.csv

* add extra-req

---------

Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com>

* adding new controllable exp scheduler

* adding new controllable exp scheduler

* update performance file

* Update PERFORMANCE.md

* Update README.md

---------

Co-authored-by: mhn226 <mhn.22692@gmail.com>
Co-authored-by: Adel Moumen <88119391+Adel-Moumen@users.noreply.github.com>
Co-authored-by: Adel Moumen <adelmoumen.pro@gmail.com>
Co-authored-by: Ha Nguyen <43038599+mhn226@users.noreply.github.com>
Co-authored-by: flexthink <1496671+flexthink@users.noreply.github.com>
Co-authored-by: flexthink <flexthink@users.noreply.github.com>
Co-authored-by: Pooneh Mousavi <moosavi.pooneh@gmail.com>
Co-authored-by: shubham-gupta-30 <127571426+shubham-gupta-30@users.noreply.github.com>
Co-authored-by: Shubham Gupta <shubhamgupta@Shubhams-MacBook-Pro-2.local>
Co-authored-by: Parcollet Titouan <parcollet.titouan@gmail.com>
Co-authored-by: asu <sdelang@sdelang.fr>
Co-authored-by: Titouan Parcollet/Embedded AI /SRUK/Engineer/Samsung Electronics <t.parcollet@sruk-ccn4.eu.corp.samsungelectronics.net>
Co-authored-by: Luca Della Libera <34525085+lucadellalib@users.noreply.github.com>
Co-authored-by: Yingzhi WANG <41187612+BenoitWang@users.noreply.github.com>
Co-authored-by: BenoitWang <wangyingzhi666@gmail.com>
@asumagic asumagic mentioned this pull request Jan 31, 2024
20 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request]: Streaming ASR (WIP)
2 participants