New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streamable Conformer-Transducer ASR model for LibriSpeech #2140
Streamable Conformer-Transducer ASR model for LibriSpeech #2140
Conversation
I wrote a relatively extensive description of the whole streaming approach.This should hopefully clear up a lot of how and why the approach works and hopefully in a more understandable/intuitive way than some of the litterature. Additionally, I got a simple greedy search prototype working yesterday and could successfully encode several minutes of spoken voice correctly. |
|
The WIP code to actually do streaming inference lives there (though this PR includes a bunch of model methods to help with it), alongside some demo and test code. |
Hey @asumagic please ping me when it's ready to review again :) |
6675d86
to
5780653
Compare
Trying to fix fp16 training NaN loss issues so that I can retrain a model after fixing a mismatch in the new convolution code, and rebasing to be up to date with the core changes. |
5780653
to
aab7365
Compare
Status update:
Not sure when the training will be done especially as narval will go under maintenance tomorrow, but at least I finally got the ball rolling there. |
Did you try removing the loss computation from the mixed training + the last softmax after the joiner network? One thing to check as well, is the RNNT loss a sum or an average? If it's a sum, dynamic batching might be slightly less stable. |
I think I did try forcing the loss computation to fp32 to no avail. I didn't try touching the joiner network/softmax though. For now I'm tempted to keep training with bf16 as I started one (.. once compute canada goes out of maintenance) and to try to investigate the fp16 problem after that so that the PR is ready ASAP. The RNN-T loss uses a sum here. |
The reason could be the sum here then. Could you please try to initialise the grad scaler with a much smaller value? Like 10 instead of 65536? Or even 2.0? The rational behind this is that torch initialise this value to a crazy high number that is ok when you have averaged losses -- see here. You can certainly also try to play with growth_factor=2.0, backoff_factor=0.5). Another good thing to do to see exactly what is happening and adapt accordingly could be to print the value of the scaling factor and see how it behaves, it may either vanish or explode due to the fact that we use a sum... @Adel-Moumen might be interested by this as well. But I am pretty sure that it comes from here. |
0d94f9d
to
096fb1b
Compare
cc @TParcollet @Adel-Moumen: nevermind, a mean is actually used in the transducer loss itself, I was looking at the wrong thing (the joiner network). I've tried reducing the initial gradscaler scale very significantly, but it did not help. Printing its state_dict to monitor what's going on shows that the scale is vanishing, so it seems to be endlessly trying to reduce scale. I'll be retrying to force different parts to float32 to diagnose the issue, since it's possible I had done it wrong the last time I tried. I think disabling autocast with Looking into this now. |
@asumagic right ... what about the rest of the recipe, is it ready? If so, I'll review one last time and try it. |
The recipe itself should™ be ready, other than that fp16 is still not fixed. The new training (with bf16) should end tomorrow or on Monday. Once done, I will test accuracy and everything once again. Notes and reminders:
The high-level interfaces shouldn't take too long to implement (comparatively) but I kind of wanted to make sure everything works before trying to design something that may also end up being reused for other streaming models (or for adapting non-streaming models to streaming contexts like whispercpp etc. do IIRC). |
@asumagic as soon as you are confident in the results, ping me and I'll review. For the three notes: 1. Ok. 2. Who decided to postpone this issue, it sounds like something that would make this recipe unusable for non-expert people; 3. Ok, this is totally fine and I actually think that the interface should be another PR. Could you ping to an issue showing 2? I will fix this. |
The |
wouldn't --find_unused_parameters solve the issue without commenting out the code? "The problem is that simply adding this check will not cut it as it will break every existing trained model, since trained models will include keys that will not exist (as the custom_tgt_module will not have been initialized)." -- right. This is a problem. Any idea from our checkpoint experts @Gastron and @pplantinga? |
No, all this does is to print which parameters were unused, but you still get an error. If you don't specify it, it does not tell you which parameter was the problem. (I imagine the default just checks a counter, whereas this option also tracks and sends the list of parameter names over DDP, which is probably why it's not default). |
@asumagic also, it looks like when streaming = True, the training speed slowdown seriously - on 8 GPUs it goes from 10min per epoch to 15-20. Is this expected? I guess it's due to the processing per chunk with all the for loops? Should we try to jit this? GPU util is also seems to be lower, once we have merged that, maybe we might want to have a look at optimising the code, if this is of interest to you. |
I'll try to profile it. It's possible that some of the list comprehensions inside of the convolution module are what harm performance, especially with longer utterances and smaller chunk sizes.
|
On a side note, I did try poking with |
Yet again updated the notebook due to the renaming refactor: |
@TParcollet Ok, I think it should be all done now. There are some conversations I didn't resolve which were mostly answers to some remarks/questions. With the very latest code, I have checked that:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Awesome work @asumagic !
* Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * rename HF's files * fix docstrings * fix args docstrings * fix docstrings * change classes' names * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Refactor HF interface, adapt recipes * Fix docstrings * commonvoice * switchboard * update readme * update readme * update lionk in test file * remove unused space token * update torchaudio * remove deprecated language model path * fix merge * fix vocab * fix switchboard * commit * fix test * fix style * remove unsued hparam * fix consistancy blank_skip_threshold * text frames * CTCPrefixBeamSearcher timestamps * pre-commit * test * test 2 * fix prints * update ctcprefixbeamsearch timestamps * remove frames from prefix bs * ≈Revert "remove frames from prefix bs" This reverts commit 30900d9. * remove prefix bs * ≈Revert "remove prefix bs" This reverts commit 2f0c3cd. * Revert "update ctcprefixbeamsearch timestamps" This reverts commit ce09e19. * Revert "fix prints" This reverts commit bf36037. * Revert "test 2" This reverts commit 84cda94. * Revert "test" This reverts commit f17349f. * Revert "pre-commit" This reverts commit 4e1cf0d. * Revert "CTCPrefixBeamSearcher timestamps" This reverts commit c3d3cf7. * Revert "text frames" This reverts commit e67c761. * Revert "fix consistancy blank_skip_threshold" This reverts commit f97a391. * Update ctc.py * arg / timestamps * precommit * timesteps -> text_frames * ls seq2seq * transformer ls * fix naming * librispeech * aishell * fix linter * precommit * switchboard * timit * Dynamic batching fixed * authors * fix conformer large * indent * Revert "Fix dynamic batching" (#2173) * update doctest skip * Fix dynamic batching (#2174) * Revert "Revert "Fix dynamic batching" (#2173)" This reverts commit faa5e76. * Update interfaces.py * Update interfaces.py * Update text_to_sequence.py * fix w2v * aishell * cv * ls transformer * ls ssl * switchboard * timit * precommit * fix indent * fix arg * unit test sorting * unittests * remove if main * Small fixes in averaging checkpoints (#2181) * add ckpt avg unittest * avoid hard-coding number of averages * last fixes * fix recipe test * fix recipe test * convert print into logger * fix transducer recipe * remove typing * fix merge * precommit * Update LibriSpeech.csv * update to new dynamic batching args * Update unstable branch with new commits (#2196) * hyper branch/conf -former fixes * remove ctc.py from doctest * get back ctc.py * remove doctest for torchaudio * adapt gpt recipe * adapt gpt recipe * small follow up fix on openrir * remove doc test (for now) * fix issue greedy search * docstring * pre-commit * Fix issues unstable (#2216) Thank you @Adel-Moumen! I did the tests again and everything works now. As for your points on the recipe tests, I agree. We can eventually do that in another PR. * Fix missing file / import in huggingface_transformers (#2224) * init/imports * comment * add partial import * wav2vec -> wav2vec2 * fix ci * Text based HF (#2214) * add mbart * Add tristage scheduler * Add mbart beam search * Add IWLST recipes * Add new models' inteference interface * Add info of new models * Add nllb scores * Add new models' info * Add test info IWSLT recipe * Add test info IWSLT recipe * add docstrings for S2STransformerBeamSearcher * Update IWSLT recipes * Update IWSLT recipes * fix doctest * add requirements * add protobuf * fix doctest * small fixes * Add protobuf install * Minor reform * Remove protobuf * Fix docstings * Fix docstrings * minor reform * remove labse * change authorship * remove comments * minor changes * change authorship * Fix recipe test * add info * Update README.md * Update README.md * change recipe structure --------- Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com> Co-authored-by: Adel Moumen <88119391+Adel-Moumen@users.noreply.github.com> * Neural LM Rescoring (#2187) * baserescorerinterface * add rescorers * first attempt * update code * 1.57 wer * update * update code * update code * docstring example rnn * updata loader * docstring example * tests * docstring example * update * tmpdir * change path * update doc * docstring * docstring args * doctest * fix docstring example * unnittest * interface * yamls update * full_infernece tests * model link * readme * yaml/inference tests * update res * fix wav2vec with wav2vec2 --------- Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com> * Add wrappers for Encodec and Vocos vocoders (#2231) * Add wrappers for Encodec and Vocos from Huggingface * Encodec: Add a comment * Encodec/Vocos: Add examples, restructure, fix masks * Vocos: Add a comment about the open pull request * Encodec/Vocos: Add the ability to customize save_path, fix a log message * Encodec/Vocos: Cosmetic changes * Vocos: Cosmetic changes * Encodec/Vocos: Remove the mandatory Vocos requirement * Vocos: Remove vocos from __init__.py * fix init * Vocos: Add a check for vocos in conftest.py * Vocos/Encodec: Update documentation, add bandwidth control * Fix old path in conftest.py * Cosmetic changes * Encodec/Vocos: Add support for embedding vectors * Encodec: Update example * Encodec/Vocodec: Add automatic reshaping, minor cosmetic changes --------- Co-authored-by: flexthink <flexthink@users.noreply.github.com> Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com> * Semantically-Aligned Multimodal Utterance-level (SAMU) pre-training (#2223) * add mbart * Add tristage scheduler * Add mbart beam search * Add IWLST recipes * Add new models' inteference interface * Add info of new models * Add nllb scores * Add new models' info * Add test info IWSLT recipe * Add test info IWSLT recipe * add docstrings for S2STransformerBeamSearcher * Update IWSLT recipes * Update IWSLT recipes * fix doctest * add requirements * add protobuf * fix doctest * small fixes * Add protobuf install * Minor reform * Remove protobuf * Fix docstings * Fix docstrings * minor reform * remove labse * Add attention pooling * Add labse * Add info about SAMU * add iwslt recipes with samu * fix recipe test * fix comments * fix recipe test * change recipe structure * fix test recipe * Add new recipes * minor doctest change * minor doctest change * small changes * add dropbox links --------- Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com> * fix norm (#2237) * Discrete SSL (#2233) * clustering training recipies for LibriSpeech for different SSL model * add Discrete Hubert Model * load from HF, fix minor issues * fix hyper-param value * fix precommit * fix flake8 * fix batch_size and n_clus values in hyperparams * fix typos * fix typo and some cleaning * fix precommit * fix device incompatibility and memroty issue * use fit instead of partial fit * add README file * add test recipies * remove unused fields from hparams * fix precommmit-yamllint - extra whitespace * add docstring for load_kmeans for Discrete_hubert.py * add discrete wavlm, wav2vec * avoid docstring testing for discrete_ssl models * fix docstring failed issue * add discrete_interface to conftest.py * fix precommit * Fixes for Encodec (#2240) * Add wrappers for Encodec and Vocos from Huggingface * Encodec: Add a comment * Encodec/Vocos: Add examples, restructure, fix masks * Vocos: Add a comment about the open pull request * Encodec/Vocos: Add the ability to customize save_path, fix a log message * Encodec/Vocos: Cosmetic changes * Vocos: Cosmetic changes * Encodec/Vocos: Remove the mandatory Vocos requirement * Vocos: Remove vocos from __init__.py * fix init * Vocos: Add a check for vocos in conftest.py * Vocos/Encodec: Update documentation, add bandwidth control * Fix old path in conftest.py * Cosmetic changes * Encodec/Vocos: Add support for embedding vectors * Encodec: Update example * Encodec/Vocodec: Add automatic reshaping, minor cosmetic changes * Encodec: Decoupled token extraction, fixed CPU/GPU issues * Encodec: Add renormalization --------- Co-authored-by: flexthink <flexthink@users.noreply.github.com> Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com> * Refactoring of the 'fit_batch' function (#2010) * add dataclass * turn False * remove valid_step * update core.py * update core.py * update core.py * precommit * self.autocast + GradScaler enabled * freeze opt * naming * update core.py * comments * example transducer conformer * update core.py * small changes * naming + skip_grad_nans * doc * check * support cpu training * precision + doctrsting * name * change w2v * restore ckpt * remove file * remove casting * tests * whisper + fix tests * seq2seq ls * update transducer / transformer * remove on_optimizers_step_end + comments * update check yaml * remove default arg * add precision in yamls * add precision inside of the yamls * ckpt and scaler * run_opt outside brain + test * several recipe updates * improve w2v fit_batch fn * add arg * update name * timit * context manager * on_fit_batch_start * update CV * should_step with noam * add flag precision * naming * aishell * aishell * update recipes * so many recipes 0.0 * update recipes * last recipes * zero_grad * fix grad_accumulation_factor * update recipes * update auto_mix_prec flag * remove opt flag test * librispeech * cv ssl * audio mnist / realm * voicebank * fix rescuespeech * fix lr annealing * libritts * multiwoz * slurp nlu * should_step * update yamls * update yaml * update batch smpler tedlium * remove fit batch * precision flag * update sampler * add precision inside of the yamls * run_opt outside brain + test * fix auto_mix_prec flag * docstring * grad acc * failing test * update unittests * update jarod's pr * fix removed avg_checkpoint param * update path * fix some recipe tests * update samu recipe * fix hifigan/IWSLT * tedlium --------- Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com> * Refactor Augmentation (#2206) * update * update * change folder * remove unnecesary file * update folder structure * add noise, add rev * augmenter refactor * refactor augment + example in templace * fix tests + linters * address comments * supporting variable-length augmentations in augmenter (e.g., speed change) * lib refactor (splitting time and freq augmentations) * fine tune freq drop * refactor of specaugment (freq-domain) - part 1 * converted specaument (freq domain) * refactor random shift * implemented cutcat, swap, and random selection * extended unittests + small fixes * improvements and fixes in augment * plugged feature augmentation + various fixes and improvements * add sum_batch noise (similat to babble) + various fixes * add drop bit resolution * added coded augmentation * added more unittests * restore all augmentations * making AddReveb more similar to AddNoise * fix device mismatch + fix last batch management * add workes to speed up AddNoise and AddRev * improve comments in template yaml * speed up template (sorting dev and test) * extend augmenter by adding activation provability * implemented enable augmentation flag (useful of hparam tuning) + other improvements * plugged coded augment * fixed coded augment * remove old files * fix integration test * remove knowledge distill TIMIT reicpes. Too many yaml files to maintain * convert TIMIT * fix recipe * converted templates using EnvCorr * converted voxceleb * converted GSC + fixes on voxceleb * convrted UrbanSound8k * converted voicebank * converted other recipes * converted CommonLanguage, VoxLingua, timers-and-such * converted all recipes using envcorr * CommonVoice * REAL-M * Aishell1Mix * LibriMix * converted all recipes! * fix linters - part1 * fix linters - part2 * add a note in the template regarding augmentation * fix docstring tests * fix yamls * remove coded tests from docstring * revised coded tests * fix identation in codec.py * try to fix doc issue * revise lib header in codec.oy * fix doc * fix doc attempt * rename sections * fix doc * fix (most) recipe tests * fix other recipe tests * address comments * fix yaml * fix * convert recipe * fix recipes * fix aug in rescoring recipes * Delete tmpdir_vocoder directory * Refactor Inference (files and folders) (#2252) * refactor inference files and folders * fix some tests * fix some tests * fix doctest * import lib * small fixes * Fix beam search (#2253) * fix starting pos prefix_length * block path ctc + fix default value to the old one * fix issue with score being -inf * remoev print * precommit * Fix ctc beam search (#2263) * fix logprobs / space_token / warnings * fix space_token * pre-commit * space_token * simplify parameters * simplify yamls * remove comma * update beam search * fix vocab/str (#2265) * Fix blank index ctc (#2266) * update blank_index * whisper * revert change * mistake * Cv unstable merge (#2254) * add fr preproccesing to Common_voice_prepare.py * add CV , CTC, new languages * fix precommit and test * add transducer recipie * add transformer recipies * update augmentation of CTC recipies * update seq-to-seq recipies * fix whisper HF interface bug. (return str insted of list) * fix recipe tests * add fr preproccesing to Common_voice_prepare.py * add CV , CTC, new languages * fix precommit and test * add transducer recipie * add transformer recipies * update augmentation of CTC recipies * update seq-to-seq recipies * fix whisper HF interface bug. (return str insted of list) * fix recipe tests * modify beamsearch for CTC: ar.es.pt and zh-CN * fix interface conflict * fix transducer interface bug --------- Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com> * Add warnings and fix numba (#2271) * upperbound torch/tochaudio + remove opt dependancy * add back automix/bf flags * linters * oops * transformers back * test requirements * Fix Bug: CommonVoice Transformer Bug loading correct optimizer (#2278) * fix trnsfrm bug to load correct opt:adam vs sgd * add data_root to the path of common_voice_prepare.py * add epoch/_counter pretrainer to fr and it recepie * revert releative path change * fix opt bug without the need to add epoch_ckpt * add log and delete launch file * update the log message * update WeightedSSLModel (#2272) * update WeightedSSLModel * requirements.txt * fix pre-commit * Sg/dac (#2246) * introducing DAC * lint errors * black * documenttion * remove unused init file * Fixing tests * More doc strings * More doc strings * PR review * PR review * PR review * Update dac.py * Update dac.py * Update dac.py * make doctests smaller to avoid memory issues in CI * even smaller tests --------- Co-authored-by: Shubham Gupta <shubhamgupta@Shubhams-MacBook-Pro-2.local> Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com> * add quantization recipies fro IEMCAP, CV, LibriSpeech and LJSpeech (#2255) * add quantization recipies fro IEMCAP, CV, LibriSpeech and LJSpeech * update discrete_ssl models * add iemocap_prepare to main folder + add test * ix test for iemocap * fik typos * fix test recepies, minor dormat editting * fix typo in coomonvoice.csv * fix typo in yaml file * fix doctests (those that we do not run in the CI) --------- Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com> * change emdedding type from long to float to vaoid getting al zeros embedding (#2292) * Update CVSS (#2285) * Update CVSS * Update train_fr-en.yaml * Update train_fr-en.yaml * Update HF interface (#2293) * RNN Tranducer Numba Loss: Add FP16 and BF16 support (code from Samsung AI Cambridge) (#2296) * Make lobes use fp32 when AMP is active (#2295) * Added utils.autocast with a fwd_default_precision function * Decorate all lobes to require float32 precision in AMP * Fix trailing space in docstring * Less confusing doc for fwd_default_precision * Be explicit that only fp inputs are affected by fwd_default_precision * Typo in docstring * Remove dtype annotation that is broken for some reason * Precommit checks will be the end of me * Fix tests * Add docstring to precision wrapper function * Fix style check again.. * adding support for fp16 transducer loss numba * adding support for fp16 transducer loss numba * fix fp16 transducer recipe * add note on half precision --------- Co-authored-by: asu <sdelang@sdelang.fr> Co-authored-by: Titouan Parcollet/Embedded AI /SRUK/Engineer/Samsung Electronics <t.parcollet@sruk-ccn4.eu.corp.samsungelectronics.net> Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com> * Fix recipe tests for TransformerASR (#2282) * fix position embedding (#2283) * fix position embedding * use speechbrain internal postional encoding and generate mask from sequence lengths * call mask function from core for tacotron * minor fix * fix device * reduce training epochs * update links --------- Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com> * Gradscaler flags (#2281) * add flags for gradscaler * add check_loss_isfinite * update dict * typo * remove default * better message * fix pre-commit * remove checks * remove new arguments --------- Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com> * add llama2 recipies (#2299) * add llama2 recipies * fix symbolic links * fix bug * remove unneccary input in docstring * fix typo * cleaning llama2 recepies * update readme * update interface and add licence to readme * fic doc string * fix precommit * fix extra-dependency * remove commented lines * inter epoch checkpoint * minor fixes * add extra req info in llama.py * fix linters --------- Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com> * small fixes * make all recipes cpu-compliant + make recipe tests passing on both cpu and gpu * fix some broken links * remove link to private HF repo * remove link to private HF repo * fix libritts recipe test * fix ljspeech recipe test * Streamable Conformer-Transducer ASR model for LibriSpeech (#2140) * Introduce DCT+DCConv logic * DDP fix? * Batch of changes and things brought back * Streaming fixes (successfully trains) * WIP streaming code * WIP functional streaming code * Fix left context * Fix formatting * Cleanups and docs in streaming utils * Better comment hparams, change seed back to orig, improve naming * uncomment averaging stuff; it was some ipython issue * Remove pin_memory as it was not beneficial * More cleanups, comments on context stuff * More comments and TODOs * encode_streaming docstring * Dirty TransducerBeamSearcher change for streaming GS * Fix precommit * Fix encoders that do not support chunk_size * Pre-commit again * Make chunk_size type consistent * Fix formatting of doctest in split_wav_lens * Remove outdated TODO * Add hasattr streaming to retain model backcompat * Cleanup doc and naming for transducer_greedy_decode * Cite paper for chunked attention * Remove lost comment * Update comment in self-attention * Don't apply masked fill fix in the non-bool mask case * Added TODO README update * Revert change to custom_tgt_module; patching model instead * Remove added entry in README * Fix streaming conformer conv mismatch * More conformer conv adjustments * Adjust context size * Remove outdated comment * Fixed causal conformer decoder * Fix linting * Gate `custom_tgt_module` creation behind the presence of decoder layers * Re-enable checkpoint averaging * Change averaged ckpt count to 10 * Add new model results to README * WIP refactor: Introduce DCTConfig dataclass * Improved notice in README * Formatting and linting fixes * Attempt at fixing circular import? * utils can't depend on core it seems; move dct * Whoops, missed file * Add DCT test, fix issues * Remove now obsolete yaml variables for streaming * Formatting * Add dummy dct_config parameter to keep unsupported encoders working * Linting fix * Fix typo * Add note on runtime autocast accuracy * Fix very bad typo from refactor in YAML * Fix hasattr streaming check * Remove legacy comment * Fix left context size calculation in new mask code * Fix causal models in TransformerASR * Remove comment on high-level inference code * YAML formatting + commenting dynchunktrain stuff * Remove outdated comment about DCConv left contexts * Remove commented out debug prints from TransformerASR * Move DCT into utils again * Rename all(?) mentions of DCT to explicit dynamic chunk training * Clarify padding logic * Remove now-useless _do_conv, fix horrible formatting * Slightly fix formatting further * Add docstrings to forward_streaming methods * Add a reference on Dynamic Chunk Training * Rework conformer docstring docs * Update conformer author list, fix doc formatting for authors * Fix trailing whitespace in conformer * Improved comments in Conformer.forward * Added random dynchunktrain sampler example * More explicit names for mask functions in TransformerASR * Added docstring example on encode_streaming * Pre-commit fix * Fix typo in conformer * Initial streaming integration test * Precommit fix * Fix indent in YAML * More consistent spelling in streaming integration test * Update CommonVoice.csv * Add KenLM n-gram training recepie (#2304) * add kenlm training * fix precommit * update readmefile with new result * fix pre-commit * fix typo * fix commit reviews * fix bug in testing * add docstring and fix indentation * fix bug in ASR interface * change encoderasr interface to support ctc beam * add suppourt fro kenlm in enoderasr interface * fix typo * little changes in REAMDE files to improve clarity) * use binaries sources in bashrc * fix trailing-whitespace --------- Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com> * Create Performance file (automatically) (#2314) * add performance readme builder * update recipe csv files * update README files * add not in prerelease test * added performance.md * fix linters * update info in README * Llama2 interface bug (#2318) * fix llama2 interface bug * fix minor bug * update multiwox.csv with correct db and HF link * New README file (#2315) * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Optimize masked Dynamic Chunk Convolution (#2308) * Reorganized some conformer convolution module to be faster * Completely get rid of the list of slices in the conformer conv module * Fix linter check * Remove unused variable * More unused variables.. * Remove unused import * Add conformer streaming code path test * Fix test formatting * small fixes in tests * Update RNNLM.yaml * BayesSpeech (#2326) * Create train_bayesspeech.py * Create bayesspeech.yaml * Update README.md * Update LibriSpeech.csv * add extra-req --------- Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com> * adding new controllable exp scheduler * adding new controllable exp scheduler * update performance file * Update PERFORMANCE.md * Update README.md --------- Co-authored-by: mhn226 <mhn.22692@gmail.com> Co-authored-by: Adel Moumen <88119391+Adel-Moumen@users.noreply.github.com> Co-authored-by: Adel Moumen <adelmoumen.pro@gmail.com> Co-authored-by: Ha Nguyen <43038599+mhn226@users.noreply.github.com> Co-authored-by: flexthink <1496671+flexthink@users.noreply.github.com> Co-authored-by: flexthink <flexthink@users.noreply.github.com> Co-authored-by: Pooneh Mousavi <moosavi.pooneh@gmail.com> Co-authored-by: shubham-gupta-30 <127571426+shubham-gupta-30@users.noreply.github.com> Co-authored-by: Shubham Gupta <shubhamgupta@Shubhams-MacBook-Pro-2.local> Co-authored-by: Parcollet Titouan <parcollet.titouan@gmail.com> Co-authored-by: asu <sdelang@sdelang.fr> Co-authored-by: Titouan Parcollet/Embedded AI /SRUK/Engineer/Samsung Electronics <t.parcollet@sruk-ccn4.eu.corp.samsungelectronics.net> Co-authored-by: Luca Della Libera <34525085+lucadellalib@users.noreply.github.com> Co-authored-by: Yingzhi WANG <41187612+BenoitWang@users.noreply.github.com> Co-authored-by: BenoitWang <wangyingzhi666@gmail.com>
Contribution in a nutshell
This PR implements a streaming ASR model for SpeechBrain.
Resolves: #1970
This PR will (probably) not be implementing the high-level interfaces (think
interfaces.py
) to perform inferenceas my work ends at the end of August, but I'm hoping to get everything in a functional state with a streaming inference proof-of-concept done.The high level interfaces will be implemented separately.Scope
Additionally, still TODO for me:
custom_tgt_module
, again -- probably needs undoing and patching the model same as the other conformer-transducerOut-of-scope for this PR (will be done separately):
Notes for reviewing (optional)
See thread, there is an accuracy issue for streaming decoding ATM.SolvedExplanations and technical details
This should go quite in a bit of detail for the big picture. It could probably eventually be turned into a general documentation on streaming once SB's support matures out.
https://gist.github.com/asumagic/aaf875929d1c03a96fd08a897599aaa8
Modifying the existing recipe?
Streaming has required very little hparams changes so far. I figure that considering the only current difference between a non-streaming model and a streaming model is changing one hyperparameter, it made more sense to keep it at one file for now.
Additionally, as per the above technical doc, the model is also trained with a context extremely close to non-streaming for ~40% of the batches. The result is that the streaming model can operate in a non-streaming fashion, and for LibriSpeech, the non-streaming performance is very close between a
streaming: False
and astreaming: True
model, at least in my testing.Self-attention logic changes
The
attention.py
change which moves the scaling before in order to reduce the likelihood of float16 overflow should be reviewed thoroughly as it is extensively used. Namely, even though it looks mathematically correct, therel_shift
trick scares me a bit and someone with a better understanding than me on the subject should double-check it.Performing the softmax in fp32 space did not seem to affect performance for me and it seems more correct.
Ultimately I trained the model using bf16. I am not sure if training in fp16 is fully stable yet, but I know that these changes reduced the amount of training failures significantly in fp16 mode.
Pre-review
(if applicable) add anextra_requirements.txt
file(if applicable) add database preparation scripts & use symlinks for nested folders (to the level of task READMEs)pre-commit run -a
to check linters; runpytest tests/consistency
tests/.run-doctests.sh
&tests/.run-unittests.sh