Skip to content

Commit

Permalink
Transformer-based Text Normalization Models (NVIDIA#2415)
Browse files Browse the repository at this point in the history
* Add notebook with recommendations for 8 kHz speech (NVIDIA#2326)

* Added a notebook with best practices for telephony speech

* Added datasets detaiils

* Added training recommendations

* Emptied out cells with results

* Added tutorial to docs

Signed-off-by: jbalam <jbalam@nvidia.com>

* Addressed review comments

Signed-off-by: jbalam <jbalam@nvidia.com>

* Added a line to note original sampling rate of an4

Signed-off-by: jbalam <jbalam@nvidia.com>

* Made changes suggested in review

Signed-off-by: jbalam <jbalam@nvidia.com>
Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Add FastEmit support for RNNT Losses (NVIDIA#2374)

* Temp commit

Signed-off-by: smajumdar <titu1994@gmail.com>

* Initial code for fastemit forward pass

Signed-off-by: smajumdar <titu1994@gmail.com>

* Correct return reg value

Signed-off-by: smajumdar <titu1994@gmail.com>

* Initial cpu impl

Signed-off-by: smajumdar <titu1994@gmail.com>

* Try gpu impl

Signed-off-by: smajumdar <titu1994@gmail.com>

* Try gpu impl

Signed-off-by: smajumdar <titu1994@gmail.com>

* Correct few impl

Signed-off-by: smajumdar <titu1994@gmail.com>

* Update fastemit scaling

Signed-off-by: smajumdar <titu1994@gmail.com>

* Cleanup fastemit

Signed-off-by: smajumdar <titu1994@gmail.com>

* Finalize FastEmit regularization PR

Signed-off-by: smajumdar <titu1994@gmail.com>

* Refactor code to support fastemit regularization

Signed-off-by: smajumdar <titu1994@gmail.com>

Co-authored-by: Samuel Kriman <samuelkriman@gmail.com>
Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Implement inference functions of TN models

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Minor Fix

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* fix bugs in hifigan code (NVIDIA#2392)

Signed-off-by: Oktai Tatanov <oktai.tatanov@gmail.com>
Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Update setup.py (NVIDIA#2394)

Signed-off-by: Jason <jasoli@nvidia.com>
Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* update checkpointing (NVIDIA#2396)

Signed-off-by: Jason <jasoli@nvidia.com>
Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* byt5 unicode implementation (NVIDIA#2365)

* Audio Norm (NVIDIA#2285)

* add jenkins test, refactoring

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* update test

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* fix new test

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* add serial to the default normalizer, add tests

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* manifest test added

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* expose more params, new test cases

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* fix jenkins, serial clean, exclude range from cardinal

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* jenkins

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* jenkins dollar sign format

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* jenkins

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* jenkins dollar sign format

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* addressed review comments

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* fix decimal in measure

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* move serial in cardinal

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* clean up

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* update for SH zero -> oh

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* change n_tagger default

Signed-off-by: ekmb <ebakhturina@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* bumping version to 1.0.1

Signed-off-by: Oleksii Kuchaiev <okuchaiev@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Add check for numba regardless of device

Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* upper bound for webdataset

Signed-off-by: Oleksii Kuchaiev <okuchaiev@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Correct Dockerfile

Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* update readmes

Signed-off-by: Oleksii Kuchaiev <okuchaiev@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* update README (NVIDIA#2332)

Signed-off-by: Oleksii Kuchaiev <okuchaiev@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* ddp translate GPU allocation fix (NVIDIA#2312)

* fixed branch in IR tutorial

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>

* ddp translate GPU allocation fix

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>

* map_location instead of set_device

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>

Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Shallow fusion (NVIDIA#2315)

* fixed branch in IR tutorial

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>

* shallow fusion init commit

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>

* debug info removed

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>

Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* [BUGFIX] Add upper bound to hydra for 1.0.x (NVIDIA#2337)

* upper bound hydra

Signed-off-by: ericharper <complex451@gmail.com>

* upper bound hydra

Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* update version number

Signed-off-by: Oleksii Kuchaiev <okuchaiev@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* update package version

Signed-off-by: Oleksii Kuchaiev <okuchaiev@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* sparrowhawk tests + punctuation post processing for pynini TN (NVIDIA#2320)

* add jenkins test, refactoring

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* update test

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* fix new test

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* add serial to the default normalizer, add tests

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* manifest test added

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* expose more params, new test cases

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* fix jenkins, serial clean, exclude range from cardinal

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* jenkins

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* jenkins dollar sign format

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* jenkins

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* jenkins dollar sign format

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* addressed review comments

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* fix decimal in measure

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* move serial in cardinal

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* sh tests init

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* sparrowhawk container tests support added

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* add post process to normalize.py, update tests

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* remove duplication

Signed-off-by: ekmb <ebakhturina@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Update notebooks to 1.0.2 release (NVIDIA#2338)

Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Update ranges for omegaconf and hydra (NVIDIA#2336)

* Update ranges

Signed-off-by: smajumdar <titu1994@gmail.com>

* Updates for Hydra and OmegaConf updates

Signed-off-by: smajumdar <titu1994@gmail.com>

* Style fixes

Signed-off-by: smajumdar <titu1994@gmail.com>

* Correct tests and revert patch for model utils

Signed-off-by: smajumdar <titu1994@gmail.com>

* Correct docstring

Signed-off-by: smajumdar <titu1994@gmail.com>

* Revert unnecessary change

Signed-off-by: smajumdar <titu1994@gmail.com>

* Revert unnecessary change

Signed-off-by: smajumdar <titu1994@gmail.com>

* Guard scheduler for None

Signed-off-by: smajumdar <titu1994@gmail.com>

* default to 0.0 if bpe_dropout is None

Signed-off-by: ericharper <complex451@gmail.com>

* Correctly log class that was restored

Signed-off-by: smajumdar <titu1994@gmail.com>

* Root patch *bpe_dropout

Signed-off-by: smajumdar <titu1994@gmail.com>

Co-authored-by: ericharper <complex451@gmail.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Update FastPitch Export (NVIDIA#2355)

Signed-off-by: Jason <jasoli@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* byt5 unicode implementation, first cut

Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* add bytelevel tokenizer

Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* update out_dir to not collide (NVIDIA#2358)

Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Update container version to 21.05 (NVIDIA#2309)

* Update container version

Signed-off-by: smajumdar <titu1994@gmail.com>

* Temporarily change export format of waveglow

Signed-off-by: smajumdar <titu1994@gmail.com>

* Add conda update for numba

Signed-off-by: smajumdar <titu1994@gmail.com>

* Update numba compat via global flag for strictness level `--relax_numba_compat`, remove pytorchlightning.metrics, refactor out numba utils to core, update tests

Signed-off-by: smajumdar <titu1994@gmail.com>

* Correct order of numba minimum verion, remove wrong flag from test

Signed-off-by: smajumdar <titu1994@gmail.com>

* Double test of cuda numba

Signed-off-by: smajumdar <titu1994@gmail.com>

* Double test of cuda numba

Signed-off-by: smajumdar <titu1994@gmail.com>

* Enable RNNT tests

Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Text Normalization Update (NVIDIA#2356)

* upper cased date support

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* update whitelist, change roman weights

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* docstrings, space fix, init file

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* lgtm

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* fraction with measure class

Signed-off-by: ekmb <ebakhturina@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* address comment

Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Add ASR CTC tutorial on fine-tuning on another language (NVIDIA#2346)

* Add ASR CTC Language finetuning notebook

Signed-off-by: smajumdar <titu1994@gmail.com>

* Add to documentation

Signed-off-by: smajumdar <titu1994@gmail.com>

* Improve documentation

Signed-off-by: smajumdar <titu1994@gmail.com>

* Correct name of the dataset

Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Correct colab link to notebook (NVIDIA#2366)

Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* sgdqa update data directories for testing (NVIDIA#2323)

* sgdqa update data directories for testing

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* fix syntax

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* check if data dir exists

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* fix

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* adding pretrained model

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Added documentation for export() (NVIDIA#2330)

* Added export document

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* Addressed review comments

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

Co-authored-by: Eric Harper <complex451@gmail.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Update Citrinet model card info (NVIDIA#2369)

* Update model card info

Signed-off-by: smajumdar <titu1994@gmail.com>

* Cleanup Docs

Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* [NMT] Model Parallel Megatron Encoders (NVIDIA#2238)

* add megatron encoder

Signed-off-by: ericharper <complex451@gmail.com>

* added megatron to get_nmt_tokenizer

Signed-off-by: ericharper <complex451@gmail.com>

* add vocab_size and hidden_size to megatron bert

Signed-off-by: ericharper <complex451@gmail.com>

* add megatron encoder module

Signed-off-by: ericharper <complex451@gmail.com>

* fixed horrible typo

Signed-off-by: ericharper <complex451@gmail.com>

* fix typo and add default

Signed-off-by: ericharper <complex451@gmail.com>

* updating nlp overrides for mp nmt

Signed-off-by: ericharper <complex451@gmail.com>

* move some logic back to nlpmodel from overrides

Signed-off-by: ericharper <complex451@gmail.com>

* add checkpoint_file property

Signed-off-by: ericharper <complex451@gmail.com>

* fix property

Signed-off-by: ericharper <complex451@gmail.com>

* num_tokentypes=0

Signed-off-by: ericharper <complex451@gmail.com>

* typo

Signed-off-by: ericharper <complex451@gmail.com>

* typo

Signed-off-by: ericharper <complex451@gmail.com>

* find_unused_parameters=True

Signed-off-by: ericharper <complex451@gmail.com>

* typo

Signed-off-by: ericharper <complex451@gmail.com>

* style

Signed-off-by: ericharper <complex451@gmail.com>

* get instead of pop

Signed-off-by: ericharper <complex451@gmail.com>

* remove token type ids from megatron input example

Signed-off-by: ericharper <complex451@gmail.com>

* pop vocab_size

Signed-off-by: ericharper <complex451@gmail.com>

* fix checkpointing for model parallel

Signed-off-by: ericharper <complex451@gmail.com>

* fix bug in non model parallel

Signed-off-by: ericharper <complex451@gmail.com>

* convert cfg.trainer to dict

Signed-off-by: ericharper <complex451@gmail.com>

* make num_tokentypes configurable for nmt

Signed-off-by: ericharper <complex451@gmail.com>

* update checkpoint_file when using named megatron model in nemo

Signed-off-by: ericharper <complex451@gmail.com>

* make vocab_file configurable

Signed-off-by: ericharper <complex451@gmail.com>

* dataclass can't have mutable default

Signed-off-by: ericharper <complex451@gmail.com>

* style

Signed-off-by: ericharper <complex451@gmail.com>

* unused imports

Signed-off-by: ericharper <complex451@gmail.com>

* revert input example

Signed-off-by: ericharper <complex451@gmail.com>

* check that checkpoint version is not None

Signed-off-by: ericharper <complex451@gmail.com>

* add mp jenkins test

Signed-off-by: ericharper <complex451@gmail.com>

* update docstring

Signed-off-by: ericharper <complex451@gmail.com>

* add docs for pretrained encoders with nemo nmt

Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Add notebook with recommendations for 8 kHz speech (NVIDIA#2326)

* Added a notebook with best practices for telephony speech

* Added datasets detaiils

* Added training recommendations

* Emptied out cells with results

* Added tutorial to docs

Signed-off-by: jbalam <jbalam@nvidia.com>

* Addressed review comments

Signed-off-by: jbalam <jbalam@nvidia.com>

* Added a line to note original sampling rate of an4

Signed-off-by: jbalam <jbalam@nvidia.com>

* Made changes suggested in review

Signed-off-by: jbalam <jbalam@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Add FastEmit support for RNNT Losses (NVIDIA#2374)

* Temp commit

Signed-off-by: smajumdar <titu1994@gmail.com>

* Initial code for fastemit forward pass

Signed-off-by: smajumdar <titu1994@gmail.com>

* Correct return reg value

Signed-off-by: smajumdar <titu1994@gmail.com>

* Initial cpu impl

Signed-off-by: smajumdar <titu1994@gmail.com>

* Try gpu impl

Signed-off-by: smajumdar <titu1994@gmail.com>

* Try gpu impl

Signed-off-by: smajumdar <titu1994@gmail.com>

* Correct few impl

Signed-off-by: smajumdar <titu1994@gmail.com>

* Update fastemit scaling

Signed-off-by: smajumdar <titu1994@gmail.com>

* Cleanup fastemit

Signed-off-by: smajumdar <titu1994@gmail.com>

* Finalize FastEmit regularization PR

Signed-off-by: smajumdar <titu1994@gmail.com>

* Refactor code to support fastemit regularization

Signed-off-by: smajumdar <titu1994@gmail.com>

Co-authored-by: Samuel Kriman <samuelkriman@gmail.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* byt5 unicode implementation, first cut

Signed-off-by: Mike Chrzanowski <mchrzanowski@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* add bytelevel tokenizer

Signed-off-by: Mike Chrzanowski <mchrzanowski@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* update styling

Signed-off-by: Mike Chrzanowski <mchrzanowski@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* avoid circular import

Signed-off-by: Mike Chrzanowski <mchrzanowski@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* fix bugs in hifigan code (NVIDIA#2392)

Signed-off-by: Oktai Tatanov <oktai.tatanov@gmail.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Update setup.py (NVIDIA#2394)

Signed-off-by: Jason <jasoli@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Update bytelevel_tokenizer.py

Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* Update bytelevel_tokenizer.py

Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* typo

Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* missed one

Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* bug fixes

Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* style fix

Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* bytelevelprocessor is now generic.

Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* style fix

Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* update checkpointing (NVIDIA#2396)

Signed-off-by: Jason <jasoli@nvidia.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* style

Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* woops, didnt merge jenkinsfile the right way

* add newline

Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* undo changes to enja processor

Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* processor selection decision fix

Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

* newline fix

Signed-off-by: mchrzanowski <mchrzanowski@nvidia.com>

Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com>
Co-authored-by: Oleksii Kuchaiev <okuchaiev@nvidia.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com>
Co-authored-by: Aleksey Grinchuk (Oleksii Hrinchuk) <grinchuk.alexey@gmail.com>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: Eric Harper <complex451@gmail.com>
Co-authored-by: Jason <jasoli@nvidia.com>
Co-authored-by: mchrzanowski <mchrzanowski@nvidia.com>
Co-authored-by: Yang Zhang <yzhang123@users.noreply.github.com>
Co-authored-by: Boris Fomitchev <borisfom@users.noreply.github.com>
Co-authored-by: Jagadeesh Balam <4916480+jbalam-nv@users.noreply.github.com>
Co-authored-by: Samuel Kriman <samuelkriman@gmail.com>
Co-authored-by: Oktai Tatanov <oktai.tatanov@gmail.com>
Co-authored-by: root <root@dgx0026.nsv.rno1.nvmetal.net>
Co-authored-by: root <root@dgx0079.nsv.rno1.nvmetal.net>
Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Minor Fix

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Minor Fixes

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Add TextNormalizationTestDataset and testing/evaluation code

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Add TextNormalizationTaggerDataset and training code for tagger

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Restore from local nemo ckpts

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Add TextNormalizationDecoderDataset

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Add interactive mode for neural_text_normalization_test.py

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Add options to do training or not for tagger/decoder

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Renamed

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Implemented setup dataloader for decoder

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Implemented training and validation for decoder

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Data augmentation for decoder training

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Config change

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* add blossom-ci.yml (NVIDIA#2401)

Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Merge r1.1 bugfixes into main (NVIDIA#2407)

* Update notebook branch and Jenkinsfile for 1.1.0 testing (NVIDIA#2378)

* update branch

Signed-off-by: ericharper <complex451@gmail.com>

* update jenkinsfile

Signed-off-by: ericharper <complex451@gmail.com>

* [BUGFIX] NMT Multi-node was incorrectly computing num_replicas (NVIDIA#2380)

* fix property when not using model parallel

Signed-off-by: ericharper <complex451@gmail.com>

* fix property when not using model parallel

Signed-off-by: ericharper <complex451@gmail.com>

* add debug statement

Signed-off-by: ericharper <complex451@gmail.com>

* add debug statement

Signed-off-by: ericharper <complex451@gmail.com>

* instantiate with NLPDDPPlugin with num_nodes from trainer config

Signed-off-by: ericharper <complex451@gmail.com>

* Update ASR scripts for tokenizer building and tarred dataset building (NVIDIA#2381)

* Update ASR scripts for tokenizer building and tarred dataset building

Signed-off-by: smajumdar <titu1994@gmail.com>

* Update container

Signed-off-by: smajumdar <titu1994@gmail.com>

* Add STT Zh Citrinet 1024 Gamma 0.25 model

Signed-off-by: smajumdar <titu1994@gmail.com>

* Update notebook (NVIDIA#2391)

Signed-off-by: smajumdar <titu1994@gmail.com>

* ASR Notebooks fix for 1.1.0 (NVIDIA#2395)

* nb fix for spring clean

Signed-off-by: fayejf <fayejf07@gmail.com>

* remove outdated instruction

Signed-off-by: fayejf <fayejf07@gmail.com>

* Mean normalization (NVIDIA#2397)

* norm embeddings

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>

* move to utils

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>

* Bugfix adaptive spec augment time masking (NVIDIA#2398)

* bugfix adaptive spec augment

Signed-off-by: smajumdar <titu1994@gmail.com>

* Revert freq mask guard

Signed-off-by: smajumdar <titu1994@gmail.com>

* Revert freq mask guard

Signed-off-by: smajumdar <titu1994@gmail.com>

* Remove static time width clamping

Signed-off-by: smajumdar <titu1994@gmail.com>

* Correct typos and issues with notebooks (NVIDIA#2402)

* Fix Primer notebook

Signed-off-by: smajumdar <titu1994@gmail.com>

* Typo

Signed-off-by: smajumdar <titu1994@gmail.com>

* remove accelerator=DDP in tutorial notebooks to avoid errors. (NVIDIA#2403)

Signed-off-by: Hoo Chang Shin <hshin@nvidia.com>

Co-authored-by: Hoo Chang Shin <hshin@nvidia.com>

* style

Signed-off-by: ericharper <complex451@gmail.com>

* update jenkins branch

Signed-off-by: ericharper <complex451@gmail.com>

* update notebook branch to main

Signed-off-by: ericharper <complex451@gmail.com>

Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Co-authored-by: fayejf <36722593+fayejf@users.noreply.github.com>
Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com>
Co-authored-by: khcs <khcs@users.noreply.github.com>
Co-authored-by: Hoo Chang Shin <hshin@nvidia.com>
Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Remove unused imports

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Add initial doc for text_normalization

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Fixed imports warnings

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Minor Fix

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Renamed

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Allowed duplex modes

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Minor Fix

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Add docs for duplex_text_normalization_train and duplex_text_normalization_test

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* docstrings for model codes + minor fix

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Add more comments and doc strings

Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Add doc for datasets + Use time.perf_counter()
Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Add code for preprocessing Google TN data
Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Add more docs and comments + Minor Fixes
Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Add more licenses + Fixed comments + Minors
Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Moved evaluation logic to DuplexTextNormalizationModel
Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Add logging errors
Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Updated validation code of tagger + Minors
Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Also write tag preds to log file
Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Add data augmentation for tagger dataset
Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Added experimental decorators
Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Updated docs
Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Updated duplex_tn_config.yaml
Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Compute token precision of tagger using NeMo metrics
Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Fixed saving issue when using ddp accelerator
Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Refactoring
Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Add option to keep punctuations in TextNormalizationTestDataset
Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Changes to input preprocessing + decoder's postprocessing
Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Fixed styles + Add references
Signed-off-by: Tuan Lai <tuanl@nvidia.com>

* Renamed examples/nlp/duplex_text_normalization/utils.py to helpers.py
Signed-off-by: Tuan Lai <tuanl@nvidia.com>

Co-authored-by: Jagadeesh Balam <4916480+jbalam-nv@users.noreply.github.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Co-authored-by: Samuel Kriman <samuelkriman@gmail.com>
Co-authored-by: Oktai Tatanov <oktai.tatanov@gmail.com>
Co-authored-by: Jason <jasoli@nvidia.com>
Co-authored-by: Mike Chrzanowski <mike.chrzanowski0@gmail.com>
Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com>
Co-authored-by: Oleksii Kuchaiev <okuchaiev@nvidia.com>
Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com>
Co-authored-by: Aleksey Grinchuk (Oleksii Hrinchuk) <grinchuk.alexey@gmail.com>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: Eric Harper <complex451@gmail.com>
Co-authored-by: mchrzanowski <mchrzanowski@nvidia.com>
Co-authored-by: Yang Zhang <yzhang123@users.noreply.github.com>
Co-authored-by: Boris Fomitchev <borisfom@users.noreply.github.com>
Co-authored-by: root <root@dgx0026.nsv.rno1.nvmetal.net>
Co-authored-by: root <root@dgx0079.nsv.rno1.nvmetal.net>
Co-authored-by: fayejf <36722593+fayejf@users.noreply.github.com>
Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com>
Co-authored-by: khcs <khcs@users.noreply.github.com>
Co-authored-by: Hoo Chang Shin <hshin@nvidia.com>
  • Loading branch information
22 people committed Jul 20, 2021
1 parent a7a587d commit 728dbea
Show file tree
Hide file tree
Showing 21 changed files with 2,266 additions and 2 deletions.
1 change: 1 addition & 0 deletions docs/source/nlp/models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,4 @@ NeMo's NLP collection supports the following models:
information_retrieval
nlp_model
machine_translation
text_normalization
20 changes: 18 additions & 2 deletions docs/source/nlp/nlp_all.bib
Original file line number Diff line number Diff line change
Expand Up @@ -100,10 +100,26 @@ @article{post2018call
}

@misc{zhang2021sgdqa,
title={SGD-QA: Fast Schema-Guided Dialogue State Tracking for Unseen Services},
title={SGD-QA: Fast Schema-Guided Dialogue State Tracking for Unseen Services},
author={Yang Zhang and Vahid Noroozi and Evelina Bakhturina and Boris Ginsburg},
year={2021},
eprint={2105.08049},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
}

@article{Sproat2016RNNAT,
title={RNN Approaches to Text Normalization: A Challenge},
author={R. Sproat and Navdeep Jaitly},
journal={ArXiv},
year={2016},
volume={abs/1611.00068}
}

@article{Zhang2019NeuralMO,
title={Neural Models of Text Normalization for Speech Applications},
author={Hao Zhang and R. Sproat and Axel H. Ng and Felix Stahlberg and Xiaochang Peng and Kyle Gorman and B. Roark},
journal={Computational Linguistics},
year={2019},
pages={293-338}
}
159 changes: 159 additions & 0 deletions docs/source/nlp/text_normalization.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
.. _text_normalization:

Text Normalization Models
==========================
Text normalization is the task of converting a written text into its spoken form. For example,
``$123`` should be verbalized as ``one hundred twenty three dollars``, while ``123 King Ave``
should be verbalized as ``one twenty three King Avenue``. At the same time, the inverse problem
is about converting a spoken sequence (e.g., an ASR output) into its written form.

NeMo has an implementation that allows you to build a neural-based system that is able to do
both text normalization (TN) and also inverse text normalization (ITN). At a high level, the
system consists of two individual components:

- `DuplexTaggerModel <https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/models/duplex_text_normalization/duplex_tagger.py/>`__ - a Transformer-based tagger for identifying "semiotic" spans in the input (e.g., spans that are about times, dates, or monetary amounts).
- `DuplexDecoderModel <https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/models/duplex_text_normalization/duplex_decoder.py/>`__ - a Transformer-based seq2seq model for decoding the semiotic spans into their appropriate forms (e.g., spoken forms for TN and written forms for ITN).

The typical workflow is to first train a DuplexTaggerModel and also a DuplexDecoderModel. An example training script
is provided: `duplex_text_normalization_train.py <https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/duplex_text_normalization/duplex_text_normalization_train.py>`__.
After that, the two trained models can be used to initialize a `DuplexTextNormalizationModel <https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/models/duplex_text_normalization/duplex_tn.py/>`__ that can be used for end-to-end inference.
An example script for evaluation and inference is provided: `duplex_text_normalization_test.py <https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/duplex_text_normalization/duplex_text_normalization_test.py>`__. The term
*duplex* refers to the fact that our system can be trained to do both TN and ITN. However, you can also specifically train the system for only one of the tasks.

NeMo Data Format
-----------
Both the DuplexTaggerModel model and the DuplexDecoderModel model use the same simple text format
as the dataset. The data needs to be stored in TAB separated files (``.tsv``) with three columns.
The first of which is the "semiotic class" (e.g., numbers, times, dates) , the second is the token
in written form, and the third is the spoken form. An example sentence in the dataset is shown below.
In the example, ``sil`` denotes that a token is a punctuation while ``self`` denotes that the spoken form is the
same as the written form. It is expected that a complete dataset contains three files: ``train.tsv``, ``dev.tsv``,
and ``test.tsv``.

.. code::
PLAIN The <self>
PLAIN company 's <self>
PLAIN revenues <self>
PLAIN grew <self>
PLAIN four <self>
PLAIN fold <self>
PLAIN between <self>
DATE 2005 two thousand five
PLAIN and <self>
DATE 2008 two thousand eight
PUNCT . sil
<eos> <eos>
An example script for generating a dataset in this format from the `Google text normalization dataset <https://www.kaggle.com/google-nlu/text-normalization>`_
can be found at `NeMo/examples/nlp/duplex_text_normalization/google_data_preprocessing.py <https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/duplex_text_normalization/google_data_preprocessing.py>`__.
Note that the script also does some preprocessing on the spoken forms of the URLs. For example,
given the URL "Zimbio.com", the original expected spoken form in the Google dataset is
"z_letter i_letter m_letter b_letter i_letter o_letter dot c_letter o_letter m_letter".
However, our script will return a more concise output which is "zim bio dot com".

More information about the Google text normalization dataset can be found in the paper `RNN Approaches to Text Normalization: A Challenge <https://arxiv.org/ftp/arxiv/papers/1611/1611.00068.pdf>`__ :cite:`nlp-textnorm-Sproat2016RNNAT`.


Model Training
--------------

An example training script is provided: `duplex_text_normalization_train.py <https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/duplex_text_normalization/duplex_text_normalization_train.py>`__.
The config file used for the example is at `duplex_tn_config.yaml <https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/duplex_text_normalization/conf/duplex_tn_config.yaml>`__.
You can change any of these parameters directly from the config file or update them with the command-line arguments.

The config file contains three main sections. The first section contains the configs for the tagger, the second section is about the decoder,
and the last section is about the dataset. Most arguments in the example config file are quite self-explanatory (e.g.,
*decoder_model.optim.lr* refers to the learning rate for training the decoder). We have set most of the hyper-parameters to
be the values that we found to be effective. Some arguments that you may want to modify are:

- *data.base_dir*: The path to the dataset directory. It is expected that the directory contains three files: train.tsv, dev.tsv, and test.tsv.

- *tagger_model.nemo_path*: This is the path where the final trained tagger model will be saved to.

- *decoder_model.nemo_path*: This is the path where the final trained decoder model will be saved to.

Example of a training command:

.. code::
python examples/nlp/duplex_text_normalization/duplex_text_normalization_train.py \
data.base_dir=<PATH_TO_DATASET_DIR> \
mode={tn,itn,joint}
There are 3 different modes. "tn" mode is for training a system for TN only.
"itn" mode is for training a system for ITN. "joint" is for training a system
that can do both TN and ITN at the same time. Note that the above command will
first train a tagger and then train a decoder sequentially.

You can also train only a tagger (without training a decoder) by running the
following command:

.. code::
python examples/nlp/duplex_text_normalization/duplex_text_normalization_train.py \
data.base_dir=PATH_TO_DATASET_DIR \
mode={tn,itn,joint} \
decoder_model.do_training=false
Or you can also train only a decoder (without training a tagger):

.. code::
python examples/nlp/duplex_text_normalization/duplex_text_normalization_train.py \
data.base_dir=PATH_TO_DATASET_DIR \
mode={tn,itn,joint} \
tagger_model.do_training=false
Model Architecture
--------------

The tagger model first uses a Transformer encoder (e.g., DistilRoBERTa) to build a
contextualized representation for each input token. It then uses a classification head
to predict the tag for each token (e.g., if a token should stay the same, its tag should
be ``SAME``). The decoder model then takes the semiotic spans identified by the tagger and
transform them into the appropriate forms (e.g., spoken forms for TN and written forms for ITN).
The decoder model is essentially a Transformer-based encoder-decoder seq2seq model (e.g., the example
training script uses the T5-base model by default). Overall, our design is partly inspired by the
RNN-based sliding window model proposed in the paper
`Neural Models of Text Normalization for Speech Applications <https://research.fb.com/wp-content/uploads/2019/03/Neural-Models-of-Text-Normalization-for-Speech-Applications.pdf>`__ :cite:`nlp-textnorm-Zhang2019NeuralMO`.

We introduce a simple but effective technique to allow our model to be duplex. Depending on the
task the model is handling, we append the appropriate prefix to the input. For example, suppose
we want to transform the text ``I live in 123 King Ave`` to its spoken form (i.e., TN problem),
then we will simply append the prefix ``tn`` to it and so the final input to our models will actually
be ``tn I live in tn 123 King Ave``. Similarly, for the ITN problem, we just append the prefix ``itn``
to the input.

To improve the effectiveness and robustness of our models, we also apply some simple data
augmentation techniques during training.

Data Augmentation for Training DuplexTaggerModel
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In the Google English TN training data, about 93% of the tokens are not in any semiotic span. In other words, the ground-truth tags of most tokens are of trivial types (i.e., ``SAME`` and ``PUNCT``). To alleviate this class imbalance problem,
for each original instance with several semiotic spans, we create a new instance by simply concatenating all the semiotic spans together. For example, considering the following ITN instance:

Original instance: ``[The|SAME] [revenues|SAME] [grew|SAME] [a|SAME] [lot|SAME] [between|SAME] [two|B-TRANSFORM] [thousand|I-TRANSFORM] [two|I-TRANSFORM] [and|SAME] [two|B-TRANSFORM] [thousand|I-TRANSFORM] [five|I-TRANSFORM] [.|PUNCT]``

Augmented instance: ``[two|B-TRANSFORM] [thousand|I-TRANSFORM] [two|I-TRANSFORM] [two|B-TRANSFORM] [thousand|I-TRANSFORM] [five|I-TRANSFORM]``

The argument ``data.train_ds.tagger_data_augmentation`` in the config file controls whether this data augmentation will be enabled or not.


Data Augmentation for Training DuplexDecoderModel
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Since the tagger may not be perfect, the inputs to the decoder may not all be semiotic spans. Therefore, to make the decoder become more robust against the tagger's potential errors,
we train the decoder with not only semiotic spans but also with some other more "noisy" spans. This way even if the tagger makes some errors, there will still be some chance that the
final output is still correct.

The argument ``data.train_ds.decoder_data_augmentation`` in the config file controls whether this data augmentation will be enabled or not.

References
----------

.. bibliography:: nlp_all.bib
:style: plain
:labelprefix: NLP-TEXTNORM
:keyprefix: nlp-textnorm-
136 changes: 136 additions & 0 deletions examples/nlp/duplex_text_normalization/conf/duplex_tn_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
name: &name DuplexTextNormalization
mode: joint # Three possible choices ['tn', 'itn', 'joint']

# Pretrained Nemo Models
tagger_pretrained_model: null
decoder_pretrained_model: null

# Tagger
tagger_trainer:
gpus: 1 # the number of gpus, 0 for CPU
num_nodes: 1
max_epochs: 5 # the number of training epochs
checkpoint_callback: false # provided by exp_manager
logger: false # provided by exp_manager
accumulate_grad_batches: 1 # accumulates grads every k batches
gradient_clip_val: 0.0
amp_level: O0 # O1/O2 for mixed precision
precision: 32 # Should be set to 16 for O1 and O2 to enable the AMP.
accelerator: ddp

tagger_model:
do_training: true
transformer: distilroberta-base
tokenizer: ${tagger_model.transformer}
nemo_path: ${tagger_exp_manager.exp_dir}/tagger_model.nemo # exported .nemo path

optim:
name: adamw
lr: 5e-5
weight_decay: 0.01

sched:
name: WarmupAnnealing

# pytorch lightning args
monitor: val_token_precision
reduce_on_plateau: false

# scheduler config override
warmup_steps: null
warmup_ratio: 0.1
last_epoch: -1

tagger_exp_manager:
exp_dir: exps # where to store logs and checkpoints
name: tagger_training # name of experiment
create_tensorboard_logger: True
create_checkpoint_callback: True
checkpoint_callback_params:
save_top_k: 3
monitor: "val_token_precision"
mode: "max"
save_best_model: true
always_save_nemo: true

# Decoder
decoder_trainer:
gpus: 1 # the number of gpus, 0 for CPU
num_nodes: 1
max_epochs: 3 # the number of training epochs
checkpoint_callback: false # provided by exp_manager
logger: false # provided by exp_manager
accumulate_grad_batches: 1 # accumulates grads every k batches
gradient_clip_val: 0.0
amp_level: O0 # O1/O2 for mixed precision
precision: 32 # Should be set to 16 for O1 and O2 to enable the AMP.
accelerator: ddp

decoder_model:
do_training: true
transformer: t5-base
tokenizer: ${decoder_model.transformer}
nemo_path: ${decoder_exp_manager.exp_dir}/decoder_model.nemo # exported .nemo path

optim:
name: adamw
lr: 2e-4
weight_decay: 0.01

sched:
name: WarmupAnnealing

# pytorch lightning args
monitor: val_loss
reduce_on_plateau: false

# scheduler config override
warmup_steps: null
warmup_ratio: 0.0
last_epoch: -1

decoder_exp_manager:
exp_dir: exps # where to store logs and checkpoints
name: decoder_training # name of experiment
create_tensorboard_logger: True
create_checkpoint_callback: True
checkpoint_callback_params:
save_top_k: 3
monitor: "val_loss"
mode: "min"
save_best_model: true
always_save_nemo: true

# Data
data:
base_dir: ??? # /path/to/data

train_ds:
data_path: ${data.base_dir}/train.tsv
batch_size: 64
shuffle: true
do_basic_tokenize: false
max_decoder_len: 80
mode: ${mode}
# Refer to the text_normalization doc for more information about data augmentation
tagger_data_augmentation: true
decoder_data_augmentation: true

validation_ds:
data_path: ${data.base_dir}/dev.tsv
batch_size: 64
shuffle: false
do_basic_tokenize: false
max_decoder_len: 80
mode: ${mode}

test_ds:
data_path: ${data.base_dir}/test.tsv
batch_size: 64
shuffle: false
mode: ${mode}

# Inference
inference:
interactive: false # Set to true if you want to enable the interactive mode when running duplex_text_normalization_test.py
errors_log_fp: errors.txt # Path to the file for logging the errors
Loading

0 comments on commit 728dbea

Please sign in to comment.