sync after 6915 (#14) (#15) · zhehuaichen/NeMo@206af78

Commit

sync after 6915 (#14) (#15)

* Fixed small bug with NoisePerturbationWithNormalization (NVIDIA#7118)



* Fix import guard checks (NVIDIA#7124)



* Revert "Fix import guard checks (NVIDIA#7124)" (NVIDIA#7125)

This reverts commit a46e325.

* Fix import guard checks (NVIDIA#7126)

* Fix import guard checks



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------




* Add updated fc ctc and rnnt xxl models (NVIDIA#7128) (NVIDIA#7130)

* [TTS] Create EnCodec training recipe (NVIDIA#6852)

* [TTS] Create EnCodec training recipe



* [TTS] Update encodec recipe



* [TTS] Rename EnCodec to AudioCodec



* [TTS] Add EnCodec unit tests



* [TTS] Add copyright header to distributed.py



---------



* Fix rank where torch.distributed may not be initialized yet and would not wait for tokenizer file caching (NVIDIA#7061)




* fix default attention size (NVIDIA#7141) (NVIDIA#7143)

* fix evaluator.py for various exceptions by ast (NVIDIA#7150)



* [TTS][ZH] add Chinese TTS recipes based on IPA symbol sets. (NVIDIA#6893)

* [TTS] add Chinese TTS recipe based on IPA.
* add new pinyin and ipa dictionaries with 36 finals.
* add yaml configs for 24-final pinyin and ipa.
* add copyright header
* add a directory level 24finals to discriminate from 36 finals.



* unify configs into a single one and add detailed comments providing supported candidates.



* choose 36-final IPA as default phoneme dict



---------



* [TTS] Add output audio format to preprocessing (NVIDIA#6889)

* [TTS] Add output audio format to preprocessing



* [TTS] Add format validation



* [TTS] Fix data tutorial



---------



* freeze (NVIDIA#7152)



* make sure any empty segments are removed (NVIDIA#7155)



* Update RIR generation scripts (NVIDIA#6547)

- fix: reduce room size if evaluation of params fails
- added randomized mic placement
- added diffuse noise generation
- added an option to specify the format and subtype for saved audio



* A quickstart speech enhancement tutorial (NVIDIA#6492)

A simple example of training a model for speech enhancement task



* NFA subtitle file config - specify colors and vertical alignment (NVIDIA#7160)

* allow specifying colors of text in ASS subtitle file



* specify vertical_alignment instead of marginv in ass_file_config



* add documentation of CTMFileConfig and ASSFileConfig to NFA README



---------



* Eagerly accumulate embedding grads into fp32 buffer (NVIDIA#6958) (NVIDIA#7153)




* TE bug fix (NVIDIA#7027) (NVIDIA#7036)




* [TTS] Remove nested TTS configs (NVIDIA#7154)

* [TTS] Remove nested TTS configs



* [TTS] Modify tutorial to support multiple sampling rates



* [TTS] Clarify min_duration unit



* [TTS] Default 22.05kHz highfreq to null



---------



* Merge release r1.20.0 to main (NVIDIA#7167)

* update package info



* Add ASR with TTS Tutorial. Fix enhancer usage. (NVIDIA#6955)

* Add ASR with TTS Tutorial
* Fix enhancer usage



* install_bs (NVIDIA#7019)



* Fix typo and branch in tutorial (NVIDIA#7048)



* fix syntax error introduced in PR-7079 (NVIDIA#7102)

* fix syntax error introduced in PR-7079



* fixes for pr review



---------



* fix links for TN (NVIDIA#7117)



* update branch (NVIDIA#7135)



* Fixed main and merging this to r1.20 (NVIDIA#7127)

* Fixed main and merging this to r1.20



* Update vad_utils.py



---------





* update branch



* fix version



* resolve conflict the other way



* keep both



* revert keep both



---------















* Upgrade to pytorch lightning 2.0 (NVIDIA#6433)

* Upgrade pytorch lightning version in requirements



* Initial fixes for PTL2.0



* Add further fixes to support lightning 2.0



* Add replacements for replace_sampler_ddp, resume_from_checkpoint_fit_path and few occurances of validation_epoch_end



* Replace all occurances of validation_epoch_end to on_validation_epoch_end



* Replace training_epoch_end, test_epoch_end with on_train_epoch_end and on_test_epoch_end respectively



* Change logger=None to logger=False in Trainer object



* Remove PTL2.0 deprecated Trainer args from TrainerConfig dataclass



* Modify trainer.precision check and other small edits



* Replace logger=None with logger=False in test_ptl_stateless_timer.py Trainer



* Add default values for args to fix Attribute Error



* Add the following modifications

1) Remove outputs arg from on_validation_epoch_end, on_test_epoch_end and make it an arg of the class
2) Replace resume_from_checkpoint with ckpt_path as needed
3) Explicitly add accelerator as 'CPU' in UTs being run on CPU



* Remove outputs arg from on_validation_epoch_end, on_test_epoch_end



* Remove outputs arg in on_validation_epoch_end in MultiBinaryAccuracy docstrings



* Add val, test outputs as instance vars in PunctuationCapitalizationModel and TokenClassificationModel



* Replace trainer.fit_loop.max_steps with trainer.fit_loop.epoch_loop.max_steps in test_optimizers_schedulers.py



* Revert an extra space that was mistakenly added



* Use self.validation_step_outputs and self.test_step_outputs in test_ema.py for uniformity



* Use self.validation_step_outputs and self.test_step_outputs in test_ptl_stateless_timer.py and check_for_ranks.py for uniformity



* Add self.validation_step_outputs.clear() and self.test_step_outputs.clear() wherever missing



* Remove outputs arg from on_train_epoch_end



* Remove outputs from on_validation_epoch_end in multi_binary_acc.py



* Remove output args from on_validation_epoch_end in the docstrings of some ASR files



* Remove output args from on_validation_epoch_end and clear memory from validation_step_outputs



* Add on_validation_epoch_end and remove outputs args for nlp models



* Append output of validation_step to validation_step_outputs in EncDecClassificationModel



* Add the following changes

1) Index self.validation_step_outputs and self.test_step.outputs with dataloader_idx wherever needed
2) Initialize self.validation_step_outputs and self.test_step.outputs as empty lists and add support for multi dataloaders if they exist
3) Remove self.pre_configure_ddp from NLPDDPStrategy class as its removed in PTL 2.0



* Add default value dataloader_idx=0 for on_validation_batch_end() in megatron_base_model.py



* TypeCast precision to str in attention.py and utils_funcs.py to avoid TypeError



* Add if condition check for multiple dataloaders when appending to validation outputs



* Separate validation pass to be used with both validation_step and test_step



* Add if condition check for multiple dataloader while appending to test_step_outputs in punctuation_capitalization_model.py



* Add condition check for multiple dataloaders based on type of trainer.val/test_dataloaders or self._validation/test_dl instead of len



* Comment Megatron T5 IA3 PP=2 in CI pipeline due to dataloader_iter issue with PTL 2.0



* Modify precision checks to account for 16-mixed and bf16-mixed



* Append output of validation/test_step to self.validation/test_step_outputs in CTCG2PModel



* Modify find_unused_parameters=True in g2p_heteronym model

1) Add find_unused_parameters=True for DDP strategy in g2p_heteronym_classification_train_and_evaluate.py
2) Remove args output in validation/test_step and add instance variables instead for heteronym_classification.py



* Remove outputs from on_test_epoch_end in DialogueGPTClassificationModel



* Add validation/test outputs in sgdqa_model and modify dialogue_config.yaml



* Add split arg self.test_step_outputs to TextClassificationModel



* Add test_step_outputs to dialogue and text classification models



* Change condition check for multiple dataloaders:

1) Replace ds_item as list in dialogue_config.yaml
2) Check for len of val/test_dataloaders or validation/test_dl along with type check of list in sgdqa_model.py while appending outputs of validation/test_step
3) Check for len of _validation/test_dl for creating self.validation/test_step_outputs in ModelPT and punctuation_cpitalization_model.py



* Add additional condition for multi dataloaders

Check len(self.trainer.val/test_dataloaders) > 1 along with type(self.trainer.val/test_dataloaders) == list for multi dataloaders in validation/test_step



* Add val step outputs and default val for dataloader_idx

1) Append validation_step outout to self.validation_step_outputs in MultiLabelIntentSlotClassificationMode
2) Add default val for dataloader_idx for on_test_batch_start/end in TimingCallback
3) Add self.validation/test_step_outputs in BERTQAModel and remove outputs arg



* Add val/test_step_outputs to S2SQAModel and GPTQAModel



* Edit JenkinsFile for bert_pretrainig.py

Edit Jenkinsfile for this test to disable validation as a workaround for trainer.val_dataloader None error



* Modify precision to support 16-mixed, bf16-mixed in megatron_gpt_pretraining.py



* Add ddp_find_unused_parameters_true and remove output args

1) Add ddp_find_unused_parameters_true fro trainer.strategy in self_alignment_pretraining.py as it has unused parameters
2) Remove output args and add self.validation/test_step_outputs to validation/test_step in mt_enc_dec_model.py
3) Comment tests in JenkinsFile that need to be fixed



* Precision fix in megatron_nmt_training.py for 16-mixed, bf16-mixed



* Precision fix for megatron_bert_pretraining.py and megatron_bert_model.py



* Precision fix and validation/test_step_outputs

1) Add fix to account for 16-mixed and bf16-mixed in megatron_retro_mutransfer_pretrain.py, megatron_retro_pretraining.py
2) Reset ckpt_path for test in enc_dec_nmt.py
3) Remove outputs args and add validation/test_step_outputs in megatron_retrieval_model.py
4) Comment Megatron Bert Pretraining and Resume Training with Pipeline Paralleism and add back NMT Training Post-LN



* Precision fix and skip few failing tests



* Add missing comment lines in JenkinsFile



* Comment jenkin tests and super().on_validation_epoch_end() in megatron_gpt_sft_model.py



* Minor edit JenkinsFile



* Minor edit in jenkins file



* Edit in Jenkins file



* Comment missed lines in Jenkins file



* Fix precision and validation/test outputs

1) Add precision fix to account for 16-mixed and bf16-mixed in megatron_t5_pretraining.py
2) Remove outputs args and add append loss to self.validation/test_step_outputs in megatron_lm_encoder_decoder_model.py
3) Add back resume_from_checkpoint in the megatron_t5_config.yaml
4) Comment out certain tests in Jenkins file



* Fix precision and validation/test/predict errors in megatron_t5_prompt_learning.py



* Precision fix and edit precision typo in all files

1) Account for 16-mixed and bf16-mixed in megatron_bart_pretraining.py and megatron_t5_seq2seq_finetune.py
2) Fix precision typo in all files



* Fix all CI TTS tests and comment few Jenkins tests



* Combine xx_epoch_end and on_xx_epoch_end

Add on_inference_epoch_end to inference_epoch_end function and have a single on_validation/test_epoch_end in megatron_finetune_model.py and megatron_gpt_sft_model.py



* Add a missing comment in JenkinsFile



* Add try except StopIteration in validation_step for models with dataloader_iter



* Remove pyyaml from requirements



* Add try except for inference_step in megatron_finetune_model.py



* Remove limit_val_batches for mockGPTDataset test



* Add new self.validation_step_outputs for MegatronGPTSFTModel



* Minor edit Jenkinsfile



* Initialize self.validation/test_step_outputs in megatron_gpt_sft_model.py

Initialize self.validation/test_step_outputs in setup of MegatronGPTSFTModel to take care of cases when datalaoders are not setup in ModelPT for example while restoring the model.



* Remove resume_from_checkpoint if trainer arg in conf yaml files



* Remove resume_from_checkpoint as trainer arg in GPT, T5 configs



* Remove resume_from_checkpoint in duplex_tn_config.yaml



* Fix typos, unused imports and refactor code to remove redundant funcs



* Remove commented code in megatron_nmt_model.py



* Fix overriden functions to match parent class functions



* Prefetch dataloader_iter to prevent hang for PP>1



* Override setup() in NLPDDPStrategy to avoid hang during predict with PP>1



* Uncomment tests in JenkinsFile



* Add '16' to precision checks and other minor fixes



* Clear validation/test_step_outputs with dataloader_idx for multi dataloaders



* Minor edits



* Modify precision checks to avoid indexing



* Remove self.validation_step_outputs_sft and add dataloader_idx to clear outputs



* Reference checkpoint with trainer.ckpt_path



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add _prefetch to NLPModel and minor fixes



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add limit_val_batches in JenkinsFile for NMT

1) Add trainer.limit_val_batches in Megatron NMT Training TP=2
2) Remove unused import in ModelPT



---------




* Include the scripts for preprocessing OAST and unit tests for chat sft datasets (NVIDIA#7112)

* scripts for sft



* fix style



* adde special token only for huggingface model



* change default name



* print out error datapoint content



* show error id



* annotation script working



* try to be compatible with huggingface tokenizer



* added examples



* added lang



* added lang



* text to value special case



* configure the slider



* annoatation handles lang



* added the unit test for chat sft dataset



* used the file in the test dir



* fix json error



* load local tokenizer



* remove mask count check



* added HF dataset backend



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------




* add paths to labeler. (NVIDIA#7087)



* T5 metrics fix (NVIDIA#7037)

* Fix race condition when executing with multi-node where some ranks does not wait for setup (NVIDIA#7016)




* Added bool types to neural_types export (NVIDIA#7032)




* rnnt and char utils (NVIDIA#6971)

* rnnt_ngram_merge



* char level bug



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------






* fix tab text gen (NVIDIA#7022) (NVIDIA#7031)





* Fixed kwargs for metric instance init



* Fixed kwargs for metric instance init



* removed kwagrs



* Updated config desc



* ASR Confidence update and tutorial (NVIDIA#6810)

* small fixes and tests



* various fixes for the tutorial



* tutorial added



* for for a little oops after rebasement



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix tests



* unused import removed



* fix review comments



* deprecated parameters for greedy configs



* move re-assigning to configs



* fix comments 2



* fix config tests



* fix ece test (my env was bugged apparently)



* renamings for confidence ensemble



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fox comments 3



* return dropped tutorial



* CI flips back and forth, increasing tolerance



---------





* install_bs (NVIDIA#7019) (NVIDIA#7028)





* fixes for spellmapper (NVIDIA#6994) (NVIDIA#7000)






* added back the retro documents (NVIDIA#7033)




* Remove pyyaml (NVIDIA#7052) (NVIDIA#7054)





* st standalone model (NVIDIA#6969)

* st standalone model



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* style fix



* sacrebleu import fix, unused imports removed



* import guard for nlp inside asr transformer bpe model



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* codeql fixes



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* comments answered



* import ordering fix



* yttm for asr removed



* logging added



* added inference and translate method



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------





* remove pos emb from state dict for old models (NVIDIA#7068)

* remove pos emb from state dict



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* move to nlp_model



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update comment



* fix nmt test



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix nmt test



---------





* Fix typo in ASR-TTS tutorial (NVIDIA#7049)




* Fixed tutorial's name (NVIDIA#7047)





* Fix documentation for Numba (NVIDIA#7065) (NVIDIA#7077)

* Fix documentation for Numba



* Update force float32 flag dynamically



* Update force float32 flag dynamically



* Fix nemo version



---------






* Update Frame-VAD doc and fix onnx export (NVIDIA#7076)

* update fvad doc



* fix typo



* update fvad example



* update



* fix onnx export



* update test



* refactor



* update doc



* update



---------





* memmap worker arg (NVIDIA#7062)

* memmap worker arg



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update



* update



---------





* Fix caching bug in causal convolutions for cache-aware ASR models (NVIDIA#7034) (NVIDIA#7082)




* Fast Conformer global token fix (NVIDIA#7085)

* old way



* fix



* fix



* fix



* remove extra



* clean



* clean



* clean



* fix



* fix



* fix



* fix



* fix



* fix



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------





* Refined export_config (NVIDIA#7053) (NVIDIA#7066)

* Refined export_config
* Rolling back hierarchy change
---------





* small Bugfix (NVIDIA#7081)

* small Bugfix (NVIDIA#7079)

* fix branch



* fix typo



* fix link



---------



* Update tutorials/nlp/SpellMapper_English_ASR_Customization.ipynb



* Update tutorials/nlp/SpellMapper_English_ASR_Customization.ipynb



---------







* Added script to extract ASR CTC and RNNT models from ASR hybrid models (NVIDIA#7092)

* Added script to extract ctc and rnnt models from hybrid models



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Updated hybrid extraction script for review request 1



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Updated hybrid convert script to remove --cuda flag



---------






* Adding docs and models for multiple lookahead cache-aware ASR (NVIDIA#7067) (NVIDIA#7094)



* update TTS readme (NVIDIA#7088)

* update TTS readme



---------




* Fix absolute path in path join call (NVIDIA#7099)




* Disable distopt contiguous param buffer by default (NVIDIA#7095)




* microphone demo (NVIDIA#7110)





* [Fix] load_state_dict in nlp_model.py (NVIDIA#7086)

* Fix load_state_dict in nlp_model.py



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------





* Fix plot function in vad_utils.py (NVIDIA#7113)

Fix plot function in vad_utils.py




* Fixed small bug with NoisePerturbationWithNormalization (NVIDIA#7118)




* Fix import guard checks (NVIDIA#7124)




* Revert "Fix import guard checks (NVIDIA#7124)" (NVIDIA#7125)

This reverts commit a46e325.



* Fix import guard checks (NVIDIA#7126)

* Fix import guard checks



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------





* Add updated fc ctc and rnnt xxl models (NVIDIA#7128) (NVIDIA#7130)



* [TTS] Create EnCodec training recipe (NVIDIA#6852)

* [TTS] Create EnCodec training recipe



* [TTS] Update encodec recipe



* [TTS] Rename EnCodec to AudioCodec



* [TTS] Add EnCodec unit tests



* [TTS] Add copyright header to distributed.py



---------




* Fix rank where torch.distributed may not be initialized yet and would not wait for tokenizer file caching (NVIDIA#7061)





* fix default attention size (NVIDIA#7141) (NVIDIA#7143)



* fix evaluator.py for various exceptions by ast (NVIDIA#7150)




* [TTS][ZH] add Chinese TTS recipes based on IPA symbol sets. (NVIDIA#6893)

* [TTS] add Chinese TTS recipe based on IPA.
* add new pinyin and ipa dictionaries with 36 finals.
* add yaml configs for 24-final pinyin and ipa.
* add copyright header
* add a directory level 24finals to discriminate from 36 finals.



* unify configs into a single one and add detailed comments providing supported candidates.



* choose 36-final IPA as default phoneme dict



---------




* [TTS] Add output audio format to preprocessing (NVIDIA#6889)

* [TTS] Add output audio format to preprocessing



* [TTS] Add format validation



* [TTS] Fix data tutorial



---------




* freeze (NVIDIA#7152)




* make sure any empty segments are removed (NVIDIA#7155)




* Update RIR generation scripts (NVIDIA#6547)

- fix: reduce room size if evaluation of params fails
- added randomized mic placement
- added diffuse noise generation
- added an option to specify the format and subtype for saved audio




* A quickstart speech enhancement tutorial (NVIDIA#6492)

A simple example of training a model for speech enhancement task




* NFA subtitle file config - specify colors and vertical alignment (NVIDIA#7160)

* allow specifying colors of text in ASS subtitle file



* specify vertical_alignment instead of marginv in ass_file_config



* add documentation of CTMFileConfig and ASSFileConfig to NFA README



---------




* Eagerly accumulate embedding grads into fp32 buffer (NVIDIA#6958) (NVIDIA#7153)





* TE bug fix (NVIDIA#7027) (NVIDIA#7036)





* [TTS] Remove nested TTS configs (NVIDIA#7154)

* [TTS] Remove nested TTS configs



* [TTS] Modify tutorial to support multiple sampling rates



* [TTS] Clarify min_duration unit



* [TTS] Default 22.05kHz highfreq to null



---------




* Merge release r1.20.0 to main (NVIDIA#7167)

* update package info



* Add ASR with TTS Tutorial. Fix enhancer usage. (NVIDIA#6955)

* Add ASR with TTS Tutorial
* Fix enhancer usage



* install_bs (NVIDIA#7019)



* Fix typo and branch in tutorial (NVIDIA#7048)



* fix syntax error introduced in PR-7079 (NVIDIA#7102)

* fix syntax error introduced in PR-7079



* fixes for pr review



---------



* fix links for TN (NVIDIA#7117)



* update branch (NVIDIA#7135)



* Fixed main and merging this to r1.20 (NVIDIA#7127)

* Fixed main and merging this to r1.20



* Update vad_utils.py



---------





* update branch



* fix version



* resolve conflict the other way



* keep both



* revert keep both



---------
















* Upgrade to pytorch lightning 2.0 (NVIDIA#6433)

* Upgrade pytorch lightning version in requirements



* Initial fixes for PTL2.0



* Add further fixes to support lightning 2.0



* Add replacements for replace_sampler_ddp, resume_from_checkpoint_fit_path and few occurances of validation_epoch_end



* Replace all occurances of validation_epoch_end to on_validation_epoch_end



* Replace training_epoch_end, test_epoch_end with on_train_epoch_end and on_test_epoch_end respectively



* Change logger=None to logger=False in Trainer object



* Remove PTL2.0 deprecated Trainer args from TrainerConfig dataclass



* Modify trainer.precision check and other small edits



* Replace logger=None with logger=False in test_ptl_stateless_timer.py Trainer



* Add default values for args to fix Attribute Error



* Add the following modifications

1) Remove outputs arg from on_validation_epoch_end, on_test_epoch_end and make it an arg of the class
2) Replace resume_from_checkpoint with ckpt_path as needed
3) Explicitly add accelerator as 'CPU' in UTs being run on CPU



* Remove outputs arg from on_validation_epoch_end, on_test_epoch_end



* Remove outputs arg in on_validation_epoch_end in MultiBinaryAccuracy docstrings



* Add val, test outputs as instance vars in PunctuationCapitalizationModel and TokenClassificationModel



* Replace trainer.fit_loop.max_steps with trainer.fit_loop.epoch_loop.max_steps in test_optimizers_schedulers.py



* Revert an extra space that was mistakenly added



* Use self.validation_step_outputs and self.test_step_outputs in test_ema.py for uniformity



* Use self.validation_step_outputs and self.test_step_outputs in test_ptl_stateless_timer.py and check_for_ranks.py for uniformity



* Add self.validation_step_outputs.clear() and self.test_step_outputs.clear() wherever missing



* Remove outputs arg from on_train_epoch_end



* Remove outputs from on_validation_epoch_end in multi_binary_acc.py



* Remove output args from on_validation_epoch_end in the docstrings of some ASR files



* Remove output args from on_validation_epoch_end and clear memory from validation_step_outputs



* Add on_validation_epoch_end and remove outputs args for nlp models



* Append output of validation_step to validation_step_outputs in EncDecClassificationModel



* Add the following changes

1) Index self.validation_step_outputs and self.test_step.outputs with dataloader_idx wherever needed
2) Initialize self.validation_step_outputs and self.test_step.outputs as empty lists and add support for multi dataloaders if they exist
3) Remove self.pre_configure_ddp from NLPDDPStrategy class as its removed in PTL 2.0



* Add default value dataloader_idx=0 for on_validation_batch_end() in megatron_base_model.py



* TypeCast precision to str in attention.py and utils_funcs.py to avoid TypeError



* Add if condition check for multiple dataloaders when appending to validation outputs



* Separate validation pass to be used with both validation_step and test_step



* Add if condition check for multiple dataloader while appending to test_step_outputs in punctuation_capitalization_model.py



* Add condition check for multiple dataloaders based on type of trainer.val/test_dataloaders or self._validation/test_dl instead of len



* Comment Megatron T5 IA3 PP=2 in CI pipeline due to dataloader_iter issue with PTL 2.0



* Modify precision checks to account for 16-mixed and bf16-mixed



* Append output of validation/test_step to self.validation/test_step_outputs in CTCG2PModel



* Modify find_unused_parameters=True in g2p_heteronym model

1) Add find_unused_parameters=True for DDP strategy in g2p_heteronym_classification_train_and_evaluate.py
2) Remove args output in validation/test_step and add instance variables instead for heteronym_classification.py



* Remove outputs from on_test_epoch_end in DialogueGPTClassificationModel



* Add validation/test outputs in sgdqa_model and modify dialogue_config.yaml



* Add split arg self.test_step_outputs to TextClassificationModel



* Add test_step_outputs to dialogue and text classification models



* Change condition check for multiple dataloaders:

1) Replace ds_item as list in dialogue_config.yaml
2) Check for len of val/test_dataloaders or validation/test_dl along with type check of list in sgdqa_model.py while appending outputs of validation/test_step
3) Check for len of _validation/test_dl for creating self.validation/test_step_outputs in ModelPT and punctuation_cpitalization_model.py



* Add additional condition for multi dataloaders

Check len(self.trainer.val/test_dataloaders) > 1 along with type(self.trainer.val/test_dataloaders) == list for multi dataloaders in validation/test_step



* Add val step outputs and default val for dataloader_idx

1) Append validation_step outout to self.validation_step_outputs in MultiLabelIntentSlotClassificationMode
2) Add default val for dataloader_idx for on_test_batch_start/end in TimingCallback
3) Add self.validation/test_step_outputs in BERTQAModel and remove outputs arg



* Add val/test_step_outputs to S2SQAModel and GPTQAModel



* Edit JenkinsFile for bert_pretrainig.py

Edit Jenkinsfile for this test to disable validation as a workaround for trainer.val_dataloader None error



* Modify precision to support 16-mixed, bf16-mixed in megatron_gpt_pretraining.py



* Add ddp_find_unused_parameters_true and remove output args

1) Add ddp_find_unused_parameters_true fro trainer.strategy in self_alignment_pretraining.py as it has unused parameters
2) Remove output args and add self.validation/test_step_outputs to validation/test_step in mt_enc_dec_model.py
3) Comment tests in JenkinsFile that need to be fixed



* Precision fix in megatron_nmt_training.py for 16-mixed, bf16-mixed



* Precision fix for megatron_bert_pretraining.py and megatron_bert_model.py



* Precision fix and validation/test_step_outputs

1) Add fix to account for 16-mixed and bf16-mixed in megatron_retro_mutransfer_pretrain.py, megatron_retro_pretraining.py
2) Reset ckpt_path for test in enc_dec_nmt.py
3) Remove outputs args and add validation/test_step_outputs in megatron_retrieval_model.py
4) Comment Megatron Bert Pretraining and Resume Training with Pipeline Paralleism and add back NMT Training Post-LN



* Precision fix and skip few failing tests



* Add missing comment lines in JenkinsFile



* Comment jenkin tests and super().on_validation_epoch_end() in megatron_gpt_sft_model.py



* Minor edit JenkinsFile



* Minor edit in jenkins file



* Edit in Jenkins file



* Comment missed lines in Jenkins file



* Fix precision and validation/test outputs

1) Add precision fix to account for 16-mixed and bf16-mixed in megatron_t5_pretraining.py
2) Remove outputs args and add append loss to self.validation/test_step_outputs in megatron_lm_encoder_decoder_model.py
3) Add back resume_from_checkpoint in the megatron_t5_config.yaml
4) Comment out certain tests in Jenkins file



* Fix precision and validation/test/predict errors in megatron_t5_prompt_learning.py



* Precision fix and edit precision typo in all files

1) Account for 16-mixed and bf16-mixed in megatron_bart_pretraining.py and megatron_t5_seq2seq_finetune.py
2) Fix precision typo in all files



* Fix all CI TTS tests and comment few Jenkins tests



* Combine xx_epoch_end and on_xx_epoch_end

Add on_inference_epoch_end to inference_epoch_end function and have a single on_validation/test_epoch_end in megatron_finetune_model.py and megatron_gpt_sft_model.py



* Add a missing comment in JenkinsFile



* Add try except StopIteration in validation_step for models with dataloader_iter



* Remove pyyaml from requirements



* Add try except for inference_step in megatron_finetune_model.py



* Remove limit_val_batches for mockGPTDataset test



* Add new self.validation_step_outputs for MegatronGPTSFTModel



* Minor edit Jenkinsfile



* Initialize self.validation/test_step_outputs in megatron_gpt_sft_model.py

Initialize self.validation/test_step_outputs in setup of MegatronGPTSFTModel to take care of cases when datalaoders are not setup in ModelPT for example while restoring the model.



* Remove resume_from_checkpoint if trainer arg in conf yaml files



* Remove resume_from_checkpoint as trainer arg in GPT, T5 configs



* Remove resume_from_checkpoint in duplex_tn_config.yaml



* Fix typos, unused imports and refactor code to remove redundant funcs



* Remove commented code in megatron_nmt_model.py



* Fix overriden functions to match parent class functions



* Prefetch dataloader_iter to prevent hang for PP>1



* Override setup() in NLPDDPStrategy to avoid hang during predict with PP>1



* Uncomment tests in JenkinsFile



* Add '16' to precision checks and other minor fixes



* Clear validation/test_step_outputs with dataloader_idx for multi dataloaders



* Minor edits



* Modify precision checks to avoid indexing



* Remove self.validation_step_outputs_sft and add dataloader_idx to clear outputs



* Reference checkpoint with trainer.ckpt_path



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add _prefetch to NLPModel and minor fixes



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add limit_val_batches in JenkinsFile for NMT

1) Add trainer.limit_val_batches in Megatron NMT Training TP=2
2) Remove unused import in ModelPT



---------





* Include the scripts for preprocessing OAST and unit tests for chat sft datasets (NVIDIA#7112)

* scripts for sft



* fix style



* adde special token only for huggingface model



* change default name



* print out error datapoint content



* show error id



* annotation script working



* try to be compatible with huggingface tokenizer



* added examples



* added lang



* added lang



* text to value special case



* configure the slider



* annoatation handles lang



* added the unit test for chat sft dataset



* used the file in the test dir



* fix json error



* load local tokenizer



* remove mask count check



* added HF dataset backend



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------





* add paths to labeler. (NVIDIA#7087)




* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------
















































Co-authored-by: Adi Renduchintala <adithyar…

Signed-off-by: Daniel Egert <degert@nvidia.com>
Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: Kim Ngo <6362111+findkim@users.noreply.github.com>
Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: arendu <adithya.r@gmail.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Ante Jukić <ajukic@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Dmytro Pykhtar <dpykhtar@nvidia.com>
Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>
Signed-off-by: Nikolay Karpov <karpnv@gmail.com>
Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru>
Signed-off-by: Evelina <ebakhturina@nvidia.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: jubick1337 <mattyson.so@gmail.com>
Signed-off-by: tbartley94 <tbartley@nvidia.com>
Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>
Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>
Signed-off-by: Vitaly Lavrukhin <vlavrukhin@nvidia.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: sam1373 <samuelkriman@gmail.com>
Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>
Signed-off-by: fayejf <fayejf07@gmail.com>
Signed-off-by: Somshubra Majumdar <titu1994@gmail.com>
Signed-off-by: Jan Beckmann <king-jan1999@hotmail.de>
Signed-off-by: Linnea Pari Leaver <lleaver@lleaver-mlt.client.nvidia.com>
Signed-off-by: Xin Yao <yaox12@outlook.com>
Signed-off-by: fayejf <36722593+fayejf@users.noreply.github.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>
Signed-off-by: Cheng-Ping Hsieh <37269846+hsiehjackson@users.noreply.github.com>
Signed-off-by: Oleksii Kuchaiev <okuchaiev@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: smajumdar <smajumdar@nvidia.com>
Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>
Signed-off-by: Virginia Adams <vadams@nvidia.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: David Mosallanezhad <dmosallanezh@nvidia.com>
Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: ekmb <ebakhturina@nvidia.com>
Signed-off-by: Yang Zhang <yangzhang@nvidia.com>
Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>
Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
Signed-off-by: Micha Livne <mlivne@nvidia.com>
Signed-off-by: Dima Rekesh <bmwshop@gmail.com>
Signed-off-by: Jim O’Regan <jaoregan@tcd.ie>
Signed-off-by: Mostafa Ghorbandoost <mos.ghorbandoost@gmail.com>
Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>
Signed-off-by: Kunal Dhawan <kunaldhawan97@gmail.com>
Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>
Signed-off-by: Andrei Andrusenko <52885736+andrusenkoau@users.noreply.github.com>
Signed-off-by: KunalDhawan <kunaldhawan97@gmail.com>
Signed-off-by: Greg Clark <grclark@nvidia.com>
Signed-off-by: Eric Harper <complex451@gmail.com>
Signed-off-by: Jan Baczek <jbaczek@nvidia.com>
Signed-off-by: yaoyu-33 <54727607+yaoyu-33@users.noreply.github.com>
Signed-off-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com>
Signed-off-by: eharper <eharper@nvidia.com>
Signed-off-by: jasonwan <jasonwan@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Siddharth Tyagi <styagi130@gmail.com>
Signed-off-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com>
Signed-off-by: Jason Wang <jasonwan@nvidia.com>
Signed-off-by: arendu <adithyare@nvidia.com>
Signed-off-by: Alireza Morsali <alireza.mors@gmail.com>
Signed-off-by: Siddharth Tyagi <siddhartht@nvidia.com>
Signed-off-by: dorotat <dorotat@nvidia.com>
Signed-off-by: mburchi <maxime.burchi@gmail.com>
Signed-off-by: Maxime Burchi <60737204+burchim@users.noreply.github.com>
Signed-off-by: Adi Renduchintala <adithya.r@gmail.com>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Xin Yao <xiny@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Alexander Jipa <azzhipa@amazon.com>
Signed-off-by: omahs <73983677+omahs@users.noreply.github.com>
Signed-off-by: lhb8125 <lhb8125@gmail.com>
Signed-off-by: Robin Dong <robin.k.dong@gmail.com>
Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>
Signed-off-by: Sangkug Lym <slym@nvidia.com>
Signed-off-by: George Zelenfroynd <gzelenfroind@nvidia.com>
Signed-off-by: Anton Peganov <apeganov@nvidia.com>
Signed-off-by: Samuele Cornell <cornellsamuele@gmail.com>
Signed-off-by: Jason <jasoli@nvidia.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Tamerlan Tabolov <tktabolov@gmail.com>
Signed-off-by: zhehuaichen <139396994+zhehuaichen@users.noreply.github.com>
Co-authored-by: trias702 <25867060+trias702@users.noreply.github.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Ryan Langman <rlangman@nvidia.com>
Co-authored-by: Kim Ngo <6362111+findkim@users.noreply.github.com>
Co-authored-by: David <amosalla@asu.edu>
Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>
Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Co-authored-by: Adi Renduchintala <adithyare@nvidia.com>
Co-authored-by: Elena Rastorgueva <80532067+erastorgueva-nv@users.noreply.github.com>
Co-authored-by: anteju <108555623+anteju@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Co-authored-by: Vladimir Bataev <vbataev@nvidia.com>
Co-authored-by: Nikolay Karpov <karpnv@gmail.com>
Co-authored-by: bene-ges <antonova_sasha@list.ru>
Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com>
Co-authored-by: Taejin Park <tango4j@gmail.com>
Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com>
Co-authored-by: Yi Dong <43824965+yidong72@users.noreply.github.com>
Co-authored-by: Matvei Novikov <mattyson.so@gmail.com>
Co-authored-by: tbartley94 <90423858+tbartley94@users.noreply.github.com>
Co-authored-by: Aleksandr Laptev <alaptev@nvidia.com>
Co-authored-by: Aleksey Grinchuk (Oleksii Hrinchuk) <grinchuk.alexey@gmail.com>
Co-authored-by: Vitaly Lavrukhin <vlavrukhin@nvidia.com>
Co-authored-by: fayejf <36722593+fayejf@users.noreply.github.com>
Co-authored-by: Vahid Noroozi <VahidooX@users.noreply.github.com>
Co-authored-by: Samuel Kriman <samuelkriman@gmail.com>
Co-authored-by: Boris Fomitchev <borisfom@users.noreply.github.com>
Co-authored-by: Jan Beckmann <king-jan1999@hotmail.de>
Co-authored-by: lleaver <137942999+lleaver@users.noreply.github.com>
Co-authored-by: Linnea Pari Leaver <lleaver@lleaver-mlt.client.nvidia.com>
Co-authored-by: Xin Yao <yaox12@outlook.com>
Co-authored-by: anmolgupt <14880251+anmolgupt@users.noreply.github.com>
Co-authored-by: ANMOL GUPTA <anmolg@nvidia.com>
Co-authored-by: Cheng-Ping Hsieh <37269846+hsiehjackson@users.noreply.github.com>
Co-authored-by: Micha Livne <michalivne@users.noreply.github.com>
Co-authored-by: Oleksii Kuchaiev <okuchaiev@nvidia.com>
Co-authored-by: Jocelyn <jocelynh@nvidia.com>
Co-authored-by: bene-ges <61418381+bene-ges@users.noreply.github.com>
Co-authored-by: Alexandra Antonova <aleksandraa@nvidia.com>
Co-authored-by: Virginia Adams <78445382+vadam5@users.noreply.github.com>
Co-authored-by: Zhilin Wang <wangzhilin12061996@hotmail.com>
Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com>
Co-authored-by: Ante Jukić <ajukic@nvidia.com>
Co-authored-by: David Mosallanezhad <dmosallanezh@nvidia.com>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Yang Zhang <yzhang123@users.noreply.github.com>
Co-authored-by: Sean Naren <snarenthiran@nvidia.com>
Co-authored-by: Neha Tadimeti <ntadimeti@nvidia.com>
Co-authored-by: Abhinav Khattar <aklife97@gmail.com>
Co-authored-by: Dima Rekesh <bmwshop@gmail.com>
Co-authored-by: Jim O’Regan <jaoregan@tcd.ie>
Co-authored-by: Mostafa Ghorbandoost <mos.ghorbandoost@gmail.com>
Co-authored-by: Dmytro Pykhtar <dpykhtar@nvidia.com>
Co-authored-by: Kunal Dhawan <kunaldhawan97@gmail.com>
Co-authored-by: Andrei Andrusenko <52885736+andrusenkoau@users.noreply.github.com>
Co-authored-by: Greg Clark <grclark@nvidia.com>
Co-authored-by: jbaczek <45043825+jbaczek@users.noreply.github.com>
Co-authored-by: yaoyu-33 <54727607+yaoyu-33@users.noreply.github.com>
Co-authored-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com>
Co-authored-by: Jason Wang <jasonwan@nvidia.com>
Co-authored-by: Maanu Grover <109391026+maanug-nv@users.noreply.github.com>
Co-authored-by: guyueh1 <140554423+guyueh1@users.noreply.github.com>
Co-authored-by: Mariana <47233618+mgrafu@users.noreply.github.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Co-authored-by: styagi130 <siddharth.tyagi2015@vit.ac.in>
Co-authored-by: Siddharth Tyagi <siddhartht@nvidia.com>
Co-authored-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Co-authored-by: Alireza Morsali <32244795+AlirezaMorsali@users.noreply.github.com>
Co-authored-by: styagi130 <styagi130@gmail.com>
Co-authored-by: dorotat-nv <115542912+dorotat-nv@users.noreply.github.com>
Co-authored-by: Maxime Burchi <60737204+burchim@users.noreply.github.com>
Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Co-authored-by: eharper <eharper@nvidia.com>
Co-authored-by: Hongbin Liu <hongbinl@nvidia.com>
Co-authored-by: Kelvin Liu <lhb8125@users.noreply.github.com>
Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com>
Co-authored-by: Alexander Jipa <alexander.jipa@gmail.com>
Co-authored-by: Alexander Jipa <azzhipa@amazon.com>
Co-authored-by: omahs <73983677+omahs@users.noreply.github.com>
Co-authored-by: Robin Dong <robin.k.dong@gmail.com>
Co-authored-by: JimmyZhang12 <67203904+JimmyZhang12@users.noreply.github.com>
Co-authored-by: Jimmy Zhang <jiemingz@nvidia.com>
Co-authored-by: Sangkug Lym <slym@nvidia.com>
Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>
Co-authored-by: PeganovAnton <apeganov@nvidia.com>
Co-authored-by: Samuele Cornell <cornellsamuele@gmail.com>
Co-authored-by: Jason <jasoli@nvidia.com>
Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com>
Co-authored-by: Jan Lasek <janek.lasek@gmail.com>
Co-authored-by: Tamerlan Tabolov <nektonikto999@gmail.com>

Loading branch information

96 people committed Sep 22, 2023

1 parent 0f44a33 commit 206af78

.github/labeler.yml

-Original file line number
+Diff line change
@@ Expand Up / @@ -3,25 +3,33 @@ ASR: @@
     - examples/asr/**/*
     - tutorials/asr/**/*
     - docs/source/asr/**/*
+    - tests/collections/asr/**
     NLP:
     - nemo/collections/nlp/**/*
     - examples/nlp/**/*
     - tutorials/nlp/**/*
     - docs/source/nlp/**/*
+    - tests/collections/nlp/**
     Speaker Tasks:
     - examples/speaker_tasks/**/*
     - tutorials/speaker_tasks/**/*
     TTS:
     - nemo/collections/tts/**/*
+    - nemo/collections/common/tokenizers/text_to_speech/**
     - examples/tts/**/*
     - tutorials/tts/**/*
     - docs/source/tts/**/*
+    - scripts/dataset_processing/tts/**
+    - scripts/tts_dataset_files/**
+    - tests/collections/tts/**
+    - tests/collections/common/tokenizers/text_to_speech/**
     core:
     - nemo/core/**/*
+    - tests/core/**
     common:
     - nemo/collections/common/**/*
@@ Expand Down @@

Dockerfile

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -14,7 +14,7 @@
  
    # See the License for the specific language governing permissions and

    # limitations under the License.

    ARG BASE_IMAGE=nvcr.io/nvidia/pytorch:23.06-py3

    ARG BASE_IMAGE=nvcr.io/nvidia/pytorch:23.08-py3

    # build an image that includes only the nemo dependencies, ensures that dependencies

    # are included first for optimal caching, and useful for building a development

    @@ -45,12 +45,18 @@ RUN apt-get update && \
  
    WORKDIR /workspace/

    WORKDIR /tmp/

    # TODO: Remove once this Apex commit (5/12/23) is included in PyTorch

    # container

    # Distributed Adam support for multiple dtypes

    RUN git clone https://github.com/NVIDIA/apex.git && \

      cd apex && \

      git checkout 8b7a1ff183741dd8f9b87e7bafd04cfde99cea28 && \

      pip3 install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_layer_norm" --global-option="--distributed_adam" --global-option="--deprecated_fused_adam" ./

      git checkout 52e18c894223800cb611682dce27d88050edf1de && \

      pip3 install -v --no-build-isolation --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_layer_norm" --global-option="--distributed_adam" --global-option="--deprecated_fused_adam" ./

    # install megatron core, this can be removed once 0.3 pip package is released

    RUN git clone https://github.com/NVIDIA/Megatron-LM.git && \

      cd Megatron-LM && \

      git checkout ab0336a5c8eab77aa74ae604ba1e73decbf6d560 && \

      pip install -e .

    # uninstall stuff from base container

    RUN pip3 uninstall -y sacrebleu torchtext

    @@ -76,6 +82,8 @@ RUN for f in $(ls requirements*.txt); do pip3 install --disable-pip-version-chec
  
    RUN pip install flash-attn

    # pinned triton version for flash-attention https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/flash_attn_triton.py#L3

    RUN pip install triton==2.0.0.dev20221202

    # install numba for latest containers

    RUN pip install numba>=0.57.1

    # install k2, skip if installation fails

    COPY scripts /tmp/nemo/scripts/

    @@ -94,7 +102,7 @@ COPY . .
  
    # start building the final container

    FROM nemo-deps as nemo

    ARG NEMO_VERSION=1.20.0

    ARG NEMO_VERSION=1.21.0

    # Check that NEMO_VERSION is set. Build will fail without this. Expose NEMO and base container

    # version information as runtime environment variable for introspection purposes

0 comments on commit `206af78`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `206af78`

Commit

There are no files selected for viewing

0 comments on commit 206af78

0 comments on commit `206af78`