In [1]:
!pip install torchaudio>=0.10.0

In [2]:
import nemo
nemo.__version__

'1.16.0'

# Nemo Collections

NeMo is sub-divided into a few fundamental collections based on their domains `asr`,`nlp`,`tts`. NeMo allows partial imports of just one or more collections

In [5]:
from torch import inf
import nemo.collections.asr as nemo_asr
# import nemo.collections.nlp as nemo_nlp  # problem with torch._six and torch2.0
import nemo.collections.tts as nemo_tts

[NeMo W 2023-03-19 14:05:58 experimental:27] Module <class 'nemo.collections.common.tokenizers.text_to_speech.tts_tokenizers.IPATokenizer'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-03-19 14:05:58 experimental:27] Module <class 'nemo.collections.tts.models.radtts.RadTTSModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-03-19 14:05:58 experimental:27] Module <class 'nemo.collections.tts.models.vits.VitsModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.


# NeMo models in Collections

Nemo contains several models for each of its collections, pertaining to certain common tasks involved in conversational AI.  

In [6]:
asr_models = [model for model in dir(nemo_asr.models) if model.endswith("Model")]
asr_models

['ASRModel',
 'AudioToAudioModel',
 'EncDecCTCModel',
 'EncDecClassificationModel',
 'EncDecDiarLabelModel',
 'EncDecHybridRNNTCTCBPEModel',
 'EncDecHybridRNNTCTCModel',
 'EncDecK2SeqModel',
 'EncDecRNNTBPEModel',
 'EncDecRNNTModel',
 'EncDecSpeakerLabelModel',
 'EncMaskDecAudioToAudioModel',
 'SLUIntentSlotBPEModel',
 'SpeechEncDecSelfSupervisedModel']

In [7]:
tts_models = [model for model in dir(nemo_tts.models) if model.endswith("Model")]
tts_models

['AlignerModel',
 'FastPitchModel',
 'GriffinLimModel',
 'HifiGanModel',
 'MelPsuedoInverseModel',
 'MixerTTSModel',
 'RadTTSModel',
 'SpectrogramEnhancerModel',
 'Tacotron2Model',
 'TwoStagesModel',
 'UnivNetModel',
 'VitsModel',
 'WaveGlowModel']

# The Nemo Model

There are many ways we can create these models - we can use the constructor and pass in a config, we can instantiate the model from a pre-trained checkpoint, or simply pas a pre-trained model name and instantiate a model directly from the cloud

Let's try to work with an ASR model [Citrinet](https://arxiv.org/abs/2104.01721)

In [8]:
citrinet = nemo_asr.models.EncDecCTCModelBPE.from_pretrained('stt_en_citrinet_512')

[NeMo I 2023-03-19 14:27:26 cloud:66] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_en_citrinet_512/versions/1.0.0rc1/files/stt_en_citrinet_512.nemo to /home/mat/.cache/torch/NeMo/NeMo_1.16.0/stt_en_citrinet_512/3262321355385bb7cf5a583146117d77/stt_en_citrinet_512.nemo
[NeMo I 2023-03-19 14:27:59 common:913] Instantiating model from pre-trained checkpoint
[NeMo I 2023-03-19 14:28:01 mixins:170] Tokenizer SentencePieceTokenizer initialized with 1024 tokens


[NeMo W 2023-03-19 14:28:02 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    sample_rate: 16000
    batch_size: 32
    trim_silence: true
    max_duration: 16.7
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    
[NeMo W 2023-03-19 14:28:02 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    sample_rate: 16000
    batch_size: 32
    shuffle: false
    
[NeMo W 2023-03-19 14:28:02 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
    T

[NeMo I 2023-03-19 14:28:02 features:286] PADDING: 16
[NeMo I 2023-03-19 14:28:03 save_restore_connector:247] Model EncDecCTCModelBPE was successfully restored from /home/mat/.cache/torch/NeMo/NeMo_1.16.0/stt_en_citrinet_512/3262321355385bb7cf5a583146117d77/stt_en_citrinet_512.nemo.


In [9]:
citrinet.summarize()

  | Name              | Type                              | Params
------------------------------------------------------------------------
0 | preprocessor      | AudioToMelSpectrogramPreprocessor | 0     
1 | encoder           | ConvASREncoder                    | 36.3 M
2 | decoder           | ConvASRDecoder                    | 657 K 
3 | loss              | CTCLoss                           | 0     
4 | spec_augmentation | SpectrogramAugmentation           | 0     
5 | _wer              | WERBPE                            | 0     
------------------------------------------------------------------------
37.0 M    Trainable params
0         Non-trainable params
37.0 M    Total params
147.977   Total estimated model params size (MB)

# Model Config using OmegaConf

---

First we import [`OmegaConf`](https://omegaconf.readthedocs.io/en/latest/). It's an excellent library that is used throughout NeMo in order to enable us to perform yaml configuration management more easily.Additionally, it plays well with another library, `Hydra` that is used by NeMo to perform on the fly config edits from the command line, dramatically boosting ease of use of our config files


In [10]:
from omegaconf import OmegaConf

All nemo models come packaged with their model configuration inside the `cfg` attribute. While technically it is meant to be config declaration of the model as it has been currently constructed, `cfg` is an essential tool to modify the behaviour of the Model after it has been constructed. it can be safely used to make it easier to perform many essentials tasks inside Models.

To be double sure, we generally work on a copy of the config until we are ready to edit it inside the model

In [11]:
import copy

In [12]:
cfg = copy.deepcopy(citrinet.cfg)
print(OmegaConf.to_yaml(cfg))

sample_rate: 16000
train_ds:
  manifest_filepath: null
  sample_rate: 16000
  batch_size: 32
  trim_silence: true
  max_duration: 16.7
  shuffle: true
  is_tarred: false
  tarred_audio_filepaths: null
validation_ds:
  manifest_filepath: null
  sample_rate: 16000
  batch_size: 32
  shuffle: false
test_ds:
  manifest_filepath:
  - /home/smajumdar/PycharmProjects/nemo-eval/nemo_eval/librispeech/manifests/dev_other.json
  sample_rate: 16000
  batch_size: 32
  shuffle: false
  num_workers: 12
  pin_memory: true
model_defaults:
  repeat: 5
  dropout: 0.0
  separable: true
  se: true
  se_context_size: -1
tokenizer:
  dir: /home/smajumdar/PycharmProjects/nemo-eval/nemo_beta_eval/asrset/manifests/asrset_1.4/tokenizers/no_appen/tokenizer_spe_unigram_v1024/
  type: bpe
preprocessor:
  _target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor
  sample_rate: 16000
  normalize: per_feature
  window_size: 0.025
  window_stride: 0.01
  window: hann
  features: 80
  n_fft: 512
  frame_s

## Analysing the contents of the Model config
---

NeMo models contain the entire definition of the neural network as well as most of the surrounding infrastructure to support that model within themselves

Citrinet contains within its config
- `preprocessor` - MelSpectrogram preprocessing layer
- `encoder` - The acoustic encoder model
- `decoder` - The CTC decoder model
- `optim` (and potientally `sched`) - Optimizer configuration, Can optionally include Scheduler info
- `spec_augment` - Spectrogram Augmentation support
- `train_ds`,`validation_ds`, and `test_ds` - Dataset and data loader construction information

## Modifying the contents of the Model config
---
Say we want to experiment with a different preprocessor or we want to add a scheduler to this model during training. \
OmegaConf makes this a very simple task for us

In [13]:
# OmegaConf won't allow to add new config items, so we temporarily disable this safeguard
OmegaConf.set_struct(cfg, False)

# Let's see the old optim config 
print("Old config: ")
print(OmegaConf.to_yaml(cfg.optim))

Old config: 
name: novograd
lr: 0.05
betas:
- 0.8
- 0.25
weight_decay: 0.001
sched:
  name: CosineAnnealing
  warmup_steps: 1000
  warmup_ratio: null
  min_lr: 1.0e-09
  last_epoch: -1



In [14]:
sched = {'name': 'CosineAnnealing', 'warmup_steps': 1000, 'min_lr': 1e-6}
sched = OmegaConf.create(sched)

# Assign it to cfg.optim.sched namespace
cfg.optim.sched = sched

# Let's see the new optim config
print("New Config: ")
print(OmegaConf.to_yaml(cfg.optim))

# Here, we restore the safeguards so no more additions can be made to the config
OmegaConf.set_struct(cfg, True)

New Config: 
name: novograd
lr: 0.05
betas:
- 0.8
- 0.25
weight_decay: 0.001
sched:
  name: CosineAnnealing
  warmup_steps: 1000
  min_lr: 1.0e-06



## Updating the model from the config
---
NeMo models can be updated in a few ways, but we follow similar patterns within each collection so as to maintain consistency

Here, we will show the 2 most common ways to modify core components of the model - using the `from_config_dict` method, and updating a few special parts of the model

## Update model using `from_config_dict`

In cetain config filesm you will notice the following pattern:

```yaml
preprocessor:
    _target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor
    normalize: per_feature
    window_size: 0.02
    sample_rate: 16000
    window_stride: 0.01
    window: hann
    features: 64
    n_fft: 512
    frame_splicing: 1
    dither: 1.0e-05
    stft_conv: false
```

You might ask why we are using `_target_`? Well, it is generally rare for the preprocessor, encoder, decodere and perhaps a few other details to be changed often from the command line when experimenting. In order to stabilize these settings, we enforce that our preprocessor will always be of type `AudioToMelSpectrogramPreprocessor` for this model by setting its `_target_` attribute in the config. In order to provide its parameters in the class constructor, we simply add them after `_target_`.


---
Note we can still change all of the parameters of this `AudioMelSpectrogramPreprocessor` class from the CLI using hydra, so we don't lose any flexibility once we decide what type of preprocessing class we want


In [16]:
new_preprocessor_config = copy.deepcopy(cfg.preprocessor)
new_preprocessor = citrinet.from_config_dict(new_preprocessor_config)
print(new_preprocessor)

[NeMo I 2023-03-19 16:14:37 features:286] PADDING: 16
AudioToMelSpectrogramPreprocessor(
  (featurizer): FilterbankFeatures()
)


In [17]:
citrinet.preprocessor = new_preprocessor
citrinet.summarize()

  | Name              | Type                              | Params
------------------------------------------------------------------------
0 | preprocessor      | AudioToMelSpectrogramPreprocessor | 0     
1 | encoder           | ConvASREncoder                    | 36.3 M
2 | decoder           | ConvASRDecoder                    | 657 K 
3 | loss              | CTCLoss                           | 0     
4 | spec_augmentation | SpectrogramAugmentation           | 0     
5 | _wer              | WERBPE                            | 0     
------------------------------------------------------------------------
37.0 M    Trainable params
0         Non-trainable params
37.0 M    Total params
147.977   Total estimated model params size (MB)

## Preserving the new config

We need to perform a crucial step - **preserving the updated config**
NeMo has may ways of saving and restoring its models. All of them depend on having an updated config that defines the model in its entirety, so if we modify anything, we should also update the corresponding part of the config to safely save and restore models.

In [None]:
# update the config copy
cfg.preprocessor = new_preprocessor_config
# update the model config
citrinet.cfg = cfg

## update a few special components of the Model

---

While the above approach is good for most major components of the model, Nemo has special utilities for a few components. They are

    - `setup_training_data`
    - `setup_validation_data` and `setup_multi_validation_data`
    - `setup_test_data` and `setup_multi_test_data`
    - `setup_optimization`

One if the major tasks of all conversational AI models is fine-tuning onto new datasets - new languages, new corpus of text, new voices etc. It is often insufficient th have just a pre-trained model. So these setup methods are provided to enable users to adapt models after they have been already trained or provided to you

Let's discuss how to add the scheduler to the model below (which initially had just an optimizer in its config)

In [18]:
# Let's print out the current optimizer
print(OmegaConf.to_yaml(citrinet.cfg.optim))

name: novograd
lr: 0.05
betas:
- 0.8
- 0.25
weight_decay: 0.001
sched:
  name: CosineAnnealing
  warmup_steps: 1000
  warmup_ratio: null
  min_lr: 1.0e-09
  last_epoch: -1



In [None]:
# Now let's update the config
citrinet.setup_optimization(cfg.optim)