References: 

Nemo: 


*   https://colab.research.google.com/github/NVIDIA/NeMo/blob/main/tutorials/asr/01_ASR_with_NeMo.ipynb#scrollTo=7mP4r1Gx_Ilt
*   https://colab.research.google.com/github/NVIDIA/NeMo/blob/main/tutorials/tools/CTC_Segmentation_Tutorial.ipynb#scrollTo=hRFAl0gO92bp
*   https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/intro.html

Please note our efforts usign these notebooks have already improved NeMo:
*   https://github.com/NVIDIA/NeMo/issues/2217#issuecomment-841738358
*   https://github.com/NVIDIA/NeMo/issues/2208







### Loading dependencies and libraries

We load all dependencies from NeMo

In [None]:
BRANCH = 'r1.0.0rc1'
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
"""
# If you're using Google Colab and not running locally, run this cell.
# install NeMo
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]
import json
import os
import wget

from IPython.display import Audio
import numpy as np
import scipy.io.wavfile as wav

! pip install pandas

# optional
! pip install plotly
from plotly import graph_objects as go

In [None]:
# If you're running the notebook locally, update the TOOLS_DIR path below
# In Colab, a few required scripts will be downloaded from NeMo github

import wget
TOOLS_DIR = '<UPDATE_PATH_TO_NeMo_root>/tools/ctc_segmentation/scripts'

if 'google.colab' in str(get_ipython()):
    TOOLS_DIR = 'scripts/'
    os.makedirs(TOOLS_DIR, exist_ok=True)

    required_files = ['prepare_data.py',
                    'normalization_helpers.py',
                    'run_ctc_segmentation.py',
                    'verify_segments.py',
                    'cut_audio.py',
                    'process_manifests.py',
                    'utils.py']
    for file in required_files:
        if not os.path.exists(os.path.join(TOOLS_DIR, file)):
            file_path = 'https://raw.githubusercontent.com/NVIDIA/NeMo/' + BRANCH + '/tools/ctc_segmentation/' + TOOLS_DIR + file
            print(file_path)
            wget.download(file_path, TOOLS_DIR)
elif not os.path.exists(TOOLS_DIR):
      raise ValueError(f'update path to NeMo root directory')

https://raw.githubusercontent.com/NVIDIA/NeMo/r1.0.0rc1/tools/ctc_segmentation/scripts/prepare_data.py
https://raw.githubusercontent.com/NVIDIA/NeMo/r1.0.0rc1/tools/ctc_segmentation/scripts/normalization_helpers.py
https://raw.githubusercontent.com/NVIDIA/NeMo/r1.0.0rc1/tools/ctc_segmentation/scripts/run_ctc_segmentation.py
https://raw.githubusercontent.com/NVIDIA/NeMo/r1.0.0rc1/tools/ctc_segmentation/scripts/verify_segments.py
https://raw.githubusercontent.com/NVIDIA/NeMo/r1.0.0rc1/tools/ctc_segmentation/scripts/cut_audio.py
https://raw.githubusercontent.com/NVIDIA/NeMo/r1.0.0rc1/tools/ctc_segmentation/scripts/process_manifests.py
https://raw.githubusercontent.com/NVIDIA/NeMo/r1.0.0rc1/tools/ctc_segmentation/scripts/utils.py


We download the configuration for the pre-trained model. Note that I manually then alter the file to change the LR to .0001, and batch size to 8. We can also alter the weight decay this way. Full discussion of all our experiments is in our report. 

In [None]:
## Grab the config we'll use in this example
!mkdir configs
!wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/conf/config.yaml

--2021-06-03 23:10:50--  https://raw.githubusercontent.com/NVIDIA/NeMo/r1.0.0rc1/examples/asr/conf/config.yaml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4040 (3.9K) [text/plain]
Saving to: ‘configs/config.yaml’


2021-06-03 23:10:51 (45.2 MB/s) - ‘configs/config.yaml’ saved [4040/4040]



We load the config into a dictionary:

In [None]:
# --- Config Information ---#
try:
    from ruamel.yaml import YAML
except ModuleNotFoundError:
    from ruamel_yaml import YAML
config_path = './configs/config.yaml'

yaml = YAML(typ='safe')
with open(config_path) as f:
    params = yaml.load(f)

We load the pre-trained model and point it at our train and test sets

In [None]:
# NeMo's "core" package
import nemo
# NeMo's ASR collection - this collections contains complete ASR models and
# building blocks (modules) for ASR
import nemo.collections.asr as nemo_asr
from omegaconf import DictConfig
quartznet = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="QuartzNet15x5Base-En")

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package cmudict to /root/nltk_data...


[NeMo W 2021-06-03 23:16:00 experimental:28] Module <class 'nemo.collections.asr.data.audio_to_text_dali.AudioToCharDALIDataset'> is experimental, not ready for production and is not fully supported. Use at your own risk.
################################################################################
###          (please add 'export KALDI_ROOT=<your_path>' in your $HOME/.profile)
###          (or run as: KALDI_ROOT=<your_path> python <your_script>.py)
################################################################################

    
    "sox" backend is being deprecated. The default backend will be changed to "sox_io" backend in 0.8.0 and "sox" backend will be removed in 0.9.0. Please migrate to "sox_io" backend. Please refer to https://github.com/pytorch/audio/issues/903 for the detail.
    
    


[nltk_data]   Unzipping corpora/cmudict.zip.
[NeMo I 2021-06-03 23:16:00 cloud:66] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemospeechmodels/versions/1.0.0a5/files/QuartzNet15x5Base-En.nemo to /root/.cache/torch/NeMo/NeMo_1.0.0rc1/QuartzNet15x5Base-En/2b066be39e9294d7100fb176ec817722/QuartzNet15x5Base-En.nemo
[NeMo I 2021-06-03 23:16:07 common:654] Instantiating model from pre-trained checkpoint
[NeMo I 2021-06-03 23:16:08 features:240] PADDING: 16
[NeMo I 2021-06-03 23:16:08 features:256] STFT using torch
[NeMo I 2021-06-03 23:16:12 modelPT:376] Model EncDecCTCModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.0.0rc1/QuartzNet15x5Base-En/2b066be39e9294d7100fb176ec817722/QuartzNet15x5Base-En.nemo.


**Legacy code**

The below code changes Dropout of the pre-trained model from 0.0 to 0.2, though note we found this degraded rather than improved performance so do not run it

In [None]:
import copy
from omegaconf import OmegaConf
cfg = copy.deepcopy(quartznet.cfg)
print(len(cfg['encoder']['jasper']))
for i in range(0,18):
  cfg['encoder']['jasper'][i]['dropout'] = 0.2
print(OmegaConf.to_yaml(cfg))
quartznet2 = quartznet.from_config_dict(cfg)

### Training

The code for creating the training and test manifests are created in the "FINAL - OyezDataPrep" notebook. Here we are training on two years of hot text and 1 year of paired real text. This roughly corresponds to the 30% data mix recommended by Amazon researchers in our report.  We are using our validation set here to validate every few epochs.  

In [None]:
params['model']['train_ds']['manifest_filepath'] = '/content/drive/MyDrive/Colab Notebooks/TTS_manifests/train_for_test_with_paired.json'
params['model']['validation_ds']['manifest_filepath'] = '/content/drive/MyDrive/Colab Notebooks/paired_full/dev_manifest_final.json'

Google Colab can throw unanticipated cuda memory errors, but shrinking the batch size usually solves this. We actually filed a bug report with NVIDIA, which they appreciated and have now corrected, that reduces cuda memory issues (we noticed that there was a memory leak in the example code)

Now we train the model and save it. Right now the configuration is:

Learning rate of .0001 (smaller than NVIDIA-recommended .001)

Dropout of 0.0

5 or 10 epochs of training

Batch size of 8 with amp mixed precision training of 16 (avoids CUDA error and speeds up training)

Weight decay of default .001 (we experimented with as high as .01 and as low as .0005. Our TTS-only model improved at .003, but all other models performed best with .001

Note that we use our params dictionary to load our config into the quartznet model

In [None]:
# Point to the data we'll use for fine-tuning as the training set
import pytorch_lightning as pl
quartznet.setup_optimization(optim_config=params['model']['optim'])
quartznet.setup_training_data(train_data_config=params['model']['train_ds'])

# Point to the new validation data for fine-tuning
quartznet.setup_validation_data(val_data_config=params['model']['validation_ds'])
#quartznet2.setup_finetune_model(params['model'])
# And now we can create a PyTorch Lightning trainer and call `fit` again.
trainer = pl.Trainer(gpus=1, amp_level='O1',precision=16, max_epochs=5)
trainer.fit(quartznet)

We store the lightning training logs in Google Drive

In [None]:
!mv '/content/lightning_logs' '/content/drive/MyDrive/Colab Notebooks/lightning_logs/save_final'

We save the trained model to Google Drive

In [None]:
quartznet.save_to('/content/drive/MyDrive/Colab Notebooks/save_final_model_here.nemo')

### Traditional evaluation: Word Error Rate


We first save the pretrained model to Drive for our comparisons

In [None]:
pretrained = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="QuartzNet15x5Base-En")
pretrained.save_to('/content/drive/MyDrive/Colab Notebooks/pretrained_here.nemo')

[NeMo I 2021-06-03 23:27:30 cloud:56] Found existing object /root/.cache/torch/NeMo/NeMo_1.0.0rc1/QuartzNet15x5Base-En/2b066be39e9294d7100fb176ec817722/QuartzNet15x5Base-En.nemo.
[NeMo I 2021-06-03 23:27:30 cloud:62] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.0.0rc1/QuartzNet15x5Base-En/2b066be39e9294d7100fb176ec817722/QuartzNet15x5Base-En.nemo
[NeMo I 2021-06-03 23:27:30 common:654] Instantiating model from pre-trained checkpoint
[NeMo I 2021-06-03 23:27:31 features:240] PADDING: 16
[NeMo I 2021-06-03 23:27:31 features:256] STFT using torch
[NeMo I 2021-06-03 23:27:32 modelPT:376] Model EncDecCTCModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.0.0rc1/QuartzNet15x5Base-En/2b066be39e9294d7100fb176ec817722/QuartzNet15x5Base-En.nemo.


In [None]:
!git clone https://github.com/NVIDIA/NeMo -b "$BRANCH"

Cloning into 'NeMo'...
remote: Enumerating objects: 51068, done.[K
remote: Counting objects: 100% (1688/1688), done.[K
remote: Compressing objects: 100% (728/728), done.[K
remote: Total 51068 (delta 1082), reused 1434 (delta 957), pack-reused 49380[K
Receiving objects: 100% (51068/51068), 131.86 MiB | 26.57 MiB/s, done.
Resolving deltas: 100% (35657/35657), done.


We can compare performance of the working model (in WER) against the pre-trained model, which has an error rate of .10 on our test set

In [None]:
!python /content/NeMo/examples/asr/speech_to_text_infer.py \
--asr_model='/content/drive/MyDrive/Colab Notebooks/pretrained_here.nemo' \
--dataset='/content/drive/MyDrive/Colab Notebooks/paired_full/test_manifest_final.json' \

2021-06-03 23:27:42.170293: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Package cmudict is already up-to-date!
[NeMo W 2021-06-03 23:27:43 experimental:28] Module <class 'nemo.collections.asr.data.audio_to_text_dali.AudioToCharDALIDataset'> is experimental, not ready for production and is not fully supported. Use at your own risk.
################################################################################
###          (please add 'export KALDI_ROOT=<your_path>' in your $HOME/.profile)
###          (or run as: KALDI_ROOT=<your_path> python <your_script>.py)
################################################################################

      '"sox"

We get the Word Error Rate of the current model on the test set



In [None]:
!python /content/NeMo/examples/asr/speech_to_text_infer.py \
--asr_model='/content/drive/MyDrive/Colab Notebooks/TTS_final.nemo' \
--dataset='/content/drive/MyDrive/Colab Notebooks/paired_full/test_manifest_final.json' \

2021-06-03 23:59:04.463296: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Package cmudict is already up-to-date!
[NeMo W 2021-06-03 23:59:06 experimental:28] Module <class 'nemo.collections.asr.data.audio_to_text_dali.AudioToCharDALIDataset'> is experimental, not ready for production and is not fully supported. Use at your own risk.
################################################################################
###          (please add 'export KALDI_ROOT=<your_path>' in your $HOME/.profile)
###          (or run as: KALDI_ROOT=<your_path> python <your_script>.py)
################################################################################

      '"sox"

We compute the word error rate on the training data to analyze for overfit (technically, this is the "test set" run so it's too late to change any hyperparameters, but still worth showing). Interestingly we appear to be slightly underfitting the data (lower error on the test set) so more epochs may have been helpful, but we are onsidering the hyperparameters locked at this point. Finally, all the models seemed to perform best on our test set, so maybe it just had easier examples, as Professor Ng discussed in lecture. 

In [None]:
!python /content/NeMo/examples/asr/speech_to_text_infer.py \
--asr_model='/content/drive/MyDrive/Colab Notebooks/TTS_final.nemo' \
--dataset='/content/drive/MyDrive/Colab Notebooks/TTS_manifests/train_for_test_with_paired.json' \

2021-06-04 00:00:56.234672: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Package cmudict is already up-to-date!
[NeMo W 2021-06-04 00:00:57 experimental:28] Module <class 'nemo.collections.asr.data.audio_to_text_dali.AudioToCharDALIDataset'> is experimental, not ready for production and is not fully supported. Use at your own risk.
################################################################################
###          (please add 'export KALDI_ROOT=<your_path>' in your $HOME/.profile)
###          (or run as: KALDI_ROOT=<your_path> python <your_script>.py)
################################################################################

      '"sox"

### Custom evaluation: Hot Text WER, Intersection over Union, Percentage of Target Vocabulary, and transcript analysis

Finally, we modified NVIDIA's inference script with two new metrics we created -- Hot Text WER and Cold Text WER. This splits our test set into examples where every word in the example was in the "hot text" -- this is Hot Text WER. Then, for examples with at least one word the model training on "hot text" didn't see in the TTS data, we calculate a Cold Text WER. The Hot Text WER is the "mirror image" of the Amazon's papers "OOV WER" or looking at examples the pre-trained model wouldn't have seen. Since we are focusing on how Hot Text improves performance, we created this mirror image to use instead. 

Please note that to load dependencies properly, we upload the test_to_infer_JS_corrected.py script into the NeMo folder, so it doesn't work automatically. With that in mind, please find the full text of the script below, with our additions highlighted:



```
# Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""
This script serves three goals:
    (1) Demonstrate how to use NeMo Models outside of PytorchLightning
    (2) Shows example of batch ASR inference
    (3) Serves as CI test for pre-trained checkpoint
"""

from argparse import ArgumentParser

import torch
import json

from nemo.collections.asr.metrics.wer import WER, word_error_rate
from nemo.collections.asr.models import EncDecCTCModel
from nemo.utils import logging

try:
    from torch.cuda.amp import autocast
except ImportError:
    from contextlib import contextmanager

    @contextmanager
    def autocast(enabled=None):
        yield


can_gpu = torch.cuda.is_available()


def main():
    parser = ArgumentParser()
    parser.add_argument(
        "--asr_model", type=str, default="QuartzNet15x5Base-En", required=True, help="Pass: 'QuartzNet15x5Base-En'",
    )
    parser.add_argument("--dataset", type=str, required=True, help="path to evaluation data")
    parser.add_argument("--batch_size", type=int, default=4)
    parser.add_argument("--wer_tolerance", type=float, default=1.0, help="used by test")
    parser.add_argument(
        "--dont_normalize_text",
        default=False,
        action='store_true',
        help="Turn off trasnscript normalization. Recommended for non-English.",
    )
    parser.add_argument(
        "--use_cer", default=False, action='store_true', help="Use Character Error Rate as the evaluation metric"
    )
    # We add an argument with the "ref vocab" the "hot text" manifest to get it's vocabulary
    parser.add_argument(
        "--ref_vocab", type=str, required=True
    )
    args = parser.parse_args()
    torch.set_grad_enabled(False)

    if args.asr_model.endswith('.nemo'):
        logging.info(f"Using local ASR model from {args.asr_model}")
        asr_model = EncDecCTCModel.restore_from(restore_path=args.asr_model)
    else:
        logging.info(f"Using NGC cloud ASR model {args.asr_model}")
        asr_model = EncDecCTCModel.from_pretrained(model_name=args.asr_model)
    asr_model.setup_test_data(
        test_data_config={
            'sample_rate': 16000,
            'manifest_filepath': args.dataset,
            'labels': asr_model.decoder.vocabulary,
            'batch_size': args.batch_size,
            'normalize_transcripts': not args.dont_normalize_text,
        }
    )
    if can_gpu:
        asr_model = asr_model.cuda()
    asr_model.eval()
    labels_map = dict([(i, asr_model.decoder.vocabulary[i]) for i in range(len(asr_model.decoder.vocabulary))])
    wer = WER(vocabulary=asr_model.decoder.vocabulary,log_prediction=True)
    hypotheses = []
    references = []
    for test_batch in asr_model.test_dataloader():
        if can_gpu:
            test_batch = [x.cuda() for x in test_batch]
        with autocast():
            log_probs, encoded_len, greedy_predictions = asr_model(
                input_signal=test_batch[0], input_signal_length=test_batch[1]
            )
        hypotheses += wer.ctc_decoder_predictions_tensor(greedy_predictions)
        for batch_ind in range(greedy_predictions.shape[0]):
            seq_len = test_batch[3][batch_ind].cpu().detach().numpy()
            seq_ids = test_batch[2][batch_ind].cpu().detach().numpy()
            reference = ''.join([labels_map[c] for c in seq_ids[0:seq_len]])
            references.append(reference)
        del test_batch
    ref_vocab_set = set()
    with open(args.ref_vocab, 'r') as f_hot:
      for line in f_hot:
        json_line = json.loads(line)
        line_vocab = json_line['text']
        ref_vocab_set.update(line_vocab.split())

# We added these lists and splitting the test set into vocab (hot text) and OOV
    hypotheses_vocab = []
    references_vocab = []
    hypotheses_OOV = []
    references_OOV = []
    for i in range(len(references)):
      reference_words=references[i].split()
      ref_num = 0
      ref_denom = 0
      for reference_word in reference_words:
        if reference_word in ref_vocab_set:
          ref_num+=1
        ref_denom=len(reference_words)
      if ref_num==ref_denom:
        references_vocab.append(references[i])
        hypotheses_vocab.append(hypotheses[i])
      else:
        hypotheses_OOV.append(hypotheses[i])
        references_OOV.append(references[i])
    wer_value_total = word_error_rate(hypotheses=hypotheses, references=references, use_cer=args.use_cer)
   
    # we calculate and print the full WER, the Hot Text WER and the Cold Text WER

    if len(references_vocab) > 0:
      wer_value_vocab = word_error_rate(hypotheses=hypotheses_vocab, references=references_vocab, use_cer=args.use_cer)
    if len(references_OOV) >0:
      wer_value_OOV = word_error_rate(hypotheses=hypotheses_OOV, references=references_OOV, use_cer=args.use_cer)
    if not args.use_cer:
        if wer_value_total > args.wer_tolerance:
            raise ValueError(f"got wer of {wer_value}. it was higher than {args.wer_tolerance}")
        logging.info(f'Got WER of {wer_value_total}. Tolerance was {args.wer_tolerance}')

        if len(references_vocab) > 0:
          logging.info(f'Got hot text WER of {wer_value_vocab}.')
        if len(references_OOV) >0:
          logging.info(f'Got cold text WER of {wer_value_OOV}.')
        
    else:
        logging.info(f'Got CER of {wer_value}')


if __name__ == '__main__':
    main()  # noqa pylint: disable=no-value-for-parameter

```





For comparison, NVIDIA's original code is below:



```
# Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""
This script serves three goals:
    (1) Demonstrate how to use NeMo Models outside of PytorchLightning
    (2) Shows example of batch ASR inference
    (3) Serves as CI test for pre-trained checkpoint
"""

from argparse import ArgumentParser

import torch

from nemo.collections.asr.metrics.wer import WER, word_error_rate
from nemo.collections.asr.models import EncDecCTCModel
from nemo.utils import logging

try:
    from torch.cuda.amp import autocast
except ImportError:
    from contextlib import contextmanager

    @contextmanager
    def autocast(enabled=None):
        yield


can_gpu = torch.cuda.is_available()


def main():
    parser = ArgumentParser()
    parser.add_argument(
        "--asr_model", type=str, default="QuartzNet15x5Base-En", required=True, help="Pass: 'QuartzNet15x5Base-En'",
    )
    parser.add_argument("--dataset", type=str, required=True, help="path to evaluation data")
    parser.add_argument("--batch_size", type=int, default=4)
    parser.add_argument("--wer_tolerance", type=float, default=1.0, help="used by test")
    parser.add_argument(
        "--dont_normalize_text",
        default=False,
        action='store_true',
        help="Turn off trasnscript normalization. Recommended for non-English.",
    )
    parser.add_argument(
        "--use_cer", default=False, action='store_true', help="Use Character Error Rate as the evaluation metric"
    )
    args = parser.parse_args()
    torch.set_grad_enabled(False)

    if args.asr_model.endswith('.nemo'):
        logging.info(f"Using local ASR model from {args.asr_model}")
        asr_model = EncDecCTCModel.restore_from(restore_path=args.asr_model)
    else:
        logging.info(f"Using NGC cloud ASR model {args.asr_model}")
        asr_model = EncDecCTCModel.from_pretrained(model_name=args.asr_model)
    asr_model.setup_test_data(
        test_data_config={
            'sample_rate': 16000,
            'manifest_filepath': args.dataset,
            'labels': asr_model.decoder.vocabulary,
            'batch_size': args.batch_size,
            'normalize_transcripts': not args.dont_normalize_text,
        }
    )
    if can_gpu:
        asr_model = asr_model.cuda()
    asr_model.eval()
    labels_map = dict([(i, asr_model.decoder.vocabulary[i]) for i in range(len(asr_model.decoder.vocabulary))])
    wer = WER(vocabulary=asr_model.decoder.vocabulary)
    hypotheses = []
    references = []
    for test_batch in asr_model.test_dataloader():
        if can_gpu:
            test_batch = [x.cuda() for x in test_batch]
        with autocast():
            log_probs, encoded_len, greedy_predictions = asr_model(
                input_signal=test_batch[0], input_signal_length=test_batch[1]
            )
        hypotheses += wer.ctc_decoder_predictions_tensor(greedy_predictions)
        for batch_ind in range(greedy_predictions.shape[0]):
            seq_len = test_batch[3][batch_ind].cpu().detach().numpy()
            seq_ids = test_batch[2][batch_ind].cpu().detach().numpy()
            reference = ''.join([labels_map[c] for c in seq_ids[0:seq_len]])
            references.append(reference)
        del test_batch

    wer_value = word_error_rate(hypotheses=hypotheses, references=references, use_cer=args.use_cer)
    if not args.use_cer:
        if wer_value > args.wer_tolerance:
            raise ValueError(f"got wer of {wer_value}. it was higher than {args.wer_tolerance}")
        logging.info(f'Got WER of {wer_value}. Tolerance was {args.wer_tolerance}')
    else:
        logging.info(f'Got CER of {wer_value}')


if __name__ == '__main__':
    main()  # noqa pylint: disable=no-value-for-parameter
```



We manually move our script into the NeMo folder to avoid dependency errors

In [None]:
!mv '/content/drive/MyDrive/Colab Notebooks/speech_to_text_infer_JS_corrected_final.py' '/content/NeMo/examples/asr'

We get Hot Text and Cold Text WER for the pretrained model. Note we are using the dev set rather than test set because we can compare more models this way (only some moved on to test set). 

In [None]:
!python /content/NeMo/examples/asr/speech_to_text_infer_JS_corrected_final.py \
--asr_model='/content/drive/MyDrive/Colab Notebooks/pretrained_here.nemo' \
--dataset='/content/drive/MyDrive/Colab Notebooks/paired_full/dev_manifest_final.json' \
--ref_vocab='/content/drive/MyDrive/Colab Notebooks/TTS_manifests/2_TTS_corrected.json'

2021-06-04 01:56:32.855211: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Package cmudict is already up-to-date!
[NeMo W 2021-06-04 01:56:34 experimental:28] Module <class 'nemo.collections.asr.data.audio_to_text_dali.AudioToCharDALIDataset'> is experimental, not ready for production and is not fully supported. Use at your own risk.
################################################################################
###          (please add 'export KALDI_ROOT=<your_path>' in your $HOME/.profile)
###          (or run as: KALDI_ROOT=<your_path> python <your_script>.py)
################################################################################

      '"sox"

Let's compare that to our experimental model with 2 years of "hot text" and 1 year of paired data

In [None]:
!python /content/NeMo/examples/asr/speech_to_text_infer_JS_corrected_final.py \
--asr_model='/content/drive/MyDrive/Colab Notebooks/2_TTS_1_paired_10_0001_001.nemo' \
--dataset='/content/drive/MyDrive/Colab Notebooks/paired_full/dev_manifest_final.json' \
--ref_vocab='/content/drive/MyDrive/Colab Notebooks/TTS_manifests/2_TTS_corrected.json'

2021-06-04 02:03:00.846680: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Package cmudict is already up-to-date!
[NeMo W 2021-06-04 02:03:02 experimental:28] Module <class 'nemo.collections.asr.data.audio_to_text_dali.AudioToCharDALIDataset'> is experimental, not ready for production and is not fully supported. Use at your own risk.
################################################################################
###          (please add 'export KALDI_ROOT=<your_path>' in your $HOME/.profile)
###          (or run as: KALDI_ROOT=<your_path> python <your_script>.py)
################################################################################

      '"sox"

Finally, we compare against our "control" model that trained on just that same year of paired data

In [None]:
!python /content/NeMo/examples/asr/speech_to_text_infer_JS_corrected_final.py \
--asr_model='/content/drive/MyDrive/Colab Notebooks/1_paired_10_0001_001.nemo' \
--dataset='/content/drive/MyDrive/Colab Notebooks/paired_full/dev_manifest_final.json' \
--ref_vocab='/content/drive/MyDrive/Colab Notebooks/TTS_manifests/2_TTS_corrected.json' \

2021-06-04 02:07:13.058306: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Package cmudict is already up-to-date!
[NeMo W 2021-06-04 02:07:14 experimental:28] Module <class 'nemo.collections.asr.data.audio_to_text_dali.AudioToCharDALIDataset'> is experimental, not ready for production and is not fully supported. Use at your own risk.
################################################################################
###          (please add 'export KALDI_ROOT=<your_path>' in your $HOME/.profile)
###          (or run as: KALDI_ROOT=<your_path> python <your_script>.py)
################################################################################

      '"sox"

We compute our Intersection over Union and Percent of Target Vocab stats to understand how closely our "hot text" matches our test set text.

Note that we delete stop words, as Professor Ng suggested in the Sequence Models courses, to make the comparisons more interesting. The list of stop words comes from: 

https://gist.github.com/sebleier/554280

Stackoverflow cite:
https://thispointer.com/python-set-remove-single-or-multiple-elements-from-a-set/


In [None]:
stop_words = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]

def get_hot_text_stats(hot_text_manifest,test_manifest): 
  f_hot_vocab = set()
  with open(hot_text_manifest, 'r') as f_hot:
    for line in f_hot:
      json_line = json.loads(line)
      line_vocab = json_line['text']
      f_hot_vocab.update(line_vocab.split())
      f_hot_vocab.difference_update(stop_words)
  test_vocab = set()
  with open(test_manifest, 'r') as f_test:
    for line in f_test:
      json_line = json.loads(line)
      line_vocab = json_line['text']
      test_vocab.update(line_vocab.split())
      test_vocab.difference_update(stop_words)
    
  intersection_vocab = f_hot_vocab.intersection(test_vocab)
  union_vocab = f_hot_vocab.union(test_vocab)
  print("intersection/union: "+str(len(intersection_vocab)/len(union_vocab)))
  print("percent of target vocab: "+str(len(intersection_vocab)/len(test_vocab)))


get_hot_text_stats('/content/drive/MyDrive/Colab Notebooks/TTS_manifests/train_for_test_with_paired.json','/content/drive/MyDrive/Colab Notebooks/paired_full/test_manifest_final.json')
get_hot_text_stats('/content/drive/MyDrive/Colab Notebooks/TTS_manifests/2_TTS_1_paired.json','/content/drive/MyDrive/Colab Notebooks/paired_full/dev_manifest_final.json')
get_hot_text_stats('/content/drive/MyDrive/Colab Notebooks/TTS_manifests/17_TTS_corrected.json','/content/drive/MyDrive/Colab Notebooks/paired_full/dev_manifest_final.json')


intersection/union: 0.40639008106819263
percent of target vocab: 0.71964195237291
intersection/union: 0.4243538444901049
percent of target vocab: 0.7241188411145968
intersection/union: 0.24201360544217687
percent of target vocab: 0.8206311127514302


### Understanding our transcripts

Finally, we can load the pre-trained model and have it transcribe a selection of cases to analyze the results. Note the pretrained model struggles on names (e.g., todmayor) and on legal terminology (e.g., even "federal lor" instead of federal law)

In [None]:
pretrained = nemo_asr.models.EncDecCTCModel.restore_from('/content/drive/MyDrive/Colab Notebooks/pretrained.nemo')
audio_paths = []
import json
with open('/content/drive/MyDrive/Colab Notebooks/paired_full/test_manifest_err_analysis.json','r') as f:
  for line in f:
    line = f.readline()
    line_dict = json.loads(line)
    audio_paths.append(line_dict["audio_filepath"])
print(audio_paths)
pretrained.transcribe(paths2audio_files=audio_paths)

[NeMo I 2021-06-03 22:00:38 features:240] PADDING: 16
[NeMo I 2021-06-03 22:00:38 features:256] STFT using torch
[NeMo I 2021-06-03 22:00:39 modelPT:376] Model EncDecCTCModel was successfully restored from /content/drive/MyDrive/Colab Notebooks/pretrained.nemo.
['/content/drive/MyDrive/Colab Notebooks/WORK_DIR_2015/output_multiple_files/high_score_clips/13-1067_20151201-opinion.delivery_0002.wav', '/content/drive/MyDrive/Colab Notebooks/WORK_DIR_2015/output_multiple_files/high_score_clips/13-1067_20151201-opinion.delivery_0004.wav', '/content/drive/MyDrive/Colab Notebooks/WORK_DIR_2015/output_multiple_files/high_score_clips/13-1067_20151201-opinion.delivery_0006.wav', '/content/drive/MyDrive/Colab Notebooks/WORK_DIR_2015/output_multiple_files/high_score_clips/13-1067_20151201-opinion.delivery_0008.wav', '/content/drive/MyDrive/Colab Notebooks/WORK_DIR_2015/output_multiple_files/high_score_clips/13-1067_20151201-opinion.delivery_0010.wav', '/content/drive/MyDrive/Colab Notebooks/WORK_DI

HBox(children=(FloatProgress(value=0.0, description='Transcribing', max=13.0, style=ProgressStyle(description_…




['she planned to use that urrail pass two among other things ride on the austrian state owed railway known as o b b from insbrook austria toprague',
 'when she returned home to the united states miss sax sued o b b in federal district court',
 'that question arises because o bb is owned by a sovereign nation austria',
 'there are however several specific exceptions to that general rule',
 'based upon a commercial activity carried on in the united states',
 'it adopted what is known as the one element test',
 'the claim in the suit such as giving rise to a duty of care the suit can be said to be based upon that activity and the commercial activity exception to the bar of sovereign immunity applies',
 'and we rejectd the one element test',
 'in that case we explained that a suit should be considered to be based upon the wrongful conduct that makes up the core or the grovemen of the complaint',
 'though the sale of the eur rail pass in the united states was connected to the tragic inciden

Let's compare to a model that was trained solely on synthetic data. It does better on some names (e.g., "ginsburg") but not others (e.g., "sodmayor")

In [None]:
TTS_2 = nemo_asr.models.EncDecCTCModel.restore_from('/content/drive/MyDrive/Colab Notebooks/2_TTS_5_0001_003.nemo')
audio_paths = []
import json
with open('/content/drive/MyDrive/Colab Notebooks/paired_full/test_manifest_err_analysis.json','r') as f:
  for line in f:
    line = f.readline()
    line_dict = json.loads(line)
    audio_paths.append(line_dict["audio_filepath"])
print(audio_paths)
TTS_2.transcribe(paths2audio_files=audio_paths)

[NeMo W 2021-06-03 22:01:04 modelPT:133] Please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /content/drive/MyDrive/Colab Notebooks/TTS_manifests/2_TTS_corrected.json
    sample_rate: 16000
    labels:
    - ' '
    - a
    - b
    - c
    - d
    - e
    - f
    - g
    - h
    - i
    - j
    - k
    - l
    - m
    - 'n'
    - o
    - p
    - q
    - r
    - s
    - t
    - u
    - v
    - w
    - x
    - 'y'
    - z
    - ''''
    batch_size: 8
    trim_silence: true
    max_duration: 16.7
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: scatter
    
[NeMo W 2021-06-03 22:01:04 modelPT:140] Please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: /conte

[NeMo I 2021-06-03 22:01:04 features:240] PADDING: 16
[NeMo I 2021-06-03 22:01:04 features:256] STFT using torch
[NeMo I 2021-06-03 22:01:05 modelPT:376] Model EncDecCTCModel was successfully restored from /content/drive/MyDrive/Colab Notebooks/2_TTS_5_0001_003.nemo.
['/content/drive/MyDrive/Colab Notebooks/WORK_DIR_2015/output_multiple_files/high_score_clips/13-1067_20151201-opinion.delivery_0002.wav', '/content/drive/MyDrive/Colab Notebooks/WORK_DIR_2015/output_multiple_files/high_score_clips/13-1067_20151201-opinion.delivery_0004.wav', '/content/drive/MyDrive/Colab Notebooks/WORK_DIR_2015/output_multiple_files/high_score_clips/13-1067_20151201-opinion.delivery_0006.wav', '/content/drive/MyDrive/Colab Notebooks/WORK_DIR_2015/output_multiple_files/high_score_clips/13-1067_20151201-opinion.delivery_0008.wav', '/content/drive/MyDrive/Colab Notebooks/WORK_DIR_2015/output_multiple_files/high_score_clips/13-1067_20151201-opinion.delivery_0010.wav', '/content/drive/MyDrive/Colab Notebooks/W

HBox(children=(FloatProgress(value=0.0, description='Transcribing', max=13.0, style=ProgressStyle(description_…




['she planned to use that eurail pass to among other things ride on the austrian state owned railway known as obb from insbruck austria to praqu',
 'when she returned home to the united states missac sued obb in federal district court',
 'that question arises because obb is owned by a sovereign nation austria',
 'there are however several specific exceptions to that general rule',
 'based upon a commercial activity carried on in the united states',
 'it adopted what is known as the one element test',
 'of the claim in the suit such as giving rise to a duty of care the suit can be said to be based upon that activity and the commercial activity exception to the bar of sovereign immunity applies',
 'and we rejectd the one element test',
 'in that case we explained that a suit should be considered to be based upon the wrongful conduct that makes up the core or the grovemen of the complaint',
 'though the sale of the eurai pass in the united states was connected to the tragic incident in in

The combination of 1 year of real data and 2 years of Hot Text does best, transcribing names and more legal terminology correctly, even "certiorari"!

In [None]:
TTS_2_1_paired = nemo_asr.models.EncDecCTCModel.restore_from('/content/drive/MyDrive/Colab Notebooks/2_TTS_1_paired_10_0001_001.nemo')
audio_paths = []
import json
with open('/content/drive/MyDrive/Colab Notebooks/paired_full/test_manifest_err_analysis.json','r') as f:
  for line in f:
    line = f.readline()
    line_dict = json.loads(line)
    audio_paths.append(line_dict["audio_filepath"])
print(audio_paths)
TTS_2_1_paired.transcribe(paths2audio_files=audio_paths)

[NeMo W 2021-06-03 22:00:13 modelPT:133] Please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /content/drive/MyDrive/Colab Notebooks/TTS_manifests/2_TTS_1_paired.json
    sample_rate: 16000
    labels:
    - ' '
    - a
    - b
    - c
    - d
    - e
    - f
    - g
    - h
    - i
    - j
    - k
    - l
    - m
    - 'n'
    - o
    - p
    - q
    - r
    - s
    - t
    - u
    - v
    - w
    - x
    - 'y'
    - z
    - ''''
    batch_size: 8
    trim_silence: true
    max_duration: 16.7
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: scatter
    
[NeMo W 2021-06-03 22:00:13 modelPT:140] Please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: /conten

[NeMo I 2021-06-03 22:00:13 features:240] PADDING: 16
[NeMo I 2021-06-03 22:00:13 features:256] STFT using torch
[NeMo I 2021-06-03 22:00:14 modelPT:376] Model EncDecCTCModel was successfully restored from /content/drive/MyDrive/Colab Notebooks/2_TTS_1_paired_10_0001_001.nemo.
['/content/drive/MyDrive/Colab Notebooks/WORK_DIR_2015/output_multiple_files/high_score_clips/13-1067_20151201-opinion.delivery_0002.wav', '/content/drive/MyDrive/Colab Notebooks/WORK_DIR_2015/output_multiple_files/high_score_clips/13-1067_20151201-opinion.delivery_0004.wav', '/content/drive/MyDrive/Colab Notebooks/WORK_DIR_2015/output_multiple_files/high_score_clips/13-1067_20151201-opinion.delivery_0006.wav', '/content/drive/MyDrive/Colab Notebooks/WORK_DIR_2015/output_multiple_files/high_score_clips/13-1067_20151201-opinion.delivery_0008.wav', '/content/drive/MyDrive/Colab Notebooks/WORK_DIR_2015/output_multiple_files/high_score_clips/13-1067_20151201-opinion.delivery_0010.wav', '/content/drive/MyDrive/Colab N

HBox(children=(FloatProgress(value=0.0, description='Transcribing', max=13.0, style=ProgressStyle(description_…




['she planed to use that eurrail pass two among other things ride on the austrian state owned railway known as obb from insbrick austria toprague',
 'when she returned home to the united states mss sac sued obb in federal district court',
 'that question arises because obb is owned by a sovereign nation austria',
 'there are however several specific exceptions to that general rule',
 'based upon a commercial activity carried on in the united states',
 'it adopted what is known as the one element test',
 'of the claim in the suit such as giving rise to a duty of care the suit can be said to be based upon that activity and the commercial activity exception to the bar of sovereign immunity applies',
 'and we rejectd the one element test',
 'in that case we explaine that a suit should be considered to be based upon the wrongful conduct that makes up the core or the grovermen of the complaint',
 'though the sale of the eurral pass in the united states was connected to the tragic incident in