In [None]:
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
"""
# If you're using Google Colab and not running locally, run this cell

# install NeMo
BRANCH = 'text_normalization'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[nlp]

In [None]:
import json
import os
import wget
import numpy as np
import inspect
import regex as re


# Introduction
Text normalization for TTS converts text into its verbalized form. That is, tokens belonging to special semiotic classes, e.g. numbers/abbreviations, will be converted into their spoken form. For example, "10:00" -> "ten o'clock", "10:00 a.m." -> "ten a m", "10kg" -> "ten kilograms".

This tutorial shows how to use the NeMo rule-based text normalization system.
Similar to [The Google Kestrel TTS text normalization
system](https://www.researchgate.net/profile/Richard_Sproat/publication/277932107_The_Kestrel_TTS_text_normalization_system/links/57308b1108aeaae23f5cc8c4/The-Kestrel-TTS-text-normalization-system.pdf), the NeMo rule-based system is devided into a tagger and a verbalizer: the tagger is responsible for detecting and classifying semiotic classes in the underlying text, the verbalizer takes the output of the tagger and carries out the normalization. The system is designed to be easily debuggable and extendable by more rules.
We provided the a set of rules that covers the majority of cases as found in the [Google Text normalization dataset](https://www.kaggle.com/richardwilliamsproat/text-normalization-for-english-russian-and-polish) for the English language. As with every language there is a long tail of special cases.

This tutorial will show how to do prediction on regular text data. It also shows how to do evaluation on a labeled text normalization dataset that follows the format of [Google Text normalization dataset](https://www.kaggle.com/richardwilliamsproat/text-normalization-for-english-russian-and-polish)


In [None]:
# If you're running the notebook locally, update the TOOLS_DIR path below
# In Colab, a few required scripts will be downloaded from NeMo github

TOOLS_DIR = '<UPDATE_PATH_TO_NeMo_root>/tools/text_normalization/'

if 'google.colab' in str(get_ipython()):
    TOOLS_DIR = 'tools/text_normalization/'
    TOOLS_DATA_DIR = TOOLS_DIR + "data/"
    os.makedirs(TOOLS_DIR, exist_ok=True)
    os.makedirs(TOOLS_DATA_DIR, exist_ok=True)

    required_files = [
      'normalize.py',
      'tagger.py',
      'utils.py',
      'run_evaluate.py',
      'run_predict.py',
      'verbalizer.py',
    ]
    required_data_file = [             
      'currency.tsv',
      'magnitudes.tsv',
      'measurements.tsv',
      'months.tsv',
      'whitelist.tsv'
    ]
    for file in required_files:
        if not os.path.exists(os.path.join(TOOLS_DIR, file)):
            file_path = f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/' + TOOLS_DIR + file
            print(file_path)
            wget.download(file_path, TOOLS_DIR)
    for file in required_data_file:
        if not os.path.exists(os.path.join(TOOLS_DATA_DIR, file)):
            file_path = f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/' + TOOLS_DATA_DIR + file
            print(file_path)
            wget.download(file_path, TOOLS_DATA_DIR)
elif not os.path.exists(TOOLS_DIR):
      raise ValueError(f'update path to NeMo root directory')

`TOOLS_DIR` should now contain scripts that we are going to need in the next steps, all necessary scripts could be found [here](https://github.com/NVIDIA/NeMo/tree/main/tools/text_normalization).

In [None]:
print(TOOLS_DIR)
! ls -l $TOOLS_DIR
! ls -l $TOOLS_DATA_DIR

# Data Preparation and Download


## Data for Prediction
For prediction, let's download a text file from [http://www.gutenberg.org/files/48874/48874-0.txt](http://www.gutenberg.org/files/48874/48874-0.txt).

In [None]:
## create data directory and download an audio file
WORK_DIR = 'WORK_DIR'
DATA_DIR = WORK_DIR + '/DATA'
os.makedirs(DATA_DIR, exist_ok=True)
text_file = '48874-0.txt'
if not os.path.exists(os.path.join(DATA_DIR, text_file)):
    print('Downloading text file')
    wget.download('http://www.gutenberg.org/files/48874/' + text_file, DATA_DIR)

The `DATA_DIR` should now contain the text file

In [None]:
!ls -l $DATA_DIR

the first 10 lines of the file :

In [None]:
! head -n 10 $DATA_DIR/$text_file

## Data for Evaluation



In [None]:
#TODO

# Prediction


In [None]:
# TODO ! ls -l $OUTPUT_DIR/processed

# Evaluation


In [None]:
# tODO


# Next Steps

Check out [NeMo Speech Data Explorer tool](https://github.com/NVIDIA/NeMo/tree/main/tools/speech_data_explorer#speech-data-explorer) to interactively evaluate the aligned segments.

# References
Kürzinger, Ludwig, et al. ["CTC-Segmentation of Large Corpora for German End-to-End Speech Recognition."](https://arxiv.org/abs/2007.09127) International Conference on Speech and Computer. Springer, Cham, 2020.