# Introduction

The goal of this tutorial is to demonstrate the basic steps required to setup and train a simple single-channel speech enhancement model in NeMo using online augmentation with noise and room impulse responce (RIR). Online augmentation is performed using a dataloader based on Lhotse speech data processing toolkit [1].


This notebook covers the following steps:

* Download speech, noise and RIR data
* Prepare Lhotse manifests for speech, noise and RIR data
* Prepare fixed validation set by mixing speech, noise and RIR data
* Configure and train a simple single-output model

Note that this tutorial is only for demonstration purposes.
To achieve best performance for a particular use case, carefully prepared data and more advanced models should be used.

*Disclaimer:*
User is responsible for checking the content of datasets and the applicable licenses and determining if suitable for the intended use.

In [None]:
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
5. Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect
"""

GIT_USER = 'NVIDIA'
BRANCH = 'main'

if 'google.colab' in str(get_ipython()):

    # Install dependencies
    !pip install wget
    !apt-get install sox libsndfile1 ffmpeg
    !pip install text-unidecode
    !pip install matplotlib>=3.3.2

    ## Install NeMo
    !python -m pip install git+https://github.com/{GIT_USER}/NeMo.git@{BRANCH}#egg=nemo_toolkit[all]

    ## Install TorchAudio
    !pip install torchaudio>=0.13.0 -f https://download.pytorch.org/whl/torch_stable.html

The following cell will take care of the necessary imports and prepare utility functions used throughout the notebook.

In [None]:
import glob
import librosa
import os
import torch
import tqdm
from itertools import islice

import IPython.display as ipd
import matplotlib.pyplot as plt
import numpy as np
import lightning.pytorch as pl
import soundfile as sf
from pathlib import Path
from omegaconf import OmegaConf, open_dict
from sklearn.model_selection import train_test_split
from torchmetrics.functional.audio import signal_distortion_ratio, scale_invariant_signal_distortion_ratio
from lhotse import CutSet, RecordingSet, Recording, MonoCut
from lhotse.recipes import (
    download_rir_noise,
    prepare_rir_noise,
    download_librispeech,
    prepare_librispeech
)

from nemo.collections.common.data.lhotse import get_lhotse_dataloader_from_config
from nemo.collections.audio.data.audio_to_audio_lhotse import LhotseAudioToTargetDataset

Utility functions for displaying signals and metrics

In [None]:
def show_signal(signal: np.ndarray, sample_rate: int = 16000, tag: str = 'Signal'):
    """Show the time-domain signal and its spectrogram.
    """
    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 2.5))

    # show waveform
    t = np.arange(0, len(signal)) / sample_rate

    ax[0].plot(t, signal)
    ax[0].set_xlim(0, t.max())
    ax[0].grid()
    ax[0].set_xlabel('time / s')
    ax[0].set_ylabel('amplitude')
    ax[0].set_title(tag)

    n_fft = 1024
    hop_length = 256

    D = librosa.amplitude_to_db(np.abs(librosa.stft(signal, n_fft=n_fft, hop_length=hop_length)), ref=np.max)
    img = librosa.display.specshow(D, y_axis='linear', x_axis='time', sr=sample_rate, n_fft=n_fft, hop_length=hop_length, ax=ax[1])
    ax[1].set_title(tag)

    plt.tight_layout()
    plt.colorbar(img, format="%+2.f dB", ax=ax)

def show_metrics(signal: np.ndarray, reference: np.ndarray, sample_rate: int = 16000, tag: str = 'Signal'):
    """Show metrics for the time-domain signal and the reference signal.
    """
    sdr = signal_distortion_ratio(preds=torch.tensor(signal), target=torch.tensor(reference))
    sisdr = scale_invariant_signal_distortion_ratio(preds=torch.tensor(signal), target=torch.tensor(reference))
    print(tag)
    print('\tsdr:  ', sdr.item())
    print('\tsisdr:', sisdr.item())

### Data preparation

In this notebook, it is assumed that all audio will be resampled to 16kHz and the data and configuration will be stored under `root_dir` as defined below.

In [None]:
# sample rate used throughout the notebook
sample_rate = 16000

# root directory for data preparation, configurations, etc
root_dir = Path('./')

# data directory
data_dir = root_dir / 'data'
data_dir.mkdir(exist_ok=True)

# scripts directory
scripts_dir = root_dir / 'scripts'
scripts_dir.mkdir(exist_ok=True)

Create dictionary with paths for all of the manifests files which will be stored under `data_dir`.

In [None]:
dataset_manifest = {
    'speech_train': data_dir / 'libri_cuts_train.jsonl.gz',
    'speech_val': data_dir / 'libri_cuts_val.jsonl.gz',
    'noise_train': data_dir / 'demand_cuts_train.jsonl.gz',
    'noise_val': data_dir / 'demand_cuts_val.jsonl.gz',
    'rir_train': data_dir / 'rir_recordings_train.jsonl.gz',
    'rir_val': data_dir / 'rir_recordings_val.jsonl.gz',
    'noisy_val': data_dir / 'noisy_cuts_val.jsonl.gz'
}

In this tutorial, a subset of LibriSpeech dataset [2] will be downloaded and used as the speech material.

To use a dataset with the Lhotse dataloader, we need to create manifest files from Lhotse cuts (refer to [3] for the details). In this cell, we first download and prepare the LibriSpeech dataset in a Lhotse format and then save it as manifest files for training and validation sets. Note that the target recording in the speech enhancement task is the original (unchanged) clean speech signal, which is defined under the custom field "target_recording" in the cuts.

In [None]:
libri_variant = 'mini_librispeech'
speech_dir = data_dir / 'speech'

libri_root = download_librispeech(speech_dir, dataset_parts=libri_variant)

# Use script from Lhotse to prepate Librispeech dataset to Lhotse format
libri = prepare_librispeech(
    libri_root, dataset_parts=libri_variant,
)
cuts_train = CutSet.from_manifests(**libri["train-clean-5"]).trim_to_supervisions()
cuts_val = CutSet.from_manifests(**libri["dev-clean-2"]).trim_to_supervisions()

# Save the manifest with a custom "target_recording"
with CutSet.open_writer(dataset_manifest['speech_train']) as writer:
    for cut in cuts_train:
        cut.target_recording = cut.recording
        writer.write(cut)

with CutSet.open_writer(dataset_manifest['speech_val']) as writer:
    for cut in cuts_val:
        cut.target_recording = cut.recording
        writer.write(cut)

During the training phase, noise data will be used for online augmentation by mixing it with the downloaded speech on-the-fly. During the validation and test phases, the noise will be used to create fixed sets.

The following cell will download and prepare the noise data using a subset of the DEMAND dataset [4].

In [None]:
noise_dir = data_dir / 'noise'
noise_data_set = 'STRAFFIC,PSTATION'

# Copy script
get_demand_script = os.path.join(scripts_dir, 'get_demand_data.py')
if not os.path.exists(get_demand_script):
    !wget -P $scripts_dir https://raw.githubusercontent.com/{GIT_USER}/NeMo/{BRANCH}/scripts/dataset_processing/get_demand_data.py

if not noise_dir.is_dir():
    noise_dir.mkdir(exist_ok=True)
    !python {get_demand_script} --data_root={noise_dir} --data_sets={noise_data_set}
else:
    print('Noise directory already exists in:', noise_dir)

noise_dir = data_dir / 'noise'
demand_recordings = RecordingSet.from_dir(noise_dir, pattern='*.wav')

demand_cuts = CutSet.from_manifests(recordings=demand_recordings)
shuffled_demand_cuts = demand_cuts.shuffle()

demand_cuts_train = shuffled_demand_cuts.subset(last=len(shuffled_demand_cuts)-3)
demand_cuts_val = shuffled_demand_cuts.subset(first=3)

demand_cuts_train.to_file(dataset_manifest['noise_train'])
demand_cuts_val.to_file(dataset_manifest['noise_val'])

The following cell will download and prepare a simulated subset from room impulse responses dataset, described in the following paper [5].

In [None]:
rir_recordings = RecordingSet()
rir_raw_dir = download_rir_noise(data_dir)
rirs = prepare_rir_noise(rir_raw_dir, parts=["sim_rir"])
rir_recordings = rirs["sim_rir"]["recordings"]
shuffled_rir_recordings = rir_recordings.shuffle()

rir_val_part = int(len(rir_recordings) * 0.1)
rir_train_part = len(rir_recordings) - rir_val_part

rir_recordings_train = shuffled_rir_recordings.subset(last=rir_train_part)
rir_recordings_val = shuffled_rir_recordings.subset(first=rir_val_part)

rir_recordings_train.to_file(dataset_manifest['rir_train'])
rir_recordings_val.to_file(dataset_manifest['rir_val'])

For this tutorial, a single-channel noisy validation set is constructed by adding speech and noise.

The following block will use based on Lhotse data loader from NeMo to create fixed noisy validation set and save it do `data/val` folder.

In [None]:
# Create the cofing for the Lhotse data loader
val_noise_config = {
'cuts_path': dataset_manifest['speech_val'].as_posix(), # path to Lhotse cuts manifest with speech signals for augmentation
'sample_rate': 16000,
'batch_size': 1,
'rir_enabled': True, # enable room impulse response augmentation
'rir_path': dataset_manifest['rir_val'].as_posix(), # path to Lhotse recordings manifest with room impulse response signals
'rir_prob': 1.0, # probability of applying RIR augmentation
'noise_path': dataset_manifest['noise_val'].as_posix(), # path to Lhotse cuts manifest with noise signals
'noise_mix_prob': 1.0,  # probability of applying noise augmentation
'noise_snr':  (0, 20), # range of speech-to-noise ratio for the noise augmentation
'shuffle': False
}

# Instantiate the data loader
dl = get_lhotse_dataloader_from_config(
OmegaConf.create(val_noise_config), global_rank=0, world_size=1, dataset=LhotseAudioToTargetDataset()
)

# Define number of samples for the validation set
num_examples = 100
print(f'Get {num_examples} samples for the validation set')
samples = [sample for sample in islice(dl, num_examples)]


# Create folders for saving noisy (input) and clean (target) samples
val_dir = data_dir / 'val'
val_noisy_dir = val_dir / 'noisy'
val_clean_dir = val_dir / 'clean'

val_dir.mkdir(exist_ok=True)
val_noisy_dir.mkdir(exist_ok =True)
val_clean_dir.mkdir(exist_ok=True)

val_noisy_basename = 'val_noisy_fileid'
val_clean_basename = 'val_clean_fileid'

with CutSet.open_writer(dataset_manifest['noisy_val']) as writer:
    for n, sample in enumerate(samples):
        noisy, clean = sample['input_signal'].numpy()[0], sample['target_signal'].numpy()[0]
        #Save
        sf.write(val_noisy_dir / f'{val_noisy_basename}_{str(n)}.wav', noisy, samplerate=val_noise_config['sample_rate'])
        sf.write(val_clean_dir / f'{val_clean_basename}_{str(n)}.wav', clean, samplerate=val_noise_config['sample_rate'])
        noisy_rec = Recording.from_file(val_noisy_dir / f'{val_noisy_basename}_{str(n)}.wav')
        clean_rec = Recording.from_file(val_clean_dir / f'{val_clean_basename}_{str(n)}.wav')

        val_cut = MonoCut(id=noisy_rec.id,
                start=0,
                duration=noisy_rec.duration,
                channel=0,
                recording=noisy_rec)
        val_cut.target_recording = clean_rec
        writer.write(val_cut)

### Model configuration

Here, a simple encoder-mask-decoder model will be used to process the noisy input signal and produce an enhanced output signal.

In general, an encoder-mask-decoder model can be configured using `EncMaskDecAudioToAudioModel` class, which is depicted in the following block diagram.

<img src="https://github.com/NVIDIA/NeMo/releases/download/v1.18.0/encmaskdecoder_model.png" alt="encmaskdecoder_model" style="width: 800px;"/>

The model structure can briefly be described as follows:
* Input to the model is a time-domain signal.
* Encoder transforms the input signal to the analysis domain.
* Mask estimator estimates a mask used to generate the output signal.
* Mask processor combines the estimated mask and the encoded input to produce the encoded output.
* Decoder transforms the encoded output into a time-domain signal.
* Output is a time-domain signal.

For this example, the model will be configured to use a fixed short-time Fourier transform-based encoder and decoder, and the mask will be estimated using a recurrent neural network. The model used here is depicted in the following block diagram.

<img src="https://github.com/NVIDIA/NeMo/releases/download/v1.18.0/single_output_example_model.png" alt="single_output_example_model" style="width: 1000px;"/>

In this particular configuration, the model structure can be described as follows:
* `AudioToSpectrogram` implements the analysis STFT transform.
* `MaskEstimatorRNN` is a mask estimator using RNNs.
* `MaskReferenceChannel` is a simple processor which applies the estimated mask on the reference channel. In this tutorial, the input signal has only a single channel, so the reference channel will be set to `0`.
* `SpectrogramToAudio` implements the synthesis STFT transform.

The following cell will load and show the default configuration for the model depicted above.

In [None]:
config_dir = root_dir / 'conf'
config_dir.mkdir(exist_ok=True)

config_path = config_dir / 'masking_online_aug.yaml'

if not config_path.is_file():
    !wget https://raw.githubusercontent.com/{GIT_USER}/NeMo/{BRANCH}/examples/audio/conf/masking_online_aug.yaml -P {config_dir.as_posix()}

config = OmegaConf.load(config_path)
config = OmegaConf.to_container(config, resolve=True)
config = OmegaConf.create(config)

print('Loaded config')
print(OmegaConf.to_yaml(config))

Training dataset is configured with the following parameters
* `cuts_path` points to a Lhotse manifest file, containing speech samples
* `noise_path` poins to a Lhotse manifest file, containing noise samples
* `noise_mix_prob` defines the probabilty with which noise will be added during training
* `noise_snr` defines an SNR range for mixing noise samples
* `rir_enabled` enables room impulse response agmentation
* `rir_path` points to a Lhotse manifest file, containing RIR samples
* `rir_prob` defines the probabilty with which RIR will be added during training
  
For the validation and test sets only `cuts_path` parameter is used since the `val` manifest already contains noisy and clean samples

In [None]:
# Setup training dataset
config.model.train_ds.cuts_path = dataset_manifest['speech_train'].as_posix() # path to Lhotse cuts manifest with speech signals for augmentation
config.model.train_ds.noise_path =  dataset_manifest['noise_train'].as_posix() # path to Lhotse cuts manifest with noise signals
config.model.train_ds.noise_mix_prob =  1.0 # probability of applying noise augmentation
config.model.train_ds.noise_snr =  (0, 20) # range of speech-to-noise ratio for the noise augmentation
config.model.train_ds.rir_enabled = True # enable room impulse response augmentation
config.model.train_ds.rir_path =  dataset_manifest['rir_val'].as_posix() # path to Lhotse recordings manifest with room impulse response signals
config.model.train_ds.rir_prob = 1.0 # probability of applying RIR augmentation

config.model.validation_ds.cuts_path = dataset_manifest['noisy_val'].as_posix() # fixed noisy validation set

config.model.test_ds.cuts_path = dataset_manifest['noisy_val'].as_posix() # fixed noisy test set


print("Train dataset config:")
print(OmegaConf.to_yaml(config.model.train_ds))

Metrics for validation and test set are configured in the following cell.

In this tutorial, signal-to-distortion ratio (SDR) and scale-invariant SDR from torch metrics are used [5].

In [None]:
# Setup metrics to compute on validation and test sets
metrics = OmegaConf.create({
    'sisdr': {
        '_target_': 'torchmetrics.audio.ScaleInvariantSignalDistortionRatio',
    },
    'sdr': {
        '_target_': 'torchmetrics.audio.SignalDistortionRatio',
    }
})
config.model.metrics.val = metrics
config.model.metrics.test = metrics

print("Metrics config:")
print(OmegaConf.to_yaml(config.model.metrics))

### Trainer configuration
NeMo models are primarily PyTorch Lightning modules and therefore are entirely compatible with the PyTorch Lightning ecosystem.

In [None]:
print("Trainer config:")
print(OmegaConf.to_yaml(config.trainer))

We can modify some trainer configs for this tutorial.
Most importantly, the number of epochs is set to a small value, to limit the runtime for the purpose of this example.

In [None]:
# Checks if we have GPU available and uses it
accelerator = 'gpu' if torch.cuda.is_available() else 'cpu'
config.trainer.devices = 1
config.trainer.accelerator = accelerator

# Reduces maximum number of epochs for quick demonstration
config.trainer.max_epochs = 30

# Remove distributed training flags
config.trainer.strategy = 'auto'

# Instantiate the trainer
trainer = pl.Trainer(**config.trainer)

### Experiment manager

NeMo has an experiment manager that handles logging and checkpointing.

In [None]:
from nemo.utils.exp_manager import exp_manager

exp_dir = exp_manager(trainer, config.get("exp_manager", None))
# The exp_dir provides a path to the current experiment for easy access

print("Experiment directory:")
print(exp_dir)

### Model instantiation

In [None]:
from nemo.collections import audio as nemo_audio

enhancement_model = nemo_audio.models.EncMaskDecAudioToAudioModel(cfg=config.model, trainer=trainer)

### Training
Create a Tensorboard visualization to monitor progress

In [None]:
try:
    from google import colab
    COLAB_ENV = True
except (ImportError, ModuleNotFoundError):
    COLAB_ENV = False

# Load the TensorBoard notebook extension
if COLAB_ENV:
    %load_ext tensorboard
    %tensorboard --logdir {exp_dir}
else:
    print("To use tensorboard, please use this notebook in a Google Colab environment.")

Training can be started using `trainer.fit`:

In [None]:
trainer.fit(enhancement_model)

After the training is completed, the configured metrics can be easily computed on the test set as follows:

In [None]:
trainer.test(enhancement_model, ckpt_path=None)

### Inference

The following cell provides an example of inference on an single audio file.
For simplicity, the audio file information is taken from the test dataset.

In [None]:
# Load 10 samples from test_dataloader
samples = [sample for sample in islice(enhancement_model.test_dataloader(), 10)]

# Different sample can be used via list index
sample = samples[0]

noisy_tensor = sample['input_signal']
speech_tensor = sample['target_signal']

# Get the one-dimentional numpy signals for the plotting audio files and metrics calculation
noisy_signal = noisy_tensor.squeeze(0).numpy()
speech_signal = speech_tensor.squeeze(0).numpy()


# Move to device
device = 'cuda' if accelerator == 'gpu' else 'cpu'
enhancement_model = enhancement_model.to(device)

# Process using the model
with torch.no_grad():
    output_tensor, _ = enhancement_model(input_signal=noisy_tensor.unsqueeze(1).cuda())
output_signal = output_tensor[0][0].detach().cpu().numpy()

Signals can be easily plotted and signal metrics can be calculated for the given example.

In [None]:
# Show noisy and clean signals
show_metrics(signal=noisy_signal, reference=speech_signal, tag='Noisy signal', sample_rate=sample_rate)
show_metrics(signal=output_signal, reference=speech_signal, tag='Output signal', sample_rate=sample_rate)

# Show signals
show_signal(speech_signal, tag='Speech signal')
show_signal(noisy_signal, tag='Noisy signal')
show_signal(output_signal, tag='Output signal')

# Play audio
print('Speech signal')
ipd.display(ipd.Audio(speech_signal, rate=sample_rate))

print('Noisy signal')
ipd.display(ipd.Audio(noisy_signal, rate=sample_rate))

print('Output signal')
ipd.display(ipd.Audio(output_signal, rate=sample_rate))

## Next steps
This is a simple tutorial which can serve as a starting point for prototyping and experimentation with audio-to-audio models.
A processed audio output can be used, for example, for ASR or TTS.

For more details about NeMo models and applications in in ASR and TTS, we recommend you checkout other tutorials next:

* [NeMo fundamentals](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/00_NeMo_Primer.ipynb)
* [NeMo models](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/01_NeMo_Models.ipynb)
* [Speech Recognition](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb)
* [Speech Synthesis](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/Inference_ModelSelect.ipynb)

## References

[1] Żelasko, P., Povey, D., Trmal, J., & Khudanpur, S. (2021). Lhotse: a speech data representation library for the modern deep learning ecosystem. https://arxiv.org/abs/2110.12561

[2] V. Panayotov, G. Chen, D. Povery, S. Khudanpur, "LibriSpeech: An ASR corpus based on public domain audio books," ICASSP 2015

[3] Lhotse documentation, https://lhotse.readthedocs.io/

[4] J. Thieman, N. Ito, V. Emmanuel, "DEMAND: collection of multi-channel recordings of acoustic noise in diverse environments," ICA 2013

[5] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer and S. Khudanpur, "A study on data augmentation of reverberant speech for robust speech recognition," 2017 ICASSP

[6] https://github.com/Lightning-AI/torchmetrics