Audio is a multi-facaded data-modality, that allows us to solve variety of tasks. We focus our attention on speech processing, and will cover the following categories of speech related adio domain tasks:

# Recognition

Automatic Speech Recognition(ASR), or speech to text(STT). It's convenient to use STT since the name encodes that we are solving an inverse TTS problem.

Formal task definition:

Given dataset $D = \{ [(x^{(i)}_{0}, y^{(i)}_{0}), (x^{(i)}_{1}, y^{(i)}_{1})] \}_{i=1}^{N}$ where $x_i$ - represents audio clip and $y_i$ is a corresponding text; and a neural network with parameters $\theta$ we want to predict text tokens given input audio. $P(y|x;\theta)$


# Speaker
- Speaker verifiction(SV)
- Speaker diarization(SD)
# Generation
- Speech synthesis(TTS)
- Speech enhancement(SE)

In [97]:
import orjson
import torch
import torch.nn as nn
import soundfile as sf
from pathlib import Path
import random
import numpy as np
from IPython.display import display, Audio
import torchaudio

In [2]:
def read_manifest(manifest_path):
    res = []
    with open(manifest_path, 'rb') as f:
        for line in f:
            data = orjson.loads(line)
            res.append(data)
        return res

def get_num_params(model: nn.Module):
    n_params = sum([p.numel() for p in model.parameters()])
    return n_params

# STT example. Conformer

In [3]:
from nemo.collections.asr.models import EncDecCTCModel

In [4]:
asr_model = EncDecCTCModel.from_pretrained("nvidia/stt_uk_citrinet_1024_gamma_0_25")

[NeMo I 2024-05-17 19:15:49 mixins:172] Tokenizer SentencePieceTokenizer initialized with 1024 tokens


[NeMo W 2024-05-17 19:15:49 modelPT:165] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /data/all_train_data_new.json
    sample_rate: 16000
    batch_size: 16
    trim_silence: false
    max_duration: 20.0
    shuffle: true
    use_start_end_token: false
    num_workers: 8
    pin_memory: true
    is_tarred: false
    tarred_audio_filepaths: null
    shuffle_n: 2048
    bucketing_strategy: synced_randomized
    bucketing_batch_size: null
    
[NeMo W 2024-05-17 19:15:49 modelPT:172] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: /data/dev_new.json
    sample_rate: 16000
    batch_size: 16
    shuffle: false
    use

[NeMo I 2024-05-17 19:15:49 features:289] PADDING: 16
[NeMo I 2024-05-17 19:15:52 save_restore_connector:263] Model EncDecCTCModelBPE was successfully restored from /home/taras/.cache/huggingface/hub/models--nvidia--stt_uk_citrinet_1024_gamma_0_25/snapshots/eb85a22c72ccc5b16e893aece05b55ae466165ad/stt_uk_citrinet_1024_gamma_0_25.nemo.


In [5]:
n_params = get_num_params(asr_model)
print(f"{type(asr_model).__name__} has {n_params=}")

EncDecCTCModelBPE has n_params=141224337


In [6]:
asr_model.summarize()

  | Name              | Type                              | Params
------------------------------------------------------------------------
0 | preprocessor      | AudioToMelSpectrogramPreprocessor | 0     
1 | encoder           | ConvASREncoder                    | 140 M 
2 | decoder           | ConvASRDecoder                    | 1.1 M 
3 | loss              | CTCLoss                           | 0     
4 | spec_augmentation | SpectrogramAugmentation           | 0     
5 | wer               | WER                               | 0     
------------------------------------------------------------------------
141 M     Trainable params
0         Non-trainable params
141 M     Total params
564.897   Total estimated model params size (MB)

In [7]:
manifest_path = '/home/taras/data/ua-corpus/cv-corpus-17.0-2024-03-15/uk/dev.json'
data = read_manifest(manifest_path)
data[0]

{'audio_filepath': '/home/taras/data/ua-corpus/cv-corpus-17.0-2024-03-15/uk/clips-resampled/common_voice_uk_38203705.wav',
 'text': 'не знаю чому але мені тутешні люди незвичайно симпатичні',
 'duration': 6.372,
 'raw_text': 'Не знаю, чому, але мені тутешні люди незвичайно симпатичні.',
 'up_votes': 2,
 'down_votes': 0,
 'age': None,
 'gender': None,
 'accents': None,
 'client_id': '5c4278d26d0d8ee095b4fa8a8787b7c2919010e7bc087ccea770de6f4cd3d8e370f8a6cc756250ad4bb9f31a9e673597c5382f17b6495d5011ec739d20ac46cb'}

# GPU Inference

In [8]:
%%time
res = asr_model.transcribe([p['audio_filepath'] for p in data[:8]], batch_size=8)

      return F.conv1d(input, weight, bias, self.stride,
    
Transcribing: 100%|████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.53it/s]

CPU times: user 1.37 s, sys: 178 ms, total: 1.55 s
Wall time: 436 ms





# CPU Inference

In [9]:
%%time
asr_model.cpu()
res = asr_model.transcribe([p['audio_filepath'] for p in data[:8]], batch_size=8)

Transcribing: 100%|████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.12it/s]

CPU times: user 27.5 s, sys: 1.35 s, total: 28.8 s
Wall time: 1.27 s





In [69]:
res

['не знаю чому але мені тутешні люди надзвичайно симпатичні',
 'я побачив жіночу постать перед собою люто та страшну',
 'а я накликала жхів і для себе і для вас',
 'холодний піт виступив у його на лобі',
 'з чого ви взяли',
 'ні крепли ни крові чуєш',
 'і вона буде їх любити',
 'ну що його діяти']

# Inference on noisy data

In [115]:
def add_noise(signal_path, noise_path, snr=0):
    signal, sr = sf.read(signal_path, dtype="float32", always_2d=True)
    noise, sr = sf.read(noise_path, dtype="float32", always_2d=True)
    if noise.shape[-1] > 1:
        noise = noise.mean(axis=1, keepdims=True)
    aug = np.zeros_like(signal)
    if len(noise) > len(signal):
        aug = noise[:len(signal)]
    else:
        aug[:len(noise)] = noise
    snr = torch.tensor([snr])
    signal = torch.tensor(signal).reshape(1, -1)
    aug = torch.tensor(aug).reshape(1, -1)
    res = torchaudio.functional.add_noise(signal, aug, snr)
    
    return res.squeeze()

In [112]:
noise_data_dir = Path("/home/ubuntu/data/noise-data/wham_noise")
noise_audio_paths = list(noise_data_dir.glob("**/*.wav"))

In [109]:
noise_sample_path = random.choice(noise_audio_paths)
print(noise_sample_path)
Audio(filename=noise_sample_path)

/home/ubuntu/data/noise-data/wham_noise/tr/01lc0217_1.9762_023o0319_-1.9762.wav


# SNR = -15

In [141]:
aug_signals = [add_noise(itm['audio_filepath'], noise_sample_path, snr=-15) for itm in data[:8]]    
res = asr_model.transcribe(aug_signals, batch_size=8)

Transcribing: 100%|████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.32it/s]


In [142]:
for sig, txt in zip(aug_signals, res):
    print(f"pred transcript: {txt}")
    display(Audio(data=sig, rate=16000))


pred transcript: стснязт


pred transcript: я побачивши нас гои перес люто та страшно


pred transcript: а я і так і дав


pred transcript: ну то розстлостля


pred transcript: чого вам взяли


pred transcript: нетрап кров каза


pred transcript: ну то розстрілять


pred transcript: огоді
