<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/ufidon/nlp/blob/main/srs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td>
    <a target="_blank" href="https://kaggle.com/kernels/welcome?src=https://github.com/ufidon/nlp/blob/main/srs.ipynb"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" /></a>
  </td>
</table>
<br>

# Automatic Speech Recognition and Text-to-Speech

📝 SALP chapter 16

## Introduction
- An early know `automatic speech recognition (ASR)` example is [Radio Rex](https://en.wikipedia.org/wiki/Virtual_assistant) appeared in the 1920s, 
  - a toy responding to specific sounds like the `vowel [eh]` in "Rex."  
- Despite limitations in diverse environments, **modern ASR** `converts speech waveforms into text` and is widely used in
  - smart appliances, personal assistants, call routing, transcription, and assisting individuals with disabilities.  
- [Wolfgang von Kempelen’s late 18th-century speech synthesizer](https://en.wikipedia.org/wiki/Wolfgang_von_Kempelen%27s_speaking_machine) marked the first `text-to-speech (TTS)` system using mechanical components.  
  - **Modern TTS** maps `text to audio waveforms`, aiding communication for visually impaired users and individuals with neurological disorders.  
- ASR and TTS share core algorithmic principles:
  - encoder-decoder models, 
  - Connectionist Temporal Classification (CTC) loss functions, 
  - word error rate evaluation, 
  - and acoustic feature extraction.  

🔭 Explore 
- [ASR models](https://www.gladia.io/blog/best-open-source-speech-to-text-models)
- [TTS models](https://www.bentoml.com/blog/exploring-the-world-of-open-source-text-to-speech-models)

🏃 Play
- [Whisper: a general-purpose speech recognition model](https://github.com/openai/whisper)
- [OpenVoice: versatile instant voice cloning](https://github.com/myshell-ai/OpenVoice)
- [UltraVox: a fast multimodal LLM for real-time voice](https://github.com/fixie-ai/ultravox)

## The Automatic Speech Recognition Task
- **Dimensions of Variation in ASR:**
  - Vocabulary size:
    - Small vocabularies (e.g., yes/no, digits) are highly accurate.
    - Large vocabularies (up to 60,000 words) in open-ended tasks are more difficult.
  - Speaker and context:
    - Human-to-machine speech (dictation/dialogue systems) is easier.
    - Read speech (e.g., audiobooks) is relatively easy.
    - Conversational speech between humans is the hardest due to faster, less clear speech.
  - Channel and noise:
    - Quiet environments with head-mounted microphones are ideal.
    - Noisy settings (e.g., streets, open car windows) complicate recognition.
  - Accent and speaker characteristics:
    - Recognition is better for accents/dialects similar to the training data.
    - Regional/ethnic dialects and children's speech are more challenging.

- **Key ASR Corpora:**
  - [LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech):
    - Open-source, read-speech dataset with 1,000+ hours of audiobooks.
    - Divided into "clean" (high quality) and "other" (lower quality) portions.
  - [Switchboard Corpus](https://huggingface.co/datasets/cgpotts/swda):
    - Prompted telephone conversations between strangers (~240 hours).
    - Extensive linguistic labeling (e.g., dialogue acts, prosody).
  - [CALLHOME Corpus](https://huggingface.co/datasets/talkbank/callhome):
    - Unscripted 30-minute telephone conversations (friends/family).
    - Focus on natural, casual speech.
  - [Santa Barbara Corpus](https://www.linguistics.ucsb.edu/research/santa-barbara-corpus):
    - Everyday spoken interactions across the US (e.g., conversations, town halls).
    - Anonymized transcripts.
  - [CORAAL](https://huggingface.co/datasets/Padomin/coraal-asr):
    - Sociolinguistic interviews with African American speakers.
    - Focus on African American Language (AAL).
  - [CHiME Challenge](https://www.chimechallenge.org/):
    - Datasets for robust ASR tasks in noisy, real environments (e.g., dinner parties).
  - [AISHELL-1 Corpus](https://paperswithcode.com/dataset/aishell-1):
    - 170 hours of Mandarin read speech from various domains.

### 🏃 Practice [Hugging Face Audio course](https://huggingface.co/learn/audio-course)
- 📖 [Unit 2. A gentle introduction to audio applications](https://huggingface.co/learn/audio-course/chapter2/introduction)
  - 📝 1. Investigate audio applications

In [None]:
!pip install transformers datasets librosa soundfile

# 1. Audio classification with a pipeline
# 1.1 Load the dataset minds14
# It contains recordings of people asking an e-banking system questions 
#   in several languages and dialects
from datasets import load_dataset
from datasets import Audio

minds = load_dataset("PolyAI/minds14", name="en-AU", split="train")
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))

# 1.2 Create a classifier
from transformers import pipeline

classifier = pipeline(
    "audio-classification",
    model="anton-l/xtreme_s_xlsr_300m_minds14",
    device='cuda'
)

In [None]:
# taste an example
example = minds[0]

In [None]:
# classifying
classifier(example["audio"]["array"])

In [None]:
# the actual label for this example is:
id2label = minds.features["intent_class"].int2str
id2label(example["intent_class"])

In [None]:
# 2. Automatic speech recognition with a pipeline
# using the same MINDS-14 dataset as before

# transcribe an audio recording using the automatic-speech-recognition pipeline
from transformers import pipeline

asr = pipeline("automatic-speech-recognition", device='cuda')

In [None]:
# try an example
example = minds[0]
asr(example["audio"]["array"])

In [None]:
# compare it to the actual transcription
example["english_transcription"]

In [None]:
# 2.2 Try it on Chinese
from datasets import load_dataset
from datasets import Audio

minds = load_dataset("PolyAI/minds14", name="zh-CN", split="train")
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))

In [None]:
# taste an example
example = minds[0]
example["transcription"]

In [None]:
# Find a pre-trained ASR model for Chinese language on the 🤗 Hub
from transformers import pipeline

asr = pipeline("automatic-speech-recognition", model="jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn", device='cuda')
asr(example["audio"]["array"])


In [None]:
# 3. Audio generation with a pipeline
# upgrade transformers
!pip install --upgrade transformers

In [None]:
# 3.1 Generating speech
from transformers import pipeline

# https://huggingface.co/suno/bark-small
pipe = pipeline("text-to-speech", model="suno/bark-small", device='cuda')

# try a text
text = "Ladybugs have had important roles in culture and religion, being associated with luck, love, fertility and prophecy. "
output = pipe(text)

# play the speech
from IPython.display import Audio

Audio(output["audio"], rate=output["sampling_rate"])

In [None]:
# Try a different language
zh_text = "月落乌啼霜满天，江枫渔火对愁眠。 姑苏城外寒山寺，夜半钟声到客船。"
output = pipe(zh_text)
Audio(output["audio"], rate=output["sampling_rate"])

In [None]:
# generate audio with non-verbal communications and singing.
song = "♪ In the jungle, the mighty jungle, the ladybug was seen. ♪ "
output = pipe(song)
Audio(output["audio"], rate=output["sampling_rate"])

In [None]:
# 3.2 Generating music
# https://huggingface.co/facebook/musicgen-small
music_pipe = pipeline("text-to-audio", model="facebook/musicgen-small", device='cuda')

prompt = "90s rock song with electric guitar and heavy drums"

# generate the music
forward_params = {"max_new_tokens": 512}

output = music_pipe(prompt, forward_params=forward_params)
Audio(output["audio"][0], rate=output["sampling_rate"])

## Feature Extraction for ASR: Log Mel Spectrum
- ASR converts waveforms into [log mel spectrum](https://medium.com/analytics-vidhya/understanding-the-mel-spectrogram-fca2afa2ce53) feature vectors.
  - representing information from small time windows of the signal.

### Sampling and Quantization
- Speech recognizers process air pressure changes caused by vocalizations, visualized as waveforms showing air pressure over time. 
  - ![A waveform of an instance of the vowel iy](./images/stts/wave.png)
    - A waveform of an instance of the vowel `i` in `ˈbeɪbi`
- Analog sound waves are digitized through 
  - **sampling**: measuring wave amplitude at intervals,
  - **quantization**: converting measurements into integers.  
- **Nyquist frequency** defines the maximum measurable frequency as half the sampling rate; 
  - 8 kHz is sufficient for telephone speech, 
  - while 16 kHz is used for microphone speech.  
- Different sampling rates (e.g., 8 kHz vs. 16 kHz) cannot be mixed in ASR training/testing, 
  - requiring downsampling for consistency.  
- Quantization stores amplitudes as integers (e.g., 8-bit or 16-bit), 
  - with log compression (e.g., µ-law) optimizing for human auditory sensitivity to smaller intensities.  
- Audio files vary by sample rate, sample size, number of channels (mono/stereo), and storage type (linear vs. compressed).  
- Common file formats include 
  - [.wav](https://ccrma.stanford.edu/courses/422-winter-2014/projects/WaveFormat/), a subset of Microsoft [Resource Interchange File Format (RIFF)](https://en.wikipedia.org/wiki/Resource_Interchange_File_Format-based), 
  - Apple’s [Audio Interchange File Format (AIFF)](https://en.wikipedia.org/wiki/Audio_Interchange_File_Format), 
  - and [raw headerless formats](https://en.wikipedia.org/wiki/Raw_audio_format).  

### Windowing
- `Spectral features` are extracted from small windows of speech, 
  - treating the signal as stationary within each window, 
  - despite speech being non-stationary overall.  
- A **frame** represents the speech within each window, determined by three parameters:
  - window size, or fame size with width in ms, 
  - frame stride, offset, or shift between successive windows, 
  - and window shape.
  - ![Windowing, showing a 25 ms rectangular window with a 10ms stride](./images/stts/win.png)  
- `Windowing` is performed by multiplying the signal $s(n)$ at time $n$ by the window function $w(n)$ at $n$, producing a windowed waveform $y[n]$.
  - $y[n]=w[n]s[n]$
- Common window shapes include 
  - **rectangular**: simple but causes boundary issues in Fourier analysis
  - **Hamming**: smoothes boundary discontinuities for better feature extraction.
- ![Windowing a sine wave with the rectangular or Hamming windows](./images/stts/winshape.png)

### Discrete Fourier Transform
- The **Discrete Fourier Transform (DFT)** is used to extract spectral information from a windowed discrete-time signal $x[n]$. 
  - The output $X[k]$ represents the magnitude and phase of $N$ discrete frequency components, enabling visualization of the signal spectrum.
- ![dft](./images/stts/dft.png)
  - Left: A 25 ms Hamming-windowed portion of a signal from the vowel `i` in `ˈbeɪbi`
  - Right: its spectrum computed by a DFT.

- The **Fast Fourier Transform (FFT)** is an efficient algorithm for computing the DFT, optimized for $N$ values that are powers of 2.
  - $X[k] = \sum_{n=0}^{N-1} x[n]e^{-j\frac{2\pi kn}{N}}$

### Mel Filter Bank and Log
- The **FFT results** provide the energy at each frequency band, 
  - but human hearing is biased toward low frequencies, 
    - aiding recognition of critical low-frequency features (e.g., vowels or nasals) over high-frequency features (e.g., fricatives). 
  - Incorporating this bias enhances speech recognition.

- The **mel scale**, an auditory frequency scale, models human perception of pitch. 
  - It spaces sounds perceptually equidistant in pitch,
  - with the mel frequency $m$ calculated from the raw frequency $f$ as:  
    - $\text{mel}(f) = 1127 \ln(1 + \dfrac{f}{700})$

- A **mel filter bank**, composed of logarithmically spaced triangular filters, 
  - collects energy with fine resolution at low frequencies and coarse resolution at high frequencies. 
  - This approach creates a **mel spectrum**, representing the perceptual energy distribution.
- ![The mel filter bank](./images/stts/melbank.png)

- Applying a **logarithmic transformation** to the mel spectrum values mirrors the human logarithmic response to signal levels. 
  - This reduces sensitivity to variations in input power, such as changes in the speaker's distance from the microphone, stabilizing feature estimates.

### 🏃 Practice [Hugging Face Audio course](https://huggingface.co/learn/audio-course)
- [Unit 1. Working with audio data](https://huggingface.co/learn/audio-course/chapter1/introduction)
  - 📝 1. Visualize audio data

In [None]:
# 1. install the librosa used to plot the waveform for an audio signal
!pip install librosa

# 2. load audio data
import librosa

array, sampling_rate = librosa.load(librosa.ex("trumpet"))

# 3. plot audio waveform
import matplotlib.pyplot as plt
import librosa.display

plt.figure().set_figwidth(12)
librosa.display.waveshow(array, sr=sampling_rate)

In [None]:
# 4. plot frequency spectrum, i.e. frequency domain representation
# power spectrum, which measures energy rather than amplitude; 
#   this is simply a spectrum with the amplitude values squared.
import numpy as np

dft_input = array[:4096]

# calculate the DFT
window = np.hanning(len(dft_input))
windowed_input = dft_input * window
dft = np.fft.rfft(windowed_input)

# get the amplitude spectrum in decibels
amplitude = np.abs(dft)
amplitude_db = librosa.amplitude_to_db(amplitude, ref=np.max)

# get the frequency bins
frequency = librosa.fft_frequencies(sr=sampling_rate, n_fft=len(dft_input))

plt.figure().set_figwidth(12)
plt.plot(frequency, amplitude_db)
plt.xlabel("Frequency (Hz)")
plt.ylabel("Amplitude (dB)")
plt.xscale("log")

In [None]:
# 5. Plot audio spetrogram
# A spectrogram plots the frequency content of an audio signal as it changes over time. 
# It allows you to see time, frequency, and amplitude all on one graph. 

import numpy as np

D = librosa.stft(array)
S_db = librosa.amplitude_to_db(np.abs(D), ref=np.max)

plt.figure().set_figwidth(12)
librosa.display.specshow(S_db, x_axis="time", y_axis="hz")
plt.colorbar()

In [None]:
# 6. Plot Mel spectrogram
#  it shows the frequency content of an audio signal over time, 
# but on a different frequency axis.
S = librosa.feature.melspectrogram(y=array, sr=sampling_rate, n_mels=128, fmax=8000)
S_dB = librosa.power_to_db(S, ref=np.max)

plt.figure().set_figwidth(12)
librosa.display.specshow(S_dB, x_axis="time", y_axis="mel", sr=sampling_rate, fmax=8000)
plt.colorbar()

- 📝 2. Load and explore an audio dataset

In [None]:
# 1. Install the Datasets library
!pip install datasets[audio]

# 2. load and explore and audio dataset called MINDS-14
# https://huggingface.co/datasets/PolyAI/minds14
# It contains recordings of people asking an e-banking system questions 
# in several languages and dialects.
from datasets import load_dataset

minds = load_dataset("PolyAI/minds14", name="en-AU", split="train")
minds

In [None]:
# taste an example
example = minds[0]
example

In [None]:
# The intent_class is a classification category of the audio recording.
id2label = minds.features["intent_class"].int2str
id2label(example["intent_class"])

In [None]:
# 3. remove unused features
columns_to_remove = ["lang_id", "english_transcription"]
minds = minds.remove_columns(columns_to_remove)
minds

In [None]:
# 4. listen to a few examples
import gradio as gr


def generate_audio():
    example = minds.shuffle()[0]
    audio = example["audio"]
    return (
        audio["sampling_rate"],
        audio["array"],
    ), id2label(example["intent_class"])


with gr.Blocks() as demo:
    with gr.Column():
        for _ in range(4):
            audio, label = generate_audio()
            output = gr.Audio(audio, label=label)

demo.launch(debug=True)

In [None]:
# 5. visualize some examples
import librosa
import matplotlib.pyplot as plt
import librosa.display

array = example["audio"]["array"]
sampling_rate = example["audio"]["sampling_rate"]

plt.figure().set_figwidth(12)
librosa.display.waveshow(array, sr=sampling_rate)

- 📝 3. Preprocessing an audio dataset

In [None]:
# 1. Resampling the audio data to the model’s expected sampling rate.
from datasets import Audio

minds = minds.cast_column("audio", Audio(sampling_rate=16_000))

# check it is resampled to the desired sampling rate:
minds[0]

In [None]:
# 2. Filtering the dataset
# e.g. limiting the audio examples to a certain duration
MAX_DURATION_IN_SECONDS = 20.0


def is_audio_length_in_range(input_length):
    return input_length < MAX_DURATION_IN_SECONDS

In [None]:
# apply the filter
# use librosa to get example's duration from the audio file
new_column = [librosa.get_duration(path=x) for x in minds["path"]]
minds = minds.add_column("duration", new_column)

# use 🤗 Datasets' `filter` method to apply the filtering function
minds = minds.filter(is_audio_length_in_range, input_columns=["duration"])

# remove the temporary helper column
minds = minds.remove_columns(["duration"])
minds

In [None]:
# 3. Pre-processing audio data
# preparing the data in the right format for model training.
# convert the raw data into input features.
# e.g. Whisper feature extractor: https://huggingface.co/papers/2212.04356

from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")

# pre-process a single audio example by passing it through the feature_extractor.
def prepare_dataset(example):
    audio = example["audio"]
    features = feature_extractor(
        audio["array"], sampling_rate=audio["sampling_rate"], padding=True
    )
    return features

In [None]:
# apply the data preparation function to all of our training examples
minds = minds.map(prepare_dataset)
minds

In [None]:
# we now have log-mel spectrograms as input_features in the dataset.
# Let’s visualize it for one of the examples in the minds dataset:
import numpy as np

example = minds[0]
input_features = example["input_features"]

plt.figure().set_figwidth(12)
librosa.display.specshow(
    np.asarray(input_features[0]),
    x_axis="time",
    y_axis="mel",
    sr=feature_extractor.sampling_rate,
    hop_length=feature_extractor.hop_length,
)
plt.colorbar()

In [None]:
# 4. for simplicity, load the feature extractor and tokenizer for Whisper 
# via the so-called processor.
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("openai/whisper-small")

- 📝 4. Streaming large audio dataset

In [None]:
# 1. enable streaming mod
gigaspeech = load_dataset("speechcolab/gigaspeech", "xs", streaming=True)

# you can no longer access individual samples using Python indexing
# you have to iterate over the dataset.
next(iter(gigaspeech["train"]))

In [None]:
# 2. preview several examples from a large dataset
gigaspeech_head = gigaspeech["train"].take(2)
list(gigaspeech_head)

## Speech Recognition Architecture
- Similar to MT architectures, **ASR** uses an `encoder-decoder framework (RNNs or Transformers)`,
  - mapping log mel spectral features to letters or wordpieces.  
- ![Schematic architecture for an encoder-decoder speech recognizer.](./images/stts/aed.png)
- **Attention-Based Encoder-Decoder (AED)**, also known as `listen attend and spell (LAS)`,
  - maps acoustic feature sequences $F = (f_1, f_2, \cdots, f_t)$ into output sequences like letters or wordpieces $Y=(⟨\text{SOS}⟩, y_1, \cdots, y_m, ⟨\text{EOS}⟩)$.
- To shorten long acoustic sequences (e.g., 200 frames for a 2-second word) to match much shorter text sequences (5 letters), a subsampling step, **length compression**, is applied.
  - $F=(f_1, f_2, \cdots, f_t) ↦ X=(x_1,⋯, x_n),\; n ≪ f$
  - The simplest algorithm **low frame rate compression** 
    - stacks acoustic vectors (e.g., concatenating 3 frames into one) to reduce sequence length, 
    - creating longer vectors at coarser intervals.  
  - After compression, the architecture uses `RNNs (LSTMs) or Transformers`, with possible beam search integration for decoding.
    - $\displaystyle p(y_1, \cdots, y_m) = ∏_{i=1}^{m} p(y_i | y_1, ⋯, y_{i-1}, X)$
    - Greedy decoding: $\displaystyle \hat{y}_i = \substack{\text{argmax} \\ \text{char}∈\text{Alphabet}} p(\text{char}|y_1, ⋯, y_{i-1}, X)$
- ASR models can improve by rescoring hypotheses $Y$ using a larger external language model and interpolating its score with the encoder-decoder score.  
  - Use beam search to generate an **n-best list** of sentence hypotheses, then rescore each using a **language model**. 
    - Combine the encoder-decoder score and language model score with a tunable weight λ. 
    - $\text{score}(Y|X) = \dfrac{\log P(Y|X)}{|Y|_c} + \lambda \log P_{\text{LM}}(Y)$ 
    - The sentence length bias is addressed by normalizing probabilities by the number of characters $|Y|_c$.

### Learning
- Encoder-decoders are trained with `cross-entropy loss`, calculated at each decoding step as:  
   - $L_{\text{CE}} = -\log p(y_i | y_1, \ldots, y_{i-1}, X)$ 
   - The total sentence loss is the sum over all tokens:  
     - $L_{\text{CE}} = -\sum_{i=1}^{m} \log p(y_i | y_1, \ldots, y_{i-1}, X)$

- **Teacher Forcing**: 
  - Training typically uses gold token history $y_i$, 
  - but can mix gold outputs with predictions $\hat{y}_i$, e.g., 
    - using 90% gold and 10% decoder output of the time.

## Connectionist Temporal Classification (CTC) 
- **CTC** offers an alternative approach by `outputting a character for each input frame`,
  - ensuring the output matches the input length, 
  - then collapsing sequences of identical letters into a shorter sequence.  
- A naive collapsing function removes consecutive duplicate letters, 
- ![A naive algorithm for collapsing an alignment between input and letters.](./images/stts/collapse.png)
  - but this can misrepresent words (e.g., "dinner" transcribed as "diner") 
  - and struggles with aligning silence in the input.  
- CTC resolves these issues by introducing a **blank symbol ␣** to the transcription alphabet,
  - which helps handle silences and prevents incorrect letter collapsing across blanks. 
- ![The CTC collapsing function B](./images/stts/ctc.png) 
  - The collapsing function $B:a↦y$ maps alignments $A$ to outputs $Y$ 
    - by removing blanks and collapsing repeated letters, 
    - enabling more accurate transcription.  
  - The function $B$ is **many-to-one**, meaning multiple alignments can produce the same output. 
    - For example, several alignments can yield the word "dinner." 
    - ![Three other legitimate alignments producing the transcript dinner](./images/stts/align.png) 
  - The inverse function $B^{-1}(Y)$ represents all possible alignments that can generate a given output $Y$.

### CTC Inference
- CTC assumes conditional independence at each time step, calculating the **CTC Alignment Probability** $P_{\text{CTC}}(A|X)$ of an alignment $\hat{A}=\{\hat{a}_1,\cdots, \hat{a}_n\}$ as:  
   - $P_{\text{CTC}}(A|X) = \prod_{t=1}^T p(a_t|X)$  
   - The best alignment is chosen greedily for each time step $t$ as:  
     - $\displaystyle \hat{a}_t = \arg\max_{c \in C} p_t(c|X)$  

- CTC uses an encoder-only model, generating a hidden state $h_t$ for each time step and decoding via a softmax over the character vocabulary. 
  - The sequence $A$ is then passed to the collapsing function $B$ to produce the output sequence $Y$.
  - ![Inference with CTC](./images/stts/infer.png)

- The most probable alignment may not correspond to the most probable collapsed output $Y$, 
  - as multiple alignments can lead to the same $Y$. 
  - To find the most probable $Y$, sum over the probabilities of all possible alignments:  
     - $\displaystyle P_{\text{CTC}}(Y|X) = \sum_{A \in B^{-1}(Y)} P(A|X)$ 
     - $\displaystyle =\sum_{A \in B^{-1}(Y)} ∏_{t=1}^T p(a_t | h_t)$
     - $\displaystyle \hat{Y} = \arg\max_{Y} P_{\text{CTC}}(Y|X)$

  - Summing over all alignments is computationally expensive. 
    - Instead, an approximate sum is achieved using a Viterbi beam search, 
    - focusing on high-probability alignments mapping to the same output.  

- Due to the independence assumption, CTC does not learn a language model. 
  - To enhance predictions, interpolate a language model score $P_{\text{LM}}(Y)$ and a length factor $L(Y)$ with trained weights:  
   - $\text{score}_{\text{CTC}}(Y|X) = \log P_{\text{CTC}}(Y|X) + \lambda_1 \log P_{\text{LM}}(Y) + \lambda_2 L(Y)$  

### CTC Training
- The CTC-based ASR system uses negative log-likelihood loss over a dataset $D$, defined as:  
   - $\displaystyle L_{\text{CTC}} = -\sum_{(X, Y) \in D} \log P_{\text{CTC}}(Y|X)$  

- Computing $P_{\text{CTC}}(Y|X)$ requires summing probabilities over all alignments that collapse to $Y$:  
   - $\displaystyle P_{\text{CTC}}(Y|X) = \sum_{A \in B^{-1}(Y)} \prod_{t=1}^T p(a_t|h_t)$  
   - This can be efficiently computed using [dynamic programming and a forward-backward algorithm](https://distill.pub/2017/ctc/). 

### Combining CTC and Encoder-Decoder
- The encoder-decoder cross-entropy loss and CTC loss can be combined during training, weighted by a tunable parameter $λ$:  
   - $\displaystyle L = -\lambda \log P_{\text{enc-dec}}(Y|X) - (1 - \lambda) \log P_{\text{CTC}}(Y|X)$  
- ![Combining the CTC and encoder-decoder loss functions.](./images/stts/ctclm.png)
- For inference, the combined losses are integrated with a language model (or length penalty), yielding the output sequence:  
   - $\displaystyle \hat{Y} = \arg\max_Y [\lambda \log P_{\text{enc-dec}}(Y|X) - (1 - \lambda) \log P_{\text{CTC}}(Y|X) + \gamma \log P_{\text{LM}}(Y)]$  

### Streaming Models: RNN-T for improving CTC
- **CTC Limitations and Streaming Advantage**: 
  - Due to its strong independence assumption, CTC models are less accurate than attention-based encoder-decoder models. 
  - However, CTC supports streaming, allowing word recognition as the user speaks, unlike attention models that require the entire input sequence to compute attention context.  

- To overcome the conditional independence limitation of CTC and incorporate output history,
  - the [RNN-Transducer (RNN-T)](https://lorenlugosch.github.io/posts/2020/11/transducer/) integrates a CTC acoustic model with a language model (predictor) that conditions on previous outputs.  
- ![The RNN-T model computing the output token distribution at time t](./images/stts/rnnt.png)

- At each time step $t$:  
  - The CTC encoder computes a hidden state $h_t^{\text{enc}}$ from the input sequence $x_1, \dots, x_t$.  
  - The predictor (language model) processes the output history $y_{<u_t}$ (excluding blanks) to produce $h_u^{\text{pred}}$.  
  - These hidden states are combined and passed through a softmax layer to predict the next character.    
     - $\displaystyle P_{\text{RNN-T}}(Y|X) = \sum_{A \in B^{-1}(Y)} P(A|X) = \sum_{A \in B^{-1}(Y)} \prod_{t=1}^T p(a_t | h_t, y_{<u_t})$

### ASR Evaluation: Word Error Rate
- **Word Error Rate (WER) Definition** evaluates how the `hypothesized word string` from a speech recognizer differs from a `reference transcription`:  
   - $\text{Word Error Rate (WER)} = 100 \times \dfrac{\text{Insertions} + \text{Substitutions} + \text{Deletions}}{\text{Total Words in Correct Transcript}}$
     - WER can exceed 100% due to insertions.

- **WER Calculation Process**:  
  - Compute the **minimum edit distance** (substitutions, insertions, deletions) between the hypothesized and reference strings.  
  - 🍎 An alignment with 6 substitutions, 3 insertions, and 1 deletion out of 13 reference words results in a WER of $\frac{6+3+1}{13} \times 100 = 76.9\%$.

- **Sentence Error Rate** computes the percentage of sentences with at least one word error, providing an additional perspective on recognition performance.

- **Evaluation Tool - [Score Lite (Sclite)](https://github.com/usnistgov/SCTK)**:  a script from NIST, automates WER computation.  


### 🏃 Practice [Hugging Face Audio course](https://huggingface.co/learn/audio-course)
- [Unit 5. Automatic Speech Recognition](https://huggingface.co/learn/audio-course/chapter5/introduction)
  - 📝 1. Pre-trained models for speech recognition

In [None]:
# 1. Probing CTC Models
# such as Wav2Vec2, HuBERT and XLSR, they are small and fast
# but prone to phonetic spelling errors. 

# 1.1 load a small excerpt of the LibriSpeech ASR dataset
# to demonstrate Wav2Vec2’s speech transcription capabilities:
# 
from datasets import load_dataset

dataset = load_dataset(
    "hf-internal-testing/librispeech_asr_dummy", "clean", split="validation"
)
dataset

In [None]:
# 1.2 explore one of the 73 audio samples 
from IPython.display import Audio

sample = dataset[2]

print(sample["text"])
Audio(sample["audio"]["array"], rate=sample["audio"]["sampling_rate"])

In [None]:
# 1.3 use the official Wav2Vec2 base checkpoint fine-tuned on 100 hours of LibriSpeech data:

from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-100h", device='cuda')

In [None]:
# transcribe a sample
# find the wrong words due to the shortcoming of a CTC model.
# prone to phonetic spelling errors 
# due it almost entirely bases its prediction on the acoustic input
pipe(sample["audio"].copy())

In [None]:
# 2. Graduation to Seq2Seq
# which support casing and punctuation
# load the Whisper Base checkpoint

import torch
from transformers import pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipe = pipeline(
    "automatic-speech-recognition", model="openai/whisper-base", device=device
)

In [None]:
# Transcribe the previous sample
pipe(sample["audio"], max_new_tokens=256)

In [None]:
# Try it on the Multilingual LibriSpeech (MLS) dataset
dataset = load_dataset(
    "facebook/multilingual_librispeech", "spanish", split="test", streaming=True
)
sample = next(iter(dataset))

In [None]:
# inspect the text transcription and take a listen to the audio segment:
print(sample["transcript"])
Audio(sample["audio"]["array"], rate=sample["audio"]["sampling_rate"])

In [None]:
# pass a copy of the audio sample, so that we can re-use the same audio sample
pipe(sample["audio"].copy(), max_new_tokens=256, generate_kwargs={"task": "transcribe"})

In [None]:
# Whisper can also do translation
pipe(sample["audio"], max_new_tokens=256, generate_kwargs={"task": "translate"})

In [None]:
# 3. Long-Form Transcription and Timestamps
# Whisper is inherently designed to work with 30 second samples
# padding shorter and truncating longer

# 3.1 concatenate audio sample to 5 minutes

import numpy as np

target_length_in_m = 5

# convert from minutes to seconds (* 60) to num samples (* sampling rate)
sampling_rate = pipe.feature_extractor.sampling_rate
target_length_in_samples = target_length_in_m * 60 * sampling_rate

# iterate over our streaming dataset, concatenating samples until we hit our target
long_audio = []
for sample in dataset:
    long_audio.extend(sample["audio"]["array"])
    if len(long_audio) > target_length_in_samples:
        break

long_audio = np.asarray(long_audio)

# how did we do?
seconds = len(long_audio) / 16000
minutes, seconds = divmod(seconds, 60)
print(f"Length of audio sample is {minutes} minutes {seconds:.2f} seconds")

In [None]:
# 3.2 transcribe the long audio sample with chunking and batching 
pipe(
    long_audio,
    max_new_tokens=256,
    generate_kwargs={"task": "transcribe"},
    chunk_length_s=30,
    batch_size=8,
)

In [None]:
# 3.3 Predict segment-level timestamps for the audio data
pipe(
    long_audio,
    max_new_tokens=256,
    generate_kwargs={"task": "transcribe"},
    chunk_length_s=30,
    batch_size=8,
    return_timestamps=True,
)["chunks"]

- 📝 2. [Choosing a dataset](https://huggingface.co/blog/audio-datasets#a-tour-of-audio-datasets-on-the-hub)

- 📝 3. Evaluation and metrics for speech recognition

In [None]:
# 3.1 Word Error Rate

reference = "the cat sat on the mat"
prediction = "the cat sit on the"

# WER = (S+I+D)/N = (1+0+1)/6=1/3

from evaluate import load

wer_metric = load("wer")
wer = wer_metric.compute(references=[reference], predictions=[prediction])
print(wer)

In [None]:
# 3.2 Word Accuracy
# WAcc = 1 - WER

# 3.3 Character Error Rate (CER)
# For the example in 3.1
# CER = (S+I+D)/N = (1+0+3)/14=2/7

# the WER requires systems to have greater understanding of the context of the predictions
# for word-based languages such as English
# Where for character-based languages such as Chinese, CER is preferred
# ⚠️ A Chinese character is equivalent to an English word
# e.g. 木 ∽ tree，石 ∽ stone

In [None]:
reference = "东方出现了第一缕曙光。"
prediction = "东方出现了弟一搂阳光"

from evaluate import load

cer_metric = load("cer")
cer = cer_metric.compute(references=[reference], predictions=[prediction])
print(cer)

In [None]:
# 3.3 orthography and normalization
# orthography - train with and predict casing and punctuation
# normalization - remove any casing and punctuation
# Wav2Vec2:  HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAUS AND ROSE BEEF LOOMING BEFORE US SIMALYIS DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND
# Whisper:   He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similarly is drawn from eating and its results occur most readily to the mind.


In [None]:
# 3.4 the normaliser of Whisper
from transformers.models.whisper.english_normalizer import BasicTextNormalizer

normalizer = BasicTextNormalizer()

prediction = " He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similarly is drawn from eating and its results occur most readily to the mind."
normalized_prediction = normalizer(prediction)

normalized_prediction

In [None]:
# Normalized WER is usually lower than orthographic WER
# It is recommended training on orthographic text and 
#   evaluating on normalised text to get the best of both worlds.
reference = "HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS AND ROAST BEEF LOOMING BEFORE US SIMILES DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND"
normalized_referece = normalizer(reference)

wer = wer_metric.compute(
    references=[normalized_referece], predictions=[normalized_prediction]
)
wer

In [None]:
# 3.5 Putting it all together
# pre-trained models, dataset selection and evaluation.

# 1) Load whisper-small
from transformers import pipeline
import torch

if torch.cuda.is_available():
    device = "cuda:0"
    torch_dtype = torch.float16
else:
    device = "cpu"
    torch_dtype = torch.float32

pipe = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-small",
    torch_dtype=torch_dtype,
    device=device,
)

In [None]:
# 2) login onto HuggingFace to download common voice dataset
from huggingface_hub import notebook_login
notebook_login()
# or if from command line
# huggingface-cli login
# or 
# export HF_API_TOKEN="your_token_here"

# pip install soundfile librosa

from datasets import load_dataset
common_voice_test = load_dataset(
    "mozilla-foundation/common_voice_13_0", "zh-CN", split="test"
)

In [None]:
# 3) pick out the needed dataset columns
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset

all_predictions = []

# run streamed inference
for prediction in tqdm(
    pipe(
        KeyDataset(common_voice_test, "audio"),
        max_new_tokens=128,
        generate_kwargs={"task": "transcribe"},
        batch_size=32,
    ),
    total=len(common_voice_test),
):
    all_predictions.append(prediction["text"])

In [None]:
# 4) compute the baseline CER without normalization
from evaluate import load

cer_metric = load("cer")

cer_ortho = 100 * cer_metric.compute(
    references=common_voice_test["sentence"], predictions=all_predictions
)
cer_ortho

In [None]:
# 5) compute the normalized CER
from transformers.models.whisper.english_normalizer import BasicTextNormalizer

normalizer = BasicTextNormalizer()

# compute normalised CER
all_predictions_norm = [normalizer(pred) for pred in all_predictions]
all_references_norm = [normalizer(label) for label in common_voice_test["sentence"]]

# filtering step to only evaluate the samples that correspond to non-zero references
all_predictions_norm = [
    all_predictions_norm[i]
    for i in range(len(all_predictions_norm))
    if len(all_references_norm[i]) > 0
]
all_references_norm = [
    all_references_norm[i]
    for i in range(len(all_references_norm))
    if len(all_references_norm[i]) > 0
]

cer = 100 * cer_metric.compute(
    references=all_references_norm, predictions=all_predictions_norm
)

cer

- 📝 4. How to fine-tune an ASR system with the Trainer API

In [None]:
# 1. Linking the notebook to the Hub
from huggingface_hub import notebook_login

notebook_login()
# or in terminal
# export HF_API_TOKEN="your_token_here"

In [None]:
# 2. Load Dataset
from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()

common_voice["train"] = load_dataset(
    "mozilla-foundation/common_voice_13_0", "zh-CN", split="train+validation"
)
common_voice["test"] = load_dataset(
    "mozilla-foundation/common_voice_13_0", "zh-CN", split="test"
)

print(common_voice)

In [None]:
# select needed columns
common_voice = common_voice.select_columns(["audio", "sentence"])

In [None]:
# 3. Feature Extractor, Tokenizer and Processor
# the Whisper model has an associated feature extractor and tokenizer, 
# called WhisperFeatureExtractor and WhisperTokenizer respectively.
# these two objects are wrapped under a single class, called the WhisperProcessor

# 1) see all possible languages supported by Whisper
from transformers.models.whisper.tokenization_whisper import TO_LANGUAGE_CODE

TO_LANGUAGE_CODE

In [None]:
# 2) load our processor from the pre-trained checkpoint
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained(
    "openai/whisper-small", language="chinese", task="transcribe"
)

In [None]:
# 3) Pre-Process the Data
# Pay particular attention to the "audio" column

common_voice["train"].features

In [None]:
# resample audio samples on-the-fly
# change the sampling rate to 16kHz expected by the Whisper model.
from datasets import Audio

sampling_rate = processor.feature_extractor.sampling_rate
common_voice = common_voice.cast_column("audio", Audio(sampling_rate=sampling_rate))

In [None]:
# 4. write a function to prepare our data ready for the model:
# a. We load and resample the audio data on a sample-by-sample basis 
#   by calling sample["audio"]. As explained above, 🤗 Datasets performs 
#   any necessary resampling operations on the fly.
# b. We use the feature extractor to compute the log-mel spectrogram 
#   input features from our 1-dimensional audio array.
# c. We encode the transcriptions to label ids through the use of the tokenizer.

def prepare_dataset(example):
    audio = example["audio"]

    example = processor(
        audio=audio["array"],
        sampling_rate=audio["sampling_rate"],
        text=example["sentence"],
    )

    # compute input length of audio sample in seconds
    example["input_length"] = len(audio["array"]) / audio["sampling_rate"]

    return example

# 4.2 apply the data preparation function to all of our training examples
common_voice = common_voice.map(
    prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=1
)

In [None]:
# 4.3 filter any training data with audio samples longer than 30s
max_input_length = 30.0

def is_audio_in_length_range(length):
    return length < max_input_length

# apply our filter function to all samples of our training dataset 
common_voice["train"] = common_voice["train"].filter(
    is_audio_in_length_range,
    input_columns=["input_length"],
)

In [None]:
# check how much training data being removed
common_voice["train"]

In [None]:
# 5. Training and Evaluation
# 5.1 Define a data collator
# perform both the feature extractor and the tokenizer operations:

import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union


@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(
        self, features: List[Dict[str, Union[List[int], torch.Tensor]]]
    ) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [
            {"input_features": feature["input_features"][0]} for feature in features
        ]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(
            labels_batch.attention_mask.ne(1), -100
        )

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

In [None]:
#  initialise the data collator 
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

In [None]:
# 5.2 Evaluation metrics
import evaluate

metric = evaluate.load("cer")

In [None]:
# define a function that takes our model predictions and returns the CER metric.
from transformers.models.whisper.english_normalizer import BasicTextNormalizer

normalizer = BasicTextNormalizer()


def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = processor.tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = processor.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = processor.batch_decode(label_ids, skip_special_tokens=True)

    # compute orthographic cer
    cer_ortho = 100 * metric.compute(predictions=pred_str, references=label_str)

    # compute normalised CER
    pred_str_norm = [normalizer(pred) for pred in pred_str]
    label_str_norm = [normalizer(label) for label in label_str]
    # filtering step to only evaluate the samples that correspond to non-zero references:
    pred_str_norm = [
        pred_str_norm[i] for i in range(len(pred_str_norm)) if len(label_str_norm[i]) > 0
    ]
    label_str_norm = [
        label_str_norm[i]
        for i in range(len(label_str_norm))
        if len(label_str_norm[i]) > 0
    ]

    cer = 100 * metric.compute(predictions=pred_str_norm, references=label_str_norm)

    return {"cer_ortho": cer_ortho, "cer": cer}

In [None]:
# 5.3 Load a pre-trained checkpoint
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")

In [None]:
# adust model parameters
from functools import partial

# disable cache during training since it's incompatible with gradient checkpointing
model.config.use_cache = False

# set language and task for generation and re-enable cache
model.generate = partial(
    model.generate, language="chinese", task="transcribe", use_cache=True
)

In [None]:
# 5.4 Define the training arguments
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./",  # save locally
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    lr_scheduler_type="constant_with_warmup",
    warmup_steps=50,
    max_steps=500,  # increase to 4000 if you have your own GPU or a Colab paid plan
    gradient_checkpointing=True,
    fp16=True,
    fp16_full_eval=True,
    eval_strategy="steps",
    per_device_eval_batch_size=16,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=500,
    eval_steps=500,
    logging_steps=25,
    # report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="cer",
    greater_is_better=False,
    push_to_hub=False,
)

In [None]:
# 5.5 forward the training arguments to the 🤗 Trainer along with 
# our model, dataset, data collator and compute_metrics function:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor,
)

In [None]:
# launch training
trainer.train()

In [None]:
# use the model
from transformers import pipeline
# update with your model id
pipe = pipeline("automatic-speech-recognition", model="./", device='cuda')

In [None]:
# 6. Build a demo with Gradio
# 6.1 load the model
from transformers import pipeline

model_id = "./"  # update with your model id
pipe = pipeline("automatic-speech-recognition", model=model_id, device='cuda')

# 6.2 define the serve fun
def transcribe_speech(filepath):
    output = pipe(
        filepath,
        max_new_tokens=256,
        generate_kwargs={
            "task": "transcribe",
            "language": "sinhalese",
        },  # update with the language you've fine-tuned on
        chunk_length_s=30,
        batch_size=8,
    )
    return output["text"]

In [None]:
# 6.3 use the Gradio blocks feature to launch two tabs on our demo:
#  one for microphone transcription, and the other for file upload.

import gradio as gr

demo = gr.Blocks()

mic_transcribe = gr.Interface(
    fn=transcribe_speech,
    inputs=gr.Audio(sources="microphone", type="filepath"),
    outputs=gr.outputs.Textbox(),
)

file_transcribe = gr.Interface(
    fn=transcribe_speech,
    inputs=gr.Audio(sources="upload", type="filepath"),
    outputs=gr.outputs.Textbox(),
)


In [None]:
# 6.4 launch the Gradio demo 
with demo:
    gr.TabbedInterface(
        [mic_transcribe, file_transcribe],
        ["Transcribe Microphone", "Transcribe Audio File"],
    )

demo.launch(debug=True)

## TTS

- `Text-to-speech (TTS)` systems convert text into audio waveforms, 
  - useful for dialogue systems, games, and education.  
- Unlike automatic speech recognition (ASR) systems, TTS systems are typically `speaker-dependent`, 
  - requiring less data but `focusing on a single voice`, 
  - such as the 24-hour LJ (Linda Johnson) speech corpus.  
- TTS involves 
  - (1) an `encoder-decoder` model to map text to mel spectrograms 
  - (2) a `vocoder` to convert mel spectrograms into waveforms.  
- TTS algorithms are computationally intensive, driving research on optimization and acceleration.  

### TTS Preprocessing: Text normalization
- TTS systems handle `non-standard words` differently based on context, 
  - requiring `verbalization (spoken form)` to match meaning, called `semiotic` classes
  - e.g., numbers, dates, monetary amounts, abbreviations:
    - "1750" as *seventeen fifty* for years 
    - but *one seven five zero* for passwords.
    - $3.2 billion: three point two billion dollars
    - N.Y.:  New York
  - Grammatical properties in some languages further affect normalization rules.
    - e.g., gender in French or case in German

- **Common Semiotic Classes:** 
  - cardinal/ordinal numbers, dates, times, monetary values, percentages, abbreviations, and acronyms, 
  - each requiring specific `verbalization strategies`.  

- **Two Normalization Methods:** 
  - `Rule-based systems` use tokenization and verbalization rules, like [Kestral](https://doi.org/10.1017/S1351324914000175), which classify and verbalize input, 
    - but are brittle and require maintenance.  
  - `Encoder-Decoder Models` treat normalization as a translation task, mapping text to verbalized output, 
    - but require labeled training data for accurate performance. 
    - While effective, these models can produce erratic errors, 
      - such as misinterpreting "45 minutes" as "forty-five meters."  
    - `Lightweight covering grammars` may be used to constrain decoding and reduce normalization errors.  

### TTS: Spectrogram prediction
- [Tacotron2](https://huggingface.co/papers/1712.05884) is a TTS architecture that builds on [Tacotron](https://huggingface.co/papers/1703.10135) and [Wavenet](https://paperswithcode.com/paper/wavenet-a-generative-model-for-raw-audio), 
- ![The Tacotron2 architecture](./images/stts/taco.png)
- It uses an `encoder-decoder` with attention to map graphemes to mel spectrograms, followed by a `vocoder` to generate waveforms.  
  - The `encoder` processes input graphemes into hidden representations using 512-dimensional embeddings, 
    - convolutional layers for letter context, and a biLSTM for final encoding.  
  - The `decoder` uses previous mel spectrum predictions, 
    - passes them through a pre-net, 
    - combines them with attention context, 
    - processes the result through LSTM layers to predict 80-dimensional log-mel filterbank vectors.
      - and through another linear layer to a sigmoid to make a “stop token prediction” decision.
- Tacotron2 uses `teacher forcing`, where the decoder is fed gold-standard mel features at each step, ensuring accurate training outputs.   

### TTS: Vocoding
- The `vocoder`, based on WaveNet, converts log-mel spectrograms into time-domain waveforms,
  - producing 8-bit mu-law compressed audio samples.  
- `WaveNet` uses an autoregressive model with dilated convolutions, 
  - enabling predictions based only on past inputs while increasing the receptive field exponentially with depth.  
  - ![Dilated convolutions](./images/stts/dicon.png)
  - Tacotron2 employs 12 convolutional layers in two cycles, 
    - with dilation values of 1, 2, 4, 8, 16, and 32, 
    - to model long-range dependencies effectively.  

- WaveNet predicts audio samples as 8-bit values using a 256-way categorical classifier, 
  - with outputs processed via softmax for each sample.  

- The spectrogram predictor and vocoder are `trained separately`, 
  - with the vocoder trained using ground truth-aligned spectral features and audio output.  

- **Challenges in WaveNet:** 
  - Predicting 8-bit values is less effective than 16-bit, 
  - requiring advanced decoding methods like mixtures of distributions for higher quality audio generation.  

- **Efficiency Improvements:** 
  - Non-autoregressive generation techniques are explored to reduce latency, 
  - enabling faster and more practical audio synthesis.  

- **Additional Features:** 
  - WaveNet integrates gated activation functions, residual, and skip connections, 
  - offering further enhancements to model performance and accuracy.  

### TTS Evaluation
- Speech synthesis systems are primarily `evaluated by human listeners`, 
  - since an effective automatic metric for evaluation remains an open research challenge.  
- Listeners rate synthesized utterances with a `Mean Opinion Score (MOS)` on a scale (typically 1–5) to assess quality; 
  - often tested for significance using statistical methods like [paired t-tests](https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/paired-sample-t-test/).  
  - or [A/B tests](https://en.wikipedia.org/wiki/A/B_testing), where listeners choose their preferred version of the same sentence, with results aggregated across multiple samples.  

## Other Speech Tasks
- **Wake Word Detection:** Identifies specific words/phrases to activate voice assistants while ensuring privacy by minimizing transmitted speech; 
  - uses ASR-like feature extraction and operates on embedded devices for efficiency.  

- **Speaker Diarization:** Segments audio to determine "who spoke when," 
  - employing voice activity detection (VAD), speaker embeddings, and clustering; 
  - useful for meeting transcription and medical interactions, with recent advancements focusing on end-to-end approaches.  

- **Speaker Recognition:** Identifies individuals through their voice, encompassing 
  - speaker verification: binary decision for authentication,
  - speaker identification: matching against a database.  

- **Language Identification:** Determines the spoken language in audio files, 
  - aiding in tasks like routing callers to language-specific operators or services.  

### 🏃 [Unit 6. From text to speech](https://huggingface.co/learn/audio-course/chapter6/introduction)
- 📝 [Text-to-speech datasets](https://huggingface.co/learn/audio-course/chapter6/tts_datasets)

- 📝 [Pre-trained models for text-to-speech](https://huggingface.co/learn/audio-course/chapter6/pre-trained_models)

In [None]:
# 1. SpeechT5
# 1.1 load the processor and model
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")

# 1.2 tokenize the input texgt
inputs = processor(text="Don't count the days, make the days count.", return_tensors="pt")

# 1.3 load X-vector of speaker embeddings
from datasets import load_dataset

embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")

import torch

speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

In [None]:
# 1.4 generate a log mel spectrogram
spectrogram = model.generate_speech(inputs["input_ids"], speaker_embeddings)

In [None]:
# 1.5 convert spectrogram to speech wave with a vocoder
from transformers import SpeechT5HifiGan

vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)

# play the speech
from IPython.display import Audio

Audio(speech, rate=16000)

In [None]:
# 2. Bark
# 2.1 load the model and its processor.
from transformers import BarkModel, BarkProcessor

model = BarkModel.from_pretrained("suno/bark-small")
sampling_rate = model.generation_config.sample_rate

processor = BarkProcessor.from_pretrained("suno/bark-small")

# add a speaker embedding
inputs = processor("This is a test!", voice_preset="v2/en_speaker_3")
speech_output = model.generate(**inputs).cpu().numpy()

Audio(speech_output, rate=sampling_rate)

In [None]:
# generate ready-to-use multilingual speeches
# try it in Chinese, let's also add a Chinese speaker embedding
inputs = processor("真是太美了!", voice_preset="v2/zh_speaker_1")

speech_output = model.generate(**inputs).cpu().numpy()
Audio(speech_output, rate=sampling_rate)

In [None]:
# generate non-verbal communications such as laughing, sighing and crying. 
inputs = processor(
    "[clears throat] This is a test ... and I just took a long pause.",
    voice_preset="v2/zh_speaker_1",
)

speech_output = model.generate(**inputs).cpu().numpy()
Audio(speech_output, rate=sampling_rate)

In [None]:
# generate music by adding ♪ musical notes ♪ around your words.
inputs = processor(
    "♪ In the mighty jungle, I'm trying to generate barks.",
)

speech_output = model.generate(**inputs).cpu().numpy()
Audio(speech_output, rate=sampling_rate)

In [None]:
# Bark supports batch processing,
input_list = [
    "[clears throat] Hello uh ..., my dog is cute [laughter]",
    "Let's try generating speech, with Bark, a text-to-speech model",
    "♪ In the jungle, the mighty jungle, the lion barks tonight ♪",
]

# also add a speaker embedding
inputs = processor(input_list, voice_preset="v2/en_speaker_3")

speech_output = model.generate(**inputs).cpu().numpy()

In [None]:
# listen to the outputs one by one.
from IPython.display import Audio

# first one
sampling_rate = model.generation_config.sample_rate
Audio(speech_output[0], rate=sampling_rate)

# second
Audio(speech_output[1], rate=sampling_rate)

# third
Audio(speech_output[2], rate=sampling_rate)

In [None]:
# 3. Massive Multilingual Speech (MMS)
# it can synthesize speech in over 1,100 languages.
!pip install git+https://github.com/huggingface/transformers.git

# 3.1 load the model and tokenizer
from transformers import VitsModel, VitsTokenizer

model = VitsModel.from_pretrained("facebook/mms-tts-deu")
tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-deu")

text_example = (
    "Ich bin Schnappi das kleine Krokodil, komm aus Ägypten das liegt direkt am Nil."
)

In [None]:
# 3.2 generate waveform output
import torch

inputs = tokenizer(text_example, return_tensors="pt")
input_ids = inputs["input_ids"]


with torch.no_grad():
    outputs = model(input_ids)

speech = outputs["waveform"]

In [None]:
# play
from IPython.display import Audio

Audio(speech, rate=16000)

- 📝 [Fine-tuning SpeechT5](https://huggingface.co/learn/audio-course/chapter6/fine-tuning)

In [None]:
# this part is very time consuming, GPU is required
!nvidia-smi

# install required libraries
!pip install transformers datasets soundfile speechbrain accelerate

In [None]:
# 1. load the dataset
#  VoxPopuli is a large-scale multilingual speech corpus consisting of data 
# sourced from 2009-2020 European Parliament event recordings. 
# It contains labelled audio-transcription data for 15 European languages. 
# we will be using the Dutch language subset,
from datasets import load_dataset, Audio

dataset = load_dataset("facebook/voxpopuli", "nl", split="train")
len(dataset)

In [None]:
# SpeechT5 expects audio data to have a sampling rate of 16 kHz
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

In [None]:
# 2. Preprocessing the data
# 2.1 loading the appropriate processor that contains both tokenizer and feature extractor
from transformers import SpeechT5Processor

checkpoint = "microsoft/speecht5_tts"
processor = SpeechT5Processor.from_pretrained(checkpoint)

In [None]:
# 2.2 get the tokenizer
tokenizer = processor.tokenizer

# investigate a sample
# the SpeechT5 tokenizer doesn’t have any tokens for numbers
# `normalized_text` since it write out the numbers as text.
dataset[0]

In [None]:
# 2.3 Extract all characters
# SpeechT5 was trained on the English language, 
# it may not recognize certain characters in the Dutch dataset
def extract_all_chars(batch):
    all_text = " ".join(batch["normalized_text"])
    vocab = list(set(all_text))
    return {"vocab": [vocab], "all_text": [all_text]}


vocabs = dataset.map(
    extract_all_chars,
    batched=True,
    batch_size=-1,
    keep_in_memory=True,
    remove_columns=dataset.column_names,
)

dataset_vocab = set(vocabs["vocab"][0])
tokenizer_vocab = {k for k, _ in tokenizer.get_vocab().items()}

In [None]:
# characters specific to Dutch
dataset_vocab - tokenizer_vocab

In [None]:
# 2.4 define a function that maps these characters to valid tokens
replacements = [
    ("à", "a"),
    ("ç", "c"),
    ("è", "e"),
    ("ë", "e"),
    ("í", "i"),
    ("ï", "i"),
    ("ö", "o"),
    ("ü", "u"),
]


def cleanup_text(inputs):
    for src, dst in replacements:
        inputs["normalized_text"] = inputs["normalized_text"].replace(src, dst)
    return inputs


dataset = dataset.map(cleanup_text)

In [None]:
# 3. Speakers
# The VoxPopuli dataset includes speech from multiple speakers
# 3.1 count the number of unique speakers 
# and the number of examples each speaker contributes to the dataset.
from collections import defaultdict

speaker_counts = defaultdict(int)

for speaker_id in dataset["speaker_id"]:
    speaker_counts[speaker_id] += 1

In [None]:
# total number of speakers, # speakers less than 100 samples, # speakers no less than 500 samples
len(speaker_counts), len([i for i in speaker_counts if speaker_counts[i]<100]), len([i for i in speaker_counts if speaker_counts[i]>=500])

In [None]:
# the distribution of speakers and examples in the data.
import matplotlib.pyplot as plt

plt.figure()
plt.hist(speaker_counts.values(), bins=20)
plt.ylabel("Speakers")
plt.xlabel("Examples")
plt.show()

In [None]:
# To improve training efficiency and balance the dataset, 
# we can limit the data to speakers with between 100 and 400 examples.
def select_speaker(speaker_id):
    return 100 <= speaker_counts[speaker_id] <= 400


dataset = dataset.filter(select_speaker, input_columns=["speaker_id"])

In [None]:
# speakers remained, samples left
len(set(dataset["speaker_id"])), len(dataset)

In [None]:
# 4. Speaker embeddings
# To enable the TTS model to differentiate between multiple speakers, 
# you’ll need to create a speaker embedding for each example. 
# e.g. use the pre-trained spkrec-xvect-voxceleb model from SpeechBrain.
# For optimal results, train an X-vector model on the target speech.
import os
import torch
from speechbrain.pretrained import EncoderClassifier

spk_model_name = "speechbrain/spkrec-xvect-voxceleb"

device = "cuda" if torch.cuda.is_available() else "cpu"
speaker_model = EncoderClassifier.from_hparams(
    source=spk_model_name,
    run_opts={"device": device},
    savedir=os.path.join("/tmp", spk_model_name),
)


def create_speaker_embedding(waveform):
    with torch.no_grad():
        speaker_embeddings = speaker_model.encode_batch(torch.tensor(waveform))
        speaker_embeddings = torch.nn.functional.normalize(speaker_embeddings, dim=2)
        speaker_embeddings = speaker_embeddings.squeeze().cpu().numpy()
    return speaker_embeddings

In [None]:
# 5. Processing the dataset
# process the data into the format the model expects
def prepare_dataset(example):
    audio = example["audio"]

    example = processor(
        text=example["normalized_text"],
        audio_target=audio["array"],
        sampling_rate=audio["sampling_rate"],
        return_attention_mask=False,
    )

    # strip off the batch dimension
    example["labels"] = example["labels"][0]

    # use SpeechBrain to obtain x-vector
    example["speaker_embeddings"] = create_speaker_embedding(audio["array"])

    return example

In [None]:
# test the processing on a single example
processed_example = prepare_dataset(dataset[0])
list(processed_example.keys())

In [None]:
# Speaker embeddings should be a 512-element vector:
processed_example["speaker_embeddings"].shape

In [None]:
# The labels should be a log-mel spectrogram with 80 mel bins.
import matplotlib.pyplot as plt

plt.figure()
plt.imshow(processed_example["labels"].T)
plt.show()

In [None]:
# 5.2 apply the processing function to the entire dataset.
dataset = dataset.map(prepare_dataset, remove_columns=dataset.column_names)

In [None]:
# 5.3 remove example longer than the maximum input length the model can handle (600 tokens)
# we remove anything over 200 tokens here
def is_not_too_long(input_ids):
    input_length = len(input_ids)
    return input_length < 200


dataset = dataset.filter(is_not_too_long, input_columns=["input_ids"])
len(dataset)

In [None]:
# create a basic train/test split:
dataset = dataset.train_test_split(test_size=0.1)

In [None]:
# 6. Data collator
# combine multiple examples into a batch with a custom data collator.
# This collator will pad shorter sequences with padding tokens
from dataclasses import dataclass
from typing import Any, Dict, List, Union


@dataclass
class TTSDataCollatorWithPadding:
    processor: Any

    def __call__(
        self, features: List[Dict[str, Union[List[int], torch.Tensor]]]
    ) -> Dict[str, torch.Tensor]:
        input_ids = [{"input_ids": feature["input_ids"]} for feature in features]
        label_features = [{"input_values": feature["labels"]} for feature in features]
        speaker_features = [feature["speaker_embeddings"] for feature in features]

        # collate the inputs and targets into a batch
        batch = processor.pad(
            input_ids=input_ids, labels=label_features, return_tensors="pt"
        )

        # replace padding with -100 to ignore loss correctly
        batch["labels"] = batch["labels"].masked_fill(
            batch.decoder_attention_mask.unsqueeze(-1).ne(1), -100
        )

        # not used during fine-tuning
        del batch["decoder_attention_mask"]

        # round down target lengths to multiple of reduction factor
        if model.config.reduction_factor > 1:
            target_lengths = torch.tensor(
                [len(feature["input_values"]) for feature in label_features]
            )
            target_lengths = target_lengths.new(
                [
                    length - length % model.config.reduction_factor
                    for length in target_lengths
                ]
            )
            max_length = max(target_lengths)
            batch["labels"] = batch["labels"][:, :max_length]

        # also add in the speaker embeddings
        batch["speaker_embeddings"] = torch.tensor(speaker_features)

        return batch

In [None]:
# instantiate a data collator
data_collator = TTSDataCollatorWithPadding(processor=processor)

In [None]:
# 7. Train the model
# 7.1 load the pretrained model freshly for fine-tuning
from transformers import SpeechT5ForTextToSpeech

model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint)

In [None]:
from functools import partial

# disable cache during training since it's incompatible with gradient checkpointing
model.config.use_cache = False

# set language and task for generation and re-enable cache
model.generate = partial(model.generate, use_cache=True)

In [None]:
# 7.2 Define the training arguments.
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./",  # change to a repo name of your choice
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=4000,
    gradient_checkpointing=True,
    fp16=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=2,
    save_steps=1000,
    eval_steps=1000,
    logging_steps=25,
    # report_to=["tensorboard"],
    load_best_model_at_end=True,
    greater_is_better=False,
    label_names=["labels"],
    push_to_hub=False,
)

In [None]:
# 7.3 Instantiate the Trainer object and pass the model, dataset, and data collator to it.
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    data_collator=data_collator,
    tokenizer=processor,
)

In [None]:
# 7.4 launch the training
trainer.train()

In [None]:
# 7.5 Inference
model = SpeechT5ForTextToSpeech.from_pretrained(
    "./" # find and fill with your local checkpoint
)

# pick an example
example = dataset["test"][304]
speaker_embeddings = torch.tensor(example["speaker_embeddings"]).unsqueeze(0)

# Define some input text and tokenize it.
text = "hallo allemaal, ik praat nederlands. groetjes aan iedereen!"

# Preprocess the input text:
inputs = processor(text=text, return_tensors="pt")

# Instantiate a vocoder and generate speech:
from transformers import SpeechT5HifiGan

vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)

# listen to the result
from IPython.display import Audio
Audio(speech.numpy(), rate=16000)

- 📝 [Evaluating text-to-speech models](https://huggingface.co/learn/audio-course/chapter6/evaluation)