<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/ufidon/nlp/blob/main/srs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td>
    <a target="_blank" href="https://kaggle.com/kernels/welcome?src=https://github.com/ufidon/nlp/blob/main/srs.ipynb"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" /></a>
  </td>
</table>
<br>

# Automatic Speech Recognition and Text-to-Speech

üìù SALP chapter 16

## Introduction
- An early know `automatic speech recognition (ASR)` example is [Radio Rex](https://en.wikipedia.org/wiki/Virtual_assistant) appeared in the 1920s, 
  - a toy responding to specific sounds like the `vowel [eh]` in "Rex."  
- Despite limitations in diverse environments, **modern ASR** `converts speech waveforms into text` and is widely used in
  - smart appliances, personal assistants, call routing, transcription, and assisting individuals with disabilities.  
- [Wolfgang von Kempelen‚Äôs late 18th-century speech synthesizer](https://en.wikipedia.org/wiki/Wolfgang_von_Kempelen%27s_speaking_machine) marked the first `text-to-speech (TTS)` system using mechanical components.  
  - **Modern TTS** maps `text to audio waveforms`, aiding communication for visually impaired users and individuals with neurological disorders.  
- ASR and TTS share core algorithmic principles:
  - encoder-decoder models, 
  - Connectionist Temporal Classification (CTC) loss functions, 
  - word error rate evaluation, 
  - and acoustic feature extraction.  

üî≠ Explore 
- [ASR models](https://www.gladia.io/blog/best-open-source-speech-to-text-models)
- [TTS models](https://www.bentoml.com/blog/exploring-the-world-of-open-source-text-to-speech-models)

üèÉ Play
- [Whisper: a general-purpose speech recognition model](https://github.com/openai/whisper)
- [OpenVoice: versatile instant voice cloning](https://github.com/myshell-ai/OpenVoice)
- [UltraVox: a fast multimodal LLM for real-time voice](https://github.com/fixie-ai/ultravox)

## The Automatic Speech Recognition Task
- **Dimensions of Variation in ASR:**
  - Vocabulary size:
    - Small vocabularies (e.g., yes/no, digits) are highly accurate.
    - Large vocabularies (up to 60,000 words) in open-ended tasks are more difficult.
  - Speaker and context:
    - Human-to-machine speech (dictation/dialogue systems) is easier.
    - Read speech (e.g., audiobooks) is relatively easy.
    - Conversational speech between humans is the hardest due to faster, less clear speech.
  - Channel and noise:
    - Quiet environments with head-mounted microphones are ideal.
    - Noisy settings (e.g., streets, open car windows) complicate recognition.
  - Accent and speaker characteristics:
    - Recognition is better for accents/dialects similar to the training data.
    - Regional/ethnic dialects and children's speech are more challenging.

- **Key ASR Corpora:**
  - [LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech):
    - Open-source, read-speech dataset with 1,000+ hours of audiobooks.
    - Divided into "clean" (high quality) and "other" (lower quality) portions.
  - [Switchboard Corpus](https://huggingface.co/datasets/cgpotts/swda):
    - Prompted telephone conversations between strangers (~240 hours).
    - Extensive linguistic labeling (e.g., dialogue acts, prosody).
  - [CALLHOME Corpus](https://huggingface.co/datasets/talkbank/callhome):
    - Unscripted 30-minute telephone conversations (friends/family).
    - Focus on natural, casual speech.
  - [Santa Barbara Corpus](https://www.linguistics.ucsb.edu/research/santa-barbara-corpus):
    - Everyday spoken interactions across the US (e.g., conversations, town halls).
    - Anonymized transcripts.
  - [CORAAL](https://huggingface.co/datasets/Padomin/coraal-asr):
    - Sociolinguistic interviews with African American speakers.
    - Focus on African American Language (AAL).
  - [CHiME Challenge](https://www.chimechallenge.org/):
    - Datasets for robust ASR tasks in noisy, real environments (e.g., dinner parties).
  - [AISHELL-1 Corpus](https://paperswithcode.com/dataset/aishell-1):
    - 170 hours of Mandarin read speech from various domains.

## Feature Extraction for ASR: Log Mel Spectrum
- ASR converts waveforms into [log mel spectrum](https://medium.com/analytics-vidhya/understanding-the-mel-spectrogram-fca2afa2ce53) feature vectors.
  - representing information from small time windows of the signal.

### Sampling and Quantization
- Speech recognizers process air pressure changes caused by vocalizations, visualized as waveforms showing air pressure over time. 
  - ![A waveform of an instance of the vowel iy](./images/stts/wave.png)
- Analog sound waves are digitized through 
  - **sampling**: measuring wave amplitude at intervals,
  - **quantization**: converting measurements into integers.  
- **Nyquist frequency** defines the maximum measurable frequency as half the sampling rate; 
  - 8 kHz is sufficient for telephone speech, 
  - while 16 kHz is used for microphone speech.  
- Different sampling rates (e.g., 8 kHz vs. 16 kHz) cannot be mixed in ASR training/testing, 
  - requiring downsampling for consistency.  
- Quantization stores amplitudes as integers (e.g., 8-bit or 16-bit), 
  - with log compression (e.g., ¬µ-law) optimizing for human auditory sensitivity to smaller intensities.  
- Audio files vary by sample rate, sample size, number of channels (mono/stereo), and storage type (linear vs. compressed).  
- Common file formats include 
  - `.wav`, a subset of Microsoft [Resource Interchange File Format (RIFF)](https://en.wikipedia.org/wiki/Resource_Interchange_File_Format-based), 
  - Apple‚Äôs [Audio Interchange File Format (AIFF)](https://en.wikipedia.org/wiki/Audio_Interchange_File_Format), 
  - and [raw headerless formats](https://en.wikipedia.org/wiki/Raw_audio_format).  

### Windowing
- `Spectral features` are extracted from small windows of speech, 
  - treating the signal as stationary within each window, 
  - despite speech being non-stationary overall.  
- A **frame** represents the speech within each window, determined by three parameters:
  -  window size (duration in ms), 
  -  frame stride (overlap/offset between windows), 
  -  and window shape.  
- `Windowing` is performed by multiplying the signal $s(n)$ at time $n$ by the window function $w(n)$ at $n$, producing a windowed waveform $y[n]$.
  - $y[n]=w[n]s[n]$
- Common window shapes include 
  - **rectangular**: simple but causes boundary issues in Fourier analysis
  - **Hamming**: smoothes boundary discontinuities for better feature extraction.