```{contents}
```


## Text-to-Speech (TTS)

Text-to-Speech converts **written text** into **natural-sounding speech**.

Example:

```
Input: "Hello, how are you?"
Output: Audio waveform of a human saying it
```

TTS is used in:

* Voice assistants (Alexa, Siri, Google Assistant)
* IVR / customer service bots
* Audiobooks
* Accessibility (screen readers)
* Multimodal LLMs (GPT-4o, Gemini, Deepgram, Coqui)

---

### Why TTS Is Hard

Speech is extremely complex:

* tone
* rhythm
* pitch
* prosody (natural expression)
* emotion
* accent
* pausing and timing
* context awareness

TTS must generate audio that sounds **human**, not robotic.

---

### **How Text-to-Speech Works — 3-Stage Architecture**

Modern TTS systems have a **three-stage pipeline**:

---

#### Text Processing (Linguistic Frontend)

The model converts text into linguistic units.

Tasks:

* Normalization (“$100” → “one hundred dollars”)
* Expand abbreviations (“Dr.” → “doctor”)
* Tokenization
* Convert text → phonemes
* Prosody prediction (where to pause)

This produces a sequence of **phoneme IDs** or **linguistic tokens**.

Example:

```
"Hello" → ["HH", "AH", "L", "OW"]
```

---

### Acoustic Model

The model predicts **acoustic features** from text/phonemes.
Typically:

* **Mel spectrograms** (image-like 2D sound representation)

Old models:

* Tacotron, Tacotron2
* FastSpeech, FastSpeech2

New models:

* VITS (end-to-end)
* SpeechLM
* GPT-4o TTS

Output example (mel spectrogram):

```
(80 mel bins × 400 frames)
```

This spectrogram describes:

* pitch
* loudness
* pronunciation
* duration

---

### 3️⃣ Vocoder

The vocoder converts the spectrogram → raw audio waveform.

Popular vocoders:

* **WaveGlow**
* **WaveNet (Google)**
* **HiFi-GAN** → most widely used today
* **RefineGAN**
* **WaveRNN**

Input:

```
Mel spectrogram
```

Output:

```
16000 samples per second (PCM waveform)
```

---

### Modern End-to-End TTS Models

#### **1. VITS (Very High Quality)**

* Combines spectrogram generation + vocoder in one network
* State of the art in open source
* High realism, natural prosody

#### **2. FastSpeech 2 + HiFi-GAN**

* Extremely fast
* Production-friendly
* Good quality

#### **3. Neural Codec Models (GPT-4o, Meta VoiceBox, Google AudioLM)**

* Use audio tokens instead of spectrograms
* Model speech like text
* Can clone voices, emotions, accents

This is why GPT-4o TTS sounds extremely natural.

---

### Intuition Behind TTS

#### Think of TTS like animating a human voice.

Text → plan how a human would say it → generate acoustic patterns → convert to audio.

TTS must decide:

* how long each phoneme lasts
* how pitch rises/falls
* where pauses go
* how loud each syllable should be

Good TTS mimics human speech patterns.

---

# ⭐ Demo — Text to Speech Using Python (Easy)

### Install TTS (Coqui)

```bash
pip install TTS
```

### Simple Example

```python
from TTS.api import TTS

tts = TTS("tts_models/en/ljspeech/tacotron2-DDC")  # model name
tts.tts_to_file(text="Hello! This is a text to speech demo.", file_path="output.wav")
```

---

### Demo — Text to Speech Using Transformers + SpeechT5

```bash
pip install transformers datasets soundfile
```

```python
import torch
import soundfile as sf
from transformers import SpeechT5ForTextToSpeech, SpeechT5Processor

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")

inputs = processor(text="Hello world!", return_tensors="pt")

speech = model.generate_speech(inputs["input_ids"])
sf.write("speech.wav", speech.numpy(), 16000)
```

---

### What Makes a TTS Model Good?

#### 1. **Natural prosody**

Sound like human, not flat.

#### 2. **Emotion control**

Happy, sad, excited.

#### 3. **Contextual awareness**

Reads sentences intelligently.

#### 4. **Speaker identity**

Consistent voice across long paragraphs.

#### 5. **Noise robustness**

#### 6. **Fast inference**

---

**One-Sentence Summary**

**Text-to-Speech converts text → phonemes → mel spectrogram → audio waveform using neural networks like Tacotron, FastSpeech, VITS, or neural codec language models for human-like speech.**
