```{contents}
```


## Audio Spectrogram

A **spectrogram** is a **visual representation of sound** that shows:

* **Time** → horizontal axis
* **Frequency** → vertical axis
* **Energy/Intensity** → color/brightness

In simple terms:

> A spectrogram shows **which frequencies exist** in the audio and **how their strength changes over time**.

It converts a **1D audio waveform** → into a **2D image-like representation**.

---

### **Why Do We Need Spectrograms?**

Raw audio waveform looks like this:

```
Amplitude vs time
```

But **waveforms do not show frequency**, and speech/music is defined by frequency patterns.

Neural networks (CNNs, Transformers, Conformers) understand audio far better in spectrogram form.

Spectrograms:

* reveal phonemes
* capture pitch
* show harmonics
* detect noise
* encode timbre

Thus, **almost all modern AI audio tasks use spectrograms**, including:

* Speech-to-Text (Whisper)
* Text-to-Speech (VITS, FastSpeech)
* Music analysis
* Audio classification
* Voice cloning

---

### **How Spectrograms Are Created**

A spectrogram is computed using **STFT (Short-Time Fourier Transform)**.

#### Steps:

#### **1. Break audio into small windows (frames)**

Example:

* window size = 25 ms
* hop size = 10 ms

#### **2. Apply FFT (Fast Fourier Transform)**

This converts each window to frequency domain.

#### **3. Stack all windows over time**

You get a 2D matrix:

```
time frames × frequency bins
```

#### **4. Apply log scale or Mel scale**

To match human perception.

---

### ⭐ Types of Spectrograms

#### **1. Linear Spectrogram**

Computed directly from FFT.

* Accurate
* Used in audio engineering

But not human-like.

---

#### **2. Log Spectrogram**

Applies log() on frequency magnitudes.

* reduces dynamic range
* easier for models to learn

---

#### **3. Mel Spectrogram (Most Used in AI)**

Transforms frequencies to **Mel scale**, which matches human hearing:

Humans hear lower frequencies with higher sensitivity.

Thus:

* 100–500 Hz → high resolution
* > 4000 Hz → low resolution

#### Used in:

* Whisper
* Tacotron2
* VITS
* FastSpeech2
* SpeechT5
* Music generation models

---

### **Mel Spectrogram Formula**

1. Apply STFT
2. Apply Mel filter banks
3. Apply log transform

$$
M = \log(\text{MelFilterBank} \times |\text{STFT}(x)|^2)
$$

---

### Visual Example

#### Waveform → Spectrogram

**Waveform**

```
Fast oscillating line, not visually meaningful for patterns.
```

**Spectrogram**

```
Frequencies on Y  
Time on X  
Colors = energy
```

Patterns become visible:

* vowels (steady harmonic bands)
* consonants (noise bursts)
* silence (dark area)

---

### Why Spectrograms Are Like Images for AI

Spectrogram is a **2D grid**, so:

* CNNs detect patterns
* Transformers process patches
* Vision models can understand audio

This is why:

* Whisper uses audio patches like a Vision Transformer
* TTS models generate spectrograms then convert to waveform
* AudioLM / GPT-4o models learn audio tokens derived from spectrograms

---

### PyTorch Demo — Create a Mel Spectrogram

Install:

```bash
pip install torchaudio
```

### Code:

```python
import torchaudio
import torchaudio.transforms as T
import torch

# Load audio
waveform, sr = torchaudio.load("audio.wav")

# Create Mel Spectrogram Transformer
mel_spectrogram = T.MelSpectrogram(
    sample_rate=sr,
    n_fft=1024,
    hop_length=256,
    n_mels=80
)

mel = mel_spectrogram(waveform)  # Shape: [channels, n_mels, time]

print("Mel spectrogram shape:", mel.shape)
```

---

### Interpreting Values

Mel spectrogram gives:

* rows = Mel frequency bins (80 commonly used)
* columns = time steps (depends on hop size)
* pixel values = loudness at that frequency and time

This is exactly what models use internally.

---

**Summary**

| Concept     | Meaning                                      |
| ----------- | -------------------------------------------- |
| Spectrogram | Time–frequency representation of audio       |
| Why         | Makes speech/music patterns visible for AI   |
| How         | STFT → frequency bins → mel/log filters      |
| Used in     | STT, TTS, music AI, sound classification     |
| Models      | Whisper, VITS, FastSpeech, Wav2Vec, SpeechT5 |