# Audio Features

Audio features are compact, meaningful measurements extracted from short speech frames that make learning easier.  

Intuitively:  
“What properties of this sound matter for my task?”  
some examples:  

| Task       | What matters              |
| ---------- | ------------------------- |
| Speaker ID | Vocal tract shape, timbre |
| Emotion    | Pitch, energy, rhythm     |
| Health     | Voice stability, noise    |
| Age/Gender | Pitch range, formants     |

# Time domain features (No FFT)

as we have seen earlier that FFT(fast fourier transform) is used to convert time domain data to frequency domain.  

## Energy (Loudness)

What:  
How strong the signal is in the frame.  

Intuition:  
Loud vs quiet speech.  

Used for:
- Voice activity detection (speech vs silence)
- Emotion (anger vs calm)
- Stress detection

## Zero Crossing Rate (ZCR)

What:  
How often the waveform crosses zero.  

Intuition:
- Noisy sounds → high ZCR
- Voiced sounds → low ZCR

Used for:
- Voiced vs unvoiced detection
- Rough speech characterization

Rarely used alone in modern systems.  

# Frequency domain and spectral features(after FFT)

## Spectral Centroid

What:  
“Center of mass” of frequencies.

Intuition:  
- Low centroid → dark / deep voice
- High centroid → bright / sharp sound

Used for:
- Emotion
- Timbre analysis

## Spectral Bandwidth, Roll-off, Flux

| Feature   | Intuition                  |
| --------- | -------------------------- |
| Bandwidth | How spread frequencies are |
| Roll-off  | High-frequency cutoff      |
| Flux      | How fast spectrum changes  |

These are common in music/audio, less dominant in speech ML today.

## Cepstral Features — MFCCs (Very Important)

In speech processing and machine learning, Cepstral Features are the specific numbers (coefficients) extracted during Cepstrum analysis. They are essentially a "summary" of the sound's texture and characteristics.

### The cepstrum

To understand the Cepstrum, think about how you speak:

- The Source (Vocal Cords): Your vocal cords vibrate at a certain speed. This creates the pitch of your voice.
- The Filter (Vocal Tract): Your mouth and tongue change shape to turn those vibrations into specific sounds like "Ah" or "Ee." This is the timbre or "texture" of the sound.

When you record a sound, the Source and Filter are "tangled" together. The Spectrum shows them as one messy line. The Cepstrum is the mathematical tool that reaches in and pulls them apart so you can analyze the pitch and the texture separately.  

In standard signal processing, the Fourier Transform (FFT) moves you from Time to Frequency. To get back, you use the Inverse Fourier Transform (IFFT).  

In a Cepstrum, you do something unusual:
1. Forward: You take the FFT (Time → Frequency).
2. Processing: You take the Logarithm of that spectrum.
3. "Backwards": You take the IFFT (Frequency → "New Time").

Because you took the logarithm in the middle, the IFFT doesn't bring you back to the original sound wave. Instead, it lands you in a "pseudo-time" domain called Quefrency. It feels like you turned around and walked back toward the starting line, but ended up in a parallel dimension.  

The scientists who created this (Tukey et al.) realized they were doing everything in reverse. To signal this to other engineers, they flipped the names of everything:
- SPECtrum	-> CEPStrum
- FREQuency ->	QUEFrency
- FILTer -> LIFTer
- PHASe -> SAPHe
- HARMonic -> RAHMonic

### Mel-Frequency Cepstral Coefficients

The MFCC is actually a specific type of Cepstrum. The word "Cepstral" in MFCC tells you that it uses the same core "backwards" logic (taking the transform of a log-spectrum).

The Pipeline Connection:
- Cepstrum: Signal → FFT → Log → IFFT
- MFCC: Signal → FFT → Mel-Filterbank → Log → DCT(descrete cosine transform - a variation of IFFT)

#### Importance of the Mel Filterbank

Humans are very good at telling the difference between 500 Hz and 600 Hz, but we are terrible at telling the difference between 10,000 Hz and 10,100 Hz, even though the gap (100 Hz) is the same.

- The Cepstrum doesn't care about this; it analyzes everything linearly.
- The MFCC "warps" the frequency axis using the Mel Scale to match our ears. This is why MFCCs are the "gold standard" for AI models that need to "hear" like people (e.g., Alexa, Siri).

### Why MFCCs Exist

Key problem:
- Spectrum is too detailed
- Humans care about spectral envelope, not fine harmonics

MFCCs capture:
- The shape of the vocal tract, not pitch harmonics

### Intuition
- Mel filters ≈ how humans hear frequency
- Log ≈ loudness perception
- DCT ≈ compression & decorrelation

MFCCs ≈ compressed vocal tract signature

### What MFCCs capture

| Captures          | Ignores        |
| ----------------- | -------------- |
| Vocal tract shape | Exact pitch    |
| Timbre            | Phase          |
| Speaker identity  | Fine harmonics |

- useful notebook - [mfcc tutorial](https://www.kaggle.com/code/ilyamich/mfcc-implementation-and-tutorial)
- useful video - [mfcc explained](https://youtu.be/SJo7vPgRlBQ?si=PEnwwwayWV1Rstza)

## Prosodic features

Prosodic features are not concerned with what you said, they are concerned with how you said it.  
we can roughly summarize prosody as : `Prosody = rhythm + intonation`  

#### rhythm

Rhythm is the timing and pattern of sounds. In English, we don't say every syllable with the same length or strength.  

- Pacing: Speaking fast when excited or slow when explaining something complex.
- Pauses: A pause can change the meaning of a sentence entirely.
    - Example: "Let’s eat, Grandma!" vs. "Let’s eat Grandma!" (The pause—or comma—saves Grandma's life).

- Stress: We "punch" certain syllables or words to make them stand out.

    - Example: Say "I didn't steal the money." Now say it again, but stress a different word each time (e.g., "**I** didn't steal it" vs "I didn't **steal** it"). The rhythm changes the accusation.

#### Intonation

Intonation is the "rise and fall" of your pitch. It tells the listener about your emotions and your intentions.

- Rising Pitch: Usually signals a question or uncertainty. Your voice goes up at the end: "You're coming tonight?"
- Falling Pitch: Usually signals a statement, a command, or finality: "You're coming tonight."
- Tone of Voice: This is how we detect sarcasm, anger, or joy. If you say "Great job" with a high, bouncy pitch, it’s a compliment. If you say it with a flat, low pitch, it's sarcasm.

There are various prosodic features such as pitch, energy contours, speaking rate etc.

## Voice Quality Features

These relate to micro-irregularities in voice.

- Jitter - Cycle-to-cycle pitch variation.
- Shimmer - Cycle-to-cycle amplitude variation.
- Harmonics to Noise Ratio(HNR) - How “clean” the voice is.

In speech analysis, the "cycle" refers to one single glottal cycle—which is the process of your vocal folds opening and closing exactly one time.

## Fature task mapping

| Feature Type   | Best For          |
| -------------- | ----------------- |
| Energy, ZCR    | VAD, emotion      |
| Spectral stats | Timbre            |
| MFCCs          | Speaker traits    |
| Pitch          | Emotion, gender   |
| Jitter/Shimmer | Health            |
| Embeddings     | Everything modern |


Modern systems prefer learned embeddings


