# The big picture

![flowchart](images/The%20Big%20Picture.png)

## What is raw audio?

### waveform(Time domain signal)

- waveform and raw audio are not exactly the same things but in speech ML tasks they are used interchangeably.

waveform or raw audio is a 1-D time series.  
each value = air pressure deviation at a moment in time.  
(for understanding only think of it as a really long 1-D array, but it is not an array it is a tensor)

### sampling rate

We need to convert the natural signals into digital ones since the natural waveform has infinite points as it is represented by a line.  
So to represent it in computers feasibly we take discrete values from the natural waveform.  
Now at the rate that we take these **discrete values** or **samples** is called the sampling rate.  

Each value in that tensor we talked about earlier is 1 sample.  

| Sampling Rate | Meaning                 |
| ------------- | ----------------------- |
| 16 kHz        | 16,000 samples / second |
| 44.1 kHz      | CD quality              |
| 48 kHz        | Professional audio      |

Speech ML standard: 16 kHz

Why?
- Human speech mostly lives below 8 kHz
- Nyquist rule: sample ≥ 2× max frequency

- speechbrain assumes 16 kHz sampling rate

## Framing & Windowing

The main question arises why do we need framing?

The problem we encounter while processing speech is that it **quasi-stationary**.  
what that means is that over short intervals (~20-30 ms), speech properties are stable, but over long durations they change.

So to solve this problem we frame the signal.

### Framing

framing means splitting the waveform in short chunks.

| Parameter	        | Typical Value |
| ----------------- | ------------- |
| Frame length      |	25 ms       |
| Frame shift (hop) |	10 ms       |

At 16 kHz:  
25 ms → 400 samples  
10 ms → 160 samples  

Waveform:  |-----------------------------|  
Frames:    [====] [====] [====] [====]  

### Windowing

when we split the audio into frames the edges of the frames are abruptly cut off. This leads to a problem called **Spectral Leakage**.  

Now there are two parts to it, we visualize things in the time domain whereas the computations happen in the frequency domain.  

1. The Intuition (Time Domain): You are looking at how frames sit next to each other in a sequence. If the end of Frame A doesn't match the start of Frame B, you get a physical "click" or "pop" because the speaker diaphragm has to teleport instantly from one position to another.  

2. The Math (Frequency Domain): The Fast Fourier Transform (FFT)—the math used to see frequencies—doesn't know about "other" frames. It only looks at one frame at a time. To do its job, it mathematically "wraps" that frame around a cylinder so the end touches the beginning.  

By applying a window (like a Hamming window), you force both the beginning and the end of the frame to zero.

Now, no matter what the next frame looks like, or how you "wrap" the signal into a circle, the transition is always 0 to 0. The "jump" is gone, and the artifacts disappear.Therefore no spectral leakage anymore.  


Raw frame:     |████████████|  
Windowed:       ▁▂▅████▅▂▁  

Purpose:  
Reduce edge artifacts  
Make frequency analysis cleaner  

- You usually never implement this manually — libraries do it.

## Time Domain vs Frequency domain

### Time Domain (What we started with)

Waveform is in time domain. It is a graph of amplitude over time.  
It captures energy and temporal patterns.  
Examples - Loudness, Silence, Speaking rate, etc.

### Frequency domain (how humans hear)

Apply FFT per frame:  

Waveform → Frames → FFT → Spectrum  

![flowchart](images/signalspectrogram.png)

The above image is an example where a signal is converted from a time domain to a frequency domain.  
In the spectrogram the darkness (in this case ) represents the magnitude or loudness of that frequency, and white places represent the absence of that frequency.