```{contents}
```

## Speech-to-Text (STT)

Speech-to-Text (also called **Automatic Speech Recognition**, ASR) converts **audio (speech)** into **written text**.

Example:

```
Audio: "Hello, how are you?"
↓
Text: "Hello, how are you?"
```

STT systems are used in:

* Voice assistants (Siri, Alexa, Google Assistant)
* Call-center analytics
* Transcription tools
* Meeting summarizers
* Voice-controlled devices
* Multimodal LLMs (GPT-4o, Gemini, Whisper, etc.)

---

### Why Speech-to-Text Is Challenging

Speech is complex:

* Noisy environments
* Different accents
* Fast speech
* Co-articulation (words blend together)
* Background sounds
* Microphone quality

An STT model must:

* Understand acoustics
* Decode phonemes
* Recognize words
* Fix grammar
* Handle ambiguity

---

### How Speech-to-Text Works (Processing Pipeline)

#### **1️⃣ Audio Input (Waveform)**

Raw audio waveform:

```
16 kHz sampling → 16000 values per second
```

---

#### **2️⃣ Feature Extraction (Mel Spectrograms)**

Audio is converted into **spectrograms**, similar to images.

Most STT models use:

* **Mel Spectrograms**
* Log-mel features

Why?

* Speech patterns become clearer
* Easier for neural networks to learn

---

#### **3️⃣ Acoustic Modeling**

Neural networks convert sound patterns into **phoneme probabilities** or **token embeddings**.

Modern models use:

* CNNs
* RNNs / LSTMs
* Transformers
* Conformer blocks
* Whisper architecture

---

#### **4️⃣ Sequence Modeling**

Speech is time-dependent; models must learn:

* sequence order
* pronunciation patterns
* long-range dependencies

Used architectures:

* RNNs (old)
* CTC models
* Encoder–Decoder Transformers
* Conformers (state of the art)

---

#### **5️⃣ Decoding (Language Modeling)**

Converts acoustic tokens → text.

Decoding strategies:

* Greedy decoding
* Beam search
* CTC beam search
* Token prediction (autoregressive like Whisper)

Language model helps fix:

* spelling
* grammar
* missing words

---

#### **6️⃣ Output Text**

Final transcript after post-processing:

* punctuation insertion
* casing
* removing filler words
* grammar correction

---

### Modern Architectures for Speech-to-Text

#### CTC-Based Models (classical)

CTC = Connectionist Temporal Classification
Model outputs characters independently.

Pros: fast
Cons: lower quality

---

#### Encoder-Decoder Models (seq2seq)

Speech → encoder → decoder predicts text tokens

Examples:

* LAS (Listen-Attend-Spell)
* RNN-T (Recurrent Neural Network Transducer)

---

#### Conformer Models

Combination of:

* Convolution (local features)
* Transformer (global long-term dependencies)

Used in:

* Google’s ASR
* OpenAI Whisper

Conformer = current SOTA for speech.

---

#### Whisper (OpenAI)

Whisper is the most popular open-source STT model:

* trained on 680,000 hours of multilingual audio
* extremely robust (noise, accents, background sounds)
* works offline

Pipeline:

```
Audio → Log-Mel → Encoder → Decoder → Text
```

Whisper is used as foundation for many current STT products.

---

### Intuition Behind Speech-to-Text

####  Think of speech as a changing picture over time.

Each small slice of audio is like a pixel row.

Spectrogram → a 2D “image of sound”.

Transformers and CNNs can “read” this image to identify:

* pitch
* phonemes
* words

---

### Example Using Whisper (Python Demo)

```python
pip install transformers torch librosa soundfile
```

```python
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import soundfile as sf

audio_path = "sample.wav"
audio, sr = sf.read(audio_path)

processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")

inputs = processor(audio, sampling_rate=sr, return_tensors="pt")

with torch.no_grad():
    predicted = model.generate(**inputs)

text = processor.decode(predicted[0])
print(text)
```

---

### Why LLM-Based STT Models (Whisper, GPT-4o) Are So Good

#### ✔ Trained on huge multilingual datasets

#### ✔ Use transformers (excellent at sequence modeling)

#### ✔ Language model improves grammar

#### ✔ Noise robustness

#### ✔ Punctuation and casing included

#### ✔ Zero-shot adaptability

---

**One-Sentence Summary**

**Speech-to-Text converts audio into text using feature extraction (spectrograms), acoustic modeling, transformers/conformers, and language decoding to produce accurate transcripts.**