```{contents}
```

## Embeddings

### Text Embeddings (Token-Level Embeddings)

#### **What they represent**

**Text embeddings represent individual tokens (words, subwords, characters) as vectors.**

These embeddings come from an **embedding matrix** inside every LLM.

Example:

```
"apple" → [0.12, -0.21, 0.59, ...]    (4096-dimensional vector)
"apples" → [0.11, -0.20, 0.58, ...]
```

Even punctuation has embeddings:

```
"," → [0.03, 0.8, ...]
```

Tokens typically come from:

* BPE (Byte-Pair Encoding)
* WordPiece
* SentencePiece

Used inside **GPT, LLaMA, Mistral, T5**, etc.

---

#### **Purpose**

Text embeddings allow models to:

* encode **semantics**
* encode **syntax**
* encode **word relationships**
* reason over token sequences

---

#### Example (GPT internal text embeddings)

Before any transformer layer:

```
token_id → embedding vector
embedding + position_embed → transformer → prediction
```

These embeddings are **context-free** (only represent token itself).
Context is added only after multiple transformer layers.

---

**Summary of Text Embeddings**

* represent *tokens*
* learned during LLM training
* used internally by GPT/LLMs
* foundation for contextual understanding

---

### **2. Sentence Embeddings (Semantic Embeddings)**

#### **What they represent**

**Sentence embeddings represent the meaning of an entire sentence, paragraph, or document.**

Example:

```
“Apple is releasing a new phone.”
```

→ vector representing the whole meaning

Sentence embeddings are NOT token embeddings.
They come from specially trained models, such as:

* Sentence-BERT
* OpenAI `text-embedding-3-small`
* E5
* Instructor XL
* Jina Embeddings

---

#### **Purpose**

Sentence embeddings enable:

* search
* clustering
* recommendations
* Retrieval-Augmented Generation (RAG)
* semantic similarity

Example:

```
"What is AI?"  ↔  "Explain artificial intelligence"
```

These sentences are different, but their **embeddings are close**.

---

#### How they are trained

Most sentence embedding models use **contrastive learning**:

* positive pair → pull together
* negative pair → push apart

Example pairs:

```
("What is AI?", "Definition of artificial intelligence") → similar
("Cat", "Airplane") → dissimilar
```

This creates an embedding **optimized for meaning**, not next-token prediction.

---

**Summary of Sentence Embeddings**

* represent entire **sentences or documents**
* optimized for **semantic similarity**
* used in **RAG**, search, clustering, retrieval
* trained via **contrastive learning**

---

### **3. Image and Audio Embeddings (Multimodal Embeddings)**

These convert non-text data (image/audio) into embedding vectors so models can understand them.

---

#### **A. Image Embeddings**

##### **What they represent**

Image embeddings represent **objects, scenes, textures, colors, and concepts** inside an image.

Generated by vision models:

* CLIP ViT
* Vision Transformer (ViT)
* ConvNeXt
* EfficientNet

Example:

```
dog.jpg → [0.11, 1.02, -0.55, ...]   (512-dimensional vector)
```

---

##### How image embeddings work

Pipeline:

1. Image is resized + normalized
2. Split into patches (ViT)
3. Pass through Transformer
4. Output CLS embedding (image-level vector)

This embedding captures:

* object categories
* scene context
* visual semantics

Used in:

* visual search
* CLIP similarity
* VLMs like LLaVA, BLIP-2
* generating captions
* grounding vision in LLMs

---

### **B. Audio Embeddings**

#### **What they represent**

Audio embeddings represent:

* speech content
* speaker identity
* tone
* pitch
* emotion
* noise patterns

Generated by models like:

* Whisper
* Wav2Vec2
* HuBERT
* AudioMAE

---

#### How audio embeddings work

Pipeline:

1. Audio waveform → Mel Spectrogram
2. Spectrogram → encoder (CNN/Transformer)
3. Output = audio embedding vector

Example:

```
voice.wav → [0.33, -0.12, 0.98, ...]
```

These embeddings can be used for:

* speech recognition
* speaker verification
* audio classification
* sound event detection

---

### **Why These Embeddings Matter in Generative AI**

#### Text embeddings

→ foundation of LLM understanding

#### Sentence embeddings

→ backbone of RAG and enterprise search

#### Image/audio embeddings

→ enable multimodal LLMs:

* GPT-4o
* Gemini
* LLaVA
* CLIP-based VLMs

These embeddings allow models to handle:

* image→text
* text→image
* speech→text
* video understanding
* multimodal reasoning

---

**FINAL CHEAT SHEET**

| Embedding Type          | Represents            | Used For                | Examples                     |
| ----------------------- | --------------------- | ----------------------- | ---------------------------- |
| **Text Embeddings**     | Tokens                | LLM internal processing | GPT, LLaMA                   |
| **Sentence Embeddings** | Meaning of whole text | RAG, search, clustering | SBERT, E5, OpenAI embeddings |
| **Image Embeddings**    | Visual meaning        | VLMs, classification    | CLIP ViT, ViT                |
| **Audio Embeddings**    | Speech features       | STT, speaker ID         | Whisper, Wav2Vec2            |