```{contents}
```

## Embeddings

An **embedding** is a vector representation of data.

For text:

```
The cat ‚Üí [0.23, -0.11, 0.89, ...]
```

For images:

```
Image pixels ‚Üí meaningful feature vector
```

For videos:

```
Sequence of frames ‚Üí sequence-level embedding
```

Embeddings capture:

* meaning
* semantics
* relationships
* similarity
* context

They allow computers to **understand high-dimensional raw data** using **dense numerical vectors**.

---

### Image Embeddings**

### üìå What are Image Embeddings?

Image embeddings are **high-level feature vectors** extracted from images using a pretrained model (CNN, ViT, CLIP).

Instead of raw pixels (millions of numbers), embeddings compress the visual meaning into a vector such as:

```
Shape: (1, 512) or (1, 1024) or (1, 4096)
```

---

### Why Image Embeddings?

Because raw pixels are not meaningful to neural networks.

Image embeddings allow:

* Image search
* Similarity detection
* Image classification
* Captioning
* Multi-modal tasks (vision+text)
* Clustering of images
* Face recognition
* Feature extraction in LLMs like CLIP/LLaVA

---

### How Image Embeddings Are Computed

Use any pretrained **vision encoder**:

1. **CNN (ResNet, VGG)** ‚Äî old method
2. **Vision Transformer (ViT)** ‚Äî modern
3. **CLIP ViT** ‚Äî best for multimodal
4. **ConvNeXt** ‚Äî new CNN
5. **EfficientNet** ‚Äî optimized CNN

### Pipeline:

```
Image ‚Üí Preprocess (resize/normalize)
      ‚Üí Vision model
      ‚Üí Extract feature vector (embedding)
```

Example shape:

```
(768,) for ViT-B/16
(512,) for CLIP RN50
(1024,) for ConvNeXt-large
```

---

### Image Embeddings with CLIP (Python Demo)

Install:

```bash
pip install transformers pillow torch
```

Code:

```python
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

image = Image.open("dog.jpg")

inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
    img_emb = model.get_image_features(**inputs)

print(img_emb.shape)   # (1, 512)
```

---

### Intuition: What Does the Embedding Meaningfully Represent?

For an image of a **dog**:

The embedding captures attributes like:

* it‚Äôs an animal
* has fur
* looks similar to ‚Äúdog‚Äù images
* is not a car, human, or tree

Thus, embeddings cluster similar images together.

---

### **3. Video Embeddings**

A video is not a single image; it is a sequence of frames **over time**.
Therefore, video embeddings must capture:

* **Spatial features** (what is in the frame)
* **Temporal features** (how things move)
* **Sequence dynamics** (actions/events)

---

# ‚≠ê How Video Embeddings Are Computed

There are 3 main strategies:

---

### **Strategy 1: Frame-Level Embeddings + Pooling**

1. Extract CLIP/VIT embeddings from each frame
2. Average (mean pool) them

Example:

```
Video with 10 frames ‚Üí 10 image embeddings (512-dim each)
Final video embedding = average of embeddings
```

Used in:

* Quick video search
* Fast retrieval
* Lightweight systems

---

### **Strategy 2: 3D CNN Models (old but fast)**

Models:

* C3D
* I3D (Google)
* R(2+1)D

These operate directly on **(T √ó H √ó W)** tensors.

They learn motion patterns:

* walking
* jumping
* running

---

### **Strategy 3: Video Transformers (state-of-the-art)**

Models:

* **ViViT**
* **TimeSformer**
* **VideoMAE**
* **XCLIP**
* **LLaVA-Video / GPT-4V video models**

These operate on:

* frame patches
* with temporal and spatial attention

Produce embeddings like:

```
(1024,) or (2048,)
```

Best modern approach for:

* action recognition
* video understanding
* VLMs
* summarization

---

### Video Embedding Example (Frames + CLIP)

```bash
pip install opencv-python transformers torch
```

```python
import cv2
import torch
from transformers import CLIPModel, CLIPProcessor

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

cap = cv2.VideoCapture("video.mp4")
embeddings = []

while True:
    ret, frame = cap.read()
    if not ret:
        break
    
    image = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

    inputs = processor(images=image, return_tensors="pt")
    with torch.no_grad():
        emb = model.get_image_features(**inputs)
    embeddings.append(emb)

# Convert list to tensor and average
video_embedding = torch.mean(torch.stack(embeddings), dim=0)

print(video_embedding.shape)  # (1, 512)
```

---

### Why Video Embeddings Are Harder Than Image Embeddings

Videos require:

* long temporal context
* understanding motion
* memory across frames
* action recognition

Example:

```
Frame 1: man holding ball  
Frame 2: man moves arm  
Frame 3: ball leaves hand  
```

Action = ‚Äúthrowing a ball‚Äù.

An image alone can‚Äôt capture that.

---

**Summary Table**

| Feature    | Image Embeddings            | Video Embeddings                  |
| ---------- | --------------------------- | --------------------------------- |
| Input      | Single image                | Multiple frames over time         |
| Model      | CNN / ViT / CLIP            | Video Transformer / 3D CNN        |
| Captures   | Objects / scenes            | Motion + actions                  |
| Used for   | Search, VLM, classification | Action recognition, summarization |
| Difficulty | Easy                        | Hard                              |

---

**Final One-Sentence Summary**

**Image embeddings capture visual features from a single image, while video embeddings capture both visual features and temporal motion dynamics across multiple frames, enabling deep understanding of scenes, actions, and events.**