
# Self-Supervised Learning Lab (2h): DINOv2 & Wav2Vec2‑BERT Embeddings + PCA
**Runtime:** Jupyter with GPU • **Libraries:** PyTorch, torchvision, HuggingFace transformers, datasets, scikit-learn, matplotlib
**Learning goals**
- Understand the intuition behind self-supervised pretraining (contrastive, self-distillation).
- Extract embeddings from a **vision SSL model** (DINOv2) and an **audio SSL model** (Wav2Vec2‑BERT).
- Reduce embeddings with **PCA** and visualize the **first 3 components**.
- Interpret what principal components capture for images (patch tokens) vs. audio (frame features).




## 0) Setup & Installs

In [None]:

# If running on a fresh environment, uncomment the next line to install requirements.
!pip install --upgrade torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
!pip install --upgrade transformers datasets accelerate soundfile librosa scikit-learn matplotlib pillow timm

# in your "teaching" env
!pip install --upgrade "transformers==4.43.3"




In [None]:
import torch, torchvision, transformers, datasets, sklearn, matplotlib, platform

In [None]:
print(torch.__version__, torch.version.cuda, platform.platform())

### GPU Sanity Check

In [None]:

# Quick CUDA test
x = torch.randn(8192, 8192, device='cuda' if torch.cuda.is_available() else 'cpu')
y = x @ x.T
print("Matmul OK on", y.device)
del x, y
torch.cuda.empty_cache()



## 1) SSL Intuition
- **DINOv2**: self-distillation with no labels. Student network matches a teacher’s targets from multiple augmented views.
- **Wav2Vec2‑BERT**: contrastive/masked prediction pretraining on raw audio; contextualized via Transformer layers.

**Checkpoint Q1 (short answer, 2–3 sentences):**  
Why do SSL models often yield features that transfer well with small labeled datasets?


## 2) Vision: DINOv2 Embeddings & PCA (35 min)


### 2.1 Load Images
You can either:
- Use the provided sample URLs, or
- Replace with your own images (local paths).

> **Note:** The sample set now includes a **cityscape**, a **human portrait**, and an **animal** photo (dog). Feel free to replace with your own URLs or local files.


In [None]:

from pathlib import Path
from PIL import Image
import requests, io

# Choose a few images (feel free to replace)
IMAGE_URLS = [
    "https://images.unsplash.com/photo-1504196606672-aef5c9cefc92",  # city
    "https://images.unsplash.com/photo-1529626455594-4ff0802cfb7e",  # human
    "https://images.unsplash.com/photo-1517841905240-472988babdf9",  # animal
]


def load_image_from_url(url):
    r = requests.get(url, timeout=10)
    r.raise_for_status()
    return Image.open(io.BytesIO(r.content)).convert("RGB")

images = [load_image_from_url(u) for u in IMAGE_URLS]
len(images), images[0].size


### 2.2 Load DINOv2 model & processor

In [None]:

import torch
from transformers import AutoImageProcessor, AutoModel

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# You can pick a different variant: dinov2-small, base, large, giant

image_processor = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14')
vision_model = AutoModel.from_pretrained(MODEL_ID_VISION).to(device).eval()



### 2.3 Extract Embeddings (CLS & Patch Tokens)

In [None]:

import torch
import numpy as np

@torch.no_grad()
def get_dino_features(pil_img):
    inputs = image_processor(images=pil_img, return_tensors="pt").to(device)
    outputs = vision_model(**inputs)
    #TODO: write the complete function
    # Many ViT-style models return last_hidden_state: [B, tokens, dim]
    # CLS token is at index 0; patch tokens follow.
    # [tokens, dim]
    # [dim]
    # [num_patches, dim]
    return cls, patches

vision_cls = []
vision_patches = []
for img in images:
    c, p = get_dino_features(img)
    vision_cls.append(c)
    vision_patches.append(p)

vision_cls = np.stack(vision_cls)   # [N_images, dim]
print("CLS shape:", vision_cls.shape)
print("Patch tokens for img0:", vision_patches[0].shape)


### 2.4 PCA → First 3 Components as RGB (per image)

In [None]:

import math
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np

def pca_to_rgb(features_2d):
    # features_2d: [num_tokens, dim]
    pca = PCA(n_components=3, random_state=0)
    comps = pca.fit_transform(features_2d)  # [num_tokens, 3]
    # normalize to [0,1] for display
    comps = (comps - comps.min(0)) / (comps.ptp(0) + 1e-8)
    return comps  # as "RGB" per token

def tokens_to_grid(tokens, grid_hw=None):
    # Infer a square-ish grid if not provided
    if grid_hw is None:
        L = tokens.shape[0]
        h = w = int(math.sqrt(L))
        if h * w != L:
            # fallback to nearest rectangle (e.g., 14x16 for L=224)
            for hh in range(h, h+10):
                if L % hh == 0:
                    h = hh; w = L // hh; break
        return tokens.reshape(h, w, -1), (h, w)
    else:
        h, w = grid_hw
        return tokens.reshape(h, w, -1), (h, w)

for idx, (img, patch_feats) in enumerate(zip(images, vision_patches)):
    rgb = pca_to_rgb(patch_feats)                 # [num_patches, 3]
    grid, (H, W) = tokens_to_grid(rgb)
    plt.figure(figsize=(4,4))
    plt.title(f"DINOv2 Patch PCA RGB (Image {idx})")
    plt.imshow(grid)
    plt.axis("off")
    plt.show()


### 2.5 2D Scatter of Patch Tokens (PCA 2D)

In [None]:

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

for idx, patch_feats in enumerate(vision_patches):
    pca2 = PCA(n_components=2, random_state=0).fit_transform(patch_feats)
    plt.figure(figsize=(4,4))
    plt.title(f"DINOv2 Patch Tokens PCA 2D (Image {idx})")
    plt.scatter(pca2[:,0], pca2[:,1], s=10, alpha=0.7)
    plt.xlabel("PC1")
    plt.ylabel("PC2")
    plt.show()



**Checkpoint Q2 (short answer):**  
In your PCA‑RGB patch maps, what visual regions (e.g., edges, textures, objects) appear grouped or contrasted along the first 1–2 components? Why might DINOv2 emphasize those?


## 3) Audio: Wav2Vec2‑BERT Embeddings & PCA (35 min)


### 3.1 Load Audio
Use a short WAV (16 kHz). You can:
- Use the example URL below, or
- Replace with your own `.wav` file.


In [None]:

import librosa, soundfile as sf, numpy as np, requests, io, os

# Example sample (replace with your own if preferred)
AUDIO_URL = "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac?download=true"

def load_audio_from_url(url, target_sr=16000):
    r = requests.get(url, timeout=10)
    r.raise_for_status()
    data, sr = sf.read(io.BytesIO(r.content))
    if data.ndim > 1:
        data = np.mean(data, axis=1)
    if sr != target_sr:
        data = librosa.resample(data, orig_sr=sr, target_sr=target_sr)
        sr = target_sr
    return data.astype(np.float32), sr

audio, sr = load_audio_from_url(AUDIO_URL)
print("Audio length (s):", len(audio)/sr, "| sr:", sr)


### 3.2 Load Wav2Vec2‑BERT model & processor

In [None]:

import torch
from transformers import AutoFeatureExtractor, AutoModel

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# TODO: Load "facebook/wav2vec2-base-960h" and assigne it to audio_model
MODEL_ID_AUDIO = "facebook/wav2vec2-base-960h"  # you can try XLSR for multilingual
feat_extractor = AutoFeatureExtractor.from_pretrained(MODEL_ID_AUDIO)
audio_model = AutoModel.from_pretrained(MODEL_ID_AUDIO).to(device).eval()



### 3.3 Extract Frame-Level Embeddings

In [None]:

@torch.no_grad()
def get_wav2vec2_features(wav):
    #TO DO: Write the full extraction function
    return last_hidden  # [frames, dim]

audio_feats = get_wav2vec2_features(audio)
audio_feats.shape


### 3.4 PCA → 2D/3D & Time Plot

In [None]:

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np

pca2 = PCA(n_components=2, random_state=0).fit_transform(audio_feats)
pca3 = PCA(n_components=3, random_state=0).fit_transform(audio_feats)

# 2D scatter colored by time
t = np.arange(len(pca2))
plt.figure(figsize=(5,4))
plt.title("Wav2Vec2 Frame PCA (2D) — colored by time")
plt.scatter(pca2[:,0], pca2[:,1], c=t, s=5)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.colorbar(label="frame index")
plt.show()

# Each PC over time
plt.figure(figsize=(6,3))
plt.title("First 3 PCs over time")
for i in range(3):
    plt.plot(pca3[:, i], label=f"PC{i+1}")
plt.legend()
plt.xlabel("Frame index")
plt.ylabel("Component value")
plt.show()



## 4) Compare & Reflect
**Checkpoint Q3:**  
Compare PC1–PC3 between vision (patch tokens) and audio (frame features).  
- What kinds of structure do you see (e.g., edges/regions vs. phonetic/energy changes)?  
- How do augmentations in SSL pretraining encourage such separations?

**Stretch Thought:**  
If you swapped PCA for t‑SNE or UMAP, what tradeoffs would you expect? When would PCA be preferable?



## 5) Troubleshooting & Tips
- **Slow downloads**: Hugging Face models cache after first use. Use a stable internet connection.
- **CUDA OOM**: Try smaller models (e.g., `facebook/dinov2-small`) or reduce batch sizes / image count.
- **Different sampling rates**: Always resample audio to 16 kHz for Wav2Vec2.
- **Reproducibility**: Set `random_state=0` in PCA; seed torch/np if needed.



## 6) Deliverables (submit with this notebook)
1. A PCA‑RGB patch map for at least **two** images.
2. One **2D patch-token scatter** for one image.
3. A **2D/3D PCA visualization** for audio frames and a **PCs-over-time** plot.
4. Brief answers to **Q1–Q3** (2–3 sentences each).



## 7) (Bonus) Experiments if you have time
- Try another DINOv2 size (`small`, `large`) and note differences.
- Swap audio model to `facebook/wav2vec2-base` vs. `wav2vec2-large-robust` and compare PCA spread.
- Pooling strategies: CLS vs. mean of patches/frames; layer-wise comparisons.
