# 📊 Group 35 – Cosine Gap Analysis

- Muhammad Bazaf Shakeel (26100146)  
- Sulaiman Ahmad (26100350)  

Welcome to the **Cosine Gap Evaluation Notebook** for **Group 35**. In this notebook, we perform a focused analysis on how well pretrained ViCLIP embeddings separate matching and non-matching video-caption pairs—measured using the **cosine similarity gap**.

---

### Objective

While other notebooks in this project implement architectural improvements (e.g., cross-attention, fusion transformers), this notebook **does not modify the base ViCLIP structure**.  
Instead, it evaluates representational quality using a **cosine-based metric**, helping us understand:

- How well the pretrained model performs **after fine-tuning**
- The impact of different **loss functions** (InfoNCE vs. HNAC) on embedding space separation

---

### In This Notebook, We:

1. **Fine-tune the pretrained ViCLIP model** on a small, custom dataset (~50 video-caption pairs)
2. Evaluate representation quality via the **Cosine Similarity Gap**:
   - Defined as: `Mean(Pos Pair Similarity) − Mean(Neg Pair Similarity)`
   - Higher values indicate better semantic separation
3. Compare the effect of **standard InfoNCE loss** vs. **Hard Negative-Aware Contrastive (HNAC) loss** on this metric

---

This notebook sets the foundation for deeper architectural exploration (e.g., cross-modal fusion, sampling strategies) conducted in later phases of the project.

### Initial Setup

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import sys
sys.path.append("/content/drive/MyDrive/InternVid")

In [None]:
%cd /content/drive/MyDrive/InternVid

/content/drive/MyDrive/InternVid


Installing Dependencies

In [None]:
!pip install ftfy



Importing Libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import os
import cv2
import pandas as pd
import torch
import torch.nn.functional as F
import random
from tqdm import tqdm
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
import gc
import torch
import warnings

warnings.filterwarnings("ignore", category=UserWarning)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

try:
    from viclip import get_viclip, retrieve_text, _frame_from_video
except:
    from .viclip import get_viclip, retrieve_text, _frame_from_video



###  Model Configuration

We define the configuration for the pretrained ViCLIP model, specifying model size and checkpoint path.


In [None]:
model_cfgs = {
    'viclip-b-internvid-10m-flt': {
        'size': 'l',
        'pretrained': 'viclip/ViClip-InternVid-10M-FLT.pth',
    }
}

### VideoCaptionDataset Class

We define a custom PyTorch `Dataset` to:
- Load video clips using OpenCV
- Extract frames using a chosen strategy
- Pair them with their corresponding captions from the dataset


In [None]:
class VideoCaptionDataset(Dataset):
    def __init__(self, df, video_dir, frame_extractor):
        """
        df          : DataFrame with columns ['YoutubeID','Caption']
        video_dir   : path where <YoutubeID>.mp4 clips live
        frame_extractor: function to turn cv2.VideoCapture -> list of frames
        """
        self.df = df.reset_index(drop=True)
        self.video_dir = video_dir
        self.extract = frame_extractor

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        vid = row["YoutubeID"]
        cap = row["Caption"]

        # load frames
        path = os.path.join(self.video_dir, f"{vid}.mp4")
        video = cv2.VideoCapture(path)
        frames = [f for f in self.extract(video)]
        video.release()

        return frames, cap

## Cross-Attention Module

To improve modality interaction, we introduce a **Cross-Attention** mechanism between the video and text embeddings.  
This module uses a multi-head attention layer where one modality (e.g., vision) queries the other (e.g., text), allowing each to adaptively attend to features in the other.

Key features:
- Uses `nn.MultiheadAttention` for rich inter-modal interactions.
- Applies residual connection followed by Layer Normalization.
- Can be used symmetrically (video → text and text → video) during training.


In [None]:
import torch.nn as nn

class CrossAttention(nn.Module):
    def __init__(self, embed_dim, num_heads=4):
        super().__init__()
        self.attn = nn.MultiheadAttention(embed_dim, num_heads, batch_first=True)
        self.ln = nn.LayerNorm(embed_dim)

    def forward(self, query, key_value):
        # Add sequence dimension if needed
        if query.dim() == 2:
            query = query.unsqueeze(1)
            key_value = key_value.unsqueeze(1)

        attn_output, _ = self.attn(query, key_value, key_value)
        return self.ln(query + attn_output).squeeze(1)

## Hard Negative-Aware Contrastive Loss (HNAC)

Standard contrastive losses treat all non-matching pairs equally as negatives. However, in video-text retrieval tasks, **false negatives** (semantically similar but unmatched captions) are common.

We address this by introducing **Hard Negative-Aware Contrastive Loss**, which:
- Applies a **soft weighting** to negative pairs based on similarity (harder negatives are down-weighted).
- Uses a decayed sigmoid function to modulate contrastive strength.
- Improves generalization by reducing over-penalization of potentially valid but unpaired samples.

This approach is especially useful in **low-data or noisy datasets** like ours.


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class HardNegativeAwareContrastiveLoss(nn.Module):
    def __init__(self, temperature=0.07, reduction='mean', hard_negative_weight=0.5):
        super().__init__()
        self.temperature = temperature
        self.reduction = reduction
        self.hard_negative_weight = hard_negative_weight

    def forward(self, video_embeddings, text_embeddings):
        """
        video_embeddings: (B, D)
        text_embeddings: (B, D)
        """
        batch_size = video_embeddings.size(0)


        video_norm = F.normalize(video_embeddings, dim=-1)
        text_norm = F.normalize(text_embeddings, dim=-1)

        sim_matrix = torch.matmul(video_norm, text_norm.T) / self.temperature

        pos_sim = torch.diag(sim_matrix)

        exp_sim = torch.exp(sim_matrix)

        weights_v2t = self._compute_negative_weights(video_norm, text_norm)

        mask = torch.eye(batch_size, device=sim_matrix.device).bool()
        exp_sim = exp_sim.masked_fill(mask, 0.0)
        weights_v2t = weights_v2t.masked_fill(mask, 0.0)


        denom_v2t = (exp_sim * weights_v2t + 1e-8).sum(dim=1)
        loss_v2t = -pos_sim + torch.log(denom_v2t + torch.exp(pos_sim))


        sim_matrix_t2v = sim_matrix.T
        pos_sim_t2v = torch.diag(sim_matrix_t2v)
        exp_sim_t2v = torch.exp(sim_matrix_t2v)
        weights_t2v = self._compute_negative_weights(text_norm, video_norm)
        weights_t2v = weights_t2v.masked_fill(mask, 0.0)
        exp_sim_t2v = exp_sim_t2v.masked_fill(mask, 0.0)
        denom_t2v = (exp_sim_t2v * weights_t2v + 1e-8).sum(dim=1)
        loss_t2v = -pos_sim_t2v + torch.log(denom_t2v + torch.exp(pos_sim_t2v))

        loss = (loss_v2t + loss_t2v) / 2

        if self.reduction == 'sum':
            return loss.sum()
        else:
            return loss.mean()

    def _compute_negative_weights(self, anchor, candidates):
        """
        Down-weight false negatives by applying a decay function on similarity.
        """
        sim_matrix = torch.matmul(anchor, candidates.T)

        weights = 1.0 - self.hard_negative_weight * torch.sigmoid(sim_matrix * 5)
        return weights.detach()

### Helper Functions

This section includes key utility functions used throughout training:

- **`normalize`**: Applies ImageNet-style normalization to image pixels.
- **`framestotensor`**: Converts a list of raw video frames to a properly shaped tensor `[1, T, C, H, W]` for ViCLIP input. Handles grayscale, RGBA, and missing frames robustly.
- **`clear_cuda`**: Frees GPU memory to avoid out-of-memory issues between runs.
- **`clip_loss`**: Computes a CLIP-style contrastive loss between video and text embeddings.
- **`custom_collate`**: A custom `collate_fn` for batching variable-length video frame sequences.


In [None]:
v_mean = np.array([0.485, 0.456, 0.406]).reshape(1,1,3)
v_std = np.array([0.229, 0.224, 0.225]).reshape(1,1,3)

def normalize(data):
    return (data/255.0-v_mean)/v_std

def frames_to_tensor(vid_list, fnum=8, target_size=(224, 224), device=torch.device('cuda')):
    assert len(vid_list) >= fnum
    step = len(vid_list) // fnum
    vid_list = vid_list[::step][:fnum]

    fixed_list = []
    for x in vid_list:
        if x is None:
            x = np.zeros((target_size[1], target_size[0], 3), dtype=np.uint8)
        elif len(x.shape) == 2:
            x = cv2.cvtColor(x, cv2.COLOR_GRAY2RGB)
        elif x.shape[2] == 1:
            x = cv2.cvtColor(x, cv2.COLOR_GRAY2RGB)
        elif x.shape[2] == 4:
            x = cv2.cvtColor(x, cv2.COLOR_RGBA2RGB)
        fixed_list.append(cv2.resize(x[:, :, ::-1], target_size))

    vid_tube = [np.expand_dims(normalize(x), axis=(0, 1)) for x in fixed_list]
    vid_tube = np.concatenate(vid_tube, axis=1)
    vid_tube = np.transpose(vid_tube, (0, 1, 4, 2, 3))
    vid_tube = torch.from_numpy(vid_tube).to(device, non_blocking=True).float()
    return vid_tube

def clear_cuda():
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.ipc_collect()

def clip_loss(vision_embeds, text_embeds, temperature=0.07):
    if vision_embeds.ndim == 2 and text_embeds.ndim == 2:
        vision_embeds = F.normalize(vision_embeds, dim=-1)
        text_embeds = F.normalize(text_embeds, dim=-1)
        logits = (vision_embeds @ text_embeds.T) / temperature
        labels = torch.arange(len(logits)).to(logits.device)
        loss_i2t = F.cross_entropy(logits, labels)
        loss_t2i = F.cross_entropy(logits.T, labels)
        return (loss_i2t + loss_t2i) / 2
    else:
        raise ValueError("Embeddings must be 2D for contrastive loss.")

def custom_collate(batch):
    frames, captions = zip(*batch)
    return list(frames), list(captions)

### Loading DataFrames and Creating DataLoaders

We begin by reading the `aes.csv` file, which contains video-caption pairs.  
The dataset is split into training (80%), validation (10%), and test (10%) subsets using `train_test_split`.

We then initialize instances of the custom `VideoCaptionDataset`, which loads videos and extracts frames using `_frame_from_video`.

Finally, PyTorch `DataLoader`s are created for each dataset split, with a custom `collate_fn` to handle variable-length video inputs.


In [None]:
aes_df = pd.read_csv("/content/drive/MyDrive/InternVid/aes.csv")

train_df, tmp_df = train_test_split(aes_df, test_size=0.2, random_state=42, shuffle=True)
val_df, test_df = train_test_split(tmp_df, test_size=0.5, random_state=42, shuffle=True)

In [None]:
video_dir = "/content/drive/MyDrive/InternVid/Aes_InternVid_Clips"
train_ds = VideoCaptionDataset(train_df, video_dir, _frame_from_video)
val_ds   = VideoCaptionDataset(val_df,  video_dir, _frame_from_video)
test_ds  = VideoCaptionDataset(test_df,  video_dir, _frame_from_video)

print(f"Sizes of Datasets → Train: {len(train_ds)}, Validation: {len(val_ds)}, Test: {len(test_ds)}")

Sizes of Datasets → Train: 40, Validation: 5, Test: 5


In [None]:
train_loader = DataLoader(train_ds, batch_size=4, shuffle=True,  num_workers=0, collate_fn=custom_collate)
val_loader   = DataLoader(val_ds,   batch_size=4, shuffle=False, num_workers=0, collate_fn=custom_collate)
test_loader  = DataLoader(test_ds,  batch_size=4, shuffle=False, num_workers=0, collate_fn=custom_collate)

### Training and Evaluation Loops

We define the core training and evaluation routines that support all architectural variants and loss functions explored in this notebook.

In [None]:
# Training Loop
def train_epoch(model, loader, optimizer, loss_fn, cross_attn=None):
    model.train()

    for frames_batch, captions in loader:
        optimizer.zero_grad()

        processed_batch = []
        for frames in frames_batch:
            processed = frames_to_tensor(frames, device=device)
            processed_batch.append(processed)

        vid_tensor = torch.cat(processed_batch, dim=0)
        vision_features = model.encode_vision(vid_tensor)
        text_features = model.encode_text(captions)

        if cross_attn:
          vision_features = cross_attn(vision_features, text_features)
          text_features = cross_attn(text_features, vision_features)

        loss = loss_fn(vision_features, text_features)
        loss.backward()
        optimizer.step()
        

# Evaluating Loop
@torch.no_grad()
def evaluate(model, loader, loss_fn, cross_attn=None):
    model.eval()
    total_loss = 0

    all_vision_embeds = []
    all_text_embeds = []

    for frames_batch, captions in loader:
        processed_batch = []
        for frames in frames_batch:
            processed = frames_to_tensor(frames, device=device)
            processed_batch.append(processed)

        vid_tensor = torch.cat(processed_batch, dim=0)
        vision_features = model.encode_vision(vid_tensor)
        text_features = model.encode_text(captions)

        if cross_attn:
            vision_features = cross_attn(vision_features, text_features)
            text_features = cross_attn(text_features, vision_features)

        loss = loss_fn(vision_features, text_features)
        total_loss += loss.item()

        all_vision_embeds.append(vision_features)
        all_text_embeds.append(text_features)

    all_vision_embeds = torch.cat(all_vision_embeds, dim=0)
    all_text_embeds = torch.cat(all_text_embeds, dim=0)

    all_vision_embeds = F.normalize(all_vision_embeds, dim=-1)
    all_text_embeds = F.normalize(all_text_embeds, dim=-1)

    sim_matrix = all_vision_embeds @ all_text_embeds.T
    N = sim_matrix.size(0)

    positive_sims = sim_matrix.diag()
    mean_pos_sim = positive_sims.mean().item()
    mask = ~torch.eye(N, dtype=torch.bool, device=sim_matrix.device)
    mean_neg_sim = sim_matrix[mask].mean().item()
    cosine_gap = mean_pos_sim - mean_neg_sim

    return cosine_gap

### Training and Validation Runner

Handles the end-to-end training and validation loop over multiple epochs.

This function:
- Trains the model on the training set and evaluates it on the validation set for each epoch.
- Supports optional **Cross-Attention** modules during both training and validation.
- Uses the passed optimizer and loss function (InfoNCE or HNAC).
- Logs training and validation loss per epoch.
- Returns two lists capturing the loss trajectories, enabling comparison and visualization.

This is the main loop used by all experimental configurations.


In [None]:
def train_and_evaluate_model(clip_model, optimizer, loss_fn, num_epochs, train_loader=train_loader, val_loader=val_loader, cross_attn=None):
  for _ in range(num_epochs):
      train_epoch(clip_model, train_loader, optimizer, loss_fn=loss_fn, cross_attn=cross_attn)
      evaluate(clip_model, val_loader, loss_fn=loss_fn, cross_attn=cross_attn)

### Model Instantiation & Training Wrapper

We define a flexible function `train_and_evaluate_cosine_gap()` to streamline the experimentation process.

This function:
- **Loads the pretrained ViCLIP model** and moves it to the appropriate device.
- Optionally adds a **Cross-Attention module** based on `use_cross_attn`.
- Selects the appropriate **loss function**: standard InfoNCE or Hard Negative-Aware Contrastive Loss (HNAC).
- Unfreezes the vision encoder to allow full model fine-tuning.
- Trains the model using `train_and_evaluate_model()` for a specified number of epochs.
- Returns training and validation losses to facilitate performance comparison.

This modular wrapper makes it easy to toggle different configurations and directly compare their effectiveness.


In [None]:
def train_and_evaluate_cosine_gap(num_epochs, use_cross_attn, use_hnac_loss):
  
  clear_cuda()
  cfg = model_cfgs['viclip-b-internvid-10m-flt']
  model_dict = get_viclip(cfg['size'], cfg['pretrained'])
  clip_model = model_dict['viclip'].to(device)

  if use_cross_attn:
    cross_attn = CrossAttention(clip_model.embed_dim).to(device)
  else:
    cross_attn = None

  for p in clip_model.vision_encoder.parameters():
      p.requires_grad = True

  optimizer = torch.optim.AdamW(
      clip_model.parameters(),
      lr=2e-5,
      weight_decay=0.02
  )

  if use_hnac_loss:
    loss_fn = HardNegativeAwareContrastiveLoss(temperature=0.07, hard_negative_weight=0.5)
  else:
    loss_fn = clip_loss

  train_and_evaluate_model(clip_model, optimizer, loss_fn, num_epochs=num_epochs,
                                                      train_loader=train_loader, val_loader=val_loader, 
                                                      cross_attn=cross_attn)
  
  cosine_gap = evaluate(clip_model, val_loader, loss_fn, cross_attn)

  return cosine_gap

In [None]:
cosine_gap_with_infonce = train_and_evaluate_cosine_gap(num_epochs=20, use_cross_attn=True, use_hnac_loss=False)
cosine_gap_with_hnac = train_and_evaluate_cosine_gap(num_epochs=20, use_cross_attn=True, use_hnac_loss=True)

print(f"Cosine Gap with InfoNCE Loss: {cosine_gap_with_infonce:.4f}")
print(f"Cosine Gap with H-NAC Loss: {cosine_gap_with_hnac:.4f}")

Cosine Gap with InfoNCE Loss: 0.1854
Cosine Gap with H-NAC Loss: 0.2036


## Evaluation: Cosine Gap Analysis

This notebook serves as a focused evaluation of **ViCLIP’s representation quality** after fine-tuning, using the **Cosine Similarity Gap** as the core metric. While other notebooks implement architectural changes (e.g., fusion transformers or sampling strategies), here we retain the pretrained ViCLIP backbone and evaluate its performance under two contrastive loss functions.

---

### 📏 What is the Cosine Gap?

The **Cosine Similarity Gap** provides a simple yet insightful measure of **semantic alignment**. It is defined as:

**Cosine Gap = Mean Similarity (Positive Pairs) − Mean Similarity (Negative Pairs)**

- A **larger gap** indicates clearer separation between matched and unmatched video-caption pairs in the learned embedding space.
- This metric is particularly useful for evaluating **embedding quality** without relying solely on loss curves.

---

### 🧪 Experimental Setup

1. The ViCLIP model is **fully fine-tuned** on our 50-pair dataset.
2. Two loss functions are compared:
   - **InfoNCE**: Standard symmetric contrastive loss, treating all non-matching pairs as equally negative.
   - **Hard Negative-Aware Contrastive (HNAC)**: Down-weights hard negatives by adjusting penalties based on semantic closeness.
3. No cross-modal fusion or sampling strategies are applied in this notebook; the focus is strictly on **fine-tuning and cosine separation**.

---

### 📊 Results

| Configuration             | Cosine Similarity Gap |
|---------------------------|------------------------|
| ViCLIP + InfoNCE Loss     | 0.1854                 |
| ViCLIP + HNAC Loss        | **0.2036**             |

---

### ✅ Key Insights

- **HNAC outperforms InfoNCE**, even without architectural modifications, supporting its use in semantically dense caption datasets.
- The **gap improvement is substantial** over zero-shot ViCLIP (~0.10 or lower), confirming that fine-tuning benefits low-resource domains.
- While **cosine gap** does not directly reflect downstream retrieval accuracy, it **captures the model’s internal ability to semantically cluster** aligned modalities.
- This analysis informed later decisions to explore **fusion mechanisms** and **frame selection strategies**, where cosine-based improvements were replaced by loss-driven training objectives.

---

### Conclusion

This phase demonstrated that **embedding quality can be significantly improved through fine-tuning alone**, especially when guided by **loss functions that account for semantic nuance**. The findings in this notebook establish a strong foundation for more complex architectural explorations in subsequent stages.
