# In Class Project






## Problem formulation

The aim of this final lab is to give you the possibility to work on a sample project. This will give you a better grasp of how the final project will be conducted and allow you more coding practice compared to previous labs.

The sample project is inspired by the paper [Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning](https://arxiv.org/pdf/2203.02053.pdf). During the course, we discussed how Vision-Language Models (VLMs) can align representations from language and vision. This alignment can be achieved using contrastive learning on image/captions pairs. However, the paper shows that performing a dimensionality reduction technique on embedded image/caption pairs results in the two modalities being disjointed in the embedding space. You can better visualize this through the following image:

<img src="https://modalitygap.readthedocs.io/en/latest/_images/Figure1.png" width="600">

As shown in the image, the "gap" between the two modalities is present for a randomly initialized network and persists even after the pretraining phase. Moreover, this modality gap is not only present in images/text pairs but also when text is aligned with other modalities (videos, medical images, amino-acid sequences).

Geometrically, the authors talk about a "cone effect," which means that with a growing number of dimensions, embeddings tend to occupy smaller regions of the space assuming a cone-like shape.

Your task will consist of testing whether the "modality gap" and "cone effect" exist using video/caption pairs and a state-of-the-art VLM model, [COCA](https://arxiv.org/pdf/2205.01917.pdf).

## Overview of the project

The project is divided into three main phases, explained in detail in the following sections. These phases are sequential, and you can perform all of them here on Colab using the available T4 GPU.

### Step 1: Video Captioning

Your first task is to perform Video Captioning, which is the automatic captioning of a video by understanding the actions and events in it.

We will provide you with short videos and no captions. Then, you'll have to extract a number of frames from each video and generate an independent caption for each frame using COCA.

You are free to choose how many frames to extract from each video, how many of these frames will be used to generate captions, how many captions to retain for each video, and which strategy to use to generate captions.

As you can imagine, many frames of a single video might be repetitive and lead to the same caption, and conversely, few key frames might contain different actions that are necessary to understand the dynamic of the event represented in the video (For example, think about a tennis player serving; probably the initial frames of the video will depict the player in a static position, focusing and preparing to serve, whereas the act of serving will be present only in fewer frames). There are multiple solutions to this problem, and you are free to choose one. Some examples are:

* Subsample only a smaller number of frames and generate captions only for those (faster).
* Filter captions based on their diversity. An approach is to cluster similar captions using the text encoder of COCA and take only one caption for each cluster (slower).

### Step 2: Caption Aggregation

At the end of step 1, you'll have a collection of captions for each single video, with these captions describing only some frames but not the video overall.

Your second task is to obtain a single summary describing the content of a video by aggregating the content of your captions. To this end, you have to choose a LLM and prompt it to generate an overall description of a video given a list of captions.

As an LLM you can use a model from the [FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) family. These models have good instruction-following capabilities and work well out-of-the-box. If you have any other preference, you are free to choose other LLMs.

P.S. After doing some experiments I found the [Phi-1.5](https://huggingface.co/microsoft/phi-1_5) model to be a good compromise in terms of memory requirements and performance. This model had no training to follow instructions, so to make it do what you want you will have to provide in-context examples and see how it performs in a few shots manner. See image below:

<img src="https://ai.stanford.edu/blog/assets/img/posts/2022-08-01-understanding-incontext/images/image11.gif" width="600">


### Step 3: Dimensionality Reduction

Once you have paired videos and captions, you are ready to see if the "modality gap" is present for these data points.

Your task is now to encode both videos and captions and use U-MAP dimensionality reduction to project the embeddings in a 2-D space. Since videos are composed of multiple frames, you'll have to use a fusion strategy to aggregate the embedding of all frames. For example, you can take the average of the embeddings to represent a whole video.

If the results are as expected, at this point, you should see two different clusters of data points, each representing one modality.

## Getting started

All the tools necessary to perform these tasks were provided during the course of previous labs.

You can also find here a list of tools.

* **FFMPEG TO EXTRACT FRAMES FROM VIDEOS**

  This is the command for extracting frames from a video specifying the frame rate using FFmpeg:

  ```
  ffmpeg -i input.mp4 -vf fps=1 %04d.png
  ```

  `%04d.png` is a sequence pattern type used to interpret the output file names by sequencing them with zero-padded sequential numbers, eg. 0001.png, 0002.png, 0003.png, etc.

* **OPEN CLIP TO ACCESS THE COCA MODEL**

  This is the [link](https://github.com/mlfoundations/open_clip) to the Open CLIP repository.

* **HUGGINGFACE'S TRANSFORMERS TO USE FLAN-T5 OR PHI-1.5**

  This is the [link](https://huggingface.co/docs/transformers/model_doc/flan-t5) to the FLAN-T5 models documentation on HuggingFace. This is the [link](https://huggingface.co/microsoft/phi-1_5) to Phi-1.5.

* **U-MAP FOR DIMENSIONALITY REDUCTION**

  This is the official [documentation](https://umap-learn.readthedocs.io/en/latest/basic_usage.html) of the python implementation of U-MAP.




# Step 1.0: Get the data

In [30]:
from datasets import load_dataset
from huggingface_hub import snapshot_download

### Get raw videos

In [31]:
from os import path, rename, listdir

In [37]:
video_folder = path.join('data', 'videos')
if not path.exists(video_folder):
    
    # Download the files from Hugging Face
    snapshot_download(repo_id="friedrichor/ActivityNet_Captions", repo_type="dataset", allow_patterns=["*.tar.part-00*"], local_dir='./data/raw')
    
    # Combine the parts into a single tar file and extract it
    !cat ./data/raw/ActivityNet_Videos.tar.part-* | tar -vxf - -C ./data
    
    # Rename the extracted folder to a simpler name
    rename(path.join('data', 'Activity_Videos'), path.join(video_folder))
    
    # Delete the parts to save space
    !rm -r -f ./data/raw

# Print the number of videos downloaded
print('Videos:', len(listdir(video_folder)))

Videos: 14950


In [33]:
df = load_dataset("friedrichor/ActivityNet_Captions")
print(df['train'][0])

{'video_id': 'v_QOlSCBRmfWY', 'video': 'v_QOlSCBRmfWY.mp4', 'caption': 'A young woman is seen standing in a room and leads into her dancing. The girl dances around the room while the camera captures her movements. She continues dancing around the room and ends by laying on the floor.', 'source': 'ActivityNet_Captions', 'duration': 82.73, 'timestamps': [[0.8300000000000001, 19.86], [17.37, 60.81], [56.26, 79.42]], 'sentences': ['A young woman is seen standing in a room and leads into her dancing.', 'The girl dances around the room while the camera captures her movements.', 'She continues dancing around the room and ends by laying on the floor.']}


### Get the video-caption pairs

# Step 1.1: Extract frames from videos

In [34]:
!apt install ffmpeg

[1;31mE: [0mCould not open lock file /var/lib/dpkg/lock-frontend - open (13: Permission denied)[0m
[1;31mE: [0mUnable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), are you root?[0m


In [35]:
import os
import re
#import gdown
import subprocess
from glob import glob

In [36]:
# define a directory to store video frames and create it
video_frames_dir = "video_frames"
if not os.path.exists(video_frames_dir):
    os.makedirs(video_frames_dir)

# iterate through types of videos e.g. kitesurfing, bowling...
for i, video in enumerate(glob("data/videos/*")):
    video_name = os.path.basename(video).split(".")[0]

    # video_name
    
    # create new directory to store frames from a video
    frames_dir = re.sub("video_sample", video_frames_dir) + f"/video_{i}"
    if not os.path.exists(frames_dir):
        os.makedirs(frames_dir)
        
    print(f"Extracting frames from video {video} into folder {frames_dir}")
        
    # define ffmpeg command
    ffmpeg_command = [
        'ffmpeg',
        '-i', video,
        '-vf', 'fps=1',
        f'{frames_dir}/%02d.png'
    ]
    
    # run command
    subprocess.call(ffmpeg_command)

TypeError: sub() missing 1 required positional argument: 'string'

# Step 1.2: Caption video frames

In [None]:
!pip install -q open_clip_torch

In [None]:
from collections import defaultdict
from PIL import Image
import open_clip
import torch
from tqdm import tqdm

In [None]:
device = "cuda"

# instantiate model
model, _, preprocess = open_clip.create_model_and_transforms(
  model_name="coca_ViT-L-14",
  pretrained="mscoco_finetuned_laion2B-s13B-b90k",
  device=device
)

In [None]:
captions_dict = {}

# iterate through types of videos e.g. kitesurfing, bowling...
for video_dir in tqdm(glob(f"{video_frames_dir}/*")):
    video_type = re.sub(f"{video_frames_dir}/", "", video_dir)
    captions_dict[video_type] = defaultdict(list)
    # iterate through single videos
    for i, video in enumerate(glob(f"{video_dir}/*")):
        # iterate through single frames
        for image in glob(f"{video}/*"):

            # preprocess frame
            im = Image.open(image).convert("RGB")
            im = preprocess(im).unsqueeze(0).to(device)

            # generate caption for frame
            with torch.no_grad(), torch.cuda.amp.autocast():
                generated = model.generate(im)

            # add generated caption to dictionary
            captions_dict[video_type][f"video_{i}"].append(open_clip.decode(generated[0]).replace("<start_of_text>", "").replace(" <end_of_text>", ""))

In [None]:
captions_dict["bowling"]["video_0"]

# Step 2: Use a LLM to generate a single caption

In [None]:
!pip install -q transformers einops

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

In [None]:
torch.set_default_device("cuda")

# initialize llm
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5", trust_remote_code=True, torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5", trust_remote_code=True, torch_dtype=torch.float16)

In [None]:
aggregated_captions_dict = defaultdict(lambda : defaultdict(str))

# iterate through types of videos e.g. kitesurfing, bowling...
for video_type in tqdm(captions_dict):
    # iterate through single videos
    for video in captions_dict[video_type]:
        sentences = ""
        # iterate through single frame captions and generate a context
        for caption in captions_dict[video_type][video]:
            sentences += f"\n{caption}"

        prompt = f"""
        Instruction:
        Create a summary sentence that aggregate the meaning of all the sentences provided in the context. The sentences in the context are in cronological order. Provide a concise summary using only the information provided in the context.

        Context:
        The two kids are playing with the cat
        The cat is runnning in the living room .
        The cat is runnning in the living room .
        Two kids are trying to catch a black cat
        One child is running in the living room .
        The cat is runnning in the living room .
        A woman is holding a black cat .
        The children and a woman are petting the cat .

        Summary:
        Two children run into the living room trying to catch a black cat. After a woman catches the cat, they all pet it together .

        Context:{sentences}

        Summary:
        """
        # generate aggregate captions
        input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

        outputs = model.generate(input_ids, repetition_penalty=1.2, max_new_tokens=50)
        result = tokenizer.decode(outputs[0]).split("Summary:")[2].strip().split("\n")[0]

        aggregated_captions_dict[video_type][video] = result

In [None]:
# inspect generated aggregate captions
for video_type in aggregated_captions_dict:
    print(video_type.upper())
    for video in aggregated_captions_dict[video_type]:
        print(video.upper())
        print("ORIGINAL CAPTIONS :")
        for caption in captions_dict[video_type][video]:
            print(caption)
        print("AGGREGATED :")
        print(aggregated_captions_dict[video_type][video])

# Step 3: Project video/caption emebeddings in a low dimensional space

In [None]:
!pip install -q umap-learn

In [None]:
import numpy as np
from open_clip.factory import get_tokenizer
from umap import UMAP
import matplotlib.pyplot as plt

In [None]:
# instantiate again COCA
model, _, preprocess = open_clip.create_model_and_transforms(
  model_name="coca_ViT-L-14",
  pretrained="mscoco_finetuned_laion2B-s13B-b90k",
  device=device
)

In [None]:
video_embeddings = []

for video_type in aggregated_captions_dict:
    for video in range(3):
        video_features = []
        for frame_path in glob(f"{video_frames_dir}/{video_type}/video_{i}/*"):
            im = Image.open(frame_path).convert("RGB")
            im = preprocess(im).unsqueeze(0).to(device)
            with torch.no_grad():
                image_features = model.encode_image(im).float()
            image_features /= image_features.norm(dim=-1, keepdim=True)
            video_features.append(image_features.cpu().numpy())
        # a video embedding is the average of the frames embeddings
        video_embeddings.append(np.mean(np.asarray(video_features), axis=0))

video_embeddings = np.asarray(video_embeddings).squeeze()
print(video_embeddings.shape)

In [None]:
tokenizer = get_tokenizer("coca_ViT-L-14")

caption_embeddings = []

for video_type in aggregated_captions_dict:
    for video in range(3):
        text_tokens = tokenizer(aggregated_captions_dict[video_type][f"video_{video}"])
        # the following line fix a bug
        text_tokens = text_tokens[:, :torch.where(text_tokens == 0)[1][0] - 1]
        with torch.no_grad():
            text_features = model.encode_text(text_tokens).float()
        text_features /= text_features.norm(dim=-1, keepdim=True)
        caption_embeddings.append(text_features.cpu().numpy())

caption_embeddings = np.asarray(caption_embeddings).squeeze()
print(caption_embeddings.shape)

In [None]:
# instantiate umap
reducer = UMAP()

# obtain 2d features
features_2d = reducer.fit_transform(np.concatenate([video_embeddings, caption_embeddings], 0))

# plot 2d features
plt.scatter(features_2d[:-len(video_embeddings), 0], features_2d[:-len(video_embeddings), 1], c='tab:blue', label="video")
plt.scatter(features_2d[-len(video_embeddings):, 0], features_2d[-len(video_embeddings):, 1], c='tab:red', label="text")
# plot lines
for i in range(len(video_embeddings)):
    plt.plot([features_2d[i, 0], features_2d[len(video_embeddings)+i, 0]], [features_2d[i, 1], features_2d[len(video_embeddings)+i, 1]], c='black', alpha=0.1)

plt.xlabel('umap 1')
plt.ylabel('umap 2')
plt.legend()
plt.show()

In [None]:
# define svd
def svd(X, n_components=2):
    U, S, Vt = np.linalg.svd(X)
    return U[:, :n_components] * S[:n_components]

# obtain 2d features
features_2d = svd(np.concatenate([video_embeddings, caption_embeddings], 0))

# plot 2d features
plt.scatter(features_2d[:-len(video_embeddings), 0], features_2d[:-len(video_embeddings), 1], c='tab:blue', label="video")
plt.scatter(features_2d[-len(video_embeddings):, 0], features_2d[-len(video_embeddings):, 1], c='tab:red', label="text")
# plot lines
for i in range(len(video_embeddings)):
    plt.plot([features_2d[i, 0], features_2d[len(video_embeddings)+i, 0]], [features_2d[i, 1], features_2d[len(video_embeddings)+i, 1]], c='black', alpha=0.1)

plt.xlabel('svd 1')
plt.ylabel('svd 2')
plt.legend()
plt.show()