# 📊 Group 35 Baseline Model

- Muhammad Bazaf Shakeel 26100146
- Sulaiman Ahmad 26100350

Welcome to the Baseline Model and Results notebook for **Group 35**. We select ViCLIP as our baseline model due to its strong performance on video-text retrieval tasks. In this notebook, we perform a forward pass on a sample and a custom video, and evaluate the model using captions from our dataset.

# Initial Setup

Setting up the relevant paths

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import sys
sys.path.append("/content/drive/MyDrive/InternVideo-main/InternVideo-main/Data/InternVid")

In [44]:
%cd /content/drive/MyDrive/InternVideo-main/InternVideo-main/Data/InternVid

/content/drive/MyDrive/InternVideo-main/InternVideo-main/Data/InternVid


Installing Dependencies

In [None]:
import os
from dotenv import load_dotenv

load_dotenv('.env')

from huggingface_hub import login
login(token=os.getenv('HUGGING_FACE_API'))

In [43]:
hf_hub_download(
    repo_id="OpenGVLab/ViCLIP",
    filename="ViClip-InternVid-10M-FLT.pth",
    local_dir="viclip",
    local_dir_use_symlinks=False
)

'viclip/ViClip-InternVid-10M-FLT.pth'

Importing Libraries

In [8]:
import numpy as np
import os
import cv2
import pandas as pd

try:
    from viclip import get_viclip, retrieve_text, _frame_from_video
except:
    from .viclip import get_viclip, retrieve_text, _frame_from_video

# Baseline Model Forward Pass (from InternVid Github Repository)

- For the purposes of this notebook, we did a forward pass on a sample video from the Github repository
- After verifying its results, we did a forward pass on a video from our selected dataset.

In [9]:
video = cv2.VideoCapture('example1.mp4')
frames = [x for x in _frame_from_video(video)]

In [34]:
model_cfgs = {
    'viclip-b-internvid-10m-flt': {
        'size': 'l',
        'pretrained': 'viclip/ViClip-InternVid-10M-FLT.pth',
    }
}

Setting up the Baseline Model

In [35]:
text_candidates = ["A playful dog and its owner wrestle in the snowy yard, chasing each other with joyous abandon.",
                   "A man in a gray coat walks through the snowy landscape, pulling a sleigh loaded with toys.",
                   "A person dressed in a blue jacket shovels the snow-covered pavement outside their house.",
                   "A pet dog excitedly runs through the snowy yard, chasing a toy thrown by its owner.",
                   "A person stands on the snowy floor, pushing a sled loaded with blankets, preparing for a fun-filled ride.",
                   "A man in a gray hat and coat walks through the snowy yard, carefully navigating around the trees.",
                   "A playful dog slides down a snowy hill, wagging its tail with delight.",
                   "A person in a blue jacket walks their pet on a leash, enjoying a peaceful winter walk among the trees.",
                   "A man in a gray sweater plays fetch with his dog in the snowy yard, throwing a toy and watching it run.",
                   "A person bundled up in a blanket walks through the snowy landscape, enjoying the serene winter scenery."]

cfg = model_cfgs['viclip-b-internvid-10m-flt']
model_s = get_viclip(cfg['size'], cfg['pretrained'])

Function for the Model's Forward Pass

In [36]:
def run_viclip_retrieval(video_path, model, text_candidates, topk=5):
    video = cv2.VideoCapture(video_path)

    if not video.isOpened():
        raise ValueError(f"Could not open video file: {video_path}")

    frames = [x for x in _frame_from_video(video)]
    texts, probs = retrieve_text(frames, text_candidates, models=model, topk=topk)

    results = list(zip(texts, probs))
    for t, p in results:
        print(f'text: {t} ~ prob: {p:.4f}')

    return results

Forward Pass on sample video from the Github Repository

In [37]:
run_viclip_retrieval('example1.mp4', model_s, text_candidates)

text: A man in a gray sweater plays fetch with his dog in the snowy yard, throwing a toy and watching it run. ~ prob: 0.8264
text: A playful dog and its owner wrestle in the snowy yard, chasing each other with joyous abandon. ~ prob: 0.1587
text: A pet dog excitedly runs through the snowy yard, chasing a toy thrown by its owner. ~ prob: 0.0141
text: A person dressed in a blue jacket shovels the snow-covered pavement outside their house. ~ prob: 0.0006
text: A playful dog slides down a snowy hill, wagging its tail with delight. ~ prob: 0.0002


Loading our dataset

In [38]:
aes_df = pd.read_csv("aes.csv")
aes_df.head()

Unnamed: 0,YoutubeID,Caption
0,KlAAQ4TzqdA,The video clip shows a group of men dressed in...
1,INsDaKTFXsM,The video clip shows a painting on the ceiling...
2,cNKOC6I1SPI,The video clip shows a man wearing a white sui...
3,WTO7-CQPjdY,The video clip shows a man wearing an orange s...
4,xreclu1ibdU,The video clip shows a woman standing in front...


Forward Pass on a video from our dataset

In [41]:
run_viclip_retrieval(f"Aes_InternVid_Clips/{aes_df.iloc[0]['YoutubeID']}.mp4", model_s, list(aes_df["Caption"]), topk=5)

text: The video clip shows a group of men dressed in blue shirts singing and dancing in front of a classic car. They seem to be having a good time and enjoying themselves. ~ prob: 0.6278
text: The video clip shows a group of men dressed in blue jackets standing in front of a row of classic cars. They seem to be posing for a photo and appear to be having a good time. ~ prob: 0.3708
text: The video clip shows a group of men standing in front of a classic car. They appear to be posing for a photo and seem to be enjoying each other's company. ~ prob: 0.0014
text: The video clip shows a man wearing an orange jacket and holding a plate of food. He is standing next to a plant and seems to be preparing a meal. ~ prob: 0.0000
text: The video clip shows a group of men dressed in suits and ties posing for a photo. They are all wearing hats and appear to be having a good time. The setting seems to be outdoors, possibly in a garden or park. It appears to be a group of friends or colleagues enjoying

# Evaluation

- Based on the results, we observed that the model accurately predicted the correct caption with a confidence score of 0.6278.

- To evaluate the model, we provided it with a list of text candidates — specifically, the captions of all other videos in our dataset.

- The ViCLIP model processes the video frames using a pretrained vision encoder, while the candidate captions are embedded via a text encoder. It then computes similarity scores between the visual and textual features.

- From this, it ranks the top matches and returns the top 5 most relevant captions based on similarity probabilities.