# **NoteBook Summary**

This notebook compares the video captioning performance of two VisionEncoderDecoder models: one pretrained model (Neleac/timesformer-gpt2-video-captioning) and another custom-trained model hosted on the Hugging Face Hub (NourFakih/TimeSformer-GPT2-UCF-8400). It begins by setting up the repository and loading the models along with the associated image processors and tokenizers. Using the av library, the notebook extracts 16 evenly spaced frames from each video clip. These frames are then passed to each model to generate captions. The generated captions are added to a DataFrame containing video paths and saved as a new CSV file for further analysis or evaluation. The notebook also demonstrates how to fine-tune generation settings such as num_beams, temperature, and repetition_penalty to control the diversity and fluency of the outputs.

In [1]:
!pip install av

Collecting av
  Downloading av-14.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.7 kB)
Downloading av-14.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (35.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m35.2/35.2 MB[0m [31m51.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: av
Successfully installed av-14.3.0


In [2]:
# from huggingface_hub import hf_hub_download

# # Define your model repository name and checkpoint name
# model_repo_name = "NourFakih/TimeSformer-GPT2-UCF-UCA-01"
# checkpoint_name = "checkpoint-100"
# output_dir="./TimeSformer-GPT2-UCF-UCA"
# # Download the checkpoint files to a local directory
# checkpoint_dir = "./TimeSformer-GPT2-UCF-UCA"
# hf_hub_download(repo_id=model_repo_name, filename=f"{checkpoint_name}/generation_config.json", local_dir=checkpoint_dir)
# hf_hub_download(repo_id=model_repo_name, filename=f"{checkpoint_name}/config.json", local_dir=checkpoint_dir)
# hf_hub_download(repo_id=model_repo_name, filename=f"{checkpoint_name}/model.safetensors" , local_dir=checkpoint_dir)
# hf_hub_download(repo_id=model_repo_name, filename=f"{checkpoint_name}/optimizer.pt" , local_dir=checkpoint_dir)
# hf_hub_download(repo_id=model_repo_name, filename=f"{checkpoint_name}/rng_state.pth", local_dir=checkpoint_dir)
# hf_hub_download(repo_id=model_repo_name, filename=f"{checkpoint_name}/scheduler.pt", local_dir=checkpoint_dir)
# hf_hub_download(repo_id=model_repo_name, filename=f"{checkpoint_name}/trainer_state.json", local_dir=checkpoint_dir)
# hf_hub_download(repo_id=model_repo_name, filename=f"{checkpoint_name}/training_args.bin", local_dir=checkpoint_dir)


In [3]:
from huggingface_hub import HfApi, Repository  # install via pip install huggingface_hub

# # 1) Create an empty model repo on HF (won’t error if it already exists)
# api = HfApi()
# api.create_repo(
#     repo_id="NourFakih/TimeSformer-GPT2-UCF-8400",
#     repo_type="model",
#     exist_ok=True
# )  # :contentReference[oaicite:0]{index=0}

# 2) Now clone that empty repo into your local folder
repo = Repository(
    local_dir="training-ucf",                          # where on disk to put it
    clone_from="NourFakih/TimeSformer-GPT2-UCF-8400",   # the HF namespace/repo
    use_auth_token=True
)


For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
Cloning https://huggingface.co/NourFakih/TimeSformer-GPT2-UCF-8400 into local empty directory.


Download file checkpoint_2/model.safetensors:   0%|          | 8.00k/1.02G [00:00<?, ?B/s]

Clean file checkpoint_2/model.safetensors:   0%|          | 1.00k/1.02G [00:00<?, ?B/s]

Clean file model.safetensors:   0%|          | 1.00k/1.02G [00:00<?, ?B/s]

In [12]:
import os
import av
import numpy as np
import pandas as pd
import torch
from transformers import AutoImageProcessor, AutoTokenizer, VisionEncoderDecoderModel

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", device)

# Load the models
def load_model_and_processor(model_name):
    processor = AutoImageProcessor.from_pretrained("MCG-NJU/videomae-base")
    tokenizer = AutoTokenizer.from_pretrained("gpt2")
    model = VisionEncoderDecoderModel.from_pretrained(model_name).to(device)
    return processor, tokenizer, model

processor1, tokenizer1, model1 = load_model_and_processor("Neleac/timesformer-gpt2-video-captioning")
processor2, tokenizer2, model2 = load_model_and_processor("/kaggle/working/training-ucf")




Device: cuda


Config of the encoder: <class 'transformers.models.timesformer.modeling_timesformer.TimesformerModel'> is overwritten by shared encoder config: TimesformerConfig {
  "architectures": [
    "TimesformerForVideoClassification"
  ],
  "attention_probs_dropout_prob": 0.0,
  "attention_type": "divided_space_time",
  "drop_path_rate": 0,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.0,
  "hidden_size": 768,
  "id2label": {
    "0": "abseiling",
    "1": "acting in play",
    "2": "adjusting glasses",
    "3": "air drumming",
    "4": "alligator wrestling",
    "5": "answering questions",
    "6": "applauding",
    "7": "applying cream",
    "8": "archaeological excavation",
    "9": "archery",
    "10": "arguing",
    "11": "arm wrestling",
    "12": "arranging flowers",
    "13": "assembling bicycle",
    "14": "assembling computer",
    "15": "attending conference",
    "16": "auctioning",
    "17": "backflip (human)",
    "18": "baking cookies",
    "19": "bandaging",
    "20": "barb

In [15]:
# Caption generation function
def generate_caption(video_path, processor, tokenizer, model, num_frames=16, max_length=30):
    try:
        container = av.open(video_path)
        total_frames = container.streams.video[0].frames
        indices = set(np.linspace(0, total_frames, num=num_frames, endpoint=False).astype(np.int64))
        frames = []
        container.seek(0)
        for i, frame in enumerate(container.decode(video=0)):
            if i in indices:
                frames.append(frame.to_ndarray(format="rgb24"))
        pixel_values = processor(frames, return_tensors="pt").pixel_values.to(device)
        tokens = model.generate(pixel_values, min_length=10, max_length=max_length, num_beams=8)
        return tokenizer.batch_decode(tokens, skip_special_tokens=True)[0]
    except Exception as e:
        print(f"Error processing {video_path}: {e}")
        return "ERROR"

In [18]:
# Load CSV
csv_path = "/kaggle/input/splitted-clips-ucf/splitted_clips_mapping.csv"
video_folder = "/kaggle/input/splitted-clips-ucf/"

df = pd.read_csv(csv_path)
df['video_path'] = df['video_path'].str.replace(r'^\.\/splitted_clips\/', '', regex=True)
df=df[:3]
# Process each row
captions_1 = []
captions_2 = []

for idx, row in df.iterrows():
    video_file = os.path.join(video_folder, row['video_path'])
    print(f"[{idx+1}/{len(df)}] Processing {video_file}")

    cap1 = generate_caption(video_file, processor1, tokenizer1, model1)
    cap2 = generate_caption(video_file, processor2, tokenizer2, model2)


    captions_1.append(cap1)
    captions_2.append(cap2)

# Add new columns
df['caption_Neleac'] = captions_1
df['caption_NourFakih'] = captions_2

# Save to new CSV
output_csv_path = "video_captions_augmented.csv"
df.to_csv(output_csv_path, index=False)
print(f"\n✅ Captions saved to {output_csv_path}")

[1/3] Processing /kaggle/input/splitted-clips-ucf/Abuse009_x264_clip0.mp4
[2/3] Processing /kaggle/input/splitted-clips-ucf/Abuse009_x264_clip1.mp4
[3/3] Processing /kaggle/input/splitted-clips-ucf/Abuse009_x264_clip2.mp4

✅ Captions saved to video_captions_augmented.csv


In [19]:
df

Unnamed: 0,video_path,caption,caption_Neleac,caption_NourFakih
0,Abuse009_x264_clip0.mp4,"At night, there were a bunch of men standing a...",A man is standing in front of a crowd of peopl...,man white to left the and to right the and to...
1,Abuse009_x264_clip1.mp4,The man standing on the other side slammed the...,A man is standing in front of a crowd of peopl...,man white to left the and man black to right the
2,Abuse009_x264_clip2.mp4,The man picked up the child and ran towards th...,A group of people are working together to fix ...,man black to left the and to right the


In [None]:
model.generate(
    pixel_values,
    min_length=10,
    max_length=50,
    num_beams=4,      # Try reducing this
    temperature=0.9,  # Add this to encourage diversity
    repetition_penalty=1.2,
    no_repeat_ngram_size=3
)
