<a href="https://colab.research.google.com/github/merveenoyan/smollm/blob/main/vision/finetuning/SmolVLM2_Video_FT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tune SmolVLM2 on Video Captioning
In this notebook we will fine-tune SmolVLM2-500M-Video-Instruct on  Video Feedback dataset. It is ran on a Colab A100 for full fine-tuning, but you can squeeze it to L4 with QLoRA.

In [1]:
!pip install -q accelerate datasets peft bitsandbytes tensorboard pyav num2words

'pip' ���O�����Υ~���R�O�B�i���檺�{���Χ妸�ɡC


In [None]:
!pip install -q git+https://github.com/huggingface/transformers.git

In [2]:
!pip install -q flash-attn --no-build-isolation

'pip' ���O�����Υ~���R�O�B�i���檺�{���Χ妸�ɡC


We will push out model to Hub so we need to authenticate ourselves.

In [1]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In this notebook we will do full fine-tuning on 500M variant. You can also apply QLoRA or LoRA on 2.2B variant, which loads an adapter to the quantized version of the model, saving space. If you want to do full fine-tuning, set `USE_LORA` and `USE_QLORA` to False. If you want to do LoRA, set `USE_QLORA` to False and `USE_LORA` to True.

Small model should learn more so we suggest disabling QLoRA or LoRA when fine-tuning it.

In [2]:
import torch
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
from transformers import AutoProcessor, BitsAndBytesConfig, AutoModelForImageTextToText
import os


USE_LORA = False
USE_QLORA = False
SMOL = True

model_id = "HuggingFaceTB/SmolVLM2-500M-Video-Instruct" if SMOL else "HuggingFaceTB/SmolVLM2-2.2B-Instruct"

processor = AutoProcessor.from_pretrained(
    model_id
)

if USE_QLORA or USE_LORA:
    lora_config = LoraConfig(
        r=8,
        lora_alpha=8,
        lora_dropout=0.1,
        target_modules=['down_proj','o_proj','k_proj','q_proj','gate_proj','up_proj','v_proj'],
        use_dora=False if USE_QLORA else True,
        init_lora_weights="gaussian"
    )
    lora_config.inference_mode = False
    if USE_QLORA:
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16
        )

    model = AutoModelForImageTextToText.from_pretrained(
        model_id,
        quantization_config=bnb_config if USE_QLORA else None,
        _attn_implementation="flash_attention_2",
        device_map="auto"
    )
    model.add_adapter(lora_config)
    model.enable_adapters()
    model = prepare_model_for_kbit_training(model)
    model = get_peft_model(model, lora_config)
    print(model.get_nb_trainable_parameters())
else:
    model = AutoModelForImageTextToText.from_pretrained(
        model_id,
        torch_dtype=torch.bfloat16,
        # _attn_implementation="flash_attention_2",
    ).to("cuda")

    # if you'd like to only fine-tune LLM
    for param in model.model.vision_model.parameters():
        param.requires_grad = False

peak_mem = torch.cuda.max_memory_allocated()
print(f"The model as is is holding: {peak_mem / 1024**3:.2f} of GPU RAM")

You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.


The model as is is holding: 0.97 of GPU RAM


## Loading the dataset and Preprocessing

We will load a dataset that contains generated videos and their super short captions of 4k examples. We are loading small chunk of it for training and smaller one for test.

In [3]:
from datasets import load_dataset

ds = load_dataset("TIGER-Lab/VideoFeedback", "real")

In [6]:
print(ds)

DatasetDict({
    test: Dataset({
        features: ['id', 'images', 'text prompt', 'video link', 'visual quality', 'temporal consistency', 'dynamic degree', 'text-to-video alignment', 'factual consistency', 'conversations'],
        num_rows: 80
    })
    train: Dataset({
        features: ['id', 'images', 'text prompt', 'video link', 'visual quality', 'temporal consistency', 'dynamic degree', 'text-to-video alignment', 'factual consistency', 'conversations'],
        num_rows: 4000
    })
})


In [10]:
print(ds["train"][0])

{'id': 'p100263', 'images': ['p100263_00.jpg', 'p100263_01.jpg', 'p100263_02.jpg', 'p100263_03.jpg', 'p100263_04.jpg', 'p100263_05.jpg', 'p100263_06.jpg', 'p100263_07.jpg', 'p100263_08.jpg', 'p100263_09.jpg', 'p100263_10.jpg', 'p100263_11.jpg', 'p100263_12.jpg', 'p100263_13.jpg', 'p100263_14.jpg', 'p100263_15.jpg', 'p100263_16.jpg', 'p100263_17.jpg', 'p100263_18.jpg', 'p100263_19.jpg', 'p100263_20.jpg', 'p100263_21.jpg', 'p100263_22.jpg', 'p100263_23.jpg'], 'text prompt': 'There are posters of various sizes and colors on the wall of a restaurant.', 'video link': 'https://huggingface.co/datasets/hexuan21/VideoFeedback-videos-mp4/resolve/main/p/p100263.mp4', 'visual quality': 4, 'temporal consistency': 4, 'dynamic degree': 4, 'text-to-video alignment': 4, 'factual consistency': 4, 'conversations': [{'from': 'human', 'value': '\nSuppose you are an expert in judging and evaluating the quality of AI-generated videos, \nplease watch the following frames of a given video and see the text prom

In [4]:
print(ds["train"])
print(ds["train"].features.keys())
print(len(ds['train']['video link']))
print(len(ds['train']['text prompt']))

Dataset({
    features: ['id', 'images', 'text prompt', 'video link', 'visual quality', 'temporal consistency', 'dynamic degree', 'text-to-video alignment', 'factual consistency', 'conversations'],
    num_rows: 4000
})
dict_keys(['id', 'images', 'text prompt', 'video link', 'visual quality', 'temporal consistency', 'dynamic degree', 'text-to-video alignment', 'factual consistency', 'conversations'])
4000
4000


In [4]:
# split_ds = ds["train"].train_test_split(test_size=0.5)
split_ds = ds["train"].train_test_split(test_size=0.98)
train_ds = split_ds["train"]

In [5]:
train_ds

Dataset({
    features: ['id', 'images', 'text prompt', 'video link', 'visual quality', 'temporal consistency', 'dynamic degree', 'text-to-video alignment', 'factual consistency', 'conversations'],
    num_rows: 80
})

In [5]:
del split_ds, ds

In [3]:
from datasets import load_dataset

# smolvlm2 = load_dataset(path = "C:\\Python_workspace\\TAISC\\dataset\\freeway_video" , name = "smolvlm2" , data_dir="smolvlm2")

# freeway_video
# smolvlm2 = load_dataset(
#     path = "C:\\Python_workspace\\TAISC\\dataset\\freeway_video\\smolvlm2\\dataset_script.py" , 
#     name = "smolvlm2" , 
#     data_dir="C:\\Python_workspace\\TAISC\\dataset\\freeway_video\\smolvlm2" , 
#     trust_remote_code = True
# )

# freeway
# smolvlm2 = load_dataset(
#     path = "C:\\Python_workspace\\TAISC\\dataset\\freeway\\dataset_script.py" , 
#     name = "smolvlm2" , 
#     data_dir="C:\\Python_workspace\\TAISC\\dataset\\freeway" , 
#     trust_remote_code = True
# )

# freeway_sample_video
smolvlm2 = load_dataset(
    path = "C:\\Python_workspace\\TAISC\\dataset\\freeway_sample_video\\dataset_script.py" , 
    name = "smolvlm2" , 
    data_dir="C:\\Python_workspace\\TAISC\\dataset\\freeway_sample_video\\" , 
    trust_remote_code = True
)

In [4]:
split_smolvlm2 = smolvlm2["train"].train_test_split(test_size=0.5)
train_smolvlm2 = split_smolvlm2["train"]

In [5]:
del split_smolvlm2, smolvlm2

Take a sneak peek.

In [21]:
print(f"prompt:  {train_ds[0]['text prompt']}, video: {train_ds[0]['video link']}")

prompt:  A person is holding a sunflower plush toy., video: https://huggingface.co/datasets/hexuan21/VideoFeedback-videos-mp4/resolve/main/p/p111718.mp4


Let's write our data collating function. We will apply prompt template to have videos and captions together so model can learn to caption. Then we pass the formatted prompts and videos to the processor which processes both.

In [29]:
from torch.nn.utils.rnn import pad_sequence

image_token_id = processor.tokenizer.additional_special_tokens_ids[
    processor.tokenizer.additional_special_tokens.index("<image>")
]

def collate_fn(examples):
    instances = []
    # print("1")
    for example in examples:
        prompt = example["text prompt"]

        user_content = [{"type": "text", "text": "Caption the video."}]
        user_content.append({"type": "video", "path": example["video link"]})

        messages = [
            {"role": "user", "content": user_content},
            {"role": "assistant", "content": [{"type": "text", "text": f"{prompt}"}]}
        ]

        instance = processor.apply_chat_template(messages, add_generation_prompt=False,
                                                 tokenize=True, return_dict=True, return_tensors="pt").to("cuda").to(model.dtype)
        instances.append(instance)

    # print("2")
    input_ids = pad_sequence(
        [inst["input_ids"].squeeze(0) for inst in instances],
        batch_first=True,
        padding_value=processor.tokenizer.pad_token_id
    )
    # print("3")
    attention_mask = pad_sequence(
        [inst["attention_mask"].squeeze(0) for inst in instances],
        batch_first=True,
        padding_value=0
    )
    # print("4")
    labels = pad_sequence(
        [inst["input_ids"].squeeze(0).clone() for inst in instances],
        batch_first=True,
        padding_value=-100
    )
    # print("5")
    labels[labels == image_token_id] = -100

    out = {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": labels
    }


    # Step 1: figure out maximum frames, height, width across the batch
    pvs = [inst["pixel_values"].squeeze(0) for inst in instances if "pixel_values" in inst]
    if pvs:  # there is at least one non-None pixel_values
        max_frames = max(pv.shape[0] for pv in pvs)
        max_h = max(pv.shape[-2] for pv in pvs)
        max_w = max(pv.shape[-1] for pv in pvs)
    else:
        max_h = max_w = processor.video_size['longest_edge']
        max_frames = 1

    padded_pixel_values_list = []
    for ex in instances:
        pv = ex.get("pixel_values", None).squeeze(0)

        if pv is None:
            # text-only => fill pixel data + mask with zeros
            shape_pv = (max_frames, 3, max_h, max_w)
            padded_pv = torch.zeros(shape_pv, dtype=torch.float32)
        else:
            f, c, h, w = pv.shape
            # Prepare final storage
            padded_pv = torch.zeros(
                (max_frames, c, max_h, max_w),
                dtype=pv.dtype,
                device=pv.device
            )
            padded_pv[:f, :, :h, :w] = pv
        padded_pixel_values_list.append(padded_pv)

    out["pixel_values"] = torch.stack(padded_pixel_values_list, dim=0)
    return out

In [14]:
print("special tokens:", processor.tokenizer.additional_special_tokens)
print("special token ids:", processor.tokenizer.additional_special_tokens_ids)

special tokens: ['<fake_token_around_image>', '<image>', '<end_of_utterance>']
special token ids: [49189, 49190, 49279]


#### TAISC: VQA fine-tune

In [6]:
# for taisc

from torch.nn.utils.rnn import pad_sequence

# video_token_id = processor.tokenizer.additional_special_tokens_ids[
#     processor.tokenizer.additional_special_tokens.index("<video>")
# ]

image_token_id = processor.tokenizer.additional_special_tokens_ids[
    processor.tokenizer.additional_special_tokens.index("<image>")
]

def collate_fn(examples):
    instances = []

    for example in examples:
        prompt = example["text"]  # 對應到你的 jsonl 中的 "text"
        # freeway
        # video_path = os.path.join("C:\\Python_workspace\\TAISC\\dataset\\freeway\\train" , example["video"])  # 對應到你的 jsonl 中的 "video"
        # freeway_video
        video_path = os.path.join("C:\\Python_workspace\\TAISC\\dataset\\freeway_video\\smolvlm2\\train" , example["video"])  # 對應到你的 jsonl 中的 "video"
        
        # print(video_path)

        user_content = [
            {"type": "text", "text": prompt},
            {"type": "video", "path": video_path}
        ]

        messages = [
            {"role": "user", "content": user_content},
            {"role": "assistant", "content": [{"type": "text", "text": example["label"]}]}
        ]

        instance = processor.apply_chat_template(
            messages,
            add_generation_prompt=False,
            tokenize=True,
            return_dict=True,
            return_tensors="pt"
        ).to("cuda").to(model.dtype)

        # 加入 label
        instance["label"] = torch.tensor(example["label"], dtype=torch.long)
        instances.append(instance)

    input_ids = pad_sequence(
        [inst["input_ids"].squeeze(0) for inst in instances],
        batch_first=True,
        padding_value=processor.tokenizer.pad_token_id
    )

    attention_mask = pad_sequence(
        [inst["attention_mask"].squeeze(0) for inst in instances],
        batch_first=True,
        padding_value=0
    )

    labels = pad_sequence(
        [inst["input_ids"].squeeze(0).clone() for inst in instances],
        batch_first=True,
        padding_value=-100
    )
    labels[labels == image_token_id] = -100

    out = {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": labels,  # 如果你使用自動回應方式（text generation），這裡是token-level的label
        "classification_labels": torch.stack([inst["label"] for inst in instances])
    }

    # 處理 video features
    pvs = [inst["pixel_values"].squeeze(0) for inst in instances if "pixel_values" in inst]
    if pvs:
        max_frames = max(pv.shape[0] for pv in pvs)
        max_h = max(pv.shape[-2] for pv in pvs)
        max_w = max(pv.shape[-1] for pv in pvs)
    else:
        max_h = max_w = processor.video_size['longest_edge']
        max_frames = 1

    padded_pixel_values_list = []
    for ex in instances:
        pv = ex.get("pixel_values", None)
        if pv is None:
            shape_pv = (max_frames, 3, max_h, max_w)
            padded_pv = torch.zeros(shape_pv, dtype=torch.float32)
        else:
            pv = pv.squeeze(0)
            f, c, h, w = pv.shape
            padded_pv = torch.zeros(
                (max_frames, c, max_h, max_w),
                dtype=pv.dtype,
                device=pv.device
            )
            padded_pv[:f, :, :h, :w] = pv
        padded_pixel_values_list.append(padded_pv)

    out["pixel_values"] = torch.stack(padded_pixel_values_list, dim=0)

    return out


In [7]:
print("special tokens:", processor.tokenizer.additional_special_tokens)
print("special token ids:", processor.tokenizer.additional_special_tokens_ids)

special tokens: ['<fake_token_around_image>', '<image>', '<end_of_utterance>']
special token ids: [49189, 49190, 49279]


In [9]:
print(processor.tokenizer.set_truncation_and_padding)

<bound method PreTrainedTokenizerFast.set_truncation_and_padding of GPT2TokenizerFast(name_or_path='HuggingFaceTB/SmolVLM2-500M-Video-Instruct', vocab_size=49152, model_max_length=8192, is_fast=True, padding_side='right', truncation_side='left', special_tokens={'bos_token': '<|im_start|>', 'eos_token': '<end_of_utterance>', 'unk_token': '<|endoftext|>', 'pad_token': '<|im_end|>', 'additional_special_tokens': ['<fake_token_around_image>', '<image>', '<end_of_utterance>'], 'end_of_utterance_token': '<end_of_utterance>', 'fake_image_token': '<fake_token_around_image>', 'global_image_token': '<global-img>', 'image_token': '<image>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, norm

## Training

We can now initialize `Trainer` and initialize `TrainingArguments` to pass to `Trainer`.

Some notes:
- If you use 8-bit QLoRA with the below setup it uses around 16.4 GB VRAM (beautiful, fits comfortably inside L4, Colab free tier)
- We use gradient accumulation to simulate a larger batch size.
- We also save up on memory from intermediate activations by using gradient checkpointing.

**Disclaimer:**
The techniques here aren't free lunch. The latter two will add additional compute to the training, thus slow down a bit (for reference on two A100s with bsz of 16, we were able to train for 2 hrs 43 mins with the gradient accumulation steps of 4, disabling it reduced it with 2 hr 35 mins).
If you want to speed-up, you might play around, reduce to 4-bit precision and have a higher batch size. Note that 4-bit might result in model learning less.

In [31]:
from transformers import TrainingArguments, Trainer

model_name = model_id.split("/")[-1]

training_args = TrainingArguments(
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=1,
    warmup_steps=50,
    learning_rate=1e-4,
    weight_decay=0.01,
    logging_steps=25,
    save_strategy="steps",
    save_steps=250,
    save_total_limit=1,
    # optim="adamw_hf", # for 8-bit, keep paged_adamw_8bit, else adamw_hf
    optim="adamw_torch",
    bf16=True,
    # output_dir=f"./{model_name}-video-feedback",
    # hub_model_id=f"{model_name}-video-feedback",
    output_dir=f"./{model_name}-taisc",
    hub_model_id=f"{model_name}-taisc",
    remove_unused_columns=False,
    report_to="tensorboard",
    dataloader_pin_memory=False
)


In [32]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=collate_fn,
    # train_dataset=train_ds,
    train_dataset=train_smolvlm2
)

In [33]:
trainer.train()

You have used fast image processor with LANCZOS resample which not yet supported for torch.Tensor. BICUBIC resample will be used as an alternative. Please fall back to image processor if you want full consistency with the original model.
Token indices sequence length is longer than the specified maximum sequence length for this model (10080 > 8192). Running this sequence through the model will result in indexing errors


RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [None]:
trainer.push_to_hub()

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

events.out.tfevents.1740055910.82ea94387a47.41010.0:   0%|          | 0.00/17.1k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.02G [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/5.43k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/merve/SmolVLM2-500M-Video-Instruct-video-feedback/commit/2f33b0685d991475ac091593e224f3e5e7b7cac7', commit_message='End of training', commit_description='', oid='2f33b0685d991475ac091593e224f3e5e7b7cac7', pr_url=None, repo_url=RepoUrl('https://huggingface.co/merve/SmolVLM2-500M-Video-Instruct-video-feedback', endpoint='https://huggingface.co', repo_type='model', repo_id='merve/SmolVLM2-500M-Video-Instruct-video-feedback'), pr_revision=None, pr_num=None)

The test example is a video of a woman walking by, you can download and check from [here](https://huggingface.co/datasets/hexuan21/VideoFeedback-videos-mp4/blob/main/p/p000304.mp4).

In [None]:
messages = [{"role": "user",
                 "content": [{"type": "text", "text": "Caption the video."},
                  {"type": "video", "path": "https://huggingface.co/datasets/hexuan21/VideoFeedback-videos-mp4/resolve/main/p/p000304.mp4"}]}]


inputs = processor.apply_chat_template(messages, add_generation_prompt=True,
                                          tokenize=True, return_dict=True, return_tensors="pt").to("cuda").to(model.dtype)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])

User: Caption the video.You are provided the following series of three frames from a 0:00:03 [H:MM:SS] video.

Frame from 00:00:
Frame from 00:01:
Frame from 00:02:


Assistant: woman in white shirt walks by
