<a href="https://colab.research.google.com/github/wayneotemah/A2A/blob/main/Llama_Vision_Finetune.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##tiny sign-language video dataset (LLaVA JSON)

In [1]:
DATA_DIR = "/content/drive/MyDrive/Added_to_dictionary"

In [2]:
import os,json, os, textwrap, pathlib

In [3]:
os.listdir(DATA_DIR)

['My_name_is_leila.MP4',
 'me_hopital_back_when.MP4',
 'How_are_you.MP4',
 'nice_meeting_you.MP4',
 'Goodbye.MP4',
 'Welcome.MP4',
 'brother.MP4',
 'sister.MP4',
 'son.MP4',
 'husband.MP4',
 'child.MP4',
 'baby.MP4',
 'grandmother.MP4',
 'family.MP4',
 'happy valentine.mp4',
 'my_risk_reduce_how.MP4',
 'where_toilet.MP4',
 'Hello.MP4',
 'difficulty_breathing_when_lying_down.MP4',
 'Feeling_overeating_Small.MP4']

In [4]:
# Build a minimal training JSON:
# Prompt uses <video> token; response is the English meaning (or gloss).
items = []
for vid in sorted(os.listdir(DATA_DIR)):
    if not vid.lower().endswith(".mp4"):
        continue

    translation = vid.rsplit(".")[0]
    #remove the _ and replace with with a " "
    translation = translation.replace("_", " ")

    items.append({
        "id": pathlib.Path(vid).stem,
        "video": vid,
        "conversations": [
            {"from": "human", "value": "<video>\nWhat does this sign mean?"},
            {"from": "gpt",   "value": translation.lower()}
        ]
    })



In [5]:
items[0]

{'id': 'Feeling_overeating_Small',
 'video': 'Feeling_overeating_Small.MP4',
 'conversations': [{'from': 'human',
   'value': '<video>\nWhat does this sign mean?'},
  {'from': 'gpt', 'value': 'feeling overeating small'}]}

In [6]:

with open("/content/train_video.json", "w") as f:
    json.dump(items, f, ensure_ascii=False, indent=2)

print(f"Wrote {len(items)} examples to /content/train_video.json")

Wrote 20 examples to /content/train_video.json


## Clone the repository

In [7]:
%cd /content
!git clone https://github.com/2U1/Llama3.2-Vision-Finetune.git
%cd Llama3.2-Vision-Finetune
!ls -la

/content
Cloning into 'Llama3.2-Vision-Finetune'...
remote: Enumerating objects: 363, done.[K
remote: Counting objects: 100% (130/130), done.[K
remote: Compressing objects: 100% (81/81), done.[K
remote: Total 363 (delta 89), reused 89 (delta 49), pack-reused 233 (from 1)[K
Receiving objects: 100% (363/363), 87.91 KiB | 14.65 MiB/s, done.
Resolving deltas: 100% (252/252), done.
/content/Llama3.2-Vision-Finetune
total 64
drwxr-xr-x 5 root root  4096 Aug 16 21:05 .
drwxr-xr-x 1 root root  4096 Aug 16 21:05 ..
-rw-r--r-- 1 root root  4676 Aug 16 21:05 environment.yaml
drwxr-xr-x 8 root root  4096 Aug 16 21:05 .git
-rw-r--r-- 1 root root  3240 Aug 16 21:05 .gitignore
-rw-r--r-- 1 root root 11357 Aug 16 21:05 LICENSE
-rw-r--r-- 1 root root 12993 Aug 16 21:05 README.md
-rw-r--r-- 1 root root  1964 Aug 16 21:05 requirements.txt
drwxr-xr-x 2 root root  4096 Aug 16 21:05 scripts
drwxr-xr-x 3 root root  4096 Aug 16 21:05 src


## install dep

In [15]:
!pip -q install --upgrade pip
# Core libs
!pip -q install "transformers>=4.43" "accelerate>=0.30" "peft>=0.11.0" bitsandbytes==0.43.1 ujson decord tensorboard trl
# Training extras
!pip -q install deepspeed==0.14.4
# Liger kernel (optional speedup used by the repo; if it errors you can skip it)
!pip -q install liger-kernel


## Log in to Hugging Face & request model access

In [19]:
from huggingface_hub import login
login()  # paste your HF token here

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

##Low-VRAM training recipe (LoRA + 8-bit + ZeRO-3 offload)

We’ll do the most Colab-friendly thing first:

* 8-bit load of the base model (--bits 8)

* LoRA on the language model (freeze LLM per repo rules)

* Freeze vision tower to save VRAM

* Keep image projector trainable (small but important for adapting visuals)

* Limit to 16 frames per clip

* Use ZeRO-3 Offload if you run out of VRAM (repo suggests this)

In [20]:
%%bash
cd /content/Llama3.2-Vision-Finetune
export PYTHONPATH=src:$PYTHONPATH

python -u src/train/train_sft.py \
  --use_liger True \
  --model_id meta-llama/Llama-3.2-11B-Vision-Instruct \
  --data_path /content/train_video.json \
  --image_folder /content/drive/MyDrive/Added_to_dictionary \
  --lora_enable True \
  --vision_lora False \
  --freeze_llm True \
  --freeze_vision_tower True \
  --freeze_img_projector False \
  --bits 8 \
  --fp16 True --bf16 False \
  --max_num_frames 12 \
  --num_train_epochs 1 \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 8 \
  --learning_rate 1e-4 \
  --projector_lr 1e-4 \
  --logging_steps 10 \
  --gradient_checkpointing True \
  --report_to tensorboard \
  --output_dir /content/out_lora_video


2025-08-16 21:33:57.158145: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1755380037.177272    8526 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1755380037.183181    8526 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1755380037.197803    8526 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1755380037.197828    8526 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1755380037.197831    8526 computation_placer.cc:177] computation placer alr

CalledProcessError: Command 'b'cd /content/Llama3.2-Vision-Finetune\nexport PYTHONPATH=src:$PYTHONPATH\n\npython -u src/train/train_sft.py \\\n  --use_liger True \\\n  --model_id meta-llama/Llama-3.2-11B-Vision-Instruct \\\n  --data_path /content/train_video.json \\\n  --image_folder /content/drive/MyDrive/Added_to_dictionary \\\n  --lora_enable True \\\n  --vision_lora False \\\n  --freeze_llm True \\\n  --freeze_vision_tower True \\\n  --freeze_img_projector False \\\n  --bits 8 \\\n  --fp16 True --bf16 False \\\n  --max_num_frames 12 \\\n  --num_train_epochs 1 \\\n  --per_device_train_batch_size 1 \\\n  --gradient_accumulation_steps 8 \\\n  --learning_rate 1e-4 \\\n  --projector_lr 1e-4 \\\n  --logging_steps 10 \\\n  --gradient_checkpointing True \\\n  --report_to tensorboard \\\n  --output_dir /content/out_lora_video\n'' returned non-zero exit status 1.