<a href="https://colab.research.google.com/github/xiaodongeast/multimodal/blob/main/Smol_VLM_FT_GRPO_step1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# adpate from this official notebook for the sft
Fine-tune SmolVLM on Visual Question Answering using Consumer GPU with QLoRA

In this notebook we will fine-tune SmolVLM VQAv2 dataset. With this notebook you can also fine-tune Idefics3, since both models have the same model class/architecture.

We will use some techniques in this notebook that will let you fine-tune the model on L4 with batch size of 4 only using around 16.4 GB of VRAM. We ran this notebook in that setup to test, but because we were able to afford A100 this notebook was last ran on an A100.

In [None]:
!pip install -q accelerate datasets peft bitsandbytes tensorboard
!pip install -q flash-attn --no-build-isolation

We will push out model to Hub so we need to authenticate ourselves.

In [None]:
from google.colab import drive
drive.mount('/content/drive6')

In [None]:
from datasets import load_dataset
from datasets import concatenate_datasets, DatasetDict
from huggingface_hub import notebook_login
import torch
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
from transformers import AutoProcessor, BitsAndBytesConfig, Idefics3ForConditionalGeneration

In this notebook we will not do full fine-tuning but use QLoRA method, which loads an adapter to the quantized version of the model, saving space. If you want to do full fine-tuning, set `USE_LORA` and `USE_QLORA` to False. If you want to do LoRA, set `USE_QLORA`Â to False and `USE_LORA`Â to True.

In [None]:
USE_LORA = True
USE_QLORA = False
SMOL = True

model_id = "HuggingFaceTB/SmolVLM-instruct"

processor = AutoProcessor.from_pretrained(
    model_id
)

if USE_QLORA or USE_LORA:
    lora_config = LoraConfig(
        r=8,
        lora_alpha=8,
        lora_dropout=0.1,
        target_modules=['down_proj','o_proj','k_proj','q_proj','gate_proj','up_proj','v_proj'],
        use_dora=False if USE_QLORA else True,
        init_lora_weights="gaussian"
    )
    lora_config.inference_mode = False
    if USE_QLORA:
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16
        )

    model = Idefics3ForConditionalGeneration.from_pretrained(
        model_id,
        quantization_config=bnb_config if USE_QLORA else None,
       # _attn_implementation="flash_attention_2",
        device_map="cuda",
        dtype=torch.bfloat16
    )
    model.add_adapter(lora_config)
    model.enable_adapters()
    model = prepare_model_for_kbit_training(model)
    model = get_peft_model(model, lora_config)
    print(model.get_nb_trainable_parameters())
else:
    model = Idefics3ForConditionalGeneration.from_pretrained(
        model_id,
        torch_dtype=torch.bfloat16,
        _attn_implementation="flash_attention_2",
        dtype=torch.bfloat16 # Add dtype for non-QLoRA
    ).to(DEVICE)

    # if you'd like to only fine-tune LLM
    for param in model.model.vision_model.parameters():
        param.requires_grad = False

The model as is is holding 2.7 GB of GPU RAM ðŸ’—

##Â Loading the dataset and Preprocessing

We will load a small portion of the VQAv2 dataset. We are loading a small portion of the model for education purposes.

In [None]:
from datasets import load_dataset


In [None]:
#split_ds = ds["validation"].train_test_split(test_size=0.9)
#train_ds = split_ds["train"]


#notebook_login()

ds = load_dataset('bugkiller2025/train')


# merge all splits into one big dataset
full = concatenate_datasets([split for split in ds.values()])

# shuffle once
full = full.shuffle(seed=42)

# now split: 50% train vs 50% remaining
train_test = full.train_test_split(test_size=0.6, seed=42)
train_ds = train_test["train"]
remaining = train_test["test"]

# split remaining into eval (10%) and test (10%) of total
intermediate = remaining.train_test_split(test_size=0.6, seed=42)
eval_test = intermediate["train"].train_test_split(test_size=0.5, seed=42)

eval_ds = eval_test["train"]   # ~25% of total
test_ds = eval_test["test"]    # ~25% of total

dataset_dict = DatasetDict({
    "train": train_ds,
    "eval": eval_ds,
    "test": test_ds
})
print(dataset_dict)

dataset_dict.push_to_hub("bugkiller2025/vqa_reasoning")

In [None]:
from datasets import load_dataset
from datasets import concatenate_datasets, DatasetDict
from huggingface_hub import notebook_login

#notebook_login()

ds = load_dataset("bugkiller2025/vqa_reasoning")


In [None]:
notebook_login()

In [None]:
train_ds = ds['train']
eval_ds = ds['eval']
test_ds = ds['test']

In [None]:
print(train_ds[0])
train_ds[0]['image']

In [None]:
from transformers import Idefics3ForConditionalGeneration,AutoProcessor
import torch

#model_id =   "HuggingFaceTB/SmolVLM-instruct"
#model = Idefics3ForConditionalGeneration.from_pretrained(
 #       model_id,
  #  ).to('cuda')

#processor = AutoProcessor.from_pretrained(
 #   "HuggingFaceTB/SmolVLM-instruct"
#)

# Use a consistent instruction; put it in the system message.
instruct ="""Answer question about the image with your reasoning. Follow this format:
<think>
[Your detailed chain-of-thought goes here]
</think>
<answer>
[Your final answer goes here]
</answer>
"""

import matplotlib.pyplot as plt

from os import system

from typing import Dict, Any, List
from PIL import Image

print(processor.tokenizer.SPECIAL_TOKENS_ATTRIBUTES)
print(processor.tokenizer.additional_special_tokens)

def _normalize_example(example: Dict[str, Any]) -> Dict[str, Any]:
    """
    Converts your record into: question(str), answer_text(str), image(PIL), solution(str or "")
    Supports:
      - answer as index into choices
      - or 'multiple_choice_answer' already as string
      - solution may be missing (defaults to "")
    """
    # --- image ---
    img = example["image"]
    if isinstance(img, dict) and "path" in img:
      img = Image.open(img["path"]).convert("RGB")
    elif hasattr(img, "mode"):
        if img.mode != "RGB":
            img = img.convert("RGB")
    else:
        # last resort: if it's a path string
        img = Image.open(str(img)).convert("RGB")
    # --- question ---
    question = example.get("question", "").strip()

    # --- answer text ---
    if "multiple_choice_answer" in example and example["multiple_choice_answer"] is not None:
        answer_text = str(example["multiple_choice_answer"]).strip()
    else:
        # Your format: choices + integer answer index
        choices = example.get("choices")
        ans_idx = example.get("answer")
        if isinstance(ans_idx, int) and isinstance(choices, list) and 0 <= ans_idx < len(choices):
            answer_text = str(choices[ans_idx]).strip()
        else:
            # fallback: maybe "answer" is already a string
            answer_text = str(example.get("answer", "")).strip()

    # --- chain-of-thought / solution ---
    thought = str(example.get("solution", "") or "").strip()

    return {
        "image": img,
        "question": question,
        "answer_text": answer_text,
        "thought": thought,
        "choices": example.get("choices")  # passthrough (optional)
    }

def get_response(example, processor, model):
  device ='cuda'
  image = example["image"]
  # if decode=False, you'll get {'path': '...', 'bytes': None}

  question = example["question"]

  messages = [

        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text":instruct + '\n' + 'Querstion:\n' + question}

            ],
        },
    ]

  formatted_query = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
  print(formatted_query)

  # Tokenize the query
  model_inputs = processor(
      images=image,
      text=formatted_query,
      return_tensors="pt"
  ).to(device, dtype=torch.bfloat16) # Add dtype here

  # Generate predictions
  with torch.no_grad():
      outputs = model.generate(**model_inputs,max_new_tokens =200)
  trimmed_generated_ids = [out_ids[len(in_ids) :] for in_ids, out_ids in zip( model_inputs.input_ids, outputs)]


  # Decode the prediction
  prediction = processor.batch_decode(  trimmed_generated_ids, skip_special_tokens=True)

  # Display the result
  print(f"Query: {question}")

  print(f"Expected Answer: {example['answer_text']}")
  print("="*20)
  print(f"Model Prediction: {prediction}")


test_sample = _normalize_example(ds['train'][1])
print(test_sample)
get_response(test_sample, processor, model)

In [None]:

test_sample = _normalize_example(ds['test'][110])
print(test_sample)
get_response(test_sample, processor, model)

In [None]:
image_token_id = processor.tokenizer.additional_special_tokens_ids[
            processor.tokenizer.additional_special_tokens.index("<image>")]
print(processor.tokenizer.SPECIAL_TOKENS_ATTRIBUTES)
print(processor.tokenizer.additional_special_tokens)


from os import system
from pprint import pprint

from typing import Dict, Any, List
from PIL import Image

def format_example(exn: Dict[str, Any]) -> Dict[str, Any]:
    """
    Builds chat-style inputs for SmolVLM where the assistant replies with:
        <think> ... </think>\n<answer> ... </

    """

    # Comment this block out if you don't want choices shown to the model.
    q = exn["question"]
    #if isinstance(exn.get("choices"), list) and exn["choices"]:
        # e.g., "Question ...\nOptions: yes | no"
    #    q = f"{q}\nOptions: " + " | ".join(map(str, exn["choices"]))

    # Desired assistant output:
    #   <think> ...</think>
    #   <answer> ...</answer>
    question = exn["question"]


    thoughts = exn['thought'] if exn['thought'] else ""
    assistant_text = f"""<think>
    {thoughts.strip()}
    </think>
    <answer>
    {exn['answer_text'].strip()}
    </answer>"""

    messages = [
      {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text":instruct + '\n' + 'Querstion:\n' + question}

            ],
        },

        {
            "role": "assistant",
            "content": [
                {"type": "text", "text": assistant_text}
            ]
        }
      ]
    return messages

pprint(_normalize_example(ds['train'][0]))
#print("=" * 100)
exn = _normalize_example(ds['train'][0])
pprint(format_example(exn))


def collate_fn(examples):
  texts = []
  images = []
  for example in examples:
      exn = _normalize_example(example)
      image = exn["image"]
      messages = format_example(exn)
      #print(messages)
      text = processor.apply_chat_template(messages, add_generation_prompt=False)
      #print(text)
      texts.append(text)
      images.append([image])

  batch = processor(text=texts, images=images, return_tensors="pt", padding=True)
  labels = batch["input_ids"].clone()
  labels[labels == processor.tokenizer.pad_token_id] = -100
  labels[labels == image_token_id] = -100
  batch["labels"] = labels

  return batch

_ = collate_fn([train_ds[2]])


Let's write our data collating function. We will apply prompt template to have questions and answers together so model can learn to answer. Then we pass the formatted prompts and images to the processor which processes both.

In [None]:
image_token_id = processor.tokenizer.additional_special_tokens_ids[
            processor.tokenizer.additional_special_tokens.index("<image>")]
print(processor.tokenizer.SPECIAL_TOKENS_ATTRIBUTES)
print(processor.tokenizer.additional_special_tokens)
def collate_fn(examples):
  texts = []
  images = []
  for example in examples:
      example = _normalize_example(example)

      messages = format_example(example)
      image = example["image"]
      text = processor.apply_chat_template(messages, add_generation_prompt=False)
      texts.append(text.strip())
      images.append([image])

  batch = processor(text=texts, images=images, return_tensors="pt", padding=True)
  labels = batch["input_ids"].clone()
  labels[labels == processor.tokenizer.pad_token_id] = -100
  labels[labels == image_token_id] = -100
  batch["labels"] = labels

  return batch

_ = collate_fn([train_ds[0]])

## Training

We can now initialize `Trainer`Â and initialize `TrainingArguments`Â to pass to `Trainer`.

Some notes:
- If you use 8-bit QLoRA with the below setup it uses around 16.4 GB VRAM (beautiful, fits comfortably inside L4, Colab free tier)
- We use gradient accumulation to simulate a larger batch size.
- We also save up on memory from intermediate activations by using gradient checkpointing.

**Disclaimer:**
The techniques here aren't free lunch. The latter two will add additional compute to the training, thus slow down a bit (for reference on two A100s with bsz of 16, we were able to train for 2 hrs 43 mins with the gradient accumulation steps of 4, disabling it reduced it with 2 hr 35 mins).
If you want to speed-up, you might play around, reduce to 4-bit precision and have a higher batch size. Note that 4-bit might result in model learning less.

In [None]:
from transformers import TrainingArguments, Trainer

model_name = model_id.split("/")[-1]

training_args = TrainingArguments(
    num_train_epochs=3,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=8,
    warmup_steps=10,
    learning_rate=1e-4,
    weight_decay=0.01,
    logging_steps=25,
    save_strategy="steps",
    save_steps=20,
    save_total_limit=1,
    optim="paged_adamw_8bit", # for 8-bit, keep this, else adamw_hf
    bf16=True, #Â underlying precision for 8bit
    output_dir=f"/content/drive6/MyDrive/smolvlm-instruct-s4",
    hub_model_id=f"smolvlm-instruct-s4",
    report_to="tensorboard",
    remove_unused_columns=False,
    gradient_checkpointing=True
)


In [None]:

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=collate_fn,
    train_dataset=train_ds,
)

In [None]:
trainer.train()
trainer.save_model("/content/drive6/MyDrive/smolvlm-instruct-s4")
trainer.push_to_hub("smolvlm-instruct-s4")

In [None]:
model

In [None]:
trainer.push_to_hub()

In [None]:
load_id = f'bugkiller2025/{model_name}-vqav2'

In [None]:
trainer.save_model(f"{model_name}-vqav2b")

In [None]:
model_no_train = Idefics3ForConditionalGeneration.from_pretrained(
        model_id,
        torch_dtype=torch.bfloat16,
        _attn_implementation="flash_attention_2",
    ).to('cuda')


In [None]:
model = Idefics3ForConditionalGeneration.from_pretrained(
        load_id,
        torch_dtype=torch.bfloat16,
        _attn_implementation="flash_attention_2",
    ).to('cuda')

processor = AutoProcessor.from_pretrained(
    model_id
)

In [None]:

test_sample = _normalize_example(test_ds[1])
print(test_sample)
get_response(test_sample, processor, model)



In [None]:

test_sample = _normalize_example(test_ds[1])
print(test_sample)
get_response(test_sample, processor, model_no_train)