Multimodal models including Llama-3.2-11B-Vision, Mistral's Pixtral-12B, Qwen's Qwen2-VL-7B and Allen AI's Molmo-7B-D-0924 have capability to handle both visual and language inputs.

In this tutorial, we finetune Qwen's Qwen-2-VL-7B using TRL

## Setup Environment

In [6]:
# Install Pytorch & other libraries
%pip install "torch==2.8.0" tensorboard pillow

# Install Hugging Face libraries
%pip install  --upgrade \
  "transformers==4.45.1" \
  "datasets==3.0.1" \
  "accelerate==0.34.2" \
  "evaluate==0.4.3" \
  "bitsandbytes==0.45.3" \
  "trl==0.11.1" \
  "peft==0.13.0" \
  "qwen-vl-utils"

Collecting bitsandbytes==0.45.3
  Downloading bitsandbytes-0.45.3-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Downloading bitsandbytes-0.45.3-py3-none-manylinux_2_24_x86_64.whl (76.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
  Attempting uninstall: bitsandbytes
    Found existing installation: bitsandbytes 0.44.0
    Uninstalling bitsandbytes-0.44.0:
      Successfully uninstalled bitsandbytes-0.44.0
Successfully installed bitsandbytes-0.45.3


### 2. Create and Prepare dataset

For this tutorial, we will prepare dataset in a format that the model can understand. We will use [philschmid/amazon-product-descriptions-vlm](https://www.google.com/url?q=https%3A%2F%2Fhuggingface.co%2Fdatasets%2Fphilschmid%2Famazon-product-descriptions-vlm) which contains 1,350 amazon products with title, images and descriptions and metadata. We will finetune the model to generate product descriptions based on the images, title and metadata.

In [3]:
# note the image is not provided in the prompt its included as part of the "processor"
prompt= """Create a Short Product description based on the provided ##PRODUCT NAME## and ##CATEGORY## and image.
Only return description. The description should be SEO optimized and for a better mobile search experience.

##PRODUCT NAME##: {product_name}
##CATEGORY##: {category}"""

system_message = "You are an expert product description writer for Amazon."

Format the dataset

In [4]:
from datasets import load_dataset

# Format as prompt messages
def format_data(sample):
    return {"messages": [
            {
                "role": "system",
                "content": [{'type': 'text', 'text': system_message}]
            },
            {
                "role": "user",
                "content": [{
                    'type': 'text',
                    'text': prompt.format(product_name=sample['Product Name'], category=sample['Category']),
                },{
                    'type': 'image',
                    'image': sample['image'],
                    }
                ],
            },
            {
                "role": "assistant",
                "content": [{'type': 'text', 'text': sample['description']}],
            },
        ],
    }

# Load dataset
dataset_id = 'philschmid/amazon-product-descriptions-vlm'
dataset = load_dataset(dataset_id, split='train')

# Apply formate template
dataset = [format_data(sample) for sample in dataset]
print(dataset[125]['messages'])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/47.6M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1345 [00:00<?, ? examples/s]

[{'role': 'system', 'content': [{'type': 'text', 'text': 'You are an expert product description writer for Amazon.'}]}, {'role': 'user', 'content': [{'type': 'text', 'text': 'Create a Short Product description based on the provided ##PRODUCT NAME## and ##CATEGORY## and image. \nOnly return description. The description should be SEO optimized and for a better mobile search experience.\n\n##PRODUCT NAME##: Calico Critters, Doll House Furniture and Décor, Bed & Comforter Set\n##CATEGORY##: None'}, {'type': 'image', 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=500x500 at 0x7B5C81416360>}]}, {'role': 'assistant', 'content': [{'type': 'text', 'text': 'Adorable Calico Critters bed & comforter set!  Perfect for enhancing your dollhouse, this charming furniture adds cozy detail.  High-quality, durable design for hours of imaginative play.  Shop now!'}]}]


### Load Model

In [9]:
import torch
from transformers import AutoModelForVision2Seq, AutoProcessor, BitsAndBytesConfig

model_id = 'Qwen/Qwen2-VL-7B-Instruct'

# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model
model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16
    )
processor = AutoProcessor.from_pretrained(model_id)

Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
CUDA is required but not available for bitsandbytes. Please consider installing the multi-platform enabled version of bitsandbytes, which is currently a work in progress. Please check currently supported platforms and installation instructions at https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend


RuntimeError: CUDA is required but not available for bitsandbytes. Please consider installing the multi-platform enabled version of bitsandbytes, which is currently a work in progress. Please check currently supported platforms and installation instructions at https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend

In [8]:
!pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu128

Looking in indexes: https://download.pytorch.org/whl/cu128


In [None]:
!pip install bitsandbytes=0.45.3

In [None]:
!pip install triton>=v1.1.1

In [6]:
processor = AutoProcessor.from_pretrained(model_id)

preprocessor_config.json:   0%|          | 0.00/347 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

chat_template.json: 0.00B [00:00, ?B/s]

In [12]:
# Prepare dataset for inference
text = processor.apply_chat_template(
    dataset[2]['messages'], tokenize=False, add_generateion_prompt=False
)
print(text)

<|im_start|>system
You are an expert product description writer for Amazon.<|im_end|>
<|im_start|>user
Create a Short Product description based on the provided ##PRODUCT NAME## and ##CATEGORY## and image. 
Only return description. The description should be SEO optimized and for a better mobile search experience.

##PRODUCT NAME##: Barbie Fashionistas Doll Wear Your Heart
##CATEGORY##: Toys & Games | Dolls & Accessories | Dolls<|vision_start|><|image_pad|><|vision_end|><|im_end|>
<|im_start|>assistant
Express your style with Barbie Fashionistas Doll Wear Your Heart! This fashionable doll boasts a unique outfit and accessories, perfect for imaginative play.  A great gift for kids aged 3+.  Collect them all! #Barbie #Fashionistas #Doll #Toys #GirlsToys #FashionDoll #Play<|im_end|>



In [13]:
from peft import LoraConfig

lora_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.05,
    r=8,
    bias="none",
    target_modules=["q_proj", "v_proj"],
    task_type="CAUSAL_LM",
)

### Setup Training Configuration for SFT

In [16]:
from trl import SFTConfig
from transformers import Qwen2VLProcessor
from qwen_vl_utils import process_vision_info

training_args = SFTConfig(
    output_dir='finetune_qwen2-7b-instruct-amazon-description',
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    gradient_checkpointing=True,
    optim="adamw_torch_fused",
    save_strategy="epoch",
    logging_steps=10,
    learning_rate=2e-4,
    bf16=True,
    tf32=True,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="constant",
    # disable_tqdm=True,
    push_to_hub=False,
    gradient_checkpointing_kwargs = {'use_reentrant': False},
    dataset_text_field="",
    dataset_kwargs = {'skip_prepare_dataset': True}
)

training_args.remove_unused_columns=False

# Create a data collator to encode text and image pairs
def data_collator(examples):
    texts = [processor.apply_chat_template(example['messages'], tokenize=False) for example in examples]
    image_input = [process_vision_info(example['messages'])[0] for example in examples]

    # Tokenize the text and process the images
    batch = processor.tokenizer(
        texts,
        images=image_input,
        padding=True,
        return_tensors='pt'
    )

    # The labels are the input_ids, we also mask the padding tokens in the loss computation
    labels = batch['input_ids'].clone()
    labels[labels == processor.tokenizer.pad_token_id] = -100
    # ignore the image toke index in the loss computation (model specific)
    if isinstance(processor, Qwen2VLProcessor):
        image_tokens = [151652, 151653, 151655]
    else:
        image_tokens = [processor.tokenizer.convert_tokens_to_ids(processor.image_token)]
    for image_token_id in image_tokens:
        labels[labels == image_token_id] = -100
    batch['labels'] = labels
    return batch

Create SFT Trainer

In [17]:
from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    peft_config=lora_config,
    data_collator=data_collator,
    dataset_text_field="", # needs dummy value
    tokenizer=processor.tokenizer,
    # packing=True,
)

NameError: name 'model' is not defined

Start training loop by calling the train() method on the Trainer instance.

In [None]:
# start training
trainer.train()

# save model
trainer.save_model(training_args.output_dir)

In [None]:
# free the memory again
del model
del trainer
torch.cuda.empty_cache()

### Test Model and run inference

In [None]:
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq

adapter_path = "./qwen2-7b-instruct-amazon-description"

# Load base model
model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16
)
processor = AutoProcessor.from_pretrained(model_id)

In [None]:
from qwen_vl_utils import process_vision_info

sample = {
    "product_name": "Hasbro Marvel Avengers-Serie Marvel Assemble Titan-Held, Iron Man, 30,5 cm Actionfigur",
    "category": "Toys & Games | Toy Figures & Playsets | Action Figures",
    "image": "https://m.media-amazon.com/images/I/81+7Up7IWyL._AC_SY300_SX300_.jpg"
}

# prepare message
messages = [{
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": sample["image"],
            },
            {"type": "text", "text": prompt.format(product_name=sample["product_name"], category=sample["catergory"])},
        ],
    }
]

# Run inference to generate product description
def generate_description(sample, model, processor):
    messages = [
        {"role": "system", "content": [{"type": "text", "text": system_message}]},
        {"role": "user", "content": [
            {"type": "image","image": sample["image"]},
            {"type": "text", "text": prompt.format(product_name=sample["product_name"], category=sample["catergory"])},
        ]},
    ]
    # Preparation for inference
    text = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    image_inputs, video_inputs = process_vision_info(messages)
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
    )
    inputs = inputs.to(model.device)
    # Inference: Generation of the output
    generated_ids = model.generate(**inputs, max_new_tokens=256, top_p=1.0, do_sample=True, temperature=0.8)
    generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
    return output_text[0]


# let's generate the description
base_description = generate_description(sample, model, processor)
print(base_description)
# you can disable the active adapter if you want to rerun it with
# model.disable_adapters()


Load the adapter and compare with base model

In [None]:
model.load_adapter(adapter_path) # load the adapter and activate

ft_description = generate_description(sample, model, processor)
print(ft_description)

Compare side by side using markdown table

In [None]:
import pandas as pd
from IPython.display import display, HTML

def compare_generations(base_gen, ft_gen):
    # Create a DataFrame
    df = pd.DataFrame({
        "Base Model": [base_gen],
        "Finetuned Model": [ft_gen]
    })
    # Style the DataFrame
    styled_df = df.style.set_properties(**{
        'text-align': 'left',
        'white-space': 'pre-wrap',
        'border': '1px solid black',
        'padding': '10px',
        'width': '250px',
        'overflow-wrap': 'break-word'})

    # Display the styled DataFrame
    display(HTML(styled_df.to_html(escape=False)))

compare_generations(base_description, ft_description)

Save Merged model

In [None]:
from peft import PeftModel
from transformers import AutoProcessor, AutoModelForVision2Seq

adapter_path = "./qwen2-7b-instruct-amazon-description"
base_model_id = "Qwen/Qwen2-VL-7B-Instruct"
merged_path = "merged"

# Load Model base model
model = AutoModelForVision2Seq.from_pretrained(model_id, low_cpu_mem_usage=True)

# Path to save the merged model

# Merge LoRA and base model and save
peft_model = PeftModel.from_pretrained(model, adapter_path)
merged_model = peft_model.merge_and_unload()
merged_model.save_pretrained(merged_path,safe_serialization=True, max_shard_size="2GB")

processor = AutoProcessor.from_pretrained(base_model_id)
processor.save_pretrained(merged_path)