
# 🚀 **Fine-Tuning Model with ms-swift for ImageCLEFmed-MEDVQA-GI-2025's Subtask 1**

🔍 Subtask 1: Algorithm Development for Question Interpretation and Response

This notebook demonstrates the **fine-tuning process** for a model (deepseek-ai/Janus-Pro-1B) on the **Kvasir-VQA dataset** using **[ms-swift](https://swift.readthedocs.io/en/latest/GetStarted/Quick-start.html)** package, which contains **medical images 🏥** with related **questions ❓** and **answers ✅**.  You can use **ms-swift** to experiment with a lot of models, parameters and strategies with ease.

🔗 **Competition Repository:** [ImageCLEFmed-MEDVQA-GI-2025](https://github.com/simula/ImageCLEFmed-MEDVQA-GI-2025) 🌐  

---

## 🔧 **Steps Included**:  
- ⚙️ **Environment setup** and installation of requirements 📦  
- 🧩 **Dataset prep** 🗂️  
- 🧹 **Data preprocessing** 🧼  
- 🤖 **Model fine-tuning** using Hugging Face's `Trainer` API 🏃‍♂️   

In [None]:
# Install required libraries
!!pip install datasets ms-swift flash_attn deepspeed mpi4py

# Import necessary modules
from datasets import load_dataset
import json, os

# comment this (RECOMMENDED) if you want to log the run to Weights & Biases
os.environ["WANDB_MODE"] = "offline" #

## Load Dataset and Prepare dataset for VLM finetuning format


In [None]:
ds_ = load_dataset("SimulaMet-HOST/Kvasir-VQA")['raw']

In [None]:
img_path="images"
os.makedirs(img_path,exist_ok=True)
existing_files=set(os.listdir(img_path))

## lets prepare JSON for each row
ds = ds_.map(lambda e: {"msg":(
        e['image'].save(f"{img_path}/{e['img_id']}.jpg")
        or existing_files.add(f"{e['img_id']}.jpg")
        if f"{e['img_id']}.jpg" not in existing_files else None
    ) or {
        "messages": [
            {"role": "user", "content": f"<image> {e['question']}"},
            {"role": "assistant", "content": e['answer']}
        ],
        "images": [f"{img_path}/{e['img_id']}.jpg"]
    }})
ds[0]['msg'] ## lets see what is in each row

In [None]:
# turned out some entries are null, so lets filter them
jsonl_data = [entry for entry in ds['msg'] if entry['messages'][1]["content"]!='nan']
# for faster DEMO, lets select only 10% of total dataset and save
import random; jsonl_data = random.sample(jsonl_data, int(len(jsonl_data) * 0.10))

open("kvasir-vqa.jsonl","w",encoding="utf-8").writelines(json.dumps(entry)+"\n" for entry in jsonl_data)

## Train the model using swift sft CLI

Define training arguments for model optimization.
For all possible arguments and options see: https://swift.readthedocs.io/en/latest/Instruction/Command-line-parameters.html

For use a different VLM models grab HF Model ID from the list of all supported models:
https://swift.readthedocs.io/en/latest/Instruction/Supported-models-and-datasets.html#multimodal-large-models


In [None]:
!swift sft \
    --model "deepseek-ai/Janus-Pro-1B" \
    --dataset "kvasir-vqa.jsonl" \
    --use_hf true \
    --train_type lora \
    --num_train_epochs 1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --learning_rate 5e-5 \
    --split_dataset_ratio 0.05 \
    --torch_dtype float16 \
    --lora_rank 4 \
    --lora_alpha 16 \
    --target_modules all-linear \
    --freeze_vit true \
    --gradient_accumulation_steps 16 \
    --eval_steps 200 \
    --save_steps 500 \
    --save_total_limit 2 \
    --logging_steps 1 \
    --max_length 1024 \
    --output_dir output \
    --warmup_ratio 0.03 \
    --attn_impl eager \
    --gradient_checkpointing true \
    --dataloader_num_workers 1 \
    --deepspeed zero2_offload

Note: Dependingn upon your GPU config, you neeed to tune the parameters for optimal performance. The current parameters are optimised for Colab's Free T4 GPU.
If GPU is Amere and above use 'sdpa' or 'flash_attn' for **--attn_impl**".

# Good luck!