
# 🚀 **Fine-Tuning Model with ms-swift for ImageCLEFmed-MEDVQA-GI-2025's Subtask 2**

🔍 Subtask 2: Creation of High-Fidelity Synthetic GI Images

This notebook demonstrates the **fine-tuning process** for a model on the **Kvasir-VQA dataset** using **diffusers** package, which contains **medical images 🏥** with related **questions ❓**, **answers ✅** and captions.  You can use **diffusers** to experiment with a lot of models, parameters and strategies with ease.

🔗 **Competition Repository:** [ImageCLEFmed-MEDVQA-GI-2025](https://github.com/simula/ImageCLEFmed-MEDVQA-GI-2025) 🌐  

---

## 🔧 **Steps Included**:  
- ⚙️ **Environment setup** and installation of requirements 📦  
- 🧩 **Dataset prep** 🗂️  
- 🧹 **Data preprocessing** 🧼  
- 🤖 **Model fine-tuning** using Hugging Face's `diffusers` 🏃‍♂️   

In [None]:
# Install required libraries
!pip install diffusers accelerate transformers datasets

# Import necessary modules
from datasets import load_dataset
import json, os
import pandas as pd

# comment this (RECOMMENDED) if you want to log the run to Weights & Biases
os.environ["WANDB_MODE"] = "offline" #

!git clone https://github.com/huggingface/diffusers
!cd diffusers && pip install .

## Load Dataset and prepare for HuggingFace dataset format


In [None]:
ds_ = load_dataset("SimulaMet-HOST/Kvasir-VQA")['raw']
captions = pd.read_json("https://raw.githubusercontent.com/simula/ImageCLEFmed-MEDVQA-GI-2025/refs/heads/main/kvasir-captions.json").to_dict()
##these captions are derieved from kvasir-vqa using QwQ LLM

In [None]:
img_path="kvasir-vqa/images"
os.makedirs(img_path,exist_ok=True)
existing_files=set(os.listdir(img_path))

## JUST to save images locally quickly
ds = ds_.map(lambda e: {"captions":(
        e['image'].save(f"{img_path}/{e['img_id']}.jpg")
        or existing_files.add(f"{e['img_id']}.jpg")
        if f"{e['img_id']}.jpg" not in existing_files else None
    ) or  f"{img_path}/{e['img_id']}.jpg",
    })
ds[0] ## lets see what is in each row

In [None]:
# make sure you are connected to huggingface
!git config --global credential.helper manager
!huggingface-cli login

In [40]:
from datasets import Dataset, Image

# Flatten the captions dictionary properly
datax = {
    "image": [],
    "caption": []
}

for img, captions_group in captions.items():
    for caption in captions_group.values():
        for cap in caption:  # one row per caption
          datax["image"].append({"path": f"{img_path}/{img}.jpg"})
          datax["caption"].append(cap)

dataset = Dataset.from_dict(datax).cast_column("image", Image())

dataset = dataset.shuffle(seed=42)  # Set seed for reproducibility
subset_size = int(len(dataset) * 0.08) # Take 8% of the samples for DEMO
dataset_subset = dataset.select(range(subset_size))

dataset_subset.push_to_hub('vqa_caption.dataset-test') #### will be uploaded to your huggingface REPO

Saving the dataset (0/1 shards):   0%|          | 0/4753 [00:00<?, ? examples/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Map:   0%|          | 0/4753 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/48 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/waitwhoami/vqa_caption.dataset-test/commit/322c2514a955830ad97db62f2ff9078c5ad35a3d', commit_message='Upload dataset', commit_description='', oid='322c2514a955830ad97db62f2ff9078c5ad35a3d', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/waitwhoami/vqa_caption.dataset-test', endpoint='https://huggingface.co', repo_type='dataset', repo_id='waitwhoami/vqa_caption.dataset-test'), pr_revision=None, pr_num=None)

# Now start finetuning

In [None]:
### change dataset_name to your uploaded dataset repo from HuggingFace
!CUDA_VISIBLE_DEVICES=0 accelerate launch /content/diffusers/examples/text_to_image/train_text_to_image_lora.py \
  --pretrained_model_name_or_path=stable-diffusion-v1-5/stable-diffusion-v1-5 \
  --dataset_name=waitwhoami/vqa_caption.dataset-test\
  --caption_column="caption" \
  --resolution=512 \
  --train_batch_size=4 \
  --num_train_epochs=5 \
  --checkpointing_steps=500 \
  --checkpoints_total_limit 3\
  --learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \
  --seed=42 \
  --output_dir="sd-kvasir-imagen-demo" \
  --validation_epochs=1 \
  --validation_prompt="The colonoscopy image contains a single, moderate-sized polyp that has not been removed, appearing in red and pink tones in the center and lower areas" --report_to="wandb" \
  --push_to_hub

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `1`
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
2025-03-04 09:02:23.162632: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1741078943.189648   22009 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1741078943.198446   22009 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-04 09:02:23.231861: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in perf

**After the training completes, check your Huggingface repo to find the uploaded model**

# Test the model from HuggingFace through DiffusionPipeline

In [None]:
from diffusers import DiffusionPipeline
from matplotlib.pyplot import imshow

pipe = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5").to("cuda") # Base model
pipe.load_lora_weights("waitwhoami/sd-kvasir-imagen-demo") ## REPLACE with your HuggingFace repo name

prompt = "Colonoscopy image shows one polyp, 11-20mm in size, in the lower-right quadrant."
image = pipe(prompt).images[0]
imshow(image)

..

# Good luck!