## Local Inference on GPU
Dataset page: https://huggingface.co/datasets/SimulaMet/Kvasir-VQA-x1

Github page: https://github.com/simula/Kvasir-VQA-x1

ArXiv: https://arxiv.org/abs/2506.09958



In [8]:
!pip install ms-swift==3.8.0 bitsandbytes decord qwen_vl_utils==0.0.11

In [2]:
import os
import torch
from swift.llm import PtEngine, RequestConfig, InferRequest
from transformers import  BitsAndBytesConfig

In [3]:
# quantized for free T4 in Colab; paper reports performance on unquantized model.
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16
)

# image size, depending upon resource you have
os.environ["MAX_PIXELS"]="640000"

In [5]:

engine = PtEngine(adapters=[ "SimulaMet/Qwen2.5-VL-KvasirVQA-x1-ft"], quantization_config = bnb_config, attn_impl="sdpa", max_batch_size=1, use_hf=True, model_id_or_path="Qwen/Qwen2.5-VL-7B-Instruct", )

[INFO:swift] Downloading the model from HuggingFace Hub, model_id: Qwen/Qwen2.5-VL-7B-Instruct


Fetching 16 files:   0%|          | 0/16 [00:00<?, ?it/s]

.gitattributes: 0.00B [00:00, ?B/s]

README.md: 0.00B [00:00, ?B/s]

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

[INFO:swift] Setting image_factor: 28. You can adjust this hyperparameter through the environment variable: `IMAGE_FACTOR`.
[INFO:swift] Setting min_pixels: 3136. You can adjust this hyperparameter through the environment variable: `MIN_PIXELS`.
[INFO:swift] Using environment variable `MAX_PIXELS`, Setting max_pixels: 640000.
[INFO:swift] Setting max_ratio: 200. You can adjust this hyperparameter through the environment variable: `MAX_RATIO`.
[INFO:swift] Setting video_min_pixels: 100352. You can adjust this hyperparameter through the environment variable: `VIDEO_MIN_PIXELS`.
[INFO:swift] Setting video_max_pixels: 602112. You can adjust this hyperparameter through the environment variable: `VIDEO_MAX_PIXELS`.
[INFO:swift] Setting video_total_pixels: 90316800. You can adjust this hyperparameter through the environment variable: `VIDEO_TOTAL_PIXELS`.
[INFO:swift] Setting frame_factor: 2. You can adjust this hyperparameter through the environment variable: `FRAME_FACTOR`.
[INFO:swift] Set

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/7.80k [00:00<?, ?B/s]

trainer_state.json: 0.00B [00:00, ?B/s]

README.md: 0.00B [00:00, ?B/s]

zero_to_fp32.py: 0.00B [00:00, ?B/s]

latest:   0%|          | 0.00/15.0 [00:00<?, ?B/s]

.gitattributes: 0.00B [00:00, ?B/s]

additional_config.json:   0%|          | 0.00/67.0 [00:00<?, ?B/s]

args.json: 0.00B [00:00, ?B/s]

[INFO:swift] Loading the model using model_dir: /root/.cache/huggingface/hub/models--SimulaMet--Qwen2.5-VL-KvasirVQA-x1-ft/snapshots/ba247995a6453b90112e5e0805e0c11bad4dee29
[INFO:swift] Create the default_template for the infer_engine
[INFO:swift] default_system: 'You are a helpful assistant.'
[INFO:swift] max_length: 128000
[INFO:swift] response_prefix: ''
[INFO:swift] agent_template: hermes
[INFO:swift] norm_bbox: none


In [8]:
from IPython.display import Image, display
display(Image(
    url="https://huggingface.co/datasets/SimulaMet/Kvasir-VQA-x1/resolve/main/images/clb0kvxvm90y4074yf50vf5nq.jpg",
    width=300  # adjust width (in pixels)
))

In [7]:
req_cfg = RequestConfig(max_tokens=512, temperature=0.3, top_k=20, top_p=0.7, repetition_penalty=1.05)

infer_requests = [
    InferRequest(messages=[{
    "role": "user",
    "content": [
        {"type": "image", "image": "https://huggingface.co/datasets/SimulaMet/Kvasir-VQA-x1/resolve/main/images/clb0kvxvm90y4074yf50vf5nq.jpg"},
        {"type": "text", "text": "What is shown in the image?"}
    ],
}])
]
resp = engine.infer(infer_requests, req_cfg)
print(resp[0].choices[0].message.content)

[INFO:swift] Successfully registered `/usr/local/lib/python3.12/dist-packages/swift/llm/dataset/data/dataset_info.json`.


The image appears to be a gastroscopy view of the stomach, showing the mucosa of the upper gastrointestinal tract. The presence of multiple gastric folds and the lack of visible instruments suggest that this is a direct visualization of the stomach during a gastroscopic procedure.
