<a href="https://www.kaggle.com/code/shravankumar147/fork-of-qwen2-vl-2b-instruct-multi-gpu?scriptVersionId=207909929" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
!pip install -q git+https://github.com/huggingface/transformers qwen-vl-utils  quanto accelerate 
# flash-attn

In [4]:
import os
import torch
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from accelerate import init_empty_weights, infer_auto_device_map
from qwen_vl_utils import process_vision_info

In [5]:
# Step 1: Set Environment Variables for Better CUDA Memory Management
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

In [6]:
# Step 2: Load Model with Accelerated Multi-GPU Device Mapping
print("Loading model...")
with init_empty_weights():
    model = Qwen2VLForConditionalGeneration.from_pretrained(
        "Qwen/Qwen2-VL-2B-Instruct",
        torch_dtype="auto"  # Use automatic precision
    )

Loading model...


config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/56.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/3.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/429M [00:00<?, ?B/s]

`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/272 [00:00<?, ?B/s]

In [7]:
device_map = infer_auto_device_map(
    model, max_memory={0: "7GiB", 1: "7GiB", "cpu": "15GiB"}
)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct",
    device_map=device_map,
    torch_dtype=torch.bfloat16,  # Use memory-efficient bfloat16 precision
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [8]:
# Step 3: Load Processor and Configure Image Token Range
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", min_pixels=256*28*28, max_pixels=1024*28*28)

preprocessor_config.json:   0%|          | 0.00/347 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/4.19k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

In [11]:
# input_image_url = "https://www.mercedes-benz.co.in/content/india/en/passengercars/models/suv/x296-24-1/overview/_jcr_content/root/responsivegrid/tabs/tabitem/hotspot_module/hotspot_simple_image.component.damq5.3432630554423.jpg/mercedes-benz-eqs-suv-x296-exterior-hotspot-start-3302x1858-02-2024.jpg"
# input_image_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
input_image_url = "https://t4.ftcdn.net/jpg/09/22/14/31/360_F_922143188_z2PQRcdhGUUGogaEPPVGqUu6Qz90KYlA.jpg"
# Step 4: Prepare Inputs
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": input_image_url,
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Process text and images
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt"
)
# 
# Send inputs to primary GPU
inputs = inputs.to("cuda:0")

# Step 5: Generate Outputs
print("Generating outputs...")
try:
    generated_ids = model.generate(**inputs, max_new_tokens=128)
    generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
    print("Output:", output_text)
except torch.cuda.OutOfMemoryError as e:
    print("CUDA Out of Memory Error:", e)

Generating outputs...
Output: ['The image depicts a serene and tranquil scene of a person sitting on a wooden dock at the edge of a calm lake. The dock extends into the water, creating a connection between the man and the lake. The lake is surrounded by a misty, mountainous landscape with autumn-colored trees on the shore. The sky above is a gradient of blue hues, suggesting either dawn or dusk. The overall atmosphere is peaceful and contemplative, evoking a sense of solitude and introspection.']


![alt-text](https://t4.ftcdn.net/jpg/09/22/14/31/360_F_922143188_z2PQRcdhGUUGogaEPPVGqUu6Qz90KYlA.jpg)

In [12]:
# Step 6: Monitor GPU Usage (Optional)
print("GPU Usage:")
os.system("nvidia-smi")

GPU Usage:
Sun Nov 17 08:38:28 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   77C    P0             32W /   70W |    5579MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla T4                      

0