<a href="https://colab.research.google.com/github/tuhinmallick/AI-for-Fashion/blob/main/Run_Qwen2_VL_on_Your_Computer_with_Text%2C_Images%2C_and_Video.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*All the details in this article: [Run Qwen2-VL on Your Computer with Text, Images, and Video, Step by Step](https://newsletter.kaitchup.com/p/run-qwen2-vl-on-your-computer-with)*

This notebook shows how to use Qwen2-VL to discuss an image, multiple images, or a video.

It works on a 12 GB GPU. If you have an 8 GB GPU, you can use the GPTQ version of the model. If you have 24 GB GPU, consider using the 7B verions for better performance.

Alibaba provides a library "qwen_vl_utils" that I recommend to install to facilitate the processing of the multimodal input. The "av" package is also required if you want to process videos.

#Setup

In [None]:
!pip install git+https://github.com/huggingface/transformers accelerate flash_attn
!pip install qwen_vl_utils av

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-t1tzb3jy
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-t1tzb3jy
  Resolved https://github.com/huggingface/transformers to commit 51e6526b3896285f898cc52989e005999bf1c2a3
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting av
  Downloading av-12.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.6 kB)
Downloading av-12.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (33.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m66.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: av
Successfully installed av-12.3.0


If you plan to use the GPTQ model:

In [None]:
!pip install auto-gptq optimum
!pip install git+https://github.com/huggingface/transformers

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-lc6027j0
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-lc6027j0
  Resolved https://github.com/huggingface/transformers to commit c409cd81777fb27aadc043ed3d8339dbc020fb3b
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-4.45.0.dev0-py3-none-any.whl size=9613684 sha256=19217d9dfb05284aa60fcf9c27fcaaeb70a5c1ab54a1459c6c274ee62255babb
  Stored in directory: /tmp/pip-ephem-wheel-cache-bnr3jlp4/wheels/c0/14/d6/6c9a5582d2ac191ec0a483be151a4495fe1eb2a6706ca49f1b
Successfully built transformers

If you plan to use the AWQ model:

In [None]:
!pip install autoawq optimum
!pip install git+https://github.com/huggingface/transformers

#Run Qwen2-VL with a single image

I use the same code provided in the model card.
Let's run it step by step.

It loads Qwen2-VL with the class Qwen2VLForConditionalGeneration. I set torch.bfloat16 otherwise the model will be loaded with float32 parameters and will consume much more memory. If your GPU doesn't support bfloat16, you replace it with torch.float16.

I use FlashAttention. If your GPU is not recent, replace "flash_attention_2" with "sdpa". SDPA is the scaled dot-product PyTorch's implementation. Then, I set device_map="auto". "auto" means that the model will be split over several devices if your GPU doesn't have enough memory. To understand how device map works, check this article:

[Device Map: Avoid Out-of-Memory Errors When Running Large Language Models](https://newsletter.kaitchup.com/p/device-map-avoid-out-of-memory-errors-when-running-large-language-models-af7de5076f9d)

In [None]:
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

model_name = "Qwen/Qwen2-VL-2B-Instruct"

model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_name, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="auto"
)

processor = AutoProcessor.from_pretrained(model_name)

You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

The prompt can be formatted as follows, in JSON:

In [None]:
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://about.benjaminmarie.com/data/visual_samples/image0.jpg",
            },
            {"type": "text", "text": "How many dogs do you see? What are they doing?"},
        ],
    }
]


Next, we need to process it, i.e., tokenize it and retrieve the image features:

In [None]:
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)

inputs = inputs.to("cuda")

The final part is (almost) regular inference code with Transformers:

In [None]:
generated_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text)

['In the image, there are two dogs running on a dirt path. The dog on the left is a Corgi, and the one on the right is a Yorkie. They appear to be enjoying a walk together.']


#Run Qwen2-VL with multiple images

To run the model with several images, the code is the same. We only modify the inputs to include the images in the prompt. For this example, I chose two images showing cars:

In [None]:
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

model_name = "Qwen/Qwen2-VL-2B-Instruct"

model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_name, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="auto"
)

processor = AutoProcessor.from_pretrained(model_name)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://about.benjaminmarie.com/data/visual_samples/image1.jpg",
            },
            {
                "type": "image",
                "image": "https://about.benjaminmarie.com/data/visual_samples/image2.jpg",
            },
            {"type": "text", "text": "Which car is better?"},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)

inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text)

config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/56.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/3.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/429M [00:00<?, ?B/s]

You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/272 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/347 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/4.19k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

['The image shows two different types of cars: a classic Mercury sedan and a Tesla Model S.\n\n1. **Classic Mercury Sedan**:\n   - This is a classic Mercury sedan from the 1950s or 1960s.\n   - It has a traditional design with a rounded front and a boxy body.\n   - The car is parked on the side of a road, and there is a person standing next to it.\n\n2. **Tesla Model S**:\n   - This is a modern electric car designed by Tesla.\n   - It has a sleek, aerodynamic design with a long hood and a low profile.\n   - The car is parked in front of a concrete wall with the Tesla logo on it.\n\n**Comparison**:\n- **Design**: The classic Mercury sedan has a more traditional and boxy design, while the Tesla Model S is sleek and modern.\n- **Economy**: The Tesla Model S is an electric car, which means it does not produce emissions and is environmentally friendly.\n- **Range**: The Tesla Model S has a range of up to 480 miles on a single charge, making it a long-range vehicle.\n- **Technology**: The Te

#Run Qwen2-VL with a video

We can provide a video to Qwen2-VL the same way we provided it with images. We only have to change the “type” to video. For the accepted format, I only confirmed that the MP4 format works.

Let's download a video of a traffic jam:

In [None]:
!wget https://about.benjaminmarie.com/data/visual_samples/cars.mp4 -O cars.mp4

--2024-08-30 18:51:51--  https://about.benjaminmarie.com/data/visual_samples/cars.mp4
Resolving about.benjaminmarie.com (about.benjaminmarie.com)... 192.95.30.6
Connecting to about.benjaminmarie.com (about.benjaminmarie.com)|192.95.30.6|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10279858 (9.8M) [video/mp4]
Saving to: ‘cars.mp4’


2024-08-30 18:51:53 (7.67 MB/s) - ‘cars.mp4’ saved [10279858/10279858]



In [None]:
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

model_name = "Qwen/Qwen2-VL-2B-Instruct"

model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_name, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="auto"
)

processor = AutoProcessor.from_pretrained(model_name)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "/content/cars.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)

inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text)

You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

['The image depicts a busy urban street scene with numerous vehicles, including cars and motorcycles, moving in both directions. The traffic appears to be congested, with some vehicles waiting at intersections or in slow-moving lanes. The street is flanked by tall buildings on either side, and there is a bridge or overpass visible in the background. The sky is overcast, suggesting it might be a cloudy day. The overall atmosphere is typical of a bustling city with heavy traffic.']
