## Qwen 3 VL Instruct for Image analysis

Launch the Endpoint using the vLLM

```
vllm serve Qwen/Qwen3-VL-8B-Instruct-FP8 \
  --tensor-parallel-size 1 \
  --limit-mm-per-prompt.video 0 \
  --async-scheduling \
  --gpu-memory-utilization 0.8 \
  --max-num-seqs 128 \
  --max-model-len 32768
```

In [1]:
import requests
import base64
import json
from PIL import Image, ImageDraw, ImageFont
import io
import random


def infer_prompt(image_path, api_url="http://localhost:8000/v1/chat/completions"):
    """Send image to vLLM API and get objects with bounding boxes."""
    
  # Load image and convert to RGB
    image = Image.open(image_path)
    
    # Convert RGBA/P/L to RGB
    if image.mode in ('RGBA', 'LA', 'P', 'L'):
        if image.mode == 'RGBA' or image.mode == 'LA':
            # Create white background for transparent images
            background = Image.new('RGB', image.size, (255, 255, 255))
            if image.mode == 'RGBA':
                background.paste(image, mask=image.split()[3])
            else:
                background.paste(image, mask=image.split()[1])
            image = background
        else:
            image = image.convert('RGB')
    
    # Save to bytes buffer and encode
    buffer = io.BytesIO()
    image.save(buffer, format='JPEG')
    base64_image = base64.b64encode(buffer.getvalue()).decode('utf-8')
    
    
    # Request object detection
    response = requests.post(
        api_url,
        headers={"Content-Type": "application/json"},
        json={
            "model": "Qwen/Qwen3-VL-8B-Instruct-FP8",
            "messages": [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": f"data:image/jpeg;base64,{base64_image}"
                            }
                        },
                        {
                            "type": "text",
                            "text": "Write in bullet points the difference between left section and right section."
                        }
                    ]
                }
            ],
            "max_tokens": 4096
        }
    )
    
    result = response.json()
    return result['choices'][0]['message']['content']



In [2]:
image_path = "../data/raw/comparison.png"

In [3]:
response_text = infer_prompt(image_path)

In [4]:
print(response_text)

- Left Section:
  - Shows a real-world, unrendered photo of the basement.
  - Appears to be a construction or renovation site with visible debris, cardboard boxes, and exposed concrete.
  - Includes a purple circular marker on the floor and an “Architectural Sheet” overlay with a red arrow.
  - Date stamp at the bottom: “07 Apr 2025.”
  - Overall look is raw, unpolished, and documentary-style.

- Right Section:
  - Displays a 3D digital model or BIM (Building Information Modeling) visualization of the same basement.
  - Shows clean, color-coded infrastructure: green pipes, red pipes, yellow conduits, and blue ducts.
  - Walls and columns are rendered with smooth, clean surfaces and minimal clutter.
  - Includes UI elements for 3D navigation (zoom, pan, rotate) on the right side.
  - Overall look is precise, technical, and designed for planning or analysis.

- Key Difference:
  - The left is a physical, unedited photo of the space as it exists.
  - The right is a digital, annotated, and