# Spatial Understanding with Qwen3-VL (Together AI)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/Multimodal/Vision/Spatial_Understanding.ipynb)


## Introduction

In this notebook, we'll explore Qwen3-VL's spatial reasoning capabilities using Together AI's API. We'll cover:

1. Understanding spatial relationships between objects
2. Perceiving object affordances (what actions are possible)
3. Integrating spatial reasoning with action planning

These capabilities enable embodied AI applications like robotics and navigation.


### Install required libraries


In [None]:
!pip install openai pillow


In [None]:
import os
import json
import base64
import openai
from PIL import Image, ImageDraw, ImageFont, ImageColor
from IPython.display import display

# Together AI Configuration
client = openai.OpenAI(
    api_key=os.environ.get("TOGETHER_API_KEY"),
    base_url="https://api.together.xyz/v1",
)

MODEL_ID = "Qwen/Qwen3-VL-32B-Instruct"

print(f"Using model: {MODEL_ID}")
print(f"API Key configured: {bool(os.environ.get('TOGETHER_API_KEY'))}")


In [None]:
# Utility functions
additional_colors = [colorname for (colorname, colorcode) in ImageColor.colormap.items()]

def encode_image(image_path):
    with open(image_path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

def get_mime_type(image_path):
    ext = image_path.split(".")[-1].lower()
    return "jpeg" if ext in ["jpg", "jpeg"] else ext

def inference_with_api(image_path, prompt, max_tokens=4096):
    base64_image = encode_image(image_path)
    mime_type = get_mime_type(image_path)
    
    response = client.chat.completions.create(
        model=MODEL_ID,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/{mime_type};base64,{base64_image}"}},
                {"type": "text", "text": prompt},
            ],
        }],
        max_tokens=max_tokens,
    )
    return response.choices[0].message.content

def parse_json(text):
    if "```json" in text:
        text = text.split("```json")[1].split("```")[0]
    return text

def plot_points(im, text):
    img = im.copy()
    width, height = img.size
    draw = ImageDraw.Draw(img)
    colors = ['red', 'green', 'blue', 'yellow', 'orange', 'pink', 'purple'] + additional_colors
    
    try:
        font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf", size=14)
    except:
        font = ImageFont.load_default()

    try:
        data = json.loads(parse_json(text))
    except:
        print("Could not parse JSON")
        display(img)
        return
    
    for i, item in enumerate(data):
        if "point_2d" in item:
            point = item["point_2d"]
            label = item.get("label", f"point_{i}")
            color = colors[i % len(colors)]
            x, y = int(point[0] / 1000 * width), int(point[1] / 1000 * height)
            radius = 5
            draw.ellipse([(x - radius, y - radius), (x + radius, y + radius)], fill=color)
            draw.text((x + 2*radius, y + 2*radius), label, fill=color, font=font)
    
    display(img)


## 1. Understand Spatial Relationships Between Objects

After identifying objects, the model can understand their relative spatial positions.


In [None]:
image_path = "../assets/spatial_understanding/spatio_case1.jpg"
prompt = """Which object, in relation to your current position, holds the farthest placement in the image?
Answer options:
A. chair
B. plant
C. window
D. tv stand."""

response = inference_with_api(image_path, prompt)
print("Prompt:", prompt)
print("\nAnswer:", response)

img = Image.open(image_path)
display(img)


## 2. Perceive Object Affordances

The model can understand what actions are enabled by specific parts of objects or empty space.


In [None]:
image_path = "../assets/spatial_understanding/spatio_case2_aff.png"
prompt = "Locate the free space on the white table on the right in this image. Output the point coordinates in JSON format."

response = inference_with_api(image_path, prompt)
print("Prompt:", prompt)
print("\nAnswer:", response)

plot_points(Image.open(image_path), response)


In [None]:
image_path = "../assets/spatial_understanding/spatio_case2_aff2.png"
prompt = "Can the speaker fit behind the guitar?"

response = inference_with_api(image_path, prompt)
print("Prompt:", prompt)
print("\nAnswer:", response)

img = Image.open(image_path)
img = img.resize((img.width//4, img.height//4))
display(img)


## 3. Integrate Spatial Reasoning and Action Planning

The model can synthesize spatial relationships and affordances to select correct actions, reasoning like an embodied agent.


In [None]:
image_path = "../assets/spatial_understanding/spatio_case2_plan.png"
prompt = "What color arrow should the robot follow to move the apple in between the green can and the orange? Choices: A. Red. B. Blue. C. Green. D. Orange."

response = inference_with_api(image_path, prompt)
print("Prompt:", prompt)
print("\nAnswer:", response)

img = Image.open(image_path)
display(img)


In [None]:
image_path = "../assets/spatial_understanding/spatio_case2_plan2.png"
prompt = "Which motion can help change the coffee pod? Choices: A. A. B. B. C. C. D. D."

response = inference_with_api(image_path, prompt)
print("Prompt:", prompt)
print("\nAnswer:", response)

img = Image.open(image_path)
display(img)
