
# Qwen2.5‑VL Forward Pass Demo (Cosmic‑AI Style, **No `pipeline`**)

This notebook shows end‑to‑end usage of **Qwen/Qwen2.5‑VL‑7B‑Instruct** with **direct model calls** (no `transformers.pipeline`).  
You'll learn how to:
- Load `AutoProcessor` and `Qwen2_5_VLForConditionalGeneration`,
- Build a multimodal **conversation** (images + text) by hand,
- Run a **pure forward pass** to inspect logits,
- Run **generation** without `pipeline`,
- Try astronomy‑style tasks (nebulae, galaxy morphology, HR diagram, light curves),
- Do a tiny rough lexical comparison to a reference caption.

> **Note:** You need local images (see the `astro/` folder hints below).


In [1]:

# 0) Install dependencies (uncomment and run once)
!pip install -U transformers accelerate torch torchvision pillow
# Optional: flash-attn for supported GPUs
# !pip install flash-attn --no-build-isolation


Collecting transformers
  Downloading transformers-4.57.0-py3-none-any.whl.metadata (41 kB)
Collecting accelerate
  Downloading accelerate-1.10.1-py3-none-any.whl.metadata (19 kB)
Collecting torch
  Downloading torch-2.8.0-cp39-none-macosx_11_0_arm64.whl.metadata (30 kB)
Collecting torchvision
  Downloading torchvision-0.23.0-cp39-cp39-macosx_11_0_arm64.whl.metadata (6.1 kB)
Collecting pillow
  Downloading pillow-11.3.0-cp39-cp39-macosx_11_0_arm64.whl.metadata (9.0 kB)
Collecting filelock (from transformers)
  Downloading filelock-3.19.1-py3-none-any.whl.metadata (2.1 kB)
Collecting huggingface-hub<1.0,>=0.34.0 (from transformers)
  Downloading huggingface_hub-0.35.3-py3-none-any.whl.metadata (14 kB)
Collecting numpy>=1.17 (from transformers)
  Downloading numpy-2.0.2-cp39-cp39-macosx_14_0_arm64.whl.metadata (60 kB)
Collecting pyyaml>=5.1 (from transformers)
  Downloading pyyaml-6.0.3-cp39-cp39-macosx_11_0_arm64.whl.metadata (2.4 kB)
Collecting regex!=2019.12.17 (from transformers)
  D

In [2]:

# 1) Imports, device, and model choice
import torch
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration

MODEL_ID = "Qwen/Qwen2.5-VL-7B-Instruct"

device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = (
    torch.bfloat16 if (device == "cuda" and torch.cuda.is_bf16_supported())
    else torch.float16 if device == "cuda"
    else torch.float32
)

processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_ID,
    torch_dtype=dtype,
    device_map="auto" if device == "cuda" else None,
    attn_implementation="flash_attention_2" if device == "cuda" else None
)
model.eval()

print(f"Device: {device}  |  DType: {dtype}")


  from .autonotebook import tqdm as notebook_tqdm
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
`torch_dtype` is deprecated! Use `dtype` instead!
Fetching 5 files:   0%|          | 0/5 [00:29<?, ?it/s]
Cancellation requested; stopping current tasks.


KeyboardInterrupt: 


## 2) Astronomy images to try
Place one or more images in a local folder, e.g. `./astro/`:
- `sdss_spiral.jpg` (spiral galaxy)
- `orion_nebula.jpg` (nebula / star‑forming region)
- `jwst_mirror.jpg` (instrument photo)
- `hr_diagram.png` (scientific chart)
- `sn_lightcurve.png` (supernova light curve)

Then adjust the file paths below.


In [None]:

# 3) Build a chat turn (image + text) manually
conversation = [{
    "role": "user",
    "content": [
        {"type": "image", "path": "./astro/orion_nebula.jpg"},
        {
            "type": "text",
            "text": "Identify the main regions (e.g., ionization fronts, dark dust lanes). "
                    "What physical processes are likely happening here?"
        }
    ]
}]
print(conversation)


In [None]:

# 4) Convert raw inputs -> tensors via the processor (no pipeline)
inputs = processor.apply_chat_template(
    conversation,
    tokenize=True,
    return_dict=True,
    add_generation_prompt=True,   # adds assistant role for decoding to continue
    return_tensors="pt",
)

# Move tensors to the model device (avoid CPU/GPU mismatch)
inputs = {k: (v.to(model.device) if torch.is_tensor(v) else v) for k, v in inputs.items()}

# Peek at the tensor shapes/keys
shape_summary = {k: (tuple(v.shape) if torch.is_tensor(v) else str(type(v))) for k, v in inputs.items()}
shape_summary


In [None]:

# 5) Pure forward pass (logits only) — no decoding here
with torch.inference_mode():
    out = model(**inputs)     # calls forward(...)
    logits = out.logits       # [batch, seq_len, vocab_size]

print("Logits shape:", tuple(logits.shape))

# Inspect the next-token distribution for the last position
last_step = logits[0, -1]
probs = last_step.softmax(dim=-1)
topk = torch.topk(probs, k=10)
{
    "top10_token_ids": topk.indices.tolist(),
    "top10_probs": [float(x) for x in topk.values]
}


In [None]:

# 6) Generation (still no pipeline): greedy decoding for determinism
with torch.inference_mode():
    gen_ids = model.generate(
        **inputs,
        max_new_tokens=192,
        do_sample=False,       # greedy
        temperature=1.0,
    )

response = processor.batch_decode(gen_ids, skip_special_tokens=True)[0]
print(response)


In [None]:

# 7) Tiny rough lexical comparison to a reference caption (toy metric)
reference = (
    "This is the Orion Nebula (M42), a star-forming region with prominent ionization fronts, "
    "bright emission from hydrogen, and dark dust lanes. Young stellar objects and proplyds are common."
)

import re
def rough_compare(pred: str, ref: str):
    tok = lambda s: set(re.findall(r"[a-zA-Z]+", s.lower()))
    a, b = tok(pred), tok(ref)
    precision = len(a & b) / (len(a) + 1e-9)
    recall    = len(a & b) / (len(b) + 1e-9)
    f1 = 2*precision*recall/(precision+recall+1e-9)
    return {"precision": float(precision), "recall": float(recall), "f1": float(f1)}

rough_compare(response, reference)


In [None]:

# 8) Visual Question Answering (VQA) example — galaxy morphology
conversation = [{
    "role": "user",
    "content": [
        {"type": "image", "path": "./astro/sdss_spiral.jpg"},
        {"type": "text",  "text": "Is this galaxy spiral, barred spiral, elliptical, or irregular? Justify your answer."}
    ]
}]
inputs = processor.apply_chat_template(conversation, tokenize=True, return_dict=True, add_generation_prompt=True, return_tensors="pt")
inputs = {k: (v.to(model.device) if torch.is_tensor(v) else v) for k, v in inputs.items()}
with torch.inference_mode():
    answer_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
processor.batch_decode(answer_ids, skip_special_tokens=True)[0]


In [None]:

# 9) Scientific chart reasoning — HR diagram or light curve
conversation = [{
    "role": "user",
    "content": [
        {"type": "image", "path": "./astro/hr_diagram.png"},
        {"type": "text",  "text": "Explain what this diagram represents and how color relates to temperature and luminosity."}
    ]
}]
inputs = processor.apply_chat_template(conversation, tokenize=True, return_dict=True, add_generation_prompt=True, return_tensors="pt")
inputs = {k: (v.to(model.device) if torch.is_tensor(v) else v) for k, v in inputs.items()}
with torch.inference_mode():
    chart_ids = model.generate(**inputs, max_new_tokens=160, do_sample=False)
processor.batch_decode(chart_ids, skip_special_tokens=True)[0]
