In [6]:
import torch
import time
from PIL import Image
from transformers import AutoTokenizer, AutoModelForCausalLM

In [2]:
MID = "apple/FastVLM-1.5B"
IMAGE_TOKEN_INDEX = -200  # what the model code looks for
# Load
tok = AutoTokenizer.from_pretrained(MID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    MID,
    dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    device_map="auto",
    trust_remote_code=True,
)

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/80.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/367 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

llava_qwen.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/apple/FastVLM-1.5B:
- llava_qwen.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/3.82G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/100 [00:00<?, ?B/s]

In [3]:
# Build chat -> render to string (not tokens) so we can place <image> exactly
messages = [
    {"role": "user", "content": "<image>\nDescribe this image in detail."}
]
rendered = tok.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=False
)
pre, post = rendered.split("<image>", 1)

# Tokenize the text *around* the image token (no extra specials!)
pre_ids  = tok(pre,  return_tensors="pt", add_special_tokens=False).input_ids
post_ids = tok(post, return_tensors="pt", add_special_tokens=False).input_ids

# Splice in the IMAGE token id (-200) at the placeholder position
img_tok = torch.tensor([[IMAGE_TOKEN_INDEX]], dtype=pre_ids.dtype)
input_ids = torch.cat([pre_ids, img_tok, post_ids], dim=1).to(model.device)
attention_mask = torch.ones_like(input_ids, device=model.device)

### Check Number of Tokens

In [8]:
print(model.config.max_position_embeddings)

32768


### Generate Result

In [9]:
# Preprocess image via the model's own processor
img = Image.open("../tasks/bird.jpg").convert("RGB")
px = model.get_vision_tower().image_processor(images=img, return_tensors="pt")["pixel_values"]
px = px.to(model.device, dtype=model.dtype)
# Generate
t1 = time.time()
with torch.no_grad():
    out = model.generate(
        inputs=input_ids,
        attention_mask=attention_mask,
        images=px,
        max_new_tokens=1024,
    )
t2 = time.time()
print(tok.decode(out[0], skip_special_tokens=True))

print(f"Generation Time: {round(t2-t1,2)}")

The image captures a close-up of a striking hummingbird, poised on the right side, just about to land on a vibrant flowering plant. The bird has a long, slender, needle-like beak, characteristic of hummingbirds, suggesting its role as a nectar feeder. Its head is a deep, dark color, almost black, and transitions to a greenish-brown chest adorned with subtle spotting. The bird's tail is not fully visible but hints at a striking combination of dark and green shades. Its white-tipped wings, slightly blurred to indicate motion, are held back in preparation for landing. 

The plant on the left side of the image showcases a multi-tiered stalk resembling a flowering cactus. The top part of the plant sports long, tubular flowers rendered in a vivid orange-red hue, exuding an iridescent sheen that catches the light beautifully. As the flowers descend, they fade into lighter shades, culminating in petals that transition from white to yellow and finally to a muted, almost brownish-white. Each tie

In [10]:
# Preprocess image via the model's own processor
img = Image.open("../tasks/bird.jpg").convert("RGB")
px = model.get_vision_tower().image_processor(images=img, return_tensors="pt")["pixel_values"]
px = px.to(model.device, dtype=model.dtype)
# Generate
t1 = time.time()
with torch.no_grad():
    out = model.generate(
        inputs=input_ids,
        attention_mask=attention_mask,
        images=px,
        max_new_tokens=128,
    )
t2 = time.time()
print(tok.decode(out[0], skip_special_tokens=True))

print(f"Generation Time: {round(t2-t1,2)}")

A vibrant outdoor photograph captures a striking scene dominated by a bright red spike of flowers at the top of the image, contrasting sharply against a yellowish-white background. The focal point of the right side of the image is a blue-tinted hummingbird mid-flight against this backdrop. The bird, with its distinct black eye and long black beak, gracefully hovers directly in front of the flower stalk, as if it’s about to land on the blooms. Its body is a beautiful mosaic of colors, with greenish speckles on the side, and gray streaks along its body.

The flower cluster itself is long and cylindrical,
Generation Time: 13.86
