SmolVLM
=====
**SmolVLM: Redefining small and efficient multimodal models**

 * Article: https://huggingface.co/blog/smolvlm
 * Paper: (SmolVLM) https://arxiv.org/pdf/2504.05299v1
 * Paper: (Idefics3) https://arxiv.org/pdf/2408.12637

![SmolVLM / Idefics3 Architecture Overview](../assets/smolvlm_arch_overview.jpg)

```bash
pip install torch torchvision
pip install transformers
```

In [None]:
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained(
    "HuggingFaceTB/SmolVLM-Instruct")
model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceTB/SmolVLM-Instruct",
    torch_dtype=torch.bfloat16,
    #_attn_implementation="flash_attention_2" if device == "cuda" else "eager"
)

model.eval().to(device);

In [9]:
image_path = "../samples/plants.jpg"
image = Image.open(image_path).convert("RGB")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Can you describe the image?"}
        ]
    },
]

# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(device)
print(inputs["input_ids"].shape, inputs["pixel_values"].shape)

torch.Size([1, 1562]) torch.Size([1, 17, 3, 384, 384])


In [12]:
# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

response = generated_texts[0]

import textwrap
print(textwrap.fill(response, width=80))  # Format the response for better readability

User:<image>Can you describe the image? Assistant: The image features two plants
in golden-colored pots, placed on a wooden floor against a white wall. The
plants are positioned in front of the wall, with the larger plant on the right
side of the image and the smaller plant on the left side. Both plants have green
leaves and appear to be healthy and well-maintained. The pots are cylindrical
and made of metal, with a golden finish that contrasts with the natural green of
the plants. The pots are placed on individual stands, which are also made of
metal and have a spiral design, allowing the plants to be placed at different
heights. The wooden floor is a light brown color, with a natural wood grain
pattern, and it appears to be well-maintained. The white wall in the background
provides a neutral backdrop for the plants and pots, making them the focal point
of the image. The lighting in the image is soft and natural, with no harsh
shadows, which helps to highlight the plants and their sur