# Vision Agents with smolagents


## Let's install the dependencies and login to our HF account to access the Inference API

If you haven't installed `smolagents` yet, you can do so by running the following command:

In [1]:
!pip install smolagents

Collecting smolagents
  Downloading smolagents-1.22.0-py3-none-any.whl.metadata (16 kB)
Downloading smolagents-1.22.0-py3-none-any.whl (149 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m149.8/149.8 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: smolagents
Successfully installed smolagents-1.22.0


Let's also login to the Hugging Face Hub to have access to the Inference API.

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Providing Images at the Start of the Agent's Execution

In this approach, images are passed to the agent at the start and stored as `task_images` alongside the task prompt. The agent then processes these images throughout its execution.  

Consider the case where Alfred wants to verify the identities of the superheroes attending the party. He already has a dataset of images from previous parties with the names of the guests. Given a new visitor's image, the agent can compare it with the existing dataset and make a decision about letting them in.  

In this case, a guest is trying to enter, and Alfred suspects that this visitor might be The Joker impersonating Wonder Woman. Alfred needs to verify their identity to prevent anyone unwanted from entering.  

Let’s build the example. First, the images are loaded. In this case, we use images from Wikipedia to keep the example minimal, but imagine the possible use-cases!

In [3]:
from PIL import Image
import requests
from io import BytesIO

image_urls = [
    "https://upload.wikimedia.org/wikipedia/commons/e/e8/The_Joker_at_Wax_Museum_Plus.jpg",
    "https://upload.wikimedia.org/wikipedia/en/9/98/Joker_%28DC_Comics_character%29.jpg"
]

images = []
for url in image_urls:
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
    }
    response = requests.get(url,headers=headers)
    image = Image.open(BytesIO(response.content)).convert("RGB")
    images.append(image)

Now that we have the images, the agent will tell us wether the guests is actually a superhero (Wonder Woman) or a villian (The Joker).

In [None]:
from google.colab import userdata
import os
os.environ["OPENAI_API_KEY"] = ""

In [13]:
!pip install smolagents[telemetry] opentelemetry-sdk opentelemetry-exporter-otlp openinference-instrumentation-smolagents

Collecting opentelemetry-exporter-otlp
  Downloading opentelemetry_exporter_otlp-1.38.0-py3-none-any.whl.metadata (2.4 kB)
Collecting openinference-instrumentation-smolagents
  Downloading openinference_instrumentation_smolagents-0.1.19-py3-none-any.whl.metadata (4.5 kB)
Collecting arize-phoenix (from smolagents[telemetry])
  Downloading arize_phoenix-12.7.0-py3-none-any.whl.metadata (35 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc==1.38.0 (from opentelemetry-exporter-otlp)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.38.0-py3-none-any.whl.metadata (2.4 kB)
Collecting opentelemetry-exporter-otlp-proto-http==1.38.0 (from opentelemetry-exporter-otlp)
  Downloading opentelemetry_exporter_otlp_proto_http-1.38.0-py3-none-any.whl.metadata (2.3 kB)
Collecting opentelemetry-exporter-otlp-proto-common==1.38.0 (from opentelemetry-exporter-otlp-proto-grpc==1.38.0->opentelemetry-exporter-otlp)
  Downloading opentelemetry_exporter_otlp_proto_common-1.38.0-py3-none-any.whl.metadat

In [None]:
import base64
import os

os.environ["LANGFUSE_PUBLIC_KEY"] = ""
os.environ["LANGFUSE_SECRET_KEY"] = ""

LANGFUSE_PUBLIC_KEY = os.environ.get("LANGFUSE_PUBLIC_KEY")
LANGFUSE_SECRET_KEY = os.environ.get("LANGFUSE_SECRET_KEY")

if not LANGFUSE_PUBLIC_KEY or not LANGFUSE_SECRET_KEY:
    raise ValueError("Langfuse public or secret keys are missing!")

LANGFUSE_AUTH = base64.b64encode(f"{LANGFUSE_PUBLIC_KEY}:{LANGFUSE_SECRET_KEY}".encode()).decode()

#os.environ["OTEL_EXPORTER_OTLP_ENDPOINT"] = "https://cloud.langfuse.com/api/public/otel" # EU data region
os.environ["OTEL_EXPORTER_OTLP_ENDPOINT"] = "https://us.cloud.langfuse.com/api/public/otel" # US data region
os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = f"Authorization=Basic {LANGFUSE_AUTH}"

print("Langfuse environment variables set successfully!")


Langfuse environment variables set successfully!


In [20]:
from opentelemetry.sdk.trace import TracerProvider

from openinference.instrumentation.smolagents import SmolagentsInstrumentor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import SimpleSpanProcessor

trace_provider = TracerProvider()
trace_provider.add_span_processor(SimpleSpanProcessor(OTLPSpanExporter()))

SmolagentsInstrumentor().instrument(tracer_provider=trace_provider)



In [4]:
pip install smolagents[litellm]

Collecting litellm>=1.60.2 (from smolagents[litellm])
  Downloading litellm-1.78.7-py3-none-any.whl.metadata (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.9/42.9 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Collecting fastuuid>=0.13.0 (from litellm>=1.60.2->smolagents[litellm])
  Downloading fastuuid-0.14.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.1 kB)
Downloading litellm-1.78.7-py3-none-any.whl (9.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m80.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fastuuid-0.14.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (278 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m278.1/278.1 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fastuuid, litellm
Successfully installed fastuuid-0.14.0 litellm-1.78.7


In [None]:
import os
# Assuming you have installed smolagents with the litellm extra: pip install smolagents[litellm]
from smolagents import CodeAgent, LiteLLMModel
# You would set your API key as an environment variable:
os.environ["GEMINI_API_KEY"] = ""

# 1. Initialize the model using LiteLLMModel for Gemini
# 'gemini/gemini-2.5-flash' is a fast, multimodal model.
# LiteLLM automatically uses the GEMINI_API_KEY environment variable.
model = LiteLLMModel(
    model_id="gemini/gemini-2.5-flash",
    temperature=0.1 # Lower temperature for a factual description
)

# 2. Instantiate the agent
agent = CodeAgent(
    tools=[],
    model=model,
    max_steps=20,
    verbosity_level=2
)

# 3. Run the agent (assuming 'images' is a list of image paths/bytes)
response = agent.run(
    """
    Describe the costume and makeup that the comic character in these photos is wearing and return the description.
    Tell me if the guest is The Joker or Wonder Woman.
    """,
    images=images
)

print(response)

The character is wearing white face makeup, a wide red smile, and blue eyeshadow. The hair is green. The costume includes a purple coat, a yellow/gold shirt, and a purple cravat/tie. The character is The Joker.


In [8]:
pip install smolagents[HuggingFaceServerModel]



In [12]:
response

'The character is wearing white face makeup, a wide red smile, and blue eyeshadow. The hair is green. The costume includes a purple coat, a yellow/gold shirt, and a purple cravat/tie. The character is The Joker.'

In this case, the output reveals that the person is impersonating someone else, so we can prevent The Joker from entering the party!

## Providing Images with Dynamic Retrieval

This examples is provided as a `.py` file since it needs to be run locally since it'll browse the web. Go to the [Hugging Face Agents Course](https://www.hf.co/learn/agents-course) for more details.