<a href="https://colab.research.google.com/github/yetessam/huggingface/blob/main/notebooks/unit2/smolagents/vision_agents.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vision Agents with smolagents


This notebook is part of the [Hugging Face Agents Course](https://www.hf.co/learn/agents-course), a free Course from beginner to expert, where you learn to build Agents.

<img src="https://huggingface.co/datasets/agents-course/course-images/resolve/main/en/communication/share.png"  height="100">



## Let's install the dependencies and login to our HF account to access the Inference API

If you haven't installed `smolagents` yet, you can do so by running the following command:

In [1]:
!pip install smolagents

Collecting smolagents
  Downloading smolagents-1.10.0-py3-none-any.whl.metadata (14 kB)
Collecting pandas>=2.2.3 (from smolagents)
  Downloading pandas-2.2.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.9/89.9 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Collecting markdownify>=0.14.1 (from smolagents)
  Downloading markdownify-1.1.0-py3-none-any.whl.metadata (9.1 kB)
Collecting duckduckgo-search>=6.3.7 (from smolagents)
  Downloading duckduckgo_search-7.5.0-py3-none-any.whl.metadata (17 kB)
Collecting python-dotenv (from smolagents)
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Collecting primp>=0.14.0 (from duckduckgo-search>=6.3.7->smolagents)
  Downloading primp-0.14.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading smolagents-1.10.0-py3-none-any.whl (104 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.6

Let's also login to the Hugging Face Hub to have access to the Inference API.

In [7]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Providing Images at the Start of the Agent's Execution

In this approach, images are passed to the agent at the start and stored as `task_images` alongside the task prompt. The agent then processes these images throughout its execution.  

Consider the case where Alfred wants to verify the identities of the superheroes attending the party. He already has a dataset of images from previous parties with the names of the guests. Given a new visitor's image, the agent can compare it with the existing dataset and make a decision about letting them in.  

In this case, a guest is trying to enter, and Alfred suspects that this visitor might be The Joker impersonating Wonder Woman. Alfred needs to verify their identity to prevent anyone unwanted from entering.  

Let’s build the example. First, the images are loaded. In this case, we use images from Wikipedia to keep the example minimal, but imagine the possible use-cases!

In [3]:
from PIL import Image
import requests
from io import BytesIO

image_urls = [
    "https://upload.wikimedia.org/wikipedia/commons/e/e8/The_Joker_at_Wax_Museum_Plus.jpg",
    "https://upload.wikimedia.org/wikipedia/en/9/98/Joker_%28DC_Comics_character%29.jpg"
]

images = []
for url in image_urls:
    response = requests.get(url)
    image = Image.open(BytesIO(response.content)).convert("RGB")
    images.append(image)

Now that we have the images, the agent will tell us wether the guests is actually a superhero (Wonder Woman) or a villian (The Joker).

In [8]:
from google.colab import userdata
import os
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')



In [23]:
import openai, os

try:
    client = openai.OpenAI()
    response = client.models.list()
    print("Your API key is valid")
    #print("You have access to the following models:")
    #for model in response.data:
    #    print(model.id)

    print(" ")

    completion = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "user", "content": "write a haiku about ai"}
        ]
    )

    print(completion.choices[0].message.content)

except openai.OpenAIError as e:
    print(f"Error: {e}")
finally:
    client.close()  # Ensure client closure even if an exception occurs






Your API key is valid, and you have access to the following models:
gpt-4.5-preview
gpt-4.5-preview-2025-02-27
gpt-4o-mini-audio-preview-2024-12-17
dall-e-3
dall-e-2
gpt-4o-audio-preview-2024-10-01
gpt-4o-audio-preview
gpt-4o-mini-realtime-preview-2024-12-17
gpt-4o-mini-realtime-preview
o1-mini-2024-09-12
o1-mini
gpt-4o-mini-audio-preview
whisper-1
gpt-4o-realtime-preview-2024-10-01
babbage-002
chatgpt-4o-latest
tts-1-hd-1106
text-embedding-3-large
gpt-4o-audio-preview-2024-12-17
gpt-4
tts-1-hd
o1-preview
o1-preview-2024-09-12
gpt-4o-2024-08-06
gpt-3.5-turbo-instruct-0914
gpt-4o
tts-1
tts-1-1106
davinci-002
gpt-3.5-turbo-1106
gpt-4-turbo
gpt-3.5-turbo-instruct
gpt-4-0125-preview
gpt-4-turbo-preview
gpt-4o-2024-11-20
gpt-3.5-turbo-0125
gpt-4o-realtime-preview-2024-12-17
gpt-3.5-turbo
gpt-4o-realtime-preview
gpt-3.5-turbo-16k
gpt-4-turbo-2024-04-09
gpt-4o-mini-2024-07-18
gpt-4o-mini
text-embedding-3-small
gpt-4o-2024-05-13
text-embedding-ada-002
gpt-4-1106-preview
gpt-4-0613
omni-moderat

In [24]:
from smolagents import CodeAgent, OpenAIServerModel

model = OpenAIServerModel(model_id="gpt-4o")
#model=HfApiModel(model_id='https://pflgm2locj2t89co.us-east-1.aws.endpoints.huggingface.cloud/')

# Instantiate the agent
agent = CodeAgent(
    tools=[],
    model=model,
    max_steps=20,
    verbosity_level=2
)

from PIL import Image

# Assuming 'images' is your list of images
resized_images = []
for image in images:
    resized_image = image.resize((512, 512))  # Adjust dimensions as needed
    resized_images.append(resized_image)


response = agent.run(
    """
    Describe the costume and makeup that the comic character in these photos is wearing and return the description.
    Tell me if the guest is The Joker or Wonder Woman.
    """,
    images=resized_images
)

In [6]:
response

"Error in generating final LLM output:\n'image'"

In this case, the output reveals that the person is impersonating someone else, so we can prevent The Joker from entering the party!

## Providing Images with Dynamic Retrieval

This examples is provided as a `.py` file since it needs to be run locally since it'll browse the web. Go to the [Hugging Face Agents Course](https://www.hf.co/learn/agents-course) for more details.