# [Beta] Multi-modal ReAct Agent

In this tutorial we show you how to construct a multi-modal ReAct agent.

This is an agent that can take in both text and images as the input task definition, and go through chain-of-thought + tool use to try to solve the task.

This is implemented with our lower-level Agent API, allowing us to explicitly step through the ReAct loop to show you what's happening in each step.

We show two use cases:
1. **RAG Agent**: Given text/images, can query a RAG pipeline to lookup the answers. (given a screenshot from OpenAI Dev Day 2023)
2. **Web Agent**: Given text/images, can query a web tool to lookup relevant information from the web (given a picture of shoes).

**NOTE**: This is explicitly a beta feature, the abstractions will likely change over time! 

**NOTE**: This currently only works with GPT-4V.

## Augment Image Analysis with a RAG Pipeline

In this section we create a multimodal agent equipped with a RAG Tool.

### Setup Data

In [1]:
!wget "https://images.openai.com/blob/a2e49de2-ba5b-4869-9c2d-db3b4b5dcc19/new-models-and-developer-products-announced-at-devday.jpg?width=2000" -O other_images/openai/dev_day.png

--2024-01-02 20:25:25--  https://images.openai.com/blob/a2e49de2-ba5b-4869-9c2d-db3b4b5dcc19/new-models-and-developer-products-announced-at-devday.jpg?width=2000
Resolving images.openai.com (images.openai.com)... 2606:4700:4400::6812:28cd, 2606:4700:4400::ac40:9333, 172.64.147.51, ...
Connecting to images.openai.com (images.openai.com)|2606:4700:4400::6812:28cd|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 300894 (294K) [image/jpeg]
Saving to: ‘other_images/openai/dev_day.png’


2024-01-02 20:25:25 (13.8 MB/s) - ‘other_images/openai/dev_day.png’ saved [300894/300894]



In [2]:
from llama_hub.web.simple_web.base import SimpleWebPageReader

url = "https://openai.com/blog/new-models-and-developer-products-announced-at-devday"
reader = SimpleWebPageReader(html_to_text=True)
documents = reader.load_data(urls=[url])

### Setup Tools

In [3]:
from llama_index.llms import OpenAI
from llama_index import ServiceContext, VectorStoreIndex
from llama_index.tools import QueryEngineTool, ToolMetadata

In [4]:
llm = OpenAI(temperature=0, model="gpt-3.5-turbo")
service_context = ServiceContext.from_defaults(llm=llm)

In [5]:
vector_index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)

In [6]:
query_tool = QueryEngineTool(
    query_engine=vector_index.as_query_engine(),
    metadata=ToolMetadata(
        name=f"vector_tool",
        description=(
            "Useful to lookup new features announced by OpenAI"
            # "Useful to lookup any information regarding the image"
        ),
    ),
)

### Setup Agent

In [7]:
from llama_index.agent.react_multimodal.step import MultimodalReActAgentWorker
from llama_index.agent import AgentRunner
from llama_index.multi_modal_llms import MultiModalLLM, OpenAIMultiModal
from llama_index.agent import Task

mm_llm = OpenAIMultiModal(model="gpt-4-vision-preview", max_new_tokens=1000)

# Option 2: Initialize AgentRunner with OpenAIAgentWorker
react_step_engine = MultimodalReActAgentWorker.from_tools(
    [query_tool],
    # [],
    multi_modal_llm=mm_llm,
    verbose=True,
)
agent = AgentRunner(react_step_engine)

In [8]:
from llama_index.schema import ImageDocument

query_str = "The photo shows some new features released by OpenAI. Can you pinpoint the features in the photo and give more details using relevant tools?"
# query_str = "Tell me more about code_interpreter and how it's used"
# image document
image_document = ImageDocument(image_path="other_images/openai/dev_day.png")

task = agent.create_task(
    query_str,
    extra_state={"image_docs": [image_document]},
)

In [9]:
def execute_step(agent: AgentRunner, task: Task):
    step_output = agent.run_step(task.task_id)
    if step_output.is_last:
        response = agent.finalize_response(task.task_id)
        print(f"> Agent finished: {str(response)}")
        return response
    else:
        return None


def execute_steps(agent: AgentRunner, task: Task):
    response = execute_step(agent, task)
    while response is None:
        response = execute_step(agent, task)
    return response

In [10]:
# Run this and not the below if you just want to run everything at once.
# response = execute_steps(agent, task)

In [11]:
response = execute_step(agent, task)

[1;3;38;5;200mThought: I need to use a tool to help me identify the new features released by OpenAI as shown in the photo.
Action: vector_tool
Action Input: {'input': 'new features released by OpenAI'}
[0m[1;3;34mObservation: OpenAI has released several new features, including the GPT-4 Turbo model, the Assistants API, and multimodal capabilities. The GPT-4 Turbo model is more capable, cheaper, and supports a 128K context window. The Assistants API makes it easier for developers to build their own assistive AI apps with goals and the ability to call models and tools. The multimodal capabilities include vision, image creation (DALLÂ·E 3), and text-to-speech (TTS). These new features are being rolled out to OpenAI customers starting at 1pm PT today.
[0m

In [12]:
response = execute_step(agent, task)

[1;3;38;5;200mThought: The observation provided information about the new features released by OpenAI, which I can now relate to the image provided.
Response: The photo shows a user interface with a section titled "Playground" and several options such as "GPT-4.0-turbo," "Code Interpreter," "Translate," and "Chat." Based on the observation from the tool, these features are part of the new releases by OpenAI. Specifically, "GPT-4.0-turbo" likely refers to the GPT-4 Turbo model, which is a more capable and cost-effective version of the language model with a larger context window. The "Code Interpreter" could be related to the Assistants API, which allows developers to build AI apps that can interpret and execute code. The "Translate" and "Chat" options might be part of the multimodal capabilities, with "Translate" possibly involving text-to-text language translation and "Chat" involving conversational AI capabilities. The multimodal capabilities also include vision and image creation, w

In [13]:
print(str(response))

The photo shows a user interface with a section titled "Playground" and several options such as "GPT-4.0-turbo," "Code Interpreter," "Translate," and "Chat." Based on the observation from the tool, these features are part of the new releases by OpenAI. Specifically, "GPT-4.0-turbo" likely refers to the GPT-4 Turbo model, which is a more capable and cost-effective version of the language model with a larger context window. The "Code Interpreter" could be related to the Assistants API, which allows developers to build AI apps that can interpret and execute code. The "Translate" and "Chat" options might be part of the multimodal capabilities, with "Translate" possibly involving text-to-text language translation and "Chat" involving conversational AI capabilities. The multimodal capabilities also include vision and image creation, which could be represented in the Playground interface but are not visible in the provided section of the photo.


## Augment Image Analysis with Web Search

In this example we show you how to setup a GPT-4V powered agent to lookup information on the web to help better explain a given image.

In [14]:
from llama_hub.tools.metaphor.base import MetaphorToolSpec
from llama_index.agent.react_multimodal.step import MultimodalReActAgentWorker
from llama_index.agent import AgentRunner
from llama_index.multi_modal_llms import MultiModalLLM, OpenAIMultiModal
from llama_index.agent import Task

metaphor_tool_spec = MetaphorToolSpec(
    api_key="f6e1ff14-56be-4ab8-a4e9-a6924f693cdc",
)
metaphor_tools = metaphor_tool_spec.to_tool_list()

In [15]:
mm_llm = OpenAIMultiModal(model="gpt-4-vision-preview", max_new_tokens=1000)

# Option 2: Initialize AgentRunner with OpenAIAgentWorker
react_step_engine = MultimodalReActAgentWorker.from_tools(
    metaphor_tools,
    # [],
    multi_modal_llm=mm_llm,
    verbose=True,
)
agent = AgentRunner(react_step_engine)

In [16]:
from llama_index.schema import ImageDocument

query_str = "Look up some reviews regarding these shoes."
image_document = ImageDocument(image_path="amazon_images_test_img.png")

task = agent.create_task(
    query_str, extra_state={"image_docs": [image_document]}
)

In [17]:
response = execute_step(agent, task)

[1;3;38;5;200mThought: The image shows a pair of shoes that appear to be from a sportswear brand, possibly Adidas, given the three-stripe logo. I need to use a tool to search for reviews of these shoes.
Action: search
Action Input: {'query': 'Adidas Ultraboost 1.0 reviews', 'num_results': 5}
[0m[Metaphor Tool] Autoprompt: "Here's a review of the Adidas Ultraboost 1.0:
[1;3;34mObservation: [{'title': 'Still Not a Speed Shoe, Adidas’s Ultraboost Light Feels Faster', 'url': 'https://www.runnersworld.com/gear/a43167849/adidas-ultraboost-light-review/?utm_medium=social-media&utm_source=twitter&utm_campaign=socialflowTWRW', 'id': 'mWv--EVfp6DpIrMj8TzynA'}, {'title': 'Adidas ULTRABOOST LIGHT : Detailed First Impressions', 'url': 'https://www.youtube.com/watch?v=780bVc3yjsk', 'id': 'LOUPO-1rU2tZJd31YFYYXA'}, {'title': 'The Ultraboost 21 Has Arrived & These Are Its Best Colorways', 'url': 'https://www.highsnobiety.com/p/adidas-ultraboost-21-buy/?utm_medium=Social&utm_source=Twitter#Echobox=1

In [18]:
response = execute_step(agent, task)

[1;3;38;5;200mThought: I have received a list of URLs that contain reviews for the Adidas Ultraboost shoes. I can now provide a summary of these reviews to the user.
Response: Here are some reviews for the Adidas Ultraboost shoes:

1. "Still Not a Speed Shoe, Adidas’s Ultraboost Light Feels Faster" - A review from Runner's World discussing the performance of the Ultraboost Light, suggesting it feels faster than previous models but is not specifically designed for speed. [Read more](https://www.runnersworld.com/gear/a43167849/adidas-ultraboost-light-review/?utm_medium=social-media&utm_source=twitter&utm_campaign=socialflowTWRW)

2. "Adidas ULTRABOOST LIGHT: Detailed First Impressions" - A YouTube video providing first impressions on the Ultraboost Light, which may include details on comfort, design, and performance. [Watch here](https://www.youtube.com/watch?v=780bVc3yjsk)

3. "The Ultraboost 21 Has Arrived & These Are Its Best Colorways" - An article by Highsnobiety that introduces th

In [19]:
print(str(response))

Here are some reviews for the Adidas Ultraboost shoes:

1. "Still Not a Speed Shoe, Adidas’s Ultraboost Light Feels Faster" - A review from Runner's World discussing the performance of the Ultraboost Light, suggesting it feels faster than previous models but is not specifically designed for speed. [Read more](https://www.runnersworld.com/gear/a43167849/adidas-ultraboost-light-review/?utm_medium=social-media&utm_source=twitter&utm_campaign=socialflowTWRW)

2. "Adidas ULTRABOOST LIGHT: Detailed First Impressions" - A YouTube video providing first impressions on the Ultraboost Light, which may include details on comfort, design, and performance. [Watch here](https://www.youtube.com/watch?v=780bVc3yjsk)

3. "The Ultraboost 21 Has Arrived & These Are Its Best Colorways" - An article by Highsnobiety that introduces the Ultraboost 21 and highlights some of its best colorways. [Read more](https://www.highsnobiety.com/p/adidas-ultraboost-21-buy/?utm_medium=Social&utm_source=Twitter#Echobox=1611