# [Beta] Multi-modal ReAct Agent

In this tutorial we show you how to construct a multi-modal ReAct agent.

This is an agent that can take in both text and images as the input task definition, and go through chain-of-thought + tool use to try to solve the task.

This is implemented with our lower-level Agent API, allowing us to explicitly step through the ReAct loop to show you what's happening in each step.

We show two use cases:
1. **RAG Agent**: Given text/images, can query a RAG pipeline to lookup the answers. (given a screenshot from OpenAI Dev Day 2023)
2. **Web Agent**: Given text/images, can query a web tool to lookup relevant information from the web (given a picture of shoes).

**NOTE**: This is explicitly a beta feature, the abstractions will likely change over time! 

**NOTE**: This currently only works with GPT-4V.

## Augment Image Analysis with a RAG Pipeline

In this section we create a multimodal agent equipped with a RAG Tool.

### Setup Data

In [61]:
!wget "https://images.openai.com/blob/a2e49de2-ba5b-4869-9c2d-db3b4b5dcc19/new-models-and-developer-products-announced-at-devday.jpg?width=2000" -O other_images/openai/dev_day.png

--2024-01-02 20:24:33--  https://images.openai.com/blob/a2e49de2-ba5b-4869-9c2d-db3b4b5dcc19/new-models-and-developer-products-announced-at-devday.jpg?width=2000
Resolving images.openai.com (images.openai.com)... 2606:4700:4400::6812:28cd, 2606:4700:4400::ac40:9333, 172.64.147.51, ...
Connecting to images.openai.com (images.openai.com)|2606:4700:4400::6812:28cd|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 300894 (294K) [image/jpeg]
Saving to: ‘other_images/openai/dev_day.png’


2024-01-02 20:24:33 (7.38 MB/s) - ‘other_images/openai/dev_day.png’ saved [300894/300894]



In [62]:
from llama_hub.web.simple_web.base import SimpleWebPageReader

url = "https://openai.com/blog/new-models-and-developer-products-announced-at-devday"
reader = SimpleWebPageReader(html_to_text=True)
documents = reader.load_data(urls=[url])

### Setup Tools

In [None]:
from llama_index.llms import OpenAI
from llama_index import ServiceContext, VectorStoreIndex
from llama_index.tools import QueryEngineTool, ToolMetadata

In [None]:
llm = OpenAI(temperature=0, model="gpt-3.5-turbo")
service_context = ServiceContext.from_defaults(llm=llm)

In [None]:
vector_index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)

In [None]:
query_tool = QueryEngineTool(
    query_engine=vector_index.as_query_engine(),
    metadata=ToolMetadata(
        name=f"vector_tool",
        description=(
            "Useful to lookup new features announced by OpenAI"
            # "Useful to lookup any information regarding the image"
        ),
    ),
)

### Setup Agent

In [None]:
from llama_index.agent.react_multimodal.step import MultimodalReActAgentWorker
from llama_index.agent import AgentRunner
from llama_index.multi_modal_llms import MultiModalLLM, OpenAIMultiModal
from llama_index.agent import Task

mm_llm = OpenAIMultiModal(model="gpt-4-vision-preview", max_new_tokens=1000)

# Option 2: Initialize AgentRunner with OpenAIAgentWorker
react_step_engine = MultimodalReActAgentWorker.from_tools(
    [query_tool],
    # [],
    multi_modal_llm=mm_llm,
    verbose=True,
)
agent = AgentRunner(react_step_engine)

In [None]:
from llama_index.schema import ImageDocument

query_str = "The photo shows some new features released by OpenAI. Can you pinpoint the features in the photo and give more details using relevant tools?"
# query_str = "Tell me more about code_interpreter and how it's used"
# image document
image_document = ImageDocument(image_path="other_images/openai/dev_day.png")

task = agent.create_task(
    query_str,
    extra_state={"image_docs": [image_document]},
)

In [None]:
def execute_step(agent: AgentRunner, task: Task):
    step_output = agent.run_step(task.task_id)
    if step_output.is_last:
        response = agent.finalize_response(task.task_id)
        print(f"> Agent finished: {str(response)}")
        return response
    else:
        return None


def execute_steps(agent: AgentRunner, task: Task):
    response = execute_step(agent, task)
    while response is None:
        response = execute_step(agent, task)
    return response

In [None]:
# Run this and not the below if you just want to run everything at once.
# response = execute_steps(agent, task)

In [None]:
response = execute_step(agent, task)

In [None]:
response = execute_step(agent, task)

In [None]:
print(str(response))

user: Observation: The latest features released by OpenAI include the GPT-4 Turbo model, the Assistants API, and multimodal capabilities such as vision, image creation (DALLÂ·E 3), and text-to-speech (TTS). These features were announced in a recent blog post and will be rolled out to OpenAI customers starting at 1pm PT today.


## Augment Image Analysis with Web Search

In this example we show you how to setup a GPT-4V powered agent to lookup information on the web to help better explain a given image.

In [None]:
from llama_hub.tools.metaphor.base import MetaphorToolSpec
from llama_index.agent.react_multimodal.step import MultimodalReActAgentWorker
from llama_index.agent import AgentRunner
from llama_index.multi_modal_llms import MultiModalLLM, OpenAIMultiModal
from llama_index.agent import Task

metaphor_tool_spec = MetaphorToolSpec(
    api_key="f6e1ff14-56be-4ab8-a4e9-a6924f693cdc",
)
metaphor_tools = metaphor_tool_spec.to_tool_list()

In [None]:
mm_llm = OpenAIMultiModal(model="gpt-4-vision-preview", max_new_tokens=1000)

# Option 2: Initialize AgentRunner with OpenAIAgentWorker
react_step_engine = MultimodalReActAgentWorker.from_tools(
    metaphor_tools,
    # [],
    multi_modal_llm=mm_llm,
    verbose=True,
)
agent = AgentRunner(react_step_engine)

In [None]:
from llama_index.schema import ImageDocument

query_str = "Look up some reviews regarding these shoes."
image_document = ImageDocument(image_path="amazon_images_test_img.png")

task = agent.create_task(
    query_str, extra_state={"image_docs": [image_document]}
)

In [None]:
response = execute_step(agent, task)

[1;3;38;5;200mThought: The image shows a pair of shoes from a website that appears to be selling them. The user is asking for reviews of these shoes, so I will use the search tool to find reviews.
Action: search
Action Input: {'query': 'reviews for Adidas Ultraboost 1.0 shoes'}
[0m[Metaphor Tool] Autoprompt: Here are some reviews for Adidas Ultraboost 1.0 shoes:
[1;3;34mObservation: [{'title': 'Adidas Ultraboost Review : 7 pros, 2 cons (2023)', 'url': 'https://runrepeat.com/adidas-ultra-boost', 'id': 'Xqa5dR7IR24En7uL5BCTEg'}, {'title': 'Adidas Ultraboost 22 Review : 11 pros, 3 cons (2023)', 'url': 'https://runrepeat.com/adidas-ultraboost-22', 'id': 'k0iu2fLqLw4KNSh0tVxFHA'}, {'title': 'Adidas UltraBoost 2020 Review', 'url': 'https://www.runningshoesguru.com/2020/02/adidas-ultraboost-2020-review/', 'id': 'dws0o3TLyDIhRwdNndrP2g'}, {'title': 'Adidas Ultraboost Uncaged Parley Review 2023, Facts, Deals (£130)', 'url': 'https://runrepeat.com/uk/adidas-ultraboost-uncaged-parley', 'id': '

In [None]:
response = execute_step(agent, task)

[1;3;38;5;200mThought: I have a list of URLs that contain reviews for Adidas Ultraboost shoes. I will retrieve the document summaries for the first few links that seem most relevant to provide the user with a summary of the reviews.
Action: retrieve_documents
Action Input: {'ids': ['Xqa5dR7IR24En7uL5BCTEg', 'k0iu2fLqLw4KNSh0tVxFHA', 'dws0o3TLyDIhRwdNndrP2g']}
[0m[1;3;34mObservation: [Document(id_='22d786e1-4da4-4063-af87-668b3a134401', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='c64c77e35afc558770887313cea7654f21f44e6a0c64b6dda48165b7fd9ae5cb', text='<div><section><h2>\n Comparison to similar running shoes\n </h2> <table><tbody><tr><th> Same brand only</th> <th><p> + + Add a product</p></th><th><p> + + Add a product</p></th><th><p> + + Add a product</p></th><th><p> + + Add a product</p></th><th><p> + + Add a product</p></th><th><p> + + Add a product</p></th><th><p> + + Add a product</p></th></tr><tr><th>CoreSco

In [None]:
response = execute_step(agent, task)

[1;3;38;5;200mThought: I can answer without using any more tools.
Response: The reviews for the Adidas Ultraboost shoes highlight several key points:

1. The Adidas Ultraboost is praised for being versatile, suitable for both running and casual wear. It is considered expensive but offers good value due to its durability and the number of miles one can get out of them. The shoes are described as lightweight, breathable, and comfortable enough to wear without socks. However, they are not recommended for wet conditions as they do not perform well in the rain.

2. The Adidas Ultraboost 22 is noted for its popularity as a sneaker rather than just a running shoe. It is described as super comfortable and plush, with a design that has received a lot of care and thought. The shoe is recommended for its comfort but may not be suitable for those looking for a tighter upper or a shoe specifically optimized for speed.

3. The Adidas Ultraboost 2020 is recognized as a good daily trainer and a fashi

AgentChatResponse(response='The reviews for the Adidas Ultraboost shoes highlight several key points:\n\n1. The Adidas Ultraboost is praised for being versatile, suitable for both running and casual wear. It is considered expensive but offers good value due to its durability and the number of miles one can get out of them. The shoes are described as lightweight, breathable, and comfortable enough to wear without socks. However, they are not recommended for wet conditions as they do not perform well in the rain.\n\n2. The Adidas Ultraboost 22 is noted for its popularity as a sneaker rather than just a running shoe. It is described as super comfortable and plush, with a design that has received a lot of care and thought. The shoe is recommended for its comfort but may not be suitable for those looking for a tighter upper or a shoe specifically optimized for speed.\n\n3. The Adidas Ultraboost 2020 is recognized as a good daily trainer and a fashionable shoe. It competes with other premium

In [None]:
print(str(response))