# LlamaParse Agent

This demo walks through using an OpenAI Agent with [LlamaParse](https://cloud.llamaindex.ai).

Status:
| Last Executed | Version | State      |
|---------------|---------|------------|
| Aug-19-2025   | 0.6.61  | Maintained |

## Setup

In [None]:
!pip install llama-cloud-services "llama-index>=0.13.0<0.14.0"

In [None]:
import os

os.environ["LLAMA_CLOUD_API_KEY"] = "llx-..."
os.environ["OPENAI_API_KEY"] = "sk-..."

In [None]:
from llama_index.core import Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.llm = OpenAI(model="gpt-5-mini")

## Parsing 

For parsing, lets use a [recent paper](https://huggingface.co/papers/2403.09611) on Multi-Modal pretraining

In [None]:
!wget https://arxiv.org/pdf/2403.09611.pdf -O paper.pdf

Below, we can tell the parser to skip content we don't want. In this case, the references section will just add noise to a RAG system.

In [None]:
from llama_cloud_services import LlamaParse
from sympy import O

parser = LlamaParse(
    parse_mode="parse_page_with_agent",
    model="openai-gpt-4-1-mini",
    high_res_ocr=True,
    adaptive_long_table=True,
    outlined_table_extraction=True,
    output_tables_as_HTML=True,
)

In [None]:
result = await parser.aparse("paper.pdf")
documents = result.get_markdown_documents(split_by_page=False)

Started parsing the file under job_id cd1958b0-b260-4a63-aa74-bf829a0c125f
..

In [None]:
from llama_index.core.node_parser import SentenceSplitter

# Chain splitters to ensure chunk size requirements are met
nodes = SentenceSplitter(chunk_size=2048, chunk_overlap=256).get_nodes_from_documents(
    documents
)

## Chat over the paper, lets find out what it is about!

In [None]:
from llama_index.core import VectorStoreIndex, SummaryIndex

vector_index = VectorStoreIndex(nodes=nodes)
summary_index = SummaryIndex(nodes=nodes)

In [None]:
from llama_index.core.agent import FunctionAgent
from llama_index.core.tools import QueryEngineTool

tools = [
    QueryEngineTool.from_defaults(
        vector_index.as_query_engine(
            similarity_top_k=4,
        ),
        name="query",
        description="Send a query that requires only a subset of the top-k documents to be considered",
    ),
    QueryEngineTool.from_defaults(
        summary_index.as_query_engine(),
        name="query_all_docs",
        description="Send a query that requires all documents to be considered",
    ),
]

agent = FunctionAgent(
    tools=tools,
    llm=Settings.llm,
    system_prompt="You are a helpful assistant that can answer questions about the paper.",
)

In [None]:
from llama_index.core.workflow import Context

# Context to persist the agent session
ctx = Context(agent)

In [None]:
from llama_index.core.agent import ToolCall, ToolCallResult

handler = agent.run(
    "What is the summary of the paper that you have access to?", ctx=ctx
)
async for ev in handler.stream_events():
    if isinstance(ev, ToolCall):
        print(f"Calling tool {ev.tool_name} with args {ev.tool_kwargs}")
    elif isinstance(ev, ToolCallResult):
        print(f"Tool call {ev.tool_name}({ev.tool_kwargs}) returned {ev.tool_output}")

print("\n================\n")

resp = await handler
print(resp)

Calling tool query_all_docs with args {'input': 'Provide the summary of the paper (concise abstract-like summary).'}
Tool call query_all_docs({'input': 'Provide the summary of the paper (concise abstract-like summary).'}) returned This paper presents a practical recipe and empirical analysis for building high-performing multimodal large language models (MLLMs). Through systematic ablations of image encoders, vision–language connectors, and pre-training data mixtures, the work identifies key design lessons: image resolution and the number of image tokens drive the largest gains, followed by encoder capacity and pre-training data; architectural choices for the vision–language connector matter far less. Data-wise, a careful mixture of captioned images, interleaved image–text documents, and some text-only data is critical — caption data boosts zero-shot captioning, interleaved documents enable strong few-shot and text performance, and text-only data preserves language capabilities. The aut

In [None]:
handler = agent.run("How do the authors evaluate their work?", ctx=ctx)
async for ev in handler.stream_events():
    if isinstance(ev, ToolCall):
        print(f"Calling tool {ev.tool_name} with args {ev.tool_kwargs}")
    elif isinstance(ev, ToolCallResult):
        print(f"Tool call {ev.tool_name}({ev.tool_kwargs}) returned {ev.tool_output}")


print("\n================\n")

resp = await handler
print(resp)

Calling tool query_all_docs with args {'input': 'Describe in detail how the authors evaluate their work: which benchmarks and tasks they use (pretraining metrics, few-shot evaluation, supervised fine-tuning, multimodal benchmarks, in-context learning, chain-of-thought, multi-image reasoning), the metrics reported, baselines compared, and ablation studies conducted. Include mentions of training steps, model sizes, and any special evaluation setups (e.g., positional interpolation, sub-image decomposition, synthetic caption data).'}
Tool call query_all_docs({'input': 'Describe in detail how the authors evaluate their work: which benchmarks and tasks they use (pretraining metrics, few-shot evaluation, supervised fine-tuning, multimodal benchmarks, in-context learning, chain-of-thought, multi-image reasoning), the metrics reported, baselines compared, and ablation studies conducted. Include mentions of training steps, model sizes, and any special evaluation setups (e.g., positional interpol