# LlamaParse Agent

This demo walks through using an OpenAI Agent with [LlamaParse](https://cloud.llamaindex.ai).

## Setup

In [None]:
!pip install llama-parse llama-index llama-index-postprocessor-sbert-rerank

In [None]:
import os

os.environ["LLAMA_CLOUD_API_KEY"] = "llx-..."
os.environ["OPENAI_API_KEY"] = "sk-..."

In [None]:
from llama_index.core import Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.llm = OpenAI(model="gpt-3.5-turbo", temperature=0.2)

## Parsing 

For parsing, lets use a [recent paper](https://huggingface.co/papers/2403.09611) on Multi-Modal pretraining

In [None]:
!wget https://arxiv.org/pdf/2403.09611.pdf -O paper.pdf

Below, we can tell the parser to skip content we don't want. In this case, the references section will just add noise to a RAG system.

In [None]:
from llama_parse import LlamaParse

parser = LlamaParse(
    result_type="markdown",
)

In [None]:
documents = await parser.aload_data("paper.pdf")

Started parsing the file under job_id 81251f39-01be-434e-99e8-1c1b83b82098


In [None]:
import nest_asyncio

nest_asyncio.apply()

from llama_index.core.node_parser import (
    MarkdownElementNodeParser,
    SentenceSplitter,
)

# explicitly extract tables with the MarkdownElementNodeParser
node_parser = MarkdownElementNodeParser(num_workers=8)
nodes = node_parser.get_nodes_from_documents(documents)
nodes, objects = node_parser.get_nodes_and_objects(nodes)

# Chain splitters to ensure chunk size requirements are met
nodes = SentenceSplitter(chunk_size=512, chunk_overlap=20).get_nodes_from_documents(
    nodes
)

Embeddings have been explicitly disabled. Using MockEmbedding.


41it [00:00, 26765.21it/s]
100%|██████████| 41/41 [00:13<00:00,  2.98it/s]


## Chat over the paper, lets find out what it is about!

In [None]:
from llama_index.core import VectorStoreIndex, SummaryIndex

vector_index = VectorStoreIndex(nodes=nodes)
summary_index = SummaryIndex(nodes=nodes)

In [None]:
from llama_index.agent.openai import OpenAIAgent
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.postprocessor.colbert_rerank import ColbertRerank

tools = [
    QueryEngineTool(
        vector_index.as_query_engine(
            similarity_top_k=8, node_postprocessors=[ColbertRerank(top_n=3)]
        ),
        metadata=ToolMetadata(
            name="search",
            description="Search the document, pass the entire user message in the query",
        ),
    ),
    QueryEngineTool(
        summary_index.as_query_engine(),
        metadata=ToolMetadata(
            name="summarize",
            description="Summarize the document using the user message",
        ),
    ),
]

agent = OpenAIAgent.from_tools(tools=tools, verbose=True)

In [None]:
# note -- this will take a while with local LLMs, its sending every node in the document to the LLM
resp = agent.chat("What is the summary of the paper?")

Added user message to memory: What is the summary of the paper?
=== Calling Function ===
Calling function: summarize with args: {"input":"summary"}
Got output: The research focuses on developing Multimodal Large Language Models (MLLMs) by incorporating image-caption, interleaved image-text, and text-only data for pre-training. It highlights the importance of factors like the image encoder, resolution, and token count, while downplaying the design of the vision-language connector. With models scaling up to 30B parameters, the MM1 family demonstrates impressive performance in pre-training metrics and competitive outcomes on diverse multimodal benchmarks. It demonstrates abilities such as in-context learning and multi-image reasoning, aiming to provide valuable insights for creating MLLMs that benefit the research community.



In [None]:
print(str(resp))

The summary of the paper highlights the development of Multimodal Large Language Models (MLLMs) by incorporating image-caption, interleaved image-text, and text-only data for pre-training. The research emphasizes factors like the image encoder, resolution, and token count, while de-emphasizing the design of the vision-language connector. The MM1 family of models, scaling up to 30B parameters, shows impressive performance in pre-training metrics and competitive outcomes on various multimodal benchmarks. These models demonstrate capabilities such as in-context learning and multi-image reasoning, aiming to provide valuable insights for creating MLLMs that benefit the research community.


In [None]:
resp = agent.chat("How do the authors evaluate their work?")

Added user message to memory: How do the authors evaluate their work?
=== Calling Function ===
Calling function: search with args: {"input":"evaluation methods"}
Got output: The evaluation methods involve synthesizing all benchmark results into a single meta-average number to simplify comparisons. This is achieved by normalizing the evaluation metrics with respect to a baseline configuration, standardizing the results for each task, adjusting every metric by dividing it by its respective baseline, and then averaging across all metrics.



In [None]:
print(str(resp))

The authors evaluate their work by synthesizing all benchmark results into a single meta-average number to simplify comparisons. They normalize the evaluation metrics with respect to a baseline configuration, standardize the results for each task, adjust every metric by dividing it by its respective baseline, and then average across all metrics for evaluation.
