In [1]:
from dotenv import load_dotenv
load_dotenv()
import os

In [2]:
!pip install llama-parse llama-index llama-index-postprocessor-sbert-rerank


Collecting llama-index-postprocessor-sbert-rerank
  Downloading llama_index_postprocessor_sbert_rerank-0.2.0-py3-none-any.whl.metadata (681 bytes)
Downloading llama_index_postprocessor_sbert_rerank-0.2.0-py3-none-any.whl (3.0 kB)
Installing collected packages: llama-index-postprocessor-sbert-rerank
Successfully installed llama-index-postprocessor-sbert-rerank-0.2.0


In [4]:
from llama_index.core import Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0.2)

In [5]:
!wget https://arxiv.org/pdf/2403.09611.pdf -O paper.pdf

'wget' is not recognized as an internal or external command,
operable program or batch file.


Parsing
For parsing, lets use a recent [paper](https://huggingface.co/papers/2403.09611) on Multi-Modal pretraining

In [6]:
import requests

url = "https://arxiv.org/pdf/2403.09611.pdf"
response = requests.get(url)

with open("paper.pdf", "wb") as file:
    file.write(response.content)

Below, we can tell the parser to skip content we don't want. In this case, the references section will just add noise to a RAG system.



In [8]:
from llama_parse import LlamaParse

parser = LlamaParse(
    result_type="markdown",
)

In [9]:
documents = await parser.aload_data("paper.pdf")

Started parsing the file under job_id 761b889d-60fd-41bc-91a5-34f908c25293
...

In [10]:
import nest_asyncio

nest_asyncio.apply()

from llama_index.core.node_parser import (
    MarkdownElementNodeParser,
    SentenceSplitter,
)

# explicitly extract tables with the MarkdownElementNodeParser
node_parser = MarkdownElementNodeParser(num_workers=8)
nodes = node_parser.get_nodes_from_documents(documents)
nodes, objects = node_parser.get_nodes_and_objects(nodes)

# Chain splitters to ensure chunk size requirements are met
nodes = SentenceSplitter(chunk_size=512, chunk_overlap=20).get_nodes_from_documents(
    nodes
)

0it [00:00, ?it/s]
0it [00:00, ?it/s]
2it [00:00, 2001.10it/s]
0it [00:00, ?it/s]
2it [00:00, 2005.40it/s]
1it [00:00, ?it/s]
0it [00:00, ?it/s]
1it [00:00, 999.83it/s]
1it [00:00, 1001.98it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
1it [00:00, ?it/s]
0it [00:00, ?it/s]
3it [00:00, ?it/s]
1it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
1it [00:00, ?it/s]
1it [00:00, ?it/s]
2it [00:00, 153.79it/s]
1it [00:00, ?it/s]
1it [00:00, ?it/s]
1it [00:00, ?it/s]
1it [00:00, 1020.02it/s]
0it [00:00, ?it/s]
1it [00:00, 1001.03it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]


# Chat over the paper, lets find out what it is about!


In [11]:
from llama_index.core import VectorStoreIndex, SummaryIndex

vector_index = VectorStoreIndex(nodes=nodes)
summary_index = SummaryIndex(nodes=nodes)

In [15]:
from llama_index.agent.openai import OpenAIAgent
from llama_index.core.tools import QueryEngineTool, ToolMetadata
# from llama_index.postprocessor.colbert_rerank import ColbertRerank
from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker

tools = [
    QueryEngineTool(
        vector_index.as_query_engine(
            similarity_top_k=8, node_postprocessors=[FlagEmbeddingReranker(top_n=3)]
        ),
        metadata=ToolMetadata(
            name="search",
            description="Search the document, pass the entire user message in the query",
        ),
    ),
    QueryEngineTool(
        summary_index.as_query_engine(),
        metadata=ToolMetadata(
            name="summarize",
            description="Summarize the document using the user message",
        ),
    ),
]

agent = OpenAIAgent.from_tools(tools=tools, verbose=True)

  from .autonotebook import tqdm as notebook_tqdm


In [16]:
# note -- this will take a while with local LLMs, its sending every node in the document to the LLM
resp = agent.chat("What is the summary of the paper?")

Added user message to memory: What is the summary of the paper?
=== Calling Function ===
Calling function: summarize with args: {"input":"What is the summary of the paper?"}
Got output: The paper discusses the development of Multimodal Large Language Models (MLLMs), focusing on the design and training processes that lead to high performance. It emphasizes the significance of various architectural components and data choices in building effective models. Key findings include the importance of a balanced mix of image-caption, interleaved image-text, and text-only data for achieving state-of-the-art few-shot results across multiple benchmarks. The study reveals that the image encoder's resolution and token count significantly impact performance, while the design of the vision-language connector is less critical. The authors present a family of models, MM1, with sizes ranging from 3B to 64B parameters, demonstrating competitive performance after supervised fine-tuning on established multim

In [17]:
print(str(resp))


The paper discusses the development of Multimodal Large Language Models (MLLMs), focusing on their design and training processes for high performance. It highlights the importance of various architectural components and data choices, emphasizing a balanced mix of image-caption, interleaved image-text, and text-only data for achieving state-of-the-art few-shot results across multiple benchmarks. Key findings include the significant impact of the image encoder's resolution and token count on performance, while the design of the vision-language connector is less critical. The authors introduce a family of models, MM1, ranging from 3B to 64B parameters, which demonstrate competitive performance after supervised fine-tuning on established multimodal benchmarks. The paper aims to provide insights and principles for building robust MLLMs, emphasizing the empirical nature of the model-building process and lessons learned from extensive ablations.


In [18]:
resp = agent.chat("How do the authors evaluate their work?")


Added user message to memory: How do the authors evaluate their work?
=== Calling Function ===
Calling function: search with args: {"input":"How do the authors evaluate their work?"}
Got output: The authors evaluate their work through the design and implementation of multimodal evaluation infrastructure, which includes model evaluations and experimentation. They also utilize text-based evaluation methods alongside multimodal evaluation approaches to assess the performance of their models.



In [19]:
print(str(resp))


The authors evaluate their work by designing and implementing multimodal evaluation infrastructure, which encompasses model evaluations and experimentation. They employ both text-based evaluation methods and multimodal evaluation approaches to assess the performance of their models.
