# Auto Merging Retriever

https://docs.llamaindex.ai/en/stable/examples/retrievers/auto_merging_retriever/

In [1]:
# !mkdir -p 'data/'
# !wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"

--2024-12-22 19:05:32--  https://arxiv.org/pdf/2307.09288.pdf
Resolving arxiv.org (arxiv.org)... 151.101.131.42, 151.101.195.42, 151.101.3.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.131.42|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://arxiv.org/pdf/2307.09288 [following]
--2024-12-22 19:05:32--  http://arxiv.org/pdf/2307.09288
Connecting to arxiv.org (arxiv.org)|151.101.131.42|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13661300 (13M) [application/pdf]
Saving to: ‘data/llama2.pdf’


2024-12-22 19:05:32 (46.6 MB/s) - ‘data/llama2.pdf’ saved [13661300/13661300]



In [1]:
from pathlib import Path

from llama_index.readers.file import PDFReader
from llama_index.readers.file import PyMuPDFReader

In [3]:
loader = PyMuPDFReader()
# docs0 = loader.load_data(file=Path("./data/llama2.pdf"))
docs0 = loader.load(file_path=Path("../../data/llama2.pdf"))

In [4]:
from llama_index.core import Document

doc_text = "\n\n".join([d.get_content() for d in docs0])
docs = [Document(text=doc_text)]

Parse Chunk Hierarchy from Text, Load into Storage

In this section we make use of the HierarchicalNodeParser. This will output a hierarchy of nodes, from top-level nodes with bigger chunk sizes to child nodes with smaller chunk sizes, where each child node has a parent node with a bigger chunk size.

By default, the hierarchy is:

- 1st level: chunk size 2048
- 2nd level: chunk size 512
- 3rd level: chunk size 128  

We then load these nodes into storage. The leaf nodes are indexed and retrieved via a vector store - these are the nodes that will first be directly retrieved via similarity search. The other nodes will be retrieved from a docstore.

In [5]:
from llama_index.core.node_parser import (
    HierarchicalNodeParser,
    SentenceSplitter,
)

In [7]:
node_parser = HierarchicalNodeParser.from_defaults()
nodes = node_parser.get_nodes_from_documents(docs)

In [8]:
len(nodes)

1001

In [16]:
from llama_index.core.node_parser import get_leaf_nodes, get_root_nodes
leaf_nodes = get_leaf_nodes(nodes)
root_nodes = get_root_nodes(nodes)

In [17]:
len(root_nodes), len(leaf_nodes)

(38, 780)

In [32]:
leaf_nodes[0].relationships

{<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='4d2bde5d-5056-4d90-9b71-1d6478fb1aa6', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='5cba396f01cb0bf33088b66871a56aea6f72c037c7f5607ed618001f953cd256'),
 <NodeRelationship.PARENT: '4'>: RelatedNodeInfo(node_id='9e8c3318-f0ab-43e4-ad6d-47f667a4af80', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='e9ebffa94aeecdf6d7bb45d9a73a3f4ff168f62ad844c4e11268fdfc7bbced81')}

In [30]:
leaf_nodes[1].relationships

{<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='4d2bde5d-5056-4d90-9b71-1d6478fb1aa6', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='5cba396f01cb0bf33088b66871a56aea6f72c037c7f5607ed618001f953cd256'),
 <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='490788cb-176c-4d0a-bf13-8a369bcb8cb2', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='f14d2893826a5e7632c9e4a6c7736672919c2fab69a68a7df2ad9e9cd0cca027'),
 <NodeRelationship.PARENT: '4'>: RelatedNodeInfo(node_id='9e8c3318-f0ab-43e4-ad6d-47f667a4af80', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='e9ebffa94aeecdf6d7bb45d9a73a3f4ff168f62ad844c4e11268fdfc7bbced81')}

In [33]:
root_nodes[0].relationships

{<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='4d2bde5d-5056-4d90-9b71-1d6478fb1aa6', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='5cba396f01cb0bf33088b66871a56aea6f72c037c7f5607ed618001f953cd256'),
 <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='51cee19b-5bec-4921-97e2-586c3fe95cae', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='d222b0db8ef51f9a528029fb1614870448e97bcad084cb2c564867c6c6391792'),
 <NodeRelationship.CHILD: '5'>: [RelatedNodeInfo(node_id='9e8c3318-f0ab-43e4-ad6d-47f667a4af80', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='e9ebffa94aeecdf6d7bb45d9a73a3f4ff168f62ad844c4e11268fdfc7bbced81'),
  RelatedNodeInfo(node_id='bb54d9f1-ddf6-4f1e-a8fb-442b7b62faa6', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='251f7264bc4a59c17a0d1c24b997229241f2c0bb2b0c36aac5ec4339ad521137'),
  RelatedNodeInfo(node_id='8e5ea6a1-0747-4e19-857c-08ea5f572786', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='9cd4362d52854b8607e34429120c33af24598e

In [34]:
root_nodes[0]

TextNode(id_='dcc149cf-8db1-4e00-a406-76979f92ca3d', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='4d2bde5d-5056-4d90-9b71-1d6478fb1aa6', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='5cba396f01cb0bf33088b66871a56aea6f72c037c7f5607ed618001f953cd256'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='51cee19b-5bec-4921-97e2-586c3fe95cae', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='d222b0db8ef51f9a528029fb1614870448e97bcad084cb2c564867c6c6391792'), <NodeRelationship.CHILD: '5'>: [RelatedNodeInfo(node_id='9e8c3318-f0ab-43e4-ad6d-47f667a4af80', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='e9ebffa94aeecdf6d7bb45d9a73a3f4ff168f62ad844c4e11268fdfc7bbced81'), RelatedNodeInfo(node_id='bb54d9f1-ddf6-4f1e-a8fb-442b7b62faa6', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='251f7264bc4a59c17a0d1c24b997229241f2c0bb2b0c36aac5ec4339ad521137'), R

## Load into Storage

We define a docstore, which we load all nodes into.

We then define a VectorStoreIndex containing just the leaf-level nodes.

In [20]:
# define storage context
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core import StorageContext
from llama_index.llms.openai import OpenAI

docstore = SimpleDocumentStore()

# insert nodes into docstore
docstore.add_documents(nodes)

# define storage context (will include vector store by default too)
storage_context = StorageContext.from_defaults(docstore=docstore)

llm = OpenAI(model="gpt-4o-mini")

In [21]:
## Load index into vector index
from llama_index.core import VectorStoreIndex

base_index = VectorStoreIndex(
    leaf_nodes,
    storage_context=storage_context,
)

## Define Retriever

In [22]:
from llama_index.core.retrievers import AutoMergingRetriever

base_retriever = base_index.as_retriever(similarity_top_k=6)
retriever = AutoMergingRetriever(base_retriever, storage_context, verbose=True)

In [23]:
# query_str = "What were some lessons learned from red-teaming?"
# query_str = "Can you tell me about the key concepts for safety finetuning"
query_str = (
    "What could be the potential outcomes of adjusting the amount of safety"
    " data used in the RLHF stage?"
)

nodes = retriever.retrieve(query_str)
base_nodes = base_retriever.retrieve(query_str)

> Merging 4 nodes into parent node.
> Parent node id: bfa3a8a5-fe73-478f-a4ba-73337a9a6fc7.
> Parent node text: Therefore, after gathering only a few thousand supervised demonstrations, we switched entirely to...



In [25]:
len(nodes), len(base_nodes)

(3, 6)

In [27]:
from llama_index.core.response.notebook_utils import display_source_node

for node in nodes:
    display_source_node(node, source_length=10000)

**Node ID:** c4063709-dc8a-4e37-a88e-90cbe8b94e00<br>**Similarity:** 0.8834604249334524<br>**Text:** To better understand how the addition of safety training data affects
general model performance, especially helpfulness, we investigate the trends in safety data scaling by
adjusting the amount of safety data used in the RLHF stage. In this ablation experiment, we keep the amount
of helpfulness training data unchanged (∼0.9M samples) and gradually increase the amount of safety data
used in model tuning, ranging from 0% to 100% (∼0.1M samples).<br>

**Node ID:** bfa3a8a5-fe73-478f-a4ba-73337a9a6fc7<br>**Similarity:** 0.8563435780861142<br>**Text:** Therefore, after gathering only a few thousand supervised demonstrations, we switched entirely to RLHF to
teach the model how to write more nuanced responses. Comprehensive tuning with RLHF has the added
benefit that it may make the model more robust to jailbreak attempts (Bai et al., 2022a).
We conduct RLHF by first collecting human preference data for safety similar to Section 3.2.2: annotators
write a prompt that they believe can elicit unsafe behavior, and then compare multiple model responses to
the prompts, selecting the response that is safest according to a set of guidelines. We then use the human
preference data to train a safety reward model (see Section 3.2.2), and also reuse the adversarial prompts to
sample from the model during the RLHF stage.
Better Long-Tail Safety Robustness without Hurting Helpfulness
Safety is inherently a long-tail problem,
where the challenge comes from a small number of very specific cases. We investigate the impact of Safety
RLHF by taking two intermediate Llama 2-Chat checkpoints—one without adversarial prompts in the RLHF
stage and one with them—and score their responses on our test sets using our safety and helpfulness reward
models. In Figure 14, we plot the score distribution shift of the safety RM on the safety test set (left) and that
of the helpfulness RM on the helpfulness test set (right). In the left hand side of the figure, we observe that
the distribution of safety RM scores on the safety set shifts to higher reward scores after safety tuning with
RLHF, and that the long tail of the distribution near zero thins out. A clear cluster appears on the top-left
corner suggesting the improvements of model safety. On the right side, we do not observe any gathering
pattern below the y = x line on the right hand side of Figure 14, which indicates that the helpfulness score
distribution is preserved after safety tuning with RLHF. Put another way, given sufficient helpfulness training
data, the addition of an additional stage of safety mitigation does not negatively impact model performance
on helpfulness to any notable degradation. A qualitative example is shown in Table 12.
Impact of Safety Data Scaling.
A tension between helpfulness and safety of LLMs has been observed in
previous studies (Bai et al., 2022a).<br>

**Node ID:** bad9fb67-acd1-4348-babd-9f79b821df19<br>**Similarity:** 0.8468802780723181<br>**Text:** Bai
et al. (2022b) partially automates this fine-tuning-plus-RLHF approach by replacing the human-labeled
fine-tuning data with the model’s own self-critiques and revisions, and by replacing human raters with a
model when ranking model outputs in RLHF, a process known as “RL from AI Feedback” (RLAIF).
Known LLM Safety Challenges.
Recent literature has extensively explored the risks and challenges linked
with Large Language Models. Bender et al. (2021b) and Weidinger et al.<br>