# Dynamic Section Retrieval with LlamaParse

<a href="https://colab.research.google.com/github/run-llama/llama_cloud_services-demo/blob/main/examples/parse/advanced_rag/dynamic_section_retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook showcases a concept called "dynamic section retrieval".

A common problem with naive RAG approaches is that each document is hierarchically organized by section, but standard chunking/retrieval searches for chunks that can be fragments of the entire section and miss out on relevant context.

Dynamic section retrieval takes into account entire contiguous sections as metadata during retrieval, avoiding the problem of retrieving section fragments. 
1. First, tag chunks of a long document with the sections they correspond to, through structured extraction.
2. Do two-pass retrieval. After initial semantic search, dynamically pull in the entire section through metadata filtering.

![](dynamic_section_retrieval_img.png)

This helps provide a solution to the common chunking problem of retrieving chunks that are only subsets of the entire section you're meant to retrieve.

Status:
| Last Executed | Version | State      |
|---------------|---------|------------|
| Aug-19-2025   | 0.6.61  | Maintained |

## Setup

Install core packages and download relevant files. Here we load some popular ICLR 2024 papers.

In [None]:
!pip install "llama-index>=0.13.0<0.14.0" "llama-index-vector-stores-chroma>=0.5.1<0.6.0"
!pip install llama-cloud-services

In [None]:
# NOTE: uncomment more papers if you want to do research over a larger subset of docs

urls = [
    # "https://openreview.net/pdf?id=VtmBAGCN7o",
    # "https://openreview.net/pdf?id=6PmJoRfdaK",
    # "https://openreview.net/pdf?id=LzPWWPAdY4",
    "https://openreview.net/pdf?id=VTF8yNQM66",
    "https://openreview.net/pdf?id=hSyW5go0v8",
    # "https://openreview.net/pdf?id=9WD9KwssyT",
    # "https://openreview.net/pdf?id=yV6fD7LYkF",
    # "https://openreview.net/pdf?id=hnrB5YHoYu",
    # "https://openreview.net/pdf?id=WbWtOYIzIK",
    "https://openreview.net/pdf?id=c5pwL0Soay",
    # "https://openreview.net/pdf?id=TpD2aG1h0D",
]

papers = [
    # "metagpt.pdf",
    # "longlora.pdf",
    # "loftq.pdf",
    "swebench.pdf",
    "selfrag.pdf",
    # "zipformer.pdf",
    # "values.pdf",
    # "finetune_fair_diffusion.pdf",
    # "knowledge_card.pdf",
    "metra.pdf",
    # "vr_mcl.pdf",
]

data_dir = "iclr_docs"

In [None]:
!mkdir "{data_dir}"
for url, paper in zip(urls, papers):
    !wget "{url}" -O "{data_dir}/{paper}"

#### Define LLM and Embedding Model

In [None]:
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(model="text-embedding-3-large", api_key="sk-...")
llm = OpenAI(model="gpt-5-mini", api_key="sk-...")

Settings.embed_model = embed_model
Settings.llm = llm

#### Parse Documents

In [None]:
from llama_cloud_services import LlamaParse

parser = LlamaParse(
    parse_mode="parse_page_with_agent",
    model="openai-gpt-4-1-mini",
    high_res_ocr=True,
    adaptive_long_table=True,
    outlined_table_extraction=True,
    output_tables_as_HTML=True,
    api_key="llx-...",
)

In [None]:
from pathlib import Path

paths_to_parse = []
for paper_path in papers:
    paper_base = Path(paper_path).stem
    full_paper_path = str(Path(data_dir) / paper_path)
    paths_to_parse.append(full_paper_path)


results = await parser.aparse(paths_to_parse)

Getting job results:   0%|          | 0/3 [00:00<?, ?it/s]

Started parsing the file under job_id d8f0df2d-5b55-4e4f-bbe9-81cf4b8a4782
Started parsing the file under job_id 6aef247f-f548-43f5-9ddb-cf8ba8373130
Started parsing the file under job_id 5c1c4baf-fa43-4ed4-b671-16c45f99461c
...

Getting job results:  67%|██████▋   | 2/3 [01:40<00:46, 46.97s/it]

.....

Getting job results: 100%|██████████| 3/3 [05:49<00:00, 116.59s/it]


#### Get Text Nodes

Using each result object, we can create a list of text nodes with metadata attached.

In [None]:
from llama_index.core.schema import TextNode


# attach image metadata to the text nodes
def get_text_nodes(result):
    """Split docs into nodes, by separator."""
    nodes = []

    md_texts = [page.md for page in result.pages]

    for idx, md_text in enumerate(md_texts):
        chunk_metadata = {
            "page_num": idx + 1,
            "paper_path": result.file_name,
        }
        node = TextNode(
            text=md_text,
            metadata=chunk_metadata,
        )
        nodes.append(node)

    return nodes

In [None]:
# this will combine all nodes from all papers into a single list
all_text_nodes = []
text_nodes_dict = {}
for result in results:
    text_nodes = get_text_nodes(result)
    all_text_nodes.extend(text_nodes)
    text_nodes_dict[result.file_name] = text_nodes

In [None]:
print(len(all_text_nodes))

106


## Add Section Metadata

The first step is to extract out a map of all sections from the text of each document. We create a workflow that extracts out if a section heading exists on each page, and merges it together into a combined list. We then run a reflection step to review/correct the extracted sections to make sure everything is correct.

Once we have a map of all the sections and the page numbers they start at, we can add the appropriate section ID as metadata to each chunk.

#### Define Section Schema to Extract Into

Here we define the output schema which allows us to extract out the section metadata from each section of the document. This will give us a full table of contents of each section.

In [None]:
from pydantic import BaseModel, Field
from typing import List, Optional


class SectionOutput(BaseModel):
    """The metadata for a given section. Includes the section name, title, page that it starts on, and more."""

    section_name: str = Field(
        ..., description="The current section number (e.g. section_name='3.2')"
    )
    section_title: str = Field(
        ...,
        description="The current section title associated with the number (e.g. section_title='Experimental Results')",
    )

    start_page_number: int = Field(..., description="The start page number.")
    is_subsection: bool = Field(
        ...,
        description="True if it's a subsection (e.g. Section 3.2). False if it's not a subsection (e.g. Section 3)",
    )
    description: Optional[str] = Field(
        None,
        description="The extracted line from the source text that indicates this is a relevant section.",
    )

    def get_section_id(self):
        """Get section id."""
        return f"{self.section_name}: {self.section_title}"


class SectionsOutput(BaseModel):
    """A list of all sections."""

    sections: List[SectionOutput]


class ValidSections(BaseModel):
    """A list of indexes, each corresponding to a valid section."""

    valid_indexes: List[int] = Field(
        "List of valid section indexes. Do NOT include sections to remove."
    )

#### Extract into Section Outputs

Use LlamaIndex structured output capabilities to iterate through each page and extract out relevant section metadata. Note: some pages may contain no section metadata (there are no sections that begin on that page).

In [None]:
from llama_index.llms.openai import OpenAI
from llama_index.core.prompts import ChatPromptTemplate, ChatMessage
from llama_index.core.llms import LLM
from llama_index.core.async_utils import run_jobs, asyncio_run
import json


async def aget_sections(
    doc_text: str, llm: Optional[LLM] = None
) -> List[SectionOutput]:
    """Get extracted sections from a provided text."""

    system_prompt = """\
    You are an AI document assistant tasked with extracting out section metadata from a document text. 
    
- You should ONLY extract out metadata if the document text contains the beginning of a section.
- The metadata schema is listed below - you should extract out the section_name, section_title, start page number, description.
- A valid section MUST begin with a hashtag (#) and have a number (e.g. "1 Introduction" or "Section 1 Introduction"). \
Note: Not all hashtag (#) lines are valid sections. 

- You can extract out multiple section metadata if there are multiple sections on the page. 
- If there are no sections that begin in this document text, do NOT extract out any sections. 
- A valid section MUST be clearly delineated in the document text. Do NOT extract out a section if it is mentioned, \
but is not actually the start of a section in the document text.
- A Figure or Table does NOT count as a section.
    
    The user will give the document text below.
    
    """
    llm = llm or OpenAI(model="gpt-5-mini", api_key="sk-...")
    sllm = llm.as_structured_llm(SectionsOutput)

    messages = [
        ChatMessage(content=system_prompt, role="system"),
        ChatMessage(content=f"Document text: {doc_text}", role="user"),
    ]
    result = await sllm.achat(messages)
    return result.raw.sections


async def arefine_sections(
    sections: List[SectionOutput], llm: Optional[LLM] = None
) -> List[SectionOutput]:
    """Refine sections based on extracted text."""

    system_prompt = """\
    You are an AI review assistant tasked with reviewing and correcting another agent's work in extracting sections from a document.

    Below is the list of sections with indexes. The sections may be incorrect in the following manner:
    - There may be false positive sections - some sections may be wrongly extracted - you can tell by the sequential order of the rest of the sections
    - Some sections may be incorrectly marked as subsections and vice-versa
    - You can use the description which contains extracted text from the source document to see if it actually qualifies as a section.

    Given this, return the list of indexes that are valid. Do NOT include the indexes to be removed.
    
    """
    llm = llm or OpenAI(model="gpt-5-mini", api_key="sk-...")
    sllm = llm.as_structured_llm(ValidSections)

    section_texts = "\n".join(
        [f"{idx}: {json.dumps(s.model_dump())}" for idx, s in enumerate(sections)]
    )

    messages = [
        ChatMessage(content=system_prompt, role="system"),
        ChatMessage(content=f"Sections in text:\n\n{section_texts}", role="user"),
    ]

    result = await sllm.achat(messages)
    valid_indexes = result.raw.valid_indexes

    new_sections = [s for idx, s in enumerate(sections) if idx in valid_indexes]
    return new_sections


async def acreate_sections(text_nodes_dict):
    sections_dict = {}
    for paper_path, text_nodes in text_nodes_dict.items():
        all_sections = []

        tasks = [aget_sections(n.get_content(metadata_mode="all")) for n in text_nodes]

        async_results = await run_jobs(tasks, workers=8, show_progress=True)
        all_sections = [s for r in async_results for s in r]

        all_sections = await arefine_sections(all_sections)
        sections_dict[paper_path] = all_sections
    return sections_dict

In [None]:
sections_dict = asyncio_run(acreate_sections(text_nodes_dict))

In [None]:
sections_dict["iclr_docs/swebench.pdf"]

[SectionOutput(section_name='1', section_title='Introduction', start_page_number=1, is_subsection=False, description='## 1 Introduction'),
 SectionOutput(section_name='2.2', section_title='TASK FORMULATION', start_page_number=3, is_subsection=True, description='## 2.2 TASK FORMULATION'),
 SectionOutput(section_name='2.3', section_title='FEATURES OF SWE-BENCH', start_page_number=3, is_subsection=True, description='## 2.3 FEATURES OF SWE-BENCH'),
 SectionOutput(section_name='3', section_title='SWE-LLAMA: FINE-TUNING CODELLAMA FOR SWE-BENCH', start_page_number=3, is_subsection=False, description='## 3 SWE-LLAMA: FINE-TUNING CODELLAMA FOR SWE-BENCH'),
 SectionOutput(section_name='4', section_title='EXPERIMENTAL SETUP', start_page_number=4, is_subsection=False, description='# 4 EXPERIMENTAL SETUP'),
 SectionOutput(section_name='4.1', section_title='RETRIEVAL-BASED APPROACH', start_page_number=4, is_subsection=True, description='## 4.1 RETRIEVAL-BASED APPROACH'),
 SectionOutput(section_name=

In [None]:
# [Optional] SAVE
import pickle

pickle.dump(sections_dict, open("sections_dict.pkl", "wb"))

In [None]:
# [Optional] LOAD
sections_dict = pickle.load(open("sections_dict.pkl", "rb"))

#### Annotate each chunk with the section metadata

In the section above we've extracted out a TOC of all sections/subsections and their page numbers. Given this we can just do one forward pass through all the chunks, and annotate them with the section they correspond to (e.g. the section/subsection with the highest page number less than the page number of the chunk). 

In [None]:
def annotate_chunks_with_sections(chunks, sections):
    main_sections = [s for s in sections if not s.is_subsection]
    # subsections include the main sections too (some sections have no subsections etc.)
    sub_sections = sections

    main_section_idx, sub_section_idx = 0, 0
    for idx, c in enumerate(chunks):
        cur_page = c.metadata["page_num"]
        while (
            main_section_idx + 1 < len(main_sections)
            and main_sections[main_section_idx + 1].start_page_number <= cur_page
        ):
            main_section_idx += 1
        while (
            sub_section_idx + 1 < len(sub_sections)
            and sub_sections[sub_section_idx + 1].start_page_number <= cur_page
        ):
            sub_section_idx += 1

        cur_main_section = main_sections[main_section_idx]
        cur_sub_section = sub_sections[sub_section_idx]

        c.metadata["section_id"] = cur_main_section.get_section_id()
        c.metadata["sub_section_id"] = cur_sub_section.get_section_id()

In [None]:
for paper_path, text_nodes in text_nodes_dict.items():
    sections = sections_dict[paper_path]
    annotate_chunks_with_sections(text_nodes, sections)

You can choose to save these nodes if you'd like.

In [None]:
# SAVE
import pickle

pickle.dump(text_nodes_dict, open("iclr_text_nodes.pkl", "wb"))

**LOAD**: If you've already saved nodes, run the below cell to load from an existing file.

In [None]:
# LOAD
import pickle

text_nodes_dict = pickle.load(open("iclr_text_nodes.pkl", "rb"))

In [None]:
all_text_nodes = []
for paper_path, text_nodes in text_nodes_dict.items():
    all_text_nodes.extend(text_nodes)

In [None]:
len(all_text_nodes)

106

### Build Indexes

Once the text nodes are ready, we feed into our vector store index abstraction, which will index these nodes into a simple in-memory vector store (of course, you should definitely check out our 40+ vector store integrations!)

Besides vector indexing, we **also** store a mapping of paper path to the summary index. This allows us to perform document-level retrieval - retrieve all chunks relevant to a given document.

In [None]:
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import VectorStoreIndex

persist_dir = "chroma_storage"

vector_store = ChromaVectorStore.from_params(
    collection_name="text_nodes", persist_dir=persist_dir
)
index = VectorStoreIndex.from_vector_store(vector_store)

**NOTE**: Don't run the block below if you've already inserted the nodes. Only run if it's your first time!!

In [None]:
index.insert_nodes(all_text_nodes)

## Setup Dynamic, Section-Level Retrieval

We now setup a retriever that will allow us to retrieve an entire contiguous section in a document, instead of a chunk of it. This is useful for preserving the entire context within a doc.

- Step 1: Do chunk-level retrieval to find the relevant chunks.
- Step 2: For each chunk, identify the section that it corresponds to.
- Step 3: Do a second retrieval pass using metadata filters to find the entire contiguous section that matches the chunk, and return that as a continguous node.
- Step 4: Feed the contiguous sections into the LLM.

In [None]:
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-5-mini", api_key="sk-...")

In [None]:
chunk_retriever = index.as_retriever(similarity_top_k=3)

In [None]:
from llama_index.core.vector_stores.types import (
    VectorStoreInfo,
    VectorStoreQuerySpec,
    MetadataInfo,
    MetadataFilters,
    FilterCondition,
)
from llama_index.core.schema import NodeWithScore
from typing import List


def section_retrieve(query: str, verbose: bool = False) -> List[NodeWithScore]:
    """Retrieve sections."""
    if verbose:
        print(f">> Identifying the right sections to retrieve")
    chunk_nodes = chunk_retriever.retrieve(query)

    all_section_nodes = {}
    for node in chunk_nodes:
        section_id = node.node.metadata["section_id"]
        if verbose:
            print(f">> Retrieving section: {section_id}")
        filters = MetadataFilters.from_dicts(
            [
                {"key": "section_id", "value": section_id, "operator": "=="},
                {
                    "key": "paper_path",
                    "value": node.node.metadata["paper_path"],
                    "operator": "==",
                },
            ],
            condition=FilterCondition.AND,
        )

        # TODO: make node_ids not positional
        section_nodes_raw = index.vector_store.get_nodes(node_ids=None, filters=filters)
        section_nodes = [NodeWithScore(node=n) for n in section_nodes_raw]
        # order and consolidate nodes
        section_nodes_sorted = sorted(
            section_nodes, key=lambda x: x.metadata["page_num"]
        )

        all_section_nodes.update({n.id_: n for n in section_nodes_sorted})
    return all_section_nodes.values()

In [None]:
nodes = section_retrieve(
    "Give me details of all additional experimental results in the Metra paper",
    verbose=True,
)

>> Identifying the right sections to retrieve
>> Retrieving section: 6: Conclusion
>> Retrieving section: 5: EXPERIMENTS
>> Retrieving section: 5: EXPERIMENTS


In [None]:
for n in nodes:
    print(n.node.metadata)

{'page_num': 9, 'paper_path': 'iclr_docs/metra.pdf', 'section_id': '6: Conclusion', 'sub_section_id': '6: Conclusion'}
{'page_num': 10, 'paper_path': 'iclr_docs/metra.pdf', 'section_id': '6: Conclusion', 'sub_section_id': '6: Conclusion'}
{'page_num': 11, 'paper_path': 'iclr_docs/metra.pdf', 'section_id': '6: Conclusion', 'sub_section_id': '6: Conclusion'}
{'page_num': 12, 'paper_path': 'iclr_docs/metra.pdf', 'section_id': '6: Conclusion', 'sub_section_id': '6: Conclusion'}
{'page_num': 13, 'paper_path': 'iclr_docs/metra.pdf', 'section_id': '6: Conclusion', 'sub_section_id': '6: Conclusion'}
{'page_num': 14, 'paper_path': 'iclr_docs/metra.pdf', 'section_id': '6: Conclusion', 'sub_section_id': '6: Conclusion'}
{'page_num': 15, 'paper_path': 'iclr_docs/metra.pdf', 'section_id': '6: Conclusion', 'sub_section_id': '6: Conclusion'}
{'page_num': 16, 'paper_path': 'iclr_docs/metra.pdf', 'section_id': '6: Conclusion', 'sub_section_id': '6: Conclusion'}
{'page_num': 17, 'paper_path': 'iclr_docs

### Try out Section-Level Retrieval as a Full RAG Pipeline

Now that we've defined the retriever, we can plug the retrieved results into an LLM to create a full RAG pipeline! 

Our response synthesizers help handle dumping context into the LLM prompt window while accounting for context window limitations.

In [None]:
from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.response_synthesizers import TreeSummarize, BaseSynthesizer


class SectionRetrieverRAGEngine(CustomQueryEngine):
    """RAG Query Engine."""

    synthesizer: BaseSynthesizer
    verbose: bool = True

    def __init__(self, *args, **kwargs):
        super().__init__(synthesizer=TreeSummarize(llm=llm))

    def custom_query(self, query_str: str):
        nodes = section_retrieve(query_str, verbose=self.verbose)
        response_obj = self.synthesizer.synthesize(query_str, nodes)
        return response_obj

In [None]:
query_engine = SectionRetrieverRAGEngine()

In [None]:
response = query_engine.query(
    "Tell me more about how difficulty correlates with context length in SWEBench"
)
print(str(response))

>> Identifying the right sections to retrieve
>> Retrieving section: 5: RESULTS
>> Retrieving section: 3: SWE-LLAMA: FINE-TUNING CODELLAMA FOR SWE-BENCH
>> Retrieving section: 4: EXPERIMENTAL SETUP
Key findings about how difficulty correlates with context length

- Performance falls as total input/context size grows. As the amount of code and other context provided to models increases, their ability to localize and produce correct edits drops noticeably (this behavior was observed across multiple models, e.g., Claude 2 and others).

- Extra (irrelevant) context distracts models. When models are given a lot of code that is unrelated to the actual edit, they frequently struggle to find the problematic lines that need changing. This sensitivity includes the relative location of the target code within the larger context.

- Increasing retriever recall doesn't fix it. Expanding retrieval windows (to include more files and therefore raise oracle recall) can actually hurt end-to-end performan

In [None]:
response = query_engine.query(
    "Give me a full overview of the benchmark details in SWE Bench"
)
print(str(response))

>> Identifying the right sections to retrieve
>> Retrieving section: 10: ACKNOWLEDGEMENTS
>> Retrieving section: 1: Introduction
>> Retrieving section: 3: SWE-LLAMA: FINE-TUNING CODELLAMA FOR SWE-BENCH
High-level summary
- SWE-bench is a repository-scale, execution-validated benchmark of real GitHub issues paired with merged pull-request solutions. Each task gives a snapshot of a real codebase plus an issue description; the model must produce a patch that, when applied, makes the repository pass the tests that verify the issue was addressed.
- The benchmark emphasizes realistic, hard software-engineering problems: large codebases, multi-file edits, long issue descriptions, and unit tests used for automatic verification.

Data sources and collection
- Candidate PRs are sourced from popular Python projects (selected from highly downloaded PyPI packages and mapped to their GitHub repositories). Repositories are filtered to ensure permissible licenses.
- Pull requests are collected via the

In [None]:
response = query_engine.query(
    "Give me details of all additional experimental results in the Metra paper"
)
print(str(response))

>> Identifying the right sections to retrieve
>> Retrieving section: 6: Conclusion
>> Retrieving section: 5: EXPERIMENTS
>> Retrieving section: 5: EXPERIMENTS
Here are the additional experimental results and analyses reported.

1) Full qualitative results (complete skill behaviors, 8 seeds)
- Environments: state-based Ant and HalfCheetah; pixel-based Quadruped and Humanoid.
- Skill parameterizations used in these visualizations: 2-D continuous skills for Ant and Humanoid, 4-D continuous skills for Quadruped, 16 discrete skills for HalfCheetah.
- Main finding: across 8 random seeds METRA consistently discovers diverse locomotion behaviors (radial/x-y coverage, different locomotion modes) regardless of seed. The paper shows multiple sample trajectories per seed to illustrate robustness and diversity.

2) Latent-space visualization
- Setup: METRA trained with 2-D continuous latent space on Ant (state inputs) and Humanoid (pixel inputs).
- Observation: the learned representation φ(s) captu