# Multi-Document RAG

Doing RAG well over multiple documents is hard. A general framework is given a user query, first select the relevant documents before selecting the content inside.

But selecting the documents can be tough - how can we dynamically select documents based on different properties depending on the user query? 

In this notebook we show you our multi-document RAG architecture:

- Extract a Pydantic **metadata** dictionary from each document (using our Pydantic programs).
- Store this metadata dictionary as filters within a vector database.
- Given a user query, first do **auto-retrieval** - infer the relevant semantic query and the set of filters to query this data (effectively combining text-to-SQL and semantic search).

In [None]:
!pip install llama-index

## Setup and Download Data

In this section, we'll load in the LlamaIndex documentation.

In [1]:
domain = "docs.llamaindex.ai"
docs_url = "https://docs.llamaindex.ai/en/latest/"
!wget -e robots=off --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains {domain} --no-parent {docs_url}

Both --no-clobber and --convert-links were specified, only --convert-links will be used.
--2023-12-18 21:29:30--  https://docs.llamaindex.ai/en/latest/
Resolving docs.llamaindex.ai (docs.llamaindex.ai)... 2606:4700::6812:a3, 2606:4700::6812:1a3, 104.18.0.163, ...
Connecting to docs.llamaindex.ai (docs.llamaindex.ai)|2606:4700::6812:a3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘docs.llamaindex.ai/en/latest/index.html’

docs.llamaindex.ai/     [ <=>                ] 189.78K  --.-KB/s    in 0.01s   

2023-12-18 21:29:30 (14.8 MB/s) - ‘docs.llamaindex.ai/en/latest/index.html’ saved [194334]

--2023-12-18 21:29:30--  https://docs.llamaindex.ai/en/latest/genindex.html
Reusing existing connection to [docs.llamaindex.ai]:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘docs.llamaindex.ai/en/latest/genindex.html’

docs.llamaindex.ai/     [ <=>                ] 874.07K  --.-KB/s    i

In [2]:
from llama_hub.file.unstructured.base import UnstructuredReader
from pathlib import Path
from llama_index.llms import OpenAI
from llama_index import ServiceContext

In [3]:
reader = UnstructuredReader()

[nltk_data] Downloading package punkt to /Users/jerryliu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/jerryliu/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [4]:
all_files_gen = Path("./docs.llamaindex.ai/").rglob("*")
all_files = [f.resolve() for f in all_files_gen]

In [5]:
all_html_files = [f for f in all_files if f.suffix.lower() == ".html"]

In [6]:
len(all_html_files)

648

In [8]:
from llama_index import Document

# TODO: set to higher value if you want more docs
doc_limit = 10

docs = []
for idx, f in enumerate(all_html_files):
    if idx > doc_limit:
        break
    print(f"Idx {idx}/{len(all_html_files)}")
    loaded_docs = reader.load_data(file=f, split_documents=True)
    # Hardcoded Index. Everything before this is ToC for all pages
    start_idx = 72
    loaded_doc = Document(
        text="\n\n".join([d.get_content() for d in loaded_docs[72:]]),
        metadata={"path": str(f)},
    )
    print(loaded_doc.metadata["path"])
    docs.append(loaded_doc)

Idx 0/648
/Users/jerryliu/Programming/gpt_index/docs/examples/query_engine/multi_doc_auto_retrieval/docs.llamaindex.ai/en/latest/index.html
Idx 1/648
/Users/jerryliu/Programming/gpt_index/docs/examples/query_engine/multi_doc_auto_retrieval/docs.llamaindex.ai/en/latest/genindex.html
Idx 2/648
/Users/jerryliu/Programming/gpt_index/docs/examples/query_engine/multi_doc_auto_retrieval/docs.llamaindex.ai/en/latest/search.html
Idx 3/648
/Users/jerryliu/Programming/gpt_index/docs/examples/query_engine/multi_doc_auto_retrieval/docs.llamaindex.ai/en/latest/contributing/documentation.html
Idx 4/648
/Users/jerryliu/Programming/gpt_index/docs/examples/query_engine/multi_doc_auto_retrieval/docs.llamaindex.ai/en/latest/contributing/contributing.html
Idx 5/648
/Users/jerryliu/Programming/gpt_index/docs/examples/query_engine/multi_doc_auto_retrieval/docs.llamaindex.ai/en/latest/changes/changelog.html
Idx 6/648
/Users/jerryliu/Programming/gpt_index/docs/examples/query_engine/multi_doc_auto_retrieval/doc

In [9]:
docs[0].get_content()

"Data indexes structure your data in intermediate representations that are easy and performant for LLMs to consume.\n\nEngines provide natural language access to your data. For example:\n- Query engines are powerful retrieval interfaces for knowledge-augmented output.\n- Chat engines are conversational interfaces for multi-message, “back and forth” interactions with your data.\n\nData agents are LLM-powered knowledge workers augmented by tools, from simple helper functions to API integrations and more.\n\nApplication integrations tie LlamaIndex back into the rest of your ecosystem. This could be LangChain, Flask, Docker, ChatGPT, or… anything else!\n\n👨\u200d👩\u200d👧\u200d👦 Who is LlamaIndex for?\uf0c1\n\nLlamaIndex provides tools for beginners, advanced users, and everyone in between.\n\nOur high-level API allows beginner users to use LlamaIndex to ingest and query their data in 5 lines of code.\n\nFor more complex applications, our lower-level APIs allow advanced users to customize a

In [10]:
docs[0].metadata

{'path': '/Users/jerryliu/Programming/gpt_index/docs/examples/query_engine/multi_doc_auto_retrieval/docs.llamaindex.ai/en/latest/index.html'}

## Extract out Metadata from each Document

We use our Pydantic programs to extract our metadata from each document.

In [11]:
from pydantic import BaseModel, Field
prompt_template_str = "{query_str}"
class DocMetadata(BaseModel):
    summary: str = Field(..., description="Summary of the document")
    

SyntaxError: unterminated string literal (detected at line 2) (3550381176.py, line 2)

In [None]:
from llama_index.program import OpenAIPydanticProgram


program = OpenAIPydanticProgram.from_defaults(
    output_cls=DocMetadata,
    llm=OpenAI(model="gpt-3.5-turbo"),
    prompt_template_str=prompt_template_str,
    verbose=True
)