# Document Summary Index

This demo showcases the document summary index, over Wikipedia articles on different cities.

The document summary index will extract a summary from each document and store that summary, as well as all nodes corresponding to the document.

Retrieval can be performed through the LLM or embeddings (which is a TODO). We first select the relevant documents to the query based on their summaries. All retrieved nodes corresponding to the selected documents are retrieved.

In [1]:
import os
import openai

os.environ["OPENAI_API_KEY"] = "sk-..."
openai.api_key = os.environ["OPENAI_API_KEY"]

In [2]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
# # Uncomment if you want to temporarily disable logger
# logger = logging.getLogger()
# logger.disabled = True

In [3]:
import nest_asyncio

nest_asyncio.apply()

In [4]:
from llama_index import (
    SimpleDirectoryReader,
    LLMPredictor,
    ServiceContext,
    get_response_synthesizer,
)
from llama_index.indices.document_summary import DocumentSummaryIndex
from llama_index.llms import OpenAI

INFO:numexpr.utils:Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
NumExpr defaulting to 8 threads.


### Load Datasets

Load Wikipedia pages on different cities

In [5]:
wiki_titles = ["Toronto", "Seattle", "Chicago", "Boston", "Houston"]

In [6]:
from pathlib import Path

import requests

for title in wiki_titles:
    response = requests.get(
        "https://en.wikipedia.org/w/api.php",
        params={
            "action": "query",
            "format": "json",
            "titles": title,
            "prop": "extracts",
            # 'exintro': True,
            "explaintext": True,
        },
    ).json()
    page = next(iter(response["query"]["pages"].values()))
    wiki_text = page["extract"]

    data_path = Path("data")
    if not data_path.exists():
        Path.mkdir(data_path)

    with open(data_path / f"{title}.txt", "w") as fp:
        fp.write(wiki_text)

In [7]:
# Load all wiki documents
city_docs = []
for wiki_title in wiki_titles:
    docs = SimpleDirectoryReader(input_files=[f"data/{wiki_title}.txt"]).load_data()
    docs[0].doc_id = wiki_title
    city_docs.extend(docs)

### Build Document Summary Index

We show two ways of building the index:
- default mode of building the document summary index
- customizing the summary query


In [8]:
# # LLM Predictor (gpt-3.5-turbo)
chatgpt = OpenAI(temperature=0, model="gpt-3.5-turbo")
service_context = ServiceContext.from_defaults(llm=chatgpt, chunk_size=1024)

In [9]:
# default mode of building the index
response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize", use_async=True
)
doc_summary_index = DocumentSummaryIndex.from_documents(
    city_docs,
    service_context=service_context,
    response_synthesizer=response_synthesizer,
)

current doc id: Toronto
INFO:openai:message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=463 request_id=d6eb8fc8301bbb70e5ed906913ea4b42 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=463 request_id=d6eb8fc8301bbb70e5ed906913ea4b42 response_code=200
INFO:openai:message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=547 request_id=066ff477ea0931dabd06411b34ee1bc7 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=547 request_id=066ff477ea0931dabd06411b34ee1bc7 response_code=200
INFO:openai:message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=453 request_id=44708e4b96149d11b88569b7766e796d response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=453 request_id=44708e4b96149d11b88569b7766e796d response_c

In [10]:
doc_summary_index.get_document_summary("Boston")

"The provided text contains information about the city of Boston, including its history, geography, climate, neighborhoods, demographics, economy, education system, healthcare facilities, public safety, culture, environment, and sports. It discusses various aspects of the city such as important institutions, mergers and acquisitions, gentrification, significant events like the Boston Marathon bombing, and the city's bid for the 2024 Summer Olympics. The text also mentions Boston's tourism, financial services, printing and publishing industry, convention centers, universities, colleges, medical centers, public schools, private schools, and cultural institutions. It provides details about Boston's air quality, water purity, climate change initiatives, and sports teams.\n\nBased on this information, the text can answer questions such as:\n- What are some major industries in Boston's economy?\n- How many international tourists visited Boston in a specific year?\n- What are some renowned un

In [11]:
doc_summary_index.storage_context.persist("index")

In [12]:
from llama_index.indices.loading import load_index_from_storage
from llama_index import StorageContext

# rebuild storage context
storage_context = StorageContext.from_defaults(persist_dir="index")
doc_summary_index = load_index_from_storage(storage_context)

INFO:llama_index.indices.loading:Loading all indices.
Loading all indices.


### Perform Retrieval from Document Summary Index

We show how to execute queries at a high-level. We also show how to perform retrieval at a lower-level so that you can view the parameters that are in place. We show both LLM-based retrieval and embedding-based retrieval using the document summaries.

#### High-level Querying

Note: this uses the default, LLM-based form of retrieval

In [14]:
query_engine = doc_summary_index.as_query_engine(
    response_mode="tree_summarize", use_async=True
)

In [15]:
response = query_engine.query("What are the sports teams in Toronto?")

INFO:openai:message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=513 request_id=ed88efab61c1ac7da2306701020c85d3 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=513 request_id=ed88efab61c1ac7da2306701020c85d3 response_code=200
INFO:openai:message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=505 request_id=e97056dfb2275b9ff0e710847aa845db response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=505 request_id=e97056dfb2275b9ff0e710847aa845db response_code=200
INFO:openai:message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=569 request_id=b8ec270016c44bc0301c3ee1ac926733 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=569 request_id=b8ec270016c44bc0301c3ee1ac926733 response_code=200
INFO:openai:mess

In [16]:
print(response)

Toronto is represented in five major league sports: the National Hockey League (NHL) with the Toronto Maple Leafs, Major League Baseball (MLB) with the Toronto Blue Jays, the National Basketball Association (NBA) with the Toronto Raptors, the Canadian Football League (CFL) with the Toronto Argonauts, and Major League Soccer (MLS) with the Toronto FC. Additionally, Toronto has the Toronto Rock in the National Lacrosse League (NLL) and the Toronto Wolfpack in the Rugby Football League (RFL).


#### LLM-based Retrieval

In [17]:
from llama_index.indices.document_summary import DocumentSummaryIndexRetriever

In [18]:
retriever = DocumentSummaryIndexRetriever(
    doc_summary_index,
    # choice_select_prompt=choice_select_prompt,
    # choice_batch_size=choice_batch_size,
    # format_node_batch_fn=format_node_batch_fn,
    # parse_choice_select_answer_fn=parse_choice_select_answer_fn,
    # service_context=service_context
)

In [19]:
retrieved_nodes = retriever.retrieve("What are the sports teams in Toronto?")

In [20]:
print(retrieved_nodes[0].score)
print(retrieved_nodes[0].node.get_text())

10.0
Toronto ( (listen) tə-RON-toh; locally [təˈɹɒɾ̃ə] or [ˈtɹɒɾ̃ə]) is the capital city of the Canadian province of Ontario. With a recorded population of 2,794,356 in 2021, it is the most populous city in Canada and the fourth most populous city in North America. The city is the anchor of the Golden Horseshoe, an urban agglomeration of 9,765,188 people (as of 2021) surrounding the western end of Lake Ontario, while the Greater Toronto Area proper had a 2021 population of 6,712,341. Toronto is an international centre of business, finance, arts, sports and culture, and is recognized as one of the most multicultural and cosmopolitan cities in the world.Indigenous peoples have travelled through and inhabited the Toronto area, located on a broad sloping plateau interspersed with rivers, deep ravines, and urban forest, for more than 10,000 years. After the broadly disputed Toronto Purchase, when the Mississauga surrendered the area to the British Crown, the British established the town of 

In [26]:
# use retriever as part of a query engine
from llama_index.query_engine import RetrieverQueryEngine

# configure response synthesizer
response_synthesizer = get_response_synthesizer()

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

# query
response = query_engine.query("What are the sports teams in Toronto?")
print(response)

Toronto is home to several major league sports teams, including the Toronto Maple Leafs in the NHL, the Toronto Blue Jays in MLB, the Toronto Raptors in the NBA, the Toronto Argonauts in the CFL, and the Toronto FC in MLS. The city also has a professional lacrosse team called the Toronto Rock and a rugby league team called the Toronto Wolfpack. Additionally, Toronto is home to the Toronto Rush, a semi-professional ultimate team that competes in the American Ultimate Disc League (AUDL). The University of Toronto, located downtown, has a rich sports history and was the site of the first recorded college football game in November 1861.


#### Embedding-based Retrieval

In [27]:
from llama_index.indices.document_summary import DocumentSummaryIndexEmbeddingRetriever

In [28]:
retriever = DocumentSummaryIndexEmbeddingRetriever(
    doc_summary_index,
    # choice_select_prompt=choice_select_prompt,
    # choice_batch_size=choice_batch_size,
    # format_node_batch_fn=format_node_batch_fn,
    # parse_choice_select_answer_fn=parse_choice_select_answer_fn,
    # service_context=service_context
)

In [29]:
retrieved_nodes = retriever.retrieve("What are the sports teams in Toronto?")

In [30]:
len(retrieved_nodes)

20