# Test Complex Queries over Multiple Documents (with and without Query Decomposition)

Query Decomposition: The ability to decompose a complex query into a simpler query given the content of the index.

Use OpenAI as the LLM model and embedding model.

In [5]:
import logging
import sys

# logging.basicConfig(stream=sys.stdout, level=logging.INFO)
# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

# Uncomment if you want to temporarily disable logger
logger = logging.getLogger()
logger.disabled = True

In [6]:
from gpt_index import (
    GPTSimpleVectorIndex, 
    GPTSimpleKeywordTableIndex, 
    GPTListIndex, 
    SimpleDirectoryReader,
    LLMPredictor,
    ServiceContext
)
import requests

#### Load Datasets

Load Wikipedia pages as well as Paul Graham's "What I Worked On" essay

In [7]:
wiki_titles = ["Toronto", "Seattle", "San Francisco", "Chicago", "Boston", "Washington, D.C.", "Cambridge, Massachusetts", "Houston"]

In [8]:
from pathlib import Path
import requests

data_path = Path('data_wiki')

for title in wiki_titles:
    response = requests.get(
        'https://en.wikipedia.org/w/api.php',
        params={
            'action': 'query',
            'format': 'json',
            'titles': title,
            'prop': 'extracts',
            # 'exintro': True,
            'explaintext': True,
        }
    ).json()
    page = next(iter(response['query']['pages'].values()))
    wiki_text = page['extract']

    if not data_path.exists():
        Path.mkdir(data_path)

    with open(data_path / f"{title}.txt", 'w') as fp:
        fp.write(wiki_text)


In [9]:
# Load all wiki documents
city_docs = {}
all_docs = []
for wiki_title in wiki_titles:
    city_docs[wiki_title] = SimpleDirectoryReader(input_files=[data_path / f"{wiki_title}.txt"]).load_data()
    all_docs.extend(city_docs[wiki_title])


In [11]:
# define service context
service_context = ServiceContext.from_defaults(
    chunk_size_limit=512, 
)

### Building the document indices
Build a separate vector index for each wiki pages about cities.

We also build a "global" vector index, which ingest documents for *all* cities. 

This allows us to test different types of data structures!

In [12]:
# Build index for each city document
city_indices = {}
index_summaries = {}
for wiki_title in wiki_titles:
    print(f"Building index for {wiki_title}")
    city_indices[wiki_title] = GPTSimpleVectorIndex.from_documents(city_docs[wiki_title], service_context=service_context)
    # set summary text for city
    index_summaries[wiki_title] = f"Wikipedia articles about {wiki_title}"
    city_indices[wiki_title].save_to_disk(f'index_{wiki_title}.json')

Building index for Toronto


INFO:gpt_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:gpt_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 27286 tokens


Building index for Seattle


INFO:gpt_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:gpt_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 22263 tokens


Building index for San Francisco


INFO:gpt_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:gpt_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 30709 tokens


Building index for Chicago


INFO:gpt_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:gpt_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 34330 tokens


Building index for Boston


INFO:gpt_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:gpt_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 24499 tokens


Building index for Washington, D.C.


INFO:gpt_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:gpt_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 28343 tokens


Building index for Cambridge, Massachusetts


INFO:gpt_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:gpt_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 17036 tokens


Building index for Houston


INFO:gpt_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:gpt_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 28795 tokens


In [35]:
# also setup a global vector index 
global_index = GPTSimpleVectorIndex.from_documents(all_docs, service_context=service_context)
global_index.save_to_disk(f'index_cities_global.json')

INFO:gpt_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:gpt_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 213201 tokens


### Loading the indices

If the index is already built, run these cells to just load index from disk.

In [10]:
# If indices already saved, try loading
city_indices = {}
for wiki_title in wiki_titles:
    city_indices[wiki_title] = GPTSimpleVectorIndex.load_from_disk(
      f'index_{wiki_title}.json', service_context=service_context
    )

In [36]:
global_index = GPTSimpleVectorIndex.load_from_disk('index_cities_global.json', service_context=service_context)

### Creating the right structure to run compare/contrast queries

Our key goal in this notebook is to run compare/contrast queries between different cities.

We currently have a separate vector index for every city document. We want to setup a "graph" structure in order to route the query 
in the right manner in order to retrieve the relevant text sections for each city. 

We compose a keyword table index on top of all the vector indices.

In [51]:
from gpt_index.indices.composability import ComposableGraph

In [12]:
graph = ComposableGraph.from_indices(
    GPTSimpleKeywordTableIndex,
    [index for _, index in city_indices.items()], 
    [summary for _, summary in index_summaries.items()],
    max_keywords_per_chunk=50
)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jerryliu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
INFO:gpt_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:gpt_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 0 tokens


In [13]:
# [optional] save to disk
graph.save_to_disk("index_multi_doc_graph.json")

In [14]:
# [optional] load from disk
graph = ComposableGraph.load_from_disk("index_multi_doc_graph.json")

### Define Query Transformation + Query Configs

We also define a "query decomposition" transform. Since we have a graph structure over multiple indexes, query decomposition
allows us to break a complex question into a simpler one over a given index.

This works well in comparing/contrasting different cities because it allows us to ask questions specific to each city.

**Query Transform**

In [50]:
from gpt_index.indices.query.query_transform.base import DecomposeQueryTransform
decompose_transform = DecomposeQueryTransform(
    llm_predictor, verbose=True
)

In [18]:
# set query config
query_configs = [
    {
        # config for the vector index
        "index_struct_type": "simple_dict",
        "query_mode": "default",
        "query_kwargs": {
            "similarity_top_k": 1,
            "verbose": True
        },
        # NOTE: set query transform for subindices
        "query_transform": decompose_transform
    },
    {
        # config for the keyword table index 
        "index_struct_type": "keyword_table",
        "query_mode": "simple",
        "query_kwargs": {
            "response_mode": "tree_summarize",
            "verbose": True
        },
    },
]

### Let's Run Some Queries! 

We run queries over the graphs and analyze the results.

We also compare results against the baseline global vector index. In the majority of cases the global vector index provides insufficient answers.

**Complex Query 1**

In [54]:
# with query decomposition in subindices
query_str = (
    "Compare and contrast the demographics in Seattle, Houston, and Toronto. "
)

In [24]:
response = graph.query(
    query_str, 
    query_configs=query_configs, 
    service_context=service_context,
)

INFO:gpt_index.indices.query.keyword_table.query:> Starting query: Compare and contrast the demographics in Seattle, Houston, and Toronto. 
INFO:gpt_index.indices.query.keyword_table.query:query keywords: ['houston', 'demographics', 'compare', 'toronto', 'seattle', 'contrast']
INFO:gpt_index.indices.query.keyword_table.query:> Extracted keywords: ['houston', 'toronto', 'seattle']
Your text contains a trailing whitespace, which has been trimmed to ensure high quality generations.


[33;1m[1;3m> Current query: Compare and contrast the demographics in Seattle, Houston, and Toronto. 
[0m[38;5;200m[1;3m> New query: What is the population of Houston?
[0m[36;1m[1;3m> Got node text: to build or not build via land use controls such as a zoning ordinance, and instead can only impose general floodplain regulations for enforcement during subdivision approvals and building permit a...
[0mresponse:  
The population of Houston is 2,304,580.


Your text contains a trailing whitespace, which has been trimmed to ensure high quality generations.


[33;1m[1;3m> Current query: Compare and contrast the demographics in Seattle, Houston, and Toronto. 
[0m[38;5;200m[1;3m> New query: What is the population of Toronto?
[0m[36;1m[1;3m> Got node text: of its 2,394,205 total private dwellings, a change of 4.6% from its 2016 population of 5,928,040. With a land area of 5,902.75 km2 (2,279.06 sq mi), it had a population density of 1,050.7/km2 (2,72...
[0mresponse:  
The population of Toronto is 2,394,205.


Your text contains a trailing whitespace, which has been trimmed to ensure high quality generations.


[33;1m[1;3m> Current query: Compare and contrast the demographics in Seattle, Houston, and Toronto. 
[0m[38;5;200m[1;3m> New query: What is the population of Seattle?
[0m[36;1m[1;3m> Got node text: shift of funding from homeless shelter beds to permanent housing.In recent years, the city has experienced steady population growth, and has been faced with the issue of accommodating more resident...
[0mresponse:  
The population of Seattle is 745,000.
[36;1m[1;3m> Got node text: 
The population of Houston is 2,304,580....
[0m[36;1m[1;3m> Got node text: 
The population of Toronto is 2,394,205....
[0m[36;1m[1;3m> Got node text: 
The population of Seattle is 745,000....
[0m

In [25]:
print(str(response))


Seattle has a population of 745,000, Houston has a population of 2,304,580, and Toronto has a population of 2,394,205. Seattle has the lowest population out of the three cities mentioned.


In [55]:
response = global_index.query(query_str, similarity_top_k=3, response_mode="tree_summarize")

INFO:gpt_index.token_counter.token_counter:> [query] Total LLM token usage: 1605 tokens
INFO:gpt_index.token_counter.token_counter:> [query] Total embedding token usage: 14 tokens


In [56]:
# NOTE: the global vector index seems to provide the right results....
# BUT see below! 
print(str(response))


Seattle, Houston, and Toronto are all large cities in North America with a population of over 1 million. Seattle is the largest city in Washington with 724,745 residents. Houston is the largest city in Texas with 2,312,947 residents. Toronto is the largest city in Canada with 2,394,205 residents.

The most noticeable difference between the three cities is their population density. Toronto has the highest population density of the three with 1,050.7 people per square mile, while Seattle has 324.9 people per square mile and Houston has 277.5 people per square mile. Toronto's population is also the most diverse of the three cities. In 2021, Toronto's population was 46.6% immigrants, while Seattle's was 71.1% native-born and Houston's was 63.8% native-born.

Seattle and Houston are both located in the southern United States. Seattle is in the state of Washington on the Pacific coast, while Houston is in the state of Texas on the Gulf of Mexico. Toronto is located in southern Ontario, Cana

In [57]:
# NOTE: there's hallucination! the sources only reference Toronto
print(response.source_nodes[0].source_text)
print(response.source_nodes[1].source_text)

of its 2,394,205 total private dwellings, a change of 4.6% from its 2016 population of 5,928,040. With a land area of 5,902.75 km2 (2,279.06 sq mi), it had a population density of 1,050.7/km2 (2,721.4/sq mi) in 2021.In 2016, persons aged 14 years and under made up 14.5 per cent of the population, and those aged 65 years and over made up 15.6 per cent. The median age was 39.3 years. The city's gender population is 48 per cent male and 52 per cent female. Women outnumber men in all age groups 15 and older.The 2021 census reported that immigrants (individuals born outside Canada) comprise 1,286,145 persons or 46.6% of the total population of Toronto. Of the total immigrant population, the top countries of origin were Philippines (132,980 persons or 10.3%), China (129,750 persons or 10.1%), India (102,155 persons or 7.9%), Sri Lanka (47,895 persons or 3.7%), Jamaica (42,655 persons or 3.3%), Italy (37,705 persons or 2.9%), Iran (37,185 persons or 2.9%), Hong Kong (36,855 persons or 2.9%), 

**Complex Query 2**

In [44]:
# with query decomposition
query_str = (
    "What are the basketball teams in Houston and Boston?"
)

In [23]:
response = graph.query(
    query_str, 
    query_configs=query_configs, 
    service_context=service_context,
)

INFO:gpt_index.indices.query.keyword_table.query:> Starting query: Give details about the basketball teams in Houston and Boston.
INFO:gpt_index.indices.query.keyword_table.query:query keywords: ['details', 'houston', 'give', 'teams', 'boston', 'basketball']
INFO:gpt_index.indices.query.keyword_table.query:> Extracted keywords: ['houston', 'boston']
Your text contains a trailing whitespace, which has been trimmed to ensure high quality generations.


[33;1m[1;3m> Current query: Give details about the basketball teams in Houston and Boston.
[0m[38;5;200m[1;3m> New query: What is the name of the basketball team in Houston?
[0m[36;1m[1;3m> Got node text: of a 30,000-ft2 (2,800 m2)in-ground facility.
The Gerald D. Hines Waterwall Park—in the Uptown District of the city—serves as a popular tourist attraction and for weddings and various celebrations....
[0mresponse:  
The Houston Rockets


Your text contains a trailing whitespace, which has been trimmed to ensure high quality generations.


[33;1m[1;3m> Current query: Give details about the basketball teams in Houston and Boston.
[0m[38;5;200m[1;3m> New query: What is the name of the basketball team in Boston?
[0m[36;1m[1;3m> Got node text: Boston
Boston City League (high-school athletic conference)
Boston Citgo Sign
Boston nicknames
Boston–Halifax relations
List of diplomatic missions in Boston
List of people from Boston
National Reg...
[0mresponse:  Boston Celtics
[36;1m[1;3m> Got node text: 
The Houston Rockets...
[0m[36;1m[1;3m> Got node text: Boston Celtics...
[0m

In [None]:
print(str(response))

In [45]:
response = global_index.query(query_str, similarity_top_k=2, response_mode="tree_summarize")

INFO:gpt_index.token_counter.token_counter:> [query] Total LLM token usage: 928 tokens
INFO:gpt_index.token_counter.token_counter:> [query] Total embedding token usage: 10 tokens


In [46]:
print(str(response))


The answer is the Houston Rockets.


**Complex Query 3**

In [47]:
# with query decomposition
query_str = (
    "Compare and contrast the climate of Houston and Boston "
)

In [31]:
response = graph.query(
    query_str, 
    query_configs=query_configs, 
    service_context=service_context,
)

INFO:gpt_index.indices.query.keyword_table.query:> Starting query: Compare and contrast the climate of Houston and Boston 
INFO:gpt_index.indices.query.keyword_table.query:query keywords: ['houston', 'compare', 'boston', 'contrast', 'climate']
INFO:gpt_index.indices.query.keyword_table.query:> Extracted keywords: ['houston', 'boston']
Your text contains a trailing whitespace, which has been trimmed to ensure high quality generations.


[33;1m[1;3m> Current query: Compare and contrast the climate of Houston and Boston 
[0m[38;5;200m[1;3m> New query: What is the average annual temperature in Houston?
[0m[36;1m[1;3m> Got node text: and Galveston Bay.During the summer, temperatures reach or exceed 90 °F (32 °C) an average of 106.5 days per year, including a majority of days from June to September. Additionally, an average of 4...
[0mresponse:  
The average annual temperature in Houston is 70.5 degrees Fahrenheit.


Your text contains a trailing whitespace, which has been trimmed to ensure high quality generations.


[33;1m[1;3m> Current query: Compare and contrast the climate of Houston and Boston 
[0m[38;5;200m[1;3m> New query: What is the average annual temperature in Boston?
[0m[36;1m[1;3m> Got node text: 2022, when the temperature reached 100 °F (38 °C). The city's average window for freezing temperatures is November 9 through April 5. Official temperature records have ranged from −18 °F (−28 °C) o...
[0mresponse:  The answer is 43.6 in (1,110 mm).
The reasoning isThe relevant information to answer the above question is: The city averages 43.6 in (1,110 mm) of precipitation a year, with 49.2 in (125 cm) of snowfall per season.
So, the answer is 43.6 in ( 1,110 mm ).
[36;1m[1;3m> Got node text: 
The average annual temperature in Houston is 70.5 degrees Fahrenheit....
[0m[36;1m[1;3m> Got node text: The answer is 43.6 in (1,110 mm).
The reasoning isThe relevant information to answer the above question is: The city averages 43.6 in (1,110 mm) of precipitation a year, with 49.2 in (12

In [32]:
print(response)


Houston has an average annual temperature of 70.5 degrees Fahrenheit and 49.2 inches of snowfall per season. Boston has an average annual temperature of 44 degrees Fahrenheit and 36.5 inches of snowfall per season.


In [48]:
response = global_index.query(query_str, similarity_top_k=2, response_mode="tree_summarize")

INFO:gpt_index.token_counter.token_counter:> [query] Total LLM token usage: 1183 tokens
INFO:gpt_index.token_counter.token_counter:> [query] Total embedding token usage: 10 tokens


In [49]:
print(str(response))


Boston has either a humid subtropical climate (Köppen Cfa) under the −3 °C (26.6 °F) isotherm or a humid continental climate under the 0 °C isotherm (Köppen Dfa). The city is best described as being in a transitional zone between the two climates. Summers are typically warm and humid, while winters are cold and stormy, with occasional periods of heavy snow. Spring and fall are usually cool to mild, with varying conditions dependent on wind direction and jet stream positioning. Prevailing wind patterns that blow offshore minimize the influence of the Atlantic Ocean. However, in winter areas near the immediate coast will often see more rain than snow as warm air is drawn off the Atlantic at times. The city lies at the transition between USDA plant hardiness zones 6b (most of the city) and 7a (Downtown, South Boston, and East Boston neighborhoods).The hottest month is July, with a mean temperature of 74.1 °F (23.4 °C). The coldest month is January, with a mean temperature of 29.9 °F (−1.