# Composable Indices Demo

In [1]:
import logging
import sys
import weaviate
from pprint import pprint

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

from gpt_index import (
    GPTSimpleVectorIndex, 
    GPTSimpleKeywordTableIndex, 
    GPTListIndex, 
    GPTWeaviateIndex,
    SimpleDirectoryReader
)

In [2]:
resource_owner_config = weaviate.AuthClientPassword(
  username = "<username>", 
  password = "<password>", 
)

In [3]:
client = weaviate.Client("https://test-weaviate-cluster.semi.network/", auth_client_secret=resource_owner_config)

In [4]:
# [optional] set batch
client.batch.configure(batch_size=10)

<weaviate.batch.crud_batch.Batch at 0x135b98310>

#### Load Datasets

Load both the NYC Wikipedia page as well as Paul Graham's "What I Worked On" essay

In [7]:
# fetch "New York City" page from Wikipedia
from pathlib import Path

import requests
response = requests.get(
    'https://en.wikipedia.org/w/api.php',
    params={
        'action': 'query',
        'format': 'json',
        'titles': 'New York City',
        'prop': 'extracts',
        # 'exintro': True,
        'explaintext': True,
    }
).json()
page = next(iter(response['query']['pages'].values()))
nyc_text = page['extract']

data_path = Path('data')
if not data_path.exists():
    Path.mkdir(data_path)

with open('../test_wiki/data/nyc_text.txt', 'w') as fp:
    fp.write(nyc_text)

In [8]:
# load NYC dataset
nyc_documents = SimpleDirectoryReader('../test_wiki/data/').load_data()

In [9]:
# load PG's essay
essay_documents = SimpleDirectoryReader('../paul_graham_essay/data/').load_data()

### Building the document indices
Build a tree index for the NYC wiki page and PG essay

In [10]:
# build NYC index
nyc_index = GPTWeaviateIndex(nyc_documents, weaviate_client=client, class_prefix="Nyc_docs")

INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens
> [build_index_from_documents] Total LLM token usage: 0 tokens
INFO:root:> [build_index_from_documents] Total embedding token usage: 28228 tokens
> [build_index_from_documents] Total embedding token usage: 28228 tokens


In [11]:
nyc_index.save_to_disk('index_nyc.json')

In [12]:
# build essay index
essay_index = GPTWeaviateIndex(essay_documents, weaviate_client=client, class_prefix="Essay_docs")

INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens
> [build_index_from_documents] Total LLM token usage: 0 tokens
INFO:root:> [build_index_from_documents] Total embedding token usage: 17598 tokens
> [build_index_from_documents] Total embedding token usage: 17598 tokens


In [13]:
essay_index.save_to_disk('index_pg.json')

### Loading the indices
Build a tree indices for the NYC wiki page and PG essay

In [14]:
# try loading
nyc_index = GPTWeaviateIndex.load_from_disk('index_nyc.json', weaviate_client=client)
essay_index = GPTWeaviateIndex.load_from_disk('index_pg.json', weaviate_client=client)

### Set summaries for the indices

Add text summaries to indices, so we can compose other indices on top of it

In [15]:
nyc_index.set_text("""
    New York, often called New York City or NYC, 
    is the most populous city in the United States. 
    With a 2020 population of 8,804,190 distributed over 300.46 square miles (778.2 km2), 
    New York City is also the most densely populated major city in the United States, 
    and is more than twice as populous as second-place Los Angeles. 
    New York City lies at the southern tip of New York State, and 
    constitutes the geographical and demographic center of both the 
    Northeast megalopolis and the New York metropolitan area, the 
    largest metropolitan area in the world by urban landmass.[8] With over 
    20.1 million people in its metropolitan statistical area and 23.5 million 
    in its combined statistical area as of 2020, New York is one of the world's 
    most populous megacities, and over 58 million people live within 250 mi (400 km) of 
    the city. New York City is a global cultural, financial, and media center with 
    a significant influence on commerce, health care and life sciences, entertainment, 
    research, technology, education, politics, tourism, dining, art, fashion, and sports. 
    Home to the headquarters of the United Nations, 
    New York is an important center for international diplomacy,
    an established safe haven for global investors, and is sometimes described as the capital of the world.
""") 
nyc_index.set_doc_id("nyc_index")
essay_index.set_text("""
    Author: Paul Graham. 
    The author grew up painting and writing essays. 
    He wrote a book on Lisp and did freelance Lisp hacking work to support himself. 
    He also became the de facto studio assistant for Idelle Weber, an early photorealist painter. 
    He eventually had the idea to start a company to put art galleries online, but the idea was unsuccessful. 
    He then had the idea to write software to build online stores, which became the basis for his successful company, Viaweb. 
    After Viaweb was acquired by Yahoo!, the author returned to painting and started writing essays online. 
    He wrote a book of essays, Hackers & Painters, and worked on spam filters. 
    He also bought a building in Cambridge to use as an office. 
    He then had the idea to start Y Combinator, an investment firm that would 
    make a larger number of smaller investments and help founders remain as CEO. 
    He and his partner Jessica Livingston ran Y Combinator and funded a batch of startups twice a year. 
    He also continued to write essays, cook for groups of friends, and explore the concept of invented vs discovered in software. 

""")
essay_index.set_doc_id("essay_index")

### Build Keyword Table Index on top of vector indices! 

We set summaries for each of the NYC and essay indices, and then compose a keyword index on top of it.

In [16]:
# set query config
query_configs = [
    {
        "index_struct_id": "nyc_index",
        "index_struct_type": "dict",
        "query_mode": "default",
        "query_kwargs": {
            "similarity_top_k": 1,
            "weaviate_client": client,
            "class_prefix": "Nyc_docs"
        }
    },
    {
        "index_struct_id": "essay_index",
        "index_struct_type": "dict",
        "query_mode": "default",
        "query_kwargs": {
            "similarity_top_k": 1,
            "weaviate_client": client,
            "class_prefix": "Essay_docs"
        }
    },
    {
        "index_struct_type": "keyword_table",
        "query_mode": "simple",
        "query_kwargs": {}
    },
]

In [17]:
keyword_table = GPTSimpleKeywordTableIndex([nyc_index, essay_index], max_keywords_per_chunk=50)

INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens
> [build_index_from_documents] Total LLM token usage: 0 tokens
INFO:root:> [build_index_from_documents] Total embedding token usage: 0 tokens
> [build_index_from_documents] Total embedding token usage: 0 tokens


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jerryliu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Define Graph

In [18]:
from gpt_index.composability import ComposableGraph

In [19]:
graph = ComposableGraph.build_from_index(keyword_table)

In [20]:
# [optional] save to disk
graph.save_to_disk("index_graph.json")

In [21]:
# [optional] load from disk
graph = ComposableGraph.load_from_disk("index_graph.json")

In [28]:
# set Logging to DEBUG for more detailed outputs
# ask it a question about NYC 
response = graph.query(
    "What is the weather of New York City like? How cold is it during the winter?", 
    query_configs=query_configs
)

INFO:root:> Starting query: What is the weather of New York City like? How cold is it during the winter?
> Starting query: What is the weather of New York City like? How cold is it during the winter?
INFO:root:query keywords: ['cold', 'winter', 'new', 'weather', 'york', 'like', 'city']
query keywords: ['cold', 'winter', 'new', 'weather', 'york', 'like', 'city']
INFO:root:> Extracted keywords: ['new', 'york', 'city']
> Extracted keywords: ['new', 'york', 'city']
INFO:root:> [query] Total LLM token usage: 3852 tokens
> [query] Total LLM token usage: 3852 tokens
INFO:root:> [query] Total embedding token usage: 18 tokens
> [query] Total embedding token usage: 18 tokens
INFO:root:> [query] Total LLM token usage: 3852 tokens
> [query] Total LLM token usage: 3852 tokens
INFO:root:> [query] Total embedding token usage: 18 tokens
> [query] Total embedding token usage: 18 tokens


In [29]:
print(str(response))



New York City has a humid subtropical climate, with hot and humid summers and cool, damp winters. The daily mean temperature in January, the area's coldest month, is 33.3 °F (0.7 °C). Temperatures usually drop to 10 °F (−12 °C) several times per winter, yet can also reach 60 °F (16 °C) for several days even in the coldest winter month. The city of New York has a complex park system, with various lands operated by the National Park Service, the New York State Office of Parks, Recreation and Historic Preservation, and the New York City Department of Parks and Recreation. In 2021, the New York City Council banned the use of synthetic pesticides by city agencies and instead required organic lawn management.


In [30]:
# Get source of response
print(response.get_formatted_sources())

> Source (Doc id: nyc_index): 
    New York, often called New York City or NYC, 
    is the most populous city in the United St...

> Source (Doc id: 9bc9fc8c-79a8-42d2-8cde-c6033ef2f8ac): of the city is land and 165.841 sq mi (429.53 km2) of this is water. The highest point in the cit...


In [25]:
# ask it a question about PG's essay
response = graph.query(
    "What did the author do growing up, before his time at Y Combinator?", 
    query_configs=query_configs
)

INFO:root:> Starting query: What did the author do growing up, before his time at Y Combinator?
> Starting query: What did the author do growing up, before his time at Y Combinator?
INFO:root:query keywords: ['combinator', 'growing', 'author', 'time']
query keywords: ['combinator', 'growing', 'author', 'time']
INFO:root:> Extracted keywords: ['combinator', 'author']
> Extracted keywords: ['combinator', 'author']
INFO:root:> [query] Total LLM token usage: 3879 tokens
> [query] Total LLM token usage: 3879 tokens
INFO:root:> [query] Total embedding token usage: 17 tokens
> [query] Total embedding token usage: 17 tokens
INFO:root:> [query] Total LLM token usage: 3879 tokens
> [query] Total LLM token usage: 3879 tokens
INFO:root:> [query] Total embedding token usage: 17 tokens
> [query] Total embedding token usage: 17 tokens


In [26]:
print(str(response))



The author grew up in England and attended college in the United States. He studied computer science and wrote software in Lisp. He also painted and wrote essays, which he published online. After college, he worked at a software company called Interleaf and then co-founded a startup called Viaweb. He also wrote essays and worked on a project to make a programming language in itself. Through this experience, he learned the importance of customs and how they can continue to constrain you even after the restrictions that caused them have disappeared.


In [27]:
# Get source of response
print(response.get_formatted_sources())

> Source (Doc id: essay_index): 
    Author: Paul Graham. 
    The author grew up painting and writing essays. 
    He wrote a bo...

> Source (Doc id: b5964813-f793-4771-b06f-731ad293440f): chance it had to do with HN, and a 40% chance it had do with everything else combined. [17]

As w...
