<a href="https://colab.research.google.com/github/yenlow/howsmybaby/blob/main/llama_index.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

You will need to get a .txt file for this, you can find the data here --> https://github.com/jerryjliu/llama_index/tree/main/examples/paul_graham_essay/data

Make sure you save the directory as `data` locally to get this code to work!

I also use a `.env` file, you can directly put your OpenAI API key in if you'd like


# Set up

In [None]:
! pip install llama-index python-dotenv
from dotenv import load_dotenv
load_dotenv()

In [None]:
import logging
import os, sys
import openai
os.environ['OPENAI_API_KEY'] = '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'
#open_api_key = os.getenv("OPENAI_API_KEY")
openai.api_key="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"

In [None]:
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

# Load text documents for OpenAI to do Q&A

In [None]:
from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader('data').load_data()
index = GPTVectorStoreIndex.from_documents(documents)

[nltk_data] Downloading package punkt to /tmp/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
query_engine = index.as_query_engine()
response=query_engine.query("What did the author do growing up?")

In [None]:
print(response)

The author worked on writing and programming outside of school before college. They wrote short stories and tried writing programs on an IBM 1401 computer. They also built a microcomputer kit and started programming on it, writing simple games and a word processor.


In [None]:
index.storage_context.persist()

In [None]:
from llama_index import StorageContext, load_index_from_storage

# rebuild storage context
storage_context = StorageContext.from_defaults(persist_dir="./storage")
# load index
index = load_index_from_storage(storage_context)

# Setup: tables and DB for Q&A

In [None]:
from llama_index.indices.struct_store.sql_query import NLSQLTableQueryEngine, SQLTableRetrieverQueryEngine
from sqlalchemy import create_engine, MetaData, Table, Column, String, Integer, select, column
from llama_index import SQLDatabase

In [None]:
engine = create_engine("sqlite:///:memory:")
metadata_obj = MetaData()

table_name = "city_stats"
city_stats_table = Table(
    table_name,
    metadata_obj,
    Column("city_name", String(16), primary_key=True, nullable=True),
    Column("population", Integer),
    Column("country", String(16), nullable=True),
)
metadata_obj.create_all(engine)

In [None]:
sql_database = SQLDatabase(engine, include_tables=[table_name])
sql_database.table_info

'\nCREATE TABLE city_stats (\n\tcity_name VARCHAR(16), \n\tpopulation INTEGER, \n\tcountry VARCHAR(16), \n\tPRIMARY KEY (city_name)\n)\n\n/*\n3 rows from city_stats table:\ncity_name\tpopulation\tcountry\n\n*/'

### Fill tables with numbers

In [None]:
from sqlalchemy import insert
rows = [
    {"city_name": "Toronto", "population": 2731571, "country": "Canada"},
    {"city_name": "Tokyo", "population": 13929286, "country": "Japan"},
    {"city_name": "Berlin", "population": 600000, "country": "Germany"},
]
for row in rows:
    stmt = insert(city_stats_table).values(**row)
    with engine.connect() as connection:
        cursor = connection.execute(stmt)
        connection.commit()

with engine.connect() as connection:
    cursor = connection.exec_driver_sql("SELECT * FROM city_stats")
    print(cursor.fetchall())

[('Toronto', 2731571, 'Canada'), ('Tokyo', 13929286, 'Japan'), ('Berlin', 600000, 'Germany')]


In [None]:
query_engine = NLSQLTableQueryEngine(
    sql_database=sql_database,
    tables=[table_name],
)
query_str = (
    "Which city has the highest population? List its population"
)
response = query_engine.query(query_str)
print(response)

The city with the highest population is Tokyo, with a population of 13,929,286.


In [None]:
response

Response(response='The city with the highest population is Tokyo, with a population of 13,929,286.', source_nodes=[], metadata={'result': [('Tokyo', 13929286)], 'sql_query': 'SELECT city_name, population\nFROM city_stats\nORDER BY population DESC\nLIMIT 1;'})

In [None]:
response.metadata

{'result': [('Tokyo', 13929286)],
 'sql_query': 'SELECT city_name, population\nFROM city_stats\nORDER BY population DESC\nLIMIT 1;'}

### Fill table with Wikipedia info

In [None]:
from llama_index import download_loader

PubmedReader = download_loader("PubmedReader")

loader = PubmedReader()
documents = loader.load_data(search_query='bariatric glp')

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10432867&db=pmc
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10432813&db=pmc
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10425229&db=pmc
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10421789&db=pmc
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10421457&db=pmc
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10421342&db=pmc
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10420088&db=pmc
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10418921&db=pmc
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10415875&db=pmc
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10413159&db=pmc


In [None]:
#https://gpt-index.readthedocs.io/en/v0.6.34/guides/tutorials/sql_guide.html
#!pip install wikipedia
from llama_index import download_loader

WikipediaReader = download_loader("WikipediaReader")
wiki_docs = WikipediaReader().load_data(pages=['Singapore', 'San Francisco', 'London'])

In [None]:
from llama_index import SQLStructStoreIndex, SQLDatabase, ServiceContext
from langchain import OpenAI
from llama_index import LLMPredictor

os.environ['OPENAI_API_KEY'] = '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'
#chatgpt = OpenAI(temperature=0, model="gpt-3.5-turbo")
llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, model="gpt-3.5-turbo"))
service_context = ServiceContext.from_defaults(llm=llm_predictor)

ValueError: ignored

In [None]:
index = SQLStructStoreIndex.from_documents(
    wiki_docs,
    sql_database=sql_database,
    table_name=table_name
)

In [None]:
index

<llama_index.indices.struct_store.sql.SQLStructStoreIndex at 0x7ae80c6ab340>

In [None]:
stmt = select(
    city_stats_table.c["city_name", "population", "country"]
).select_from(city_stats_table)

with engine.connect() as connection:
    results = connection.execute(stmt).fetchall()
    print(results)

[(None, None, None), (None, 873965, None), (None, None, None)]


In [None]:
table_name = "city_stats2"
city_stats_table = Table(
    table_name,
    metadata_obj,
    Column("city_name", String(16), primary_key=True),
    Column("population", Integer),
    Column("country", String(16), nullable=False),
)
metadata_obj.create_all()


TypeError: ignored

# Scape Wikipedia into text documents and load into Vector Store for Q&A
https://gpt-index.readthedocs.io/en/latest/examples/composable_indices/city_analysis/City_Analysis-Unified-Query.html

In [None]:
from pathlib import Path
import requests
from llama_index import (
    VectorStoreIndex,
    SimpleKeywordTableIndex,
    SimpleDirectoryReader,
    ServiceContext,
)

In [None]:
wiki_titles = ["Toronto", "Seattle", "Chicago", "Boston", "Houston"]
for title in wiki_titles:
    response = requests.get(
        "https://en.wikipedia.org/w/api.php",
        params={
            "action": "query",
            "format": "json",
            "titles": title,
            "prop": "extracts",
            # 'exintro': True,
            "explaintext": True,
        },
    ).json()
    page = next(iter(response["query"]["pages"].values()))
    wiki_text = page["extract"]

    data_path = Path("data")
    if not data_path.exists():
        Path.mkdir(data_path)

    with open(data_path / f"{title}.txt", "w") as fp:
        fp.write(wiki_text)

In [None]:
wiki_text

'Houston ( (listen); HEW-stən) is the most populous city in Texas and in the Southern United States. It is the fourth-most populous city in the United States after New York City, Los Angeles, and Chicago, and the sixth-most populous city in North America. With a population of 2,304,580 in 2020, Houston is located in Southeast Texas near Galveston Bay and the Gulf of Mexico; it is the seat and largest city of Harris County and the largest principal city of the Greater Houston metropolitan area, which is the fifth-most populous metropolitan statistical area in the United States and the second-most populous in Texas after Dallas–Fort Worth. Houston is the southeast anchor of the greater megaregion known as the Texas Triangle.Comprising a land area of 640.4 square miles (1,659 km2), Houston is the ninth-most expansive city in the United States (including consolidated city-counties). It is the largest city in the United States by total area whose government is not consolidated with a county

In [None]:
city_docs = {}
for wiki_title in wiki_titles:
    city_docs[wiki_title] = SimpleDirectoryReader(
        input_files=[f"data/{wiki_title}.txt"]
    ).load_data()

In [None]:
from llama_index.llms import OpenAI

chatgpt = OpenAI(temperature=0, model="gpt-3.5-turbo")
service_context = ServiceContext.from_defaults(llm=chatgpt, chunk_size=1024)

[nltk_data] Downloading package punkt to /tmp/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Q&A for individual docs: vector index

In [None]:
vector_indices = {}
for wiki_title in wiki_titles:
    # build vector index
    vector_indices[wiki_title] = VectorStoreIndex.from_documents(city_docs[wiki_title],
                                                                 service_context=service_context)
    # set id for vector index
    vector_indices[wiki_title].set_index_id(wiki_title)

index_summaries = {
    wiki_title: (
        f"This content contains Wikipedia articles about {wiki_title}. "
        f"Use this index if you need to lookup specific facts about {wiki_title}.\n"
        "Do not use this index if you want to analyze multiple cities."
    )
    for wiki_title in wiki_titles
}

In [None]:
query_engine = vector_indices["Toronto"].as_query_engine()
response = query_engine.query("What are the sports teams in Toronto?")
print(str(response))

The sports teams in Toronto include the Toronto Maple Leafs (NHL), Toronto Blue Jays (MLB), Toronto Raptors (NBA), Toronto Argonauts (CFL), Toronto FC (MLS), Toronto Rock (National Lacrosse League), Toronto Wolfpack (Rugby Football League), and Toronto Rush (American Ultimate Disc League).


## Q&A over multiple docs: graph index
We build a graph by composing a keyword table index on top of all the vector indices. We use this graph for compare/contrast queries


In [None]:
from llama_index.indices.composability import ComposableGraph
graph = ComposableGraph.from_indices(
    SimpleKeywordTableIndex,
    [index for _, index in vector_indices.items()],
    [summary for _, summary in index_summaries.items()],
    max_keywords_per_chunk=50,
)
# get root index
root_index = graph.get_index(graph.root_id)
# set id of root index
root_index.set_index_id("compare_contrast")

# define decompose_transform
from llama_index.indices.query.query_transform.base import DecomposeQueryTransform
from llama_index import LLMPredictor
decompose_transform = DecomposeQueryTransform(LLMPredictor(llm=chatgpt), verbose=True)

# define custom retrievers
from llama_index.query_engine.transform_query_engine import TransformQueryEngine
custom_query_engines = {}
for index in vector_indices.values():
    query_engine = index.as_query_engine(service_context=service_context)
    query_engine = TransformQueryEngine(
        query_engine,
        query_transform=decompose_transform,
        transform_metadata={"index_summary": index.index_struct.summary},
    )
    custom_query_engines[index.index_id] = query_engine
custom_query_engines[graph.root_id] = graph.root_index.as_query_engine(
    retriever_mode="simple",
    response_mode="tree_summarize",
    service_context=service_context,
    verbose=True,
)

# define graph
graph_query_engine = graph.as_query_engine(custom_query_engines=custom_query_engines)

[nltk_data] Downloading package stopwords to /tmp/llama_index...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
query_str = "Compare and contrast the arts and culture of Houston and Boston. "
response = graph_query_engine.query(query_str)
print(response)

[33;1m[1;3m> Current query: Compare and contrast the arts and culture of Houston and Boston. 
[0m[38;5;200m[1;3m> New query: What are some notable cultural institutions in Houston and Boston?
[0m[33;1m[1;3m> Current query: Compare and contrast the arts and culture of Houston and Boston. 
[0m[38;5;200m[1;3m> New query: What are some notable cultural institutions in Houston and Boston?
[0m[33;1m[1;3m> Current query: Compare and contrast the arts and culture of Houston and Boston. 
[0m[38;5;200m[1;3m> New query: What are some notable cultural institutions in Houston?
[0m[33;1m[1;3m> Current query: Compare and contrast the arts and culture of Houston and Boston. 
[0m[38;5;200m[1;3m> New query: What are some notable cultural institutions in Houston?
[0mHouston and Boston both have vibrant arts and cultural scenes. In Houston, notable cultural institutions include The Museum of Fine Arts, the Houston Museum of Natural Science, the Contemporary Arts Museum Houston, the

## Add router so it switches between vector indices for single doc and graph indices for multiple docs

In [None]:
from llama_index.tools.query_engine import QueryEngineTool

query_engine_tools = []
# add vector index tools
for wiki_title in wiki_titles:
    index = vector_indices[wiki_title]
    summary = index_summaries[wiki_title]
    query_engine = index.as_query_engine(service_context=service_context)
    vector_tool = QueryEngineTool.from_defaults(query_engine, description=summary)
    query_engine_tools.append(vector_tool)

# add graph tool
graph_description = (
    "This tool contains Wikipedia articles about multiple cities. "
    "Use this tool if you want to compare multiple cities. "
)
graph_tool = QueryEngineTool.from_defaults(
    graph_query_engine,
    description=graph_description
)
query_engine_tools.append(graph_tool)

# Add router to query_engine_tools (i.e. either vector or graph indices)
from llama_index.query_engine.router_query_engine import RouterQueryEngine
#from llama_index.selectors.llm_selectors import LLMSingleSelector
# Bug in LLMSingleSelector,  use pydantic selector to avoid parsing from JSON
from llama_index.selectors.pydantic_selectors import PydanticMultiSelector, PydanticSingleSelector
router_query_engine = RouterQueryEngine(
    selector=PydanticSingleSelector.from_defaults(),
    query_engine_tools=query_engine_tools,
)

### Q&A about multiple cities

In [None]:
response = router_query_engine.query("Compare and contrast the arts and culture of Houston and Boston.")
print(response)

[33;1m[1;3m> Current query: Compare and contrast the arts and culture of Houston and Boston.
[0m[38;5;200m[1;3m> New query: What are some notable cultural institutions in Houston and Boston?
[0m[33;1m[1;3m> Current query: Compare and contrast the arts and culture of Houston and Boston.
[0m[38;5;200m[1;3m> New query: What are some notable cultural institutions in Houston and Boston?
[0m[33;1m[1;3m> Current query: Compare and contrast the arts and culture of Houston and Boston.
[0m[38;5;200m[1;3m> New query: What are some notable cultural institutions in Houston?
[0m[33;1m[1;3m> Current query: Compare and contrast the arts and culture of Houston and Boston.
[0m[38;5;200m[1;3m> New query: What are some notable cultural institutions in Houston?
[0mHouston and Boston both have vibrant arts and cultural scenes. In Houston, notable cultural institutions include The Museum of Fine Arts, the Houston Museum of Natural Science, the Contemporary Arts Museum Houston, the Sta

### Q&A about single city

In [None]:
response = router_query_engine.query("What are the sports teams in Toronto?")
print(response)

The sports teams in Toronto include the Toronto Maple Leafs (NHL), Toronto Blue Jays (MLB), Toronto Raptors (NBA), Toronto Argonauts (CFL), Toronto FC (MLS), Toronto Rock (National Lacrosse League), Toronto Wolfpack (Rugby Football League), and Toronto Rush (American Ultimate Disc League).
