<a href="https://colab.research.google.com/github/smytjf11/BMIER_project/blob/main/Llama_Index_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Some packages com pre-installed, other we will need to install ourselves.

Source: https://gpt-index.readthedocs.io/en/latest/guides/sql_guide.html

In [None]:
!pip install llama_index

This prompts for an Open AI API Key, since it uses GPT-3 by default. It's easy to get a key, but repeated experimentation can be costly. There are free alternatives we can explore for evaluation purposes.

In [3]:
import getpass
import os
os.environ['OPENAI_API_KEY'] = getpass.getpass(prompt='Enter your API Key: ')


Enter your API Key: ··········


Llama Index has some built in scraping, we will want to explore what sources we can take advantage of natively. We can also provide our own document store if we need to.

Notes on Wikipedia Reader: 

In [22]:
from llama_index import download_loader

WikipediaReader = download_loader("WikipediaReader")
wiki_docs = WikipediaReader().load_data(pages=['Toronto', 'Berlin', 'Tokyo', 'Indianapolis', 'Fort Wayne, Indiana', 'Kosciusko County, Indiana'])

In [34]:
wiki_docs

[Document(text='Toronto ( (listen) tə-RON-toh; locally [təˈɹɒɾ̃ə] or [ˈtɹɒɾ̃ə]) is the capital city of the Canadian province of Ontario. With a recorded population of 2,794,356 in 2021, it is the most populous city in Canada and the fourth most populous city in North America. The city is the anchor of the Golden Horseshoe, an urban agglomeration of 9,765,188 people (as of 2021) surrounding the western end of Lake Ontario, while the Greater Toronto Area proper had a 2021 population of 6,712,341. Toronto is an international centre of business, finance, arts, sports and culture, and is recognized as one of the most multicultural and cosmopolitan cities in the world.Indigenous peoples have travelled through and inhabited the Toronto area, located on a broad sloping plateau interspersed with rivers, deep ravines, and urban forest, for more than 10,000 years. After the broadly disputed Toronto Purchase, when the Mississauga surrendered the area to the British Crown, the British established t

Here we are instantiating an in-memory database (for the purposes of this example). We will want to start from here and construct our own database with the relevant information we have from the Declarations of Disaster dataset. That will involve connecting to the GCP instance and importing that data as the foundation of the database

In [35]:
from sqlalchemy import create_engine, MetaData, Table, Column, String, Integer, select, column

engine = create_engine("sqlite:///:memory:")
metadata_obj = MetaData(bind=engine)

# create city SQL table
table_name = "city_stats"
city_stats_table = Table(
    table_name,
    metadata_obj,
    Column("city_name", String(50), primary_key=True),
    Column("population", Integer),
    Column("country", String(50), nullable=False),
    Column("date_founded", String(16), nullable=True),
    Column("current_ruler", String(50), nullable=True),
    Column("original_name", String(50), nullable=True),
)
metadata_obj.create_all()

Obnce this is done, we can check on what we created:

* A Table called "city_stats"
* The appropriate Metadata attached to it
* No other data

In [24]:
from llama_index import GPTSQLStructStoreIndex, SQLDatabase

sql_database = SQLDatabase(engine, include_tables=["city_stats"])
# NOTE: the table_name specified here is the table that you
# want to extract into from unstructured documents.
index = GPTSQLStructStoreIndex(
    wiki_docs, 
    sql_database=sql_database, 
    table_name="city_stats",
)

Now that we've populated the table, let's take a look at the results:

In [25]:
import pandas as pd

# view current table
stmt = select(
    [column("city_name"), column("population"), column("country"), column("date_founded"), column("current_ruler")]
).select_from(city_stats_table)

with engine.connect() as connection:
    results = connection.execute(stmt).fetchall()
    print(pd.DataFrame(results))

      city_name  population  country date_founded     current_ruler
0       Toronto   2794356.0   Canada         1793          JohnTory
1        Berlin    600000.0  Germany         1237                NA
2         Tokyo         NaN    Japan         1889  ShintaroIshihara
3  Indianapolis    332199.0       US         1855        JoeHogsett
4     FortWayne    180637.0      USA         1905         EricLahey
5        Warsaw     80240.0       US         1836              None


So, clearly filling in the details isn't great, but we can work on that with context and prompt engineering. What's really impressive is its ability to intuit what you want from a natural language query

In [30]:
# set Logging to DEBUG for more detailed outputs
response = index.query("Which city has the highest population?", mode="default")
print(response)

[('Toronto', 2794356)]


It can't infer *too* much, but it's good enough for the moment

In [31]:
# set Logging to DEBUG for more detailed outputs
response = index.query("Which city is closest to London?", mode="default")
print(response)

[('Toronto',)]


In [32]:
# set Logging to DEBUG for more detailed outputs
response = index.query("Which city is oldest?", mode="default")
print(response)

[('Berlin', '1237')]


In [33]:
# set Logging to DEBUG for more detailed outputs
response = index.query("How many cities are more than 200 years old?", mode="default")
print(response)

[(2,)]


In [36]:
# set Logging to DEBUG for more detailed outputs
response = index.query("Which many cities are more than 200 years old?", mode="default")
print(response)

[('Toronto', '1793'), ('Berlin', '1237')]
