# SQL Index Guide (Core)

This is a basic guide to LlamaIndex's SQL index capabilities. We first show how to "build" a SQL Index by extracting unstructured Wikipedia articles on cities into structured data of city/population statistics. We then show how to run text-to-SQL over these population statistics.

In [None]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

In [19]:
from llama_index import SimpleDirectoryReader, WikipediaReader
from IPython.display import Markdown, display

### Load Wikipedia Data

We use our WikipediaReader to load in data from various cities.

In [None]:
# install wikipedia python package
!pip install wikipedia

In [2]:
wiki_docs = WikipediaReader().load_data(pages=['Toronto', 'Berlin', 'Tokyo'])

### Create Database Schema

We use `sqlalchemy`, a popular SQL database toolkit, to create an empty `city_stats` Table

In [3]:
from sqlalchemy import create_engine, MetaData, Table, Column, String, Integer, select, column

In [4]:
engine = create_engine("sqlite:///:memory:")
metadata_obj = MetaData()

In [5]:
# create city SQL table
table_name = "city_stats"
city_stats_table = Table(
    table_name,
    metadata_obj,
    Column("city_name", String(16), primary_key=True),
    Column("population", Integer),
    Column("country", String(16), nullable=False),
)
metadata_obj.create_all(engine)

### Build Index

We then build the SQL Index (`GPTSQLStructStoreIndex`). We first define our `SQLDatabase` abstraction (a light wrapper around SQLAlchemy). 

In [None]:
from llama_index import GPTSQLStructStoreIndex, SQLDatabase, ServiceContext
from langchain import OpenAI
from llama_index import LLMPredictor

In [None]:
llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, model_name="text-davinci-002"))
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)

In [9]:
sql_database = SQLDatabase(engine, include_tables=["city_stats"])

In [10]:
sql_database.table_info

"Table 'city_stats' has columns: city_name (VARCHAR(16)), population (INTEGER), country (VARCHAR(16))."

In [None]:
# NOTE: the table_name specified here is the table that you
# want to extract into from unstructured documents.
index = GPTSQLStructStoreIndex.from_documents(
    wiki_docs, 
    sql_database=sql_database, 
    table_name="city_stats",
    service_context=service_context
)

In [12]:
# view current table
stmt = select(
    city_stats_table.c["city_name", "population", "country"]
).select_from(city_stats_table)

with engine.connect() as connection:
    results = connection.execute(stmt).fetchall()
    print(results)


[('Toronto', 2731571, 'Canada'), ('Berlin', 600000, 'Germany'), ('Tokyo', 13929286, 'Japan')]


### Query Index

We first show how we can execute a raw SQL query, which directly executes over the table.

In [15]:
query_engine = index.as_query_engine(
    query_mode="sql"
)
response = query_engine.query("SELECT city_name from city_stats")

> [query] Total LLM token usage: 0 tokens
> [query] Total embedding token usage: 0 tokens


In [14]:
display(Markdown(f"<b>{response}</b>"))

<b>[('Berlin',), ('Tokyo',), ('Toronto',)]</b>

We then show a natural language query, which is translated to a SQL query under the hood with our text-to-SQL prompt.

In [16]:
# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine(
    query_mode="nl"
)
response = query_engine.query("Which city has the highest population?")

> Predicted SQL query: SELECT city_name, population
FROM city_stats
ORDER BY population DESC
LIMIT 1
> [query] Total LLM token usage: 144 tokens
> [query] Total embedding token usage: 0 tokens


In [23]:
display(Markdown(f"<b>{response}</b>"))

<b>[('Tokyo', 13929286)]</b>

In [18]:
# you can also fetch the raw result from SQLAlchemy! 
response.extra_info["result"]

[('Tokyo', 13929286)]

### Using LangChain for Querying

Since our SQLDatabase inherits from langchain, you can also use langchain itself for querying purposes.

In [22]:
from langchain import OpenAI, SQLDatabase, SQLDatabaseChain

In [24]:
llm = OpenAI(temperature=0)

In [26]:
# set Logging to DEBUG for more detailed outputs
db_chain = SQLDatabaseChain(llm=llm, database=sql_database)

In [27]:
db_chain.run("Which city has the highest population?")



[1m> Entering new SQLDatabaseChain chain...[0m
Which city has the highest population? 
SQLQuery:[32;1m[1;3m SELECT city_name FROM city_stats ORDER BY population DESC LIMIT 1;[0m
SQLResult: [33;1m[1;3m[('Tokyo',)][0m
Answer:[32;1m[1;3m Tokyo has the highest population.[0m
[1m> Finished chain.[0m


' Tokyo has the highest population.'