<a href="https://colab.research.google.com/github/sudarshan-koirala/youtube-stuffs/blob/main/llamaindex/Combining_Text_to_SQL_with_Semantic_Search_for_Retrieval_Augmented_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Combining Text-to-SQL with Semantic Search for `RAG` [LlamaIndex Website](https://www.llamaindex.ai/)

In this tutorial, we show you how to use our `SQLAutoVectorQueryEngine`. More info in this [blog post](https://blog.llamaindex.ai/combining-text-to-sql-with-semantic-search-for-retrieval-augmented-generation-c60af30ec3b).

- This query engine allows you to combine insights from your structured tables with your unstructured data.
- Can leverage both a SQL database as well as a vector store to fulfill complex natural language queries over a combination of structured and unstructured data
- It first decides whether to query your structured tables for insights. Once it does, it can then infer a corresponding query to the vector store in order to fetch corresponding documents.



## SETUP

In [1]:
%%capture
!pip install llama-index openai pinecone-client

In [58]:
import openai
import os

# find API key in console at https://platform.openai.com/account/api-keys

os.environ["OPENAI_API_KEY"] = "OPENAI_API_KEY"
openai.api_key = os.environ["OPENAI_API_KEY"]

In [3]:
from llama_index import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    ServiceContext,
    StorageContext,
    SQLDatabase,
    WikipediaReader,
)

## Create Common Objects
This includes a `ServiceContext` object containing abstractions such as the LLM and chunk size. This also includes a `StorageContext` object containing our vector store abstractions.

In [47]:
# define pinecone index
import pinecone
import os

# find API key in console at https://app.pinecone.io/
os.environ['PINECONE_API_KEY'] = 'PINECONE_API_KEY'
# environment is found next to API key in the console
os.environ['PINECONE_ENVIRONMENT'] = 'asia-southeast1-gcp'

# initialize connection to pinecone
pinecone.init(
    api_key=os.environ['PINECONE_API_KEY'],
    environment=os.environ['PINECONE_ENVIRONMENT']
)

# dimensions are for text-embedding-ada-002
pinecone.create_index("quickstart", dimension=1536, metric="euclidean", pod_type="p1")

In [61]:
# list indexes
pinecone.list_indexes()

['quickstart']

In [62]:
# describe index
pinecone.describe_index("quickstart")

IndexDescription(name='quickstart', metric='euclidean', replicas=1, dimension=1536.0, shards=1, pods=1, pod_type='p1.x1', status={'ready': True, 'state': 'Ready'}, metadata_config=None, source_collection='')

In [63]:
# connect to the index
pinecone_index = pinecone.Index('quickstart')

In [64]:
from llama_index.node_parser.simple import SimpleNodeParser
from llama_index import ServiceContext, LLMPredictor
from llama_index.storage import StorageContext
from llama_index.vector_stores import PineconeVectorStore
from llama_index.text_splitter import TokenTextSplitter
from llama_index.llms import OpenAI

# define node parser and LLM
chunk_size = 1024
llm = OpenAI(temperature=0, model="gpt-3.5-turbo", streaming=True)
service_context = ServiceContext.from_defaults(chunk_size=chunk_size, llm=llm)
text_splitter = TokenTextSplitter(chunk_size=chunk_size)
node_parser = SimpleNodeParser.from_defaults(text_splitter=text_splitter)

# define pinecone vector index
vector_store = PineconeVectorStore(
    pinecone_index=pinecone_index, namespace="wiki_cities"
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
vector_index = VectorStoreIndex([], storage_context=storage_context)

## Create Database Schema + Test Data

In [65]:
from sqlalchemy import (
    create_engine,
    MetaData,
    Table,
    Column,
    String,
    Integer,
    select,
    column,
)

In [66]:
engine = create_engine("sqlite:///:memory:", future=True)
metadata_obj = MetaData()

In [67]:
# create city SQL table
table_name = "city_stats"
city_stats_table = Table(
    table_name,
    metadata_obj,
    Column("city_name", String(16), primary_key=True),
    Column("population", Integer),
    Column("country", String(16), nullable=False),
)

metadata_obj.create_all(engine)

In [68]:
# print tables
metadata_obj.tables.keys()

dict_keys(['city_stats'])

We introduce some test data into the `city_stats` table

In [69]:
from sqlalchemy import insert

rows = [
    {"city_name": "Toronto", "population": 2930000, "country": "Canada"},
    {"city_name": "Tokyo", "population": 13960000, "country": "Japan"},
    {"city_name": "Berlin", "population": 3645000, "country": "Germany"},
]
for row in rows:
    stmt = insert(city_stats_table).values(**row)
    with engine.connect() as connection:
        cursor = connection.execute(stmt)
        connection.commit()

In [70]:
with engine.connect() as connection:
    cursor = connection.exec_driver_sql("SELECT * FROM city_stats")
    print(cursor.fetchall())

[('Toronto', 2930000, 'Canada'), ('Tokyo', 13960000, 'Japan'), ('Berlin', 3645000, 'Germany')]


## Load Data
- Lets use the [Wikipedia Loader](https://llamahub.ai/l/wikipedia) from LlamaHub.

In [71]:
%%capture
!pip install wikipedia

In [72]:
from llama_index import download_loader

WikipediaReader = download_loader("WikipediaReader")

loader = WikipediaReader()
cities = ["Toronto", "Berlin", "Tokyo"]
wiki_docs = loader.load_data(pages=cities)

In [73]:
len(wiki_docs)

3

## Build SQL Index

In [74]:
sql_database = SQLDatabase(engine, include_tables=["city_stats"])

In [75]:
from llama_index.indices.struct_store.sql_query import NLSQLTableQueryEngine

In [76]:
sql_query_engine = NLSQLTableQueryEngine(
    sql_database=sql_database,
    tables=["city_stats"],
)

## Build Vector Index

In [77]:
# Insert documents into vector index
# Each document has metadata of the city attached
for city, wiki_doc in zip(cities, wiki_docs):
    nodes = node_parser.get_nodes_from_documents([wiki_doc])
    # add metadata to each node
    for node in nodes:
        node.metadata = {"title": city}
    vector_index.insert_nodes(nodes)

Upserted vectors:   0%|          | 0/17 [00:00<?, ?it/s]

Upserted vectors:   0%|          | 0/18 [00:00<?, ?it/s]

Upserted vectors:   0%|          | 0/11 [00:00<?, ?it/s]

## Define Query Engines, Set as Tools

In [78]:
from llama_index.query_engine import SQLAutoVectorQueryEngine, RetrieverQueryEngine
from llama_index.tools.query_engine import QueryEngineTool
from llama_index.indices.vector_store import VectorIndexAutoRetriever
from llama_index.vector_stores.types import MetadataInfo, VectorStoreInfo

In [79]:
vector_store_info = VectorStoreInfo(
    content_info="articles about different cities",
    metadata_info=[
        MetadataInfo(name="title", type="str", description="The name of the city"),
    ],
)
vector_auto_retriever = VectorIndexAutoRetriever(
    vector_index, vector_store_info=vector_store_info
)

retriever_query_engine = RetrieverQueryEngine.from_args(
    vector_auto_retriever, service_context=service_context
)

In [80]:
sql_tool = QueryEngineTool.from_defaults(
    query_engine=sql_query_engine,
    description=(
        "Useful for translating a natural language query into a SQL query over a table containing: "
        "city_stats, containing the population/country of each city"
    ),
)
vector_tool = QueryEngineTool.from_defaults(
    query_engine=retriever_query_engine,
    description=f"Useful for answering semantic questions about different cities",
)

## Define `SQLAutoVectorQueryEngine`

In [None]:
query_engine = SQLAutoVectorQueryEngine(
    sql_tool, vector_tool, service_context=service_context
)

## Query
**The original question, SQL query, SQL response, vector store query, and vector store response are combined into a prompt to synthesize the final answer.**

In [29]:
response = query_engine.query("Can you give me the country corresponding to each city?")
response.response

[1;3;34mQuerying SQL database: The first choice is most relevant because it mentions translating a natural language query into a SQL query over a table containing city_stats, which likely includes information about the country of each city.
[0m[1;3;33mSQL query: SELECT city_name, country
FROM city_stats
[0m[1;3;33mSQL response: The country corresponding to each city is Canada for Toronto, Japan for Tokyo, and Germany for Berlin.
[0m[1;3;34mTransformed query given SQL response: None
[0m

'The country corresponding to each city is Canada for Toronto, Japan for Tokyo, and Germany for Berlin.'

In [29]:
response.metadata

{'result': [('Toronto', 'Canada'), ('Tokyo', 'Japan'), ('Berlin', 'Germany')],
 'sql_query': 'SELECT city_name, country\nFROM city_stats'}

In [30]:
response = query_engine.query("Tell me about the history of Berlin")
response.response

[1;3;34mQuerying other query engine: The question is asking for information about the history of a city, which is related to answering semantic questions about different cities.
[0m[1;3;38;5;200mQuery Engine response: Berlin has a rich and diverse history. It was first settled by Germanic tribes around 500 BC, and later became a center of Slavic settlements and castles in the 8th century AD. The earliest written records of towns in the area date back to the late 12th century, with Spandau mentioned in 1197 and Köpenick in 1209. The central part of Berlin can be traced back to two towns, Cölln and Berlin, which formed close economic and social ties. In 1307, Berlin-Cölln became the capital of the Margraviate of Brandenburg, and the Hohenzollern family ruled in Berlin until 1918. The city experienced significant devastation during the Thirty Years' War in the 17th century, but rebounded under the rule of Frederick William, known as the "Great Elector". He initiated policies to promote

'Berlin has a rich and diverse history. It was first settled by Germanic tribes around 500 BC, and later became a center of Slavic settlements and castles in the 8th century AD. The earliest written records of towns in the area date back to the late 12th century, with Spandau mentioned in 1197 and Köpenick in 1209. The central part of Berlin can be traced back to two towns, Cölln and Berlin, which formed close economic and social ties. In 1307, Berlin-Cölln became the capital of the Margraviate of Brandenburg, and the Hohenzollern family ruled in Berlin until 1918. The city experienced significant devastation during the Thirty Years\' War in the 17th century, but rebounded under the rule of Frederick William, known as the "Great Elector". He initiated policies to promote immigration and religious tolerance, leading to an influx of French Huguenots and immigrants from other regions. Berlin became the capital of the Kingdom of Prussia in 1701 and saw significant growth during the 19th ce

In [31]:
response.metadata

{'f86372a2-5739-403b-8934-5d5e65b24ff2': {'title': 'Berlin'},
 '5cb431cf-dd52-49c6-99c5-4b67e1102726': {'title': 'Berlin'}}

In [32]:
response = query_engine.query(
    "Tell me about the arts and culture of the city with the highest population"
)

[1;3;34mQuerying SQL database: The first choice is relevant because it mentions translating a natural language query into a SQL query over a table containing city_stats, which could be used to find the city with the highest population.
[0m[1;3;33mSQL query: SELECT city_name, population, country
FROM city_stats
WHERE population = (SELECT MAX(population) FROM city_stats)
[0m[1;3;33mSQL response: The city with the highest population is Tokyo, Japan. Tokyo is known for its vibrant arts and culture scene. It is home to numerous art galleries, museums, and theaters. The city offers a wide range of cultural experiences, including traditional Japanese arts such as tea ceremonies, calligraphy, and kabuki theater. Additionally, Tokyo is a hub for contemporary art and hosts various international art festivals and exhibitions. The city's rich cultural heritage and modern artistic expressions make it a must-visit destination for art and culture enthusiasts.
[0m[1;3;34mTransformed query given

In [33]:
response.response

"The city with the highest population is Tokyo, Japan. Tokyo is known for its vibrant arts and culture scene. It is home to numerous art galleries, museums, and theaters. The city offers a wide range of cultural experiences, including traditional Japanese arts such as tea ceremonies, calligraphy, and kabuki theater. Additionally, Tokyo is a hub for contemporary art and hosts various international art festivals and exhibitions. The city's rich cultural heritage and modern artistic expressions make it a must-visit destination for art and culture enthusiasts."

### GPT-4 is needed for Querying both tool

In [41]:
llm = OpenAI(temperature=0.7, model="gpt-4", streaming=True)
service_context = ServiceContext.from_defaults(chunk_size=chunk_size, llm=llm)

query_engine = SQLAutoVectorQueryEngine(
    sql_tool, vector_tool, service_context=service_context
)

In [42]:
response = query_engine.query(
    "Tell me about the arts and culture of the city with the highest population"
)

[1;3;34mQuerying SQL database: This choice is about city statistics, which might have information about the city with the highest population. Although it doesn't directly mention arts and culture, it's the closest match between the two choices.
[0m[1;3;33mSQL query: SELECT city_name, population, country
FROM city_stats
WHERE population = (SELECT MAX(population) FROM city_stats)
[0m[1;3;33mSQL response: The city with the highest population is Tokyo, Japan. Tokyo is known for its vibrant arts and culture scene. It is home to numerous art galleries, museums, and theaters. The city offers a wide range of cultural experiences, including traditional Japanese arts such as tea ceremonies, calligraphy, and kabuki theater. Additionally, Tokyo hosts various international art festivals and events throughout the year, attracting artists and art enthusiasts from around the world.
[0m[1;3;34mTransformed query given SQL response: Can you provide more information on the art festivals and events 

In [43]:
response.response

'The city with the highest population is Tokyo, Japan which is renowned for its dynamic arts and culture scene. Tokyo is home to various art galleries, museums, and theaters and offers a broad array of cultural experiences. These include traditional Japanese arts like tea ceremonies, calligraphy, and kabuki theater. Moreover, Tokyo holds different international art festivals and events annually, drawing artists and art lovers from around the globe.\n\nTokyo hosts numerous festivals and events throughout the year, including several centered on art. These events, showcasing both traditional and contemporary art forms, attract both locals and tourists. Some notable art festivals and events in Tokyo are the Sannō Festival at Hie Shrine, the Sanja Festival at Asakusa Shrine, and the biennial Kanda Festivals. The Kanda Festivals, in particular, feature a parade with elaborately decorated floats and thousands of participants. Tokyo is also famous for its cherry blossom viewing parties. These 

In [44]:
response.metadata

{'result': [('Tokyo', 13960000, 'Japan')],
 'sql_query': 'SELECT city_name, population, country\nFROM city_stats\nWHERE population = (SELECT MAX(population) FROM city_stats)',
 '6033237d-19e2-4b1f-806e-485811e1df87': {'title': 'Tokyo'},
 'b9226539-64a3-44ef-9acf-809bd3ec0249': {'title': 'Tokyo'}}

In [45]:
# delete the pinecone index
pinecone.delete_index("quickstart")

## Conclusion
- Chunking plays a vital role.
- Need to play around with different models to get the right answer.
- Trial and error is what it is needed for LLMs.