Ask your database questions in plain English and get accurate answers β no SQL knowledge required. This project converts natural language questions into executable SQL queries using LlamaIndex and GPT-4o-mini, demonstrating two production-ready approaches for any database scale.
- What is this project?
- Why Text-to-SQL?
- Two Approaches β Simple vs Scalable
- How it works β Full Pipeline
- Flowcharts
- Database Schema
- Project Structure
- Tech Stack & Tools
- Component Deep Dive
- SQLAlchemy Fundamentals
- Setup & Installation
- Running on Google Colab
- Configuration & Customization
- Sample Outputs
- Limitations
- How Limitations Can Be Resolved
- Key Concepts for Beginners
- Text-to-SQL vs Manual SQL vs ORM
- Real-World Use Cases
- What to Build Next
- Contributing
This project builds a Text-to-SQL system using LlamaIndex β an AI pipeline that takes a natural language question, automatically generates the correct SQL query, executes it against a real database, and returns a human-readable answer.
It uses a SQLite employee database as the demonstration data, with 5 employees across various roles and salaries. The project demonstrates two distinct approaches:
-
NLSQLTableQueryEngineβ Simple, direct approach where the full table schema is injected into the LLM's context. Best for small databases with few tables. -
SQLTableRetrieverQueryEngineβ Scalable approach where table schemas are stored in aVectorStoreIndexand retrieved semantically at query time. Best for large databases with many tables.
"How many employees have salary greater than 70000?"
β
βΌ
LLM generates: SELECT COUNT(*) FROM employee WHERE salary > 70000
β
βΌ
SQLAlchemy executes the query on SQLite
β
βΌ
"There are 3 employees with a salary greater than $70,000."
No SQL expertise required from the user. The entire translation is handled by the AI.
Every time a non-technical user needs data from a database, they must:
- Find a developer or data analyst
- Wait for the developer to write and run the query
- Get results hours or days later
This bottleneck is extremely common in business β managers, HR teams, executives, and analysts constantly need data but cannot write SQL.
| Without Text-to-SQL | With Text-to-SQL |
|---|---|
| Must know SQL syntax | Ask in plain English |
| Depends on developers | Self-service data access |
| Hours or days for data | Seconds |
| Technical barrier | Anyone can query |
| Manual, error-prone | Automated, consistent |
| Expensive analyst time | Immediate answers |
SQL is the language of nearly every business database. Hundreds of millions of business users access databases regularly, but only a fraction can write SQL. Text-to-SQL bridges this gap at scale.
This project demonstrates both approaches, explaining when to use each:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Approach 1: NLSQLTableQueryEngine β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β Schema injected directly into LLM prompt β
β β
Simple setup β 5 lines of code β
β β
Fast β no retrieval step β
β β
Best for: 1β10 tables with small schemas β
β β Context window overflow with many tables β
β β LLM sees ALL schemas even for simple queries β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Approach 2: SQLTableRetrieverQueryEngine β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β Table schemas stored in VectorStoreIndex β
β Semantically retrieves only the relevant schema(s) β
β β
Scales to 100+ tables without context overflow β
β β
Only relevant schemas sent to LLM β
β β
Best for: Large enterprise databases β
β β Slightly more setup β
β β Retrieval step adds small latency β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
SQLAlchemy Schema Definition
β Table, Column, primary key, constraints
βΌ
create_engine("sqlite:///:memory:")
β Creates in-memory SQLite database
βΌ
metadata_obj.create_all(engine)
β Executes CREATE TABLE SQL
βΌ
insert(employee_table).values(**row)
β Inserts 5 employee records
βΌ
SQLDatabase(engine, include_tables=["employee"])
β LlamaIndex wrapper β knows schema + can execute queries
βΌ
Database Ready for Querying
User Natural Language Question
β "How many employees have salary greater than 70000?"
βΌ
NLSQLTableQueryEngine
β
ββ Step 1: Schema Injection
β LLM receives full table schema:
β "Table: employee, Columns: employee_id (INT, PK),
β first_name (VARCHAR), salary (FLOAT), ..."
β
ββ Step 2: SQL Generation
β LLM generates:
β SELECT COUNT(*) FROM employee WHERE salary > 70000
β
ββ Step 3: SQL Execution
β SQLAlchemy executes query on SQLite
β Raw result: [(3,)]
β
ββ Step 4: Response Synthesis
LLM converts: [(3,)] β "There are 3 employees..."
Returns: response.response + response.metadata["result"]
User Natural Language Question
β "Who are the employees with salary greater than 70000?"
βΌ
SQLTableRetrieverQueryEngine
β
ββ Step 1: Schema Retrieval (NEW step vs Approach 1)
β Question β embedding vector
β Search VectorStoreIndex of table schemas
β Find top-1 most relevant table schema
β Returns: "employee" table schema + context_str
β
ββ Step 2: SQL Generation
β LLM receives only the RETRIEVED schema (not all schemas)
β LLM generates:
β SELECT * FROM employee WHERE salary > 70000
β
ββ Step 3: SQL Execution
β SQLAlchemy executes β returns matching rows
β
ββ Step 4: Response Synthesis
LLM formats results as human-readable answer
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Text-to-SQL System β Full Architecture β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β DATABASE LAYER β β
β β β β
β β SQLAlchemy Schema β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β employee table β β β
β β β βββ employee_id INTEGER PRIMARY KEY β β β
β β β βββ first_name VARCHAR(50) NOT NULL β β β
β β β βββ last_name VARCHAR(50) NOT NULL β β β
β β β βββ email VARCHAR(100) UNIQUE β β β
β β β βββ phone_number VARCHAR(15) β β β
β β β βββ hire_date DATE NOT NULL β β β
β β β βββ job_title VARCHAR(50) β β β
β β β βββ salary FLOAT β β β
β β β βββ is_manager BOOLEAN DEFAULT False β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β β
β β βΌ β β
β β SQLite in-memory database β β
β β (create_engine("sqlite:///:memory:")) β β
β β β β β
β β βΌ β β
β β SQLDatabase (LlamaIndex wrapper) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββ΄βββββββββββββββ β
β β β β
β βββββββββββββΌβββββββββββ βββββββββββββΌββββββββββββββββββββββββ β
β β APPROACH 1 β β APPROACH 2 β β
β β NLSQLTable β β SQLTableRetriever β β
β β QueryEngine β β QueryEngine β β
β β β β β β
β β Schema β LLM prompt β β Schema β VectorStoreIndex β β
β β (direct injection) β β (semantic retrieval first) β β
β β β β β β
β β Best: few tables β β Best: many tables β β
β ββββββββββββ¬ββββββββββββ ββββββββββββββββ¬βββββββββββββββββββββ β
β β β β
β βββββββββββββββ¬ββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββ β
β β SHARED OUTPUT LAYER β β
β β β β
β β response.response β Natural language answer β β
β β response.metadata["result"] β Raw SQL query result β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
User: "How many employees have salary greater than 70000?"
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β NLSQLTableQueryEngine β
β β
β Step 1 β Build Schema Context β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β "Table 'employee' has columns: β β
β β employee_id (INTEGER), first_name (VARCHAR(50)), β β
β β last_name (VARCHAR(50)), email (VARCHAR(100)), β β
β β phone_number (VARCHAR(15)), hire_date (DATE), β β
β β job_title (VARCHAR(50)), salary (FLOAT), β β
β β is_manager (BOOLEAN)" β β
β βββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ β
β β injected into LLM system prompt β
β βΌ β
β Step 2 β SQL Generation (gpt-4o-mini) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β LLM Input: schema + "How many employees..." β β
β β LLM Output: SELECT COUNT(*) β β
β β FROM employee β β
β β WHERE salary > 70000 β β
β βββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β Step 3 β SQL Execution (SQLAlchemy β SQLite) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Raw SQL executed against employee table β β
β β employees with salary > 70000: β β
β β Alice β $85,000 β
β β
β β Bob β $120,000 β
β β
β β Charlieβ $95,000 β
β β
β β Diana β $70,000 β (not > 70000, equal) β β
β β Evan β $65,000 β β β
β β Result: [(3,)] β β
β βββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β Step 4 β Response Synthesis (gpt-4o-mini) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β LLM converts [(3,)] β natural language β β
β β "There are 3 employees with a salary greater β β
β β than $70,000." β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
User: "Who are the employees with salary greater than 70000?"
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SQLTableRetrieverQueryEngine β
β β
β Step 1 β Schema Retrieval (NEW β not in Approach 1) β
β β
β Question β text-embedding-3-small β query vector β
β β
β VectorStoreIndex of table schemas: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Indexed node: β β
β β "employee" table β β
β β context: "This table gives information regarding β β
β β the employees of the organization" β β
β β + full schema definition β β
β β vector: [0.23, -0.11, 0.87, ...] β β
β ββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ β
β β β
β Cosine similarity: query_vector β schema_vector β
β similarity_top_k=1 β "employee" table selected β
β β β
β Step 2 β SQL Generation (same as Approach 1) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β LLM receives ONLY the retrieved "employee" schema β β
β β (not all schemas β critical for large databases) β β
β β β β
β β LLM generates: β β
β β SELECT first_name, last_name, salary, job_title β β
β β FROM employee β β
β β WHERE salary > 70000 β β
β ββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ β
β β β
β Step 3 β SQL Execution β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Returns rows: β β
β β ("Alice", "Johnson", 85000.0, "Software Developer")β β
β β ("Bob", "Smith", 120000.0, "Project Manager") β β
β β ("Charlie", "Brown", 95000.0, "Data Scientist") β β
β ββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ β
β β β
β Step 4 β Response Synthesis β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β "The employees with salaries greater than $70,000 β β
β β are Alice Johnson (Software Developer, $85K), β β
β β Bob Smith (Project Manager, $120K), and β β
β β Charlie Brown (Data Scientist, $95K)." β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Approach 1 (NLSQLTableQueryEngine):
βββββββββββββββββββββββββββββββββββββ
Database has 20 tables
β
β ALL schemas injected into every LLM prompt
βΌ
LLM context: [schema_1][schema_2]...[schema_20][user_query]
β
19 irrelevant schemas wasted!
Risk: Context window overflow
Cost: Paying for tokens from all 20 schemas every query
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Approach 2 (SQLTableRetrieverQueryEngine):
ββββββββββββββββββββββββββββββββββββββββββ
Database has 20 tables
β
β All 20 schemas indexed in VectorStoreIndex
βΌ
User query β embedding β similarity search
β
β Only top-1 relevant schema retrieved
βΌ
LLM context: [schema_7_employee][user_query]
β
Only 1 relevant schema β 19 others not sent!
Benefit: No context overflow, lower cost, faster
query_engine.query("How many employees have salary > 70000?")
β
βΌ
Response Object
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β response.response β
β βββββββββββββββββ β
β "There are 3 employees with a salary greater β
β than $70,000." β
β (Natural language answer synthesized by LLM) β
β β
β response.metadata["result"] β
β ββββββββββββββββββββββββββββ β
β [(3,)] β
β (Raw result from SQL execution β list of tuples) β
β β
β response.metadata["sql_query"] (available in some modes) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β "SELECT COUNT(*) FROM employee WHERE salary > 70000" β
β (The actual SQL that was generated and executed) β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Python Code SQL Equivalent
βββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββ
create_engine [database connection]
("sqlite:///:memory:") βββ sqlite in-memory
MetaData() [schema registry]
Table("employee", CREATE TABLE employee (
Column("employee_id", employee_id INTEGER PRIMARY KEY,
Integer,
primary_key=True),
Column("first_name", first_name VARCHAR(50) NOT NULL,
String(50),
nullable=False),
Column("last_name", last_name VARCHAR(50) NOT NULL,
String(50),
nullable=False),
Column("email", email VARCHAR(100) UNIQUE,
String(100), unique=True),
Column("phone_number", phone_number VARCHAR(15),
String(15)),
Column("hire_date", hire_date DATE NOT NULL,
Date, nullable=False),
Column("job_title", job_title VARCHAR(50),
String(50)),
Column("salary", salary FLOAT,
Float),
Column("is_manager", is_manager BOOLEAN DEFAULT FALSE
Boolean, );
default=False),
)
metadata_obj.create_all(engine) [executes CREATE TABLE on database]
The project uses a single employee table with the following structure and sample data:
| employee_id | first_name | last_name | phone | hire_date | job_title | salary | is_manager | |
|---|---|---|---|---|---|---|---|---|
| 1 | Alice | Johnson | alice@example.com | 123-456-7890 | 2021-06-15 | Software Developer | 85,000 | False |
| 2 | Bob | Smith | bob@example.com | 987-654-3210 | 2020-03-10 | Project Manager | 120,000 | True |
| 3 | Charlie | Brown | charlie@example.com | 555-123-4567 | 2019-11-01 | Data Scientist | 95,000 | False |
| 4 | Diana | Prince | diana@example.com | 333-444-5555 | 2022-07-20 | HR Specialist | 70,000 | False |
| 5 | Evan | Taylor | evan@example.com | 222-333-4444 | 2023-01-05 | Marketing Analyst | 65,000 | False |
text-to-sql-llamaindex/
β
βββ Text_to_SQL_LlamaIndex.ipynb # Main Colab notebook
βββ requirements.txt # All Python dependencies
βββ README.md # This file
This project uses an in-memory SQLite database β no external database files are created. All data is defined and populated within the notebook itself.
| Tool / Library | Version | Purpose |
|---|---|---|
| LlamaIndex Core | β₯ 0.11.0 | Text-to-SQL framework β NLSQLTableQueryEngine, SQLTableRetrieverQueryEngine |
| llama-index-llms-openai | β₯ 0.2.0 | OpenAI LLM integration (gpt-4o-mini) |
| OpenAI | β₯ 1.30.0 | OpenAI API client β LLM + embeddings |
| SQLAlchemy | β₯ 2.0.0 | Python SQL toolkit β schema definition, ORM, query execution |
| SQLite | built-in | In-memory relational database (via Python sqlite3) |
| Python | 3.10+ | Runtime |
| Google Colab | β | Cloud notebook execution environment |
LlamaIndex provides the most complete, production-ready Text-to-SQL pipeline of any open-source framework:
| Feature | LlamaIndex | LangChain | Raw OpenAI | Manual |
|---|---|---|---|---|
| Schema auto-injection | β | β | β Manual | β Manual |
| Multi-table retrieval | β VectorIndex | β | β | |
| ObjectIndex for schemas | β Native | β | β | β |
| context_str per table | β | β | β | β |
| Response + metadata | β | β | β | |
| SQLAlchemy integration | β Native | β | β | β |
| Production-ready | β | β | β |
SQLAlchemy is Python's most widely-used SQL toolkit. It provides:
- Schema definition in Python β no raw SQL CREATE TABLE statements needed
- Database agnosticism β switch from SQLite to PostgreSQL/MySQL by changing one line (
create_engineURL) - Connection management β automatic commit/rollback, connection pooling
- Type safety β Python types map to SQL types (Integer, String, Float, Boolean, Date)
- Both ORM and Core API β use Python objects or raw SQL as needed
create_engine("sqlite:///:memory:") creates a temporary, in-memory database:
- β Zero setup β no files, no server, no configuration
- β Perfect for demos, testing, and development
- β Extremely fast β data lives in RAM
- β Lost when Python process ends β not for production data
For production, just change the connection string:
engine = create_engine("postgresql://user:pass@host:5432/dbname") # PostgreSQL
engine = create_engine("mysql://user:pass@host:3306/dbname") # MySQL
engine = create_engine("sqlite:///my_database.db") # Persistent SQLite fileSQL generation requires precision and consistency:
temperature=0.1β near-deterministic output; the LLM consistently picks the most likely SQL tokens rather than exploring creative alternatives- Higher temperature risks generating syntactically varied but semantically incorrect SQL
gpt-4o-miniβ sufficient intelligence for SQL generation at minimal cost
sql_database = SQLDatabase(engine, include_tables=["employee"])SQLDatabase wraps a SQLAlchemy engine and provides LlamaIndex with:
- Schema introspection β automatically reads column names, types, and constraints
- Query execution β runs generated SQL and returns results
- Table filtering β
include_tableslimits which tables are exposed to the LLM
Why include_tables matters:
# Without include_tables β ALL tables in the database are exposed
sql_database = SQLDatabase(engine)
# Risk: If database has 50 tables, all 50 schemas go into context β overflow
# With include_tables β only specified tables are exposed
sql_database = SQLDatabase(engine, include_tables=["employee"])
# LLM only sees the "employee" schema β clean and efficientquery_engine = NLSQLTableQueryEngine(
sql_database=sql_database, # database connection
tables=["employee"], # tables to include in schema context
llm=llm # LLM for SQL generation and synthesis
)Internal flow:
- Reads schema of all specified tables from
sql_database - Builds a prompt:
[system: you are a SQL expert] + [schema] + [user query] - LLM generates a SQL query string
- LlamaIndex executes the SQL via SQLAlchemy
- LLM synthesizes the raw result into a natural language answer
context_str = "This table gives information regarding the employees of the organization"
table_node_mapping = SQLTableNodeMapping(sql_database)
table_schema_objs = [
SQLTableSchema(table_name="employee", context_str=context_str)
]
obj_index = ObjectIndex.from_objects(
table_schema_objs, # list of table schema objects
table_node_mapping, # maps schemas to indexable nodes
VectorStoreIndex, # index type for semantic search
)What each piece does:
| Component | Role |
|---|---|
SQLTableSchema |
Represents one table β table_name + optional context_str |
context_str |
Human-written description of what the table contains β helps semantic matching |
SQLTableNodeMapping |
Converts SQLTableSchema objects to LlamaIndex Node objects (indexable units) |
ObjectIndex |
Creates a VectorStoreIndex from the table nodes β enables semantic search |
obj_index.as_retriever(similarity_top_k=1) |
Retriever that finds the top-1 most relevant table for any query |
Why context_str is powerful:
# Without context_str β LLM only has schema column names
SQLTableSchema(table_name="emp_master")
# Query "find employees" β vector similarity to "emp_master" schema is weak
# With context_str β rich semantic context for matching
SQLTableSchema(
table_name="emp_master",
context_str="Contains all employee personal details, salaries, job titles, and manager information"
)
# Query "find employees" β high similarity to context_str β correct table selectedquery_engine = SQLTableRetrieverQueryEngine(
sql_database, # database connection
obj_index.as_retriever(similarity_top_k=1) # retrieves top-1 relevant table schema
)This engine:
- Embeds the user query (via
text-embedding-3-smallby default) - Searches the
ObjectIndexfor the most semantically similar table schema - Retrieves only that table's schema (not all schemas)
- Passes only the relevant schema to the LLM for SQL generation
- Executes and synthesizes the answer
Understanding SQLAlchemy is key to extending this project. Here's a quick reference:
from sqlalchemy import Table, Column, String, Integer, Float, Boolean, Date
my_table = Table(
"table_name", # SQL table name
metadata_obj, # schema registry
Column("id", Integer, primary_key=True),
Column("name", String(100), nullable=False),
Column("value", Float),
Column("active",Boolean, default=True),
Column("date", Date),
)
metadata_obj.create_all(engine) # executes CREATE TABLEfrom sqlalchemy import insert
with engine.begin() as conn:
conn.execute(insert(my_table).values(id=1, name="Alice", value=42.5))from sqlalchemy import select
query = select(my_table).where(my_table.c.value > 40)
with engine.connect() as conn:
results = conn.execute(query).fetchall()from sqlalchemy import text
with engine.connect() as conn:
results = conn.execute(text("SELECT * FROM my_table WHERE value > 40"))
for row in results:
print(row)# SQLite file (development/small apps)
engine = create_engine("sqlite:///my_database.db")
# PostgreSQL (production)
engine = create_engine("postgresql://user:password@localhost:5432/mydb")
# MySQL (production)
engine = create_engine("mysql+pymysql://user:password@localhost:3306/mydb")
# SQL Server
engine = create_engine("mssql+pyodbc://user:password@server/db?driver=ODBC+Driver+17+for+SQL+Server")- Python 3.10 or higher
- An OpenAI API key
- pip
git clone https://github.com/your-username/text-to-sql-llamaindex.git
cd text-to-sql-llamaindexpip install -r requirements.txtecho "OPENAI_API_KEY=sk-your-key-here" > .envThen in the notebook:
from dotenv import load_dotenv
load_dotenv()jupyter notebook Text_to_SQL_LlamaIndex.ipynbGo to colab.research.google.com β File β Upload Notebook
- Click the π Secrets icon in the left sidebar
- Add secret: Name =
OpenAI, Value = your OpenAI API key - Toggle Notebook access ON
!pip install llama-index llama-index-llms-openai openai sqlalchemyRuntime β Run all (Ctrl+F9)
β‘ This project runs fast β no PDF indexing, no large downloads. Typical full run: under 30 seconds.
new_employees = [
{
"employee_id": 6,
"first_name": "Fiona",
"last_name": "Green",
"email": "fiona.green@example.com",
"phone_number": "111-222-3333",
"hire_date": date(2021, 9, 1),
"job_title": "DevOps Engineer",
"salary": 105000.0,
"is_manager": False,
}
]
for row in new_employees:
with engine.begin() as connection:
connection.execute(insert(employee_table).values(**row))# Define a department table
department_table = Table(
"department", metadata_obj,
Column("dept_id", Integer, primary_key=True),
Column("dept_name", String(100), nullable=False),
Column("budget", Float),
Column("head_employee_id", Integer),
)
metadata_obj.create_all(engine)
# Update SQLDatabase to include both tables
sql_database = SQLDatabase(engine, include_tables=["employee", "department"])
# Add both to ObjectIndex for Approach 2
table_schema_objs = [
SQLTableSchema(table_name="employee",
context_str="Contains employee personal details, salaries, and roles"),
SQLTableSchema(table_name="department",
context_str="Contains department names, budgets, and department heads"),
]# PostgreSQL example
from sqlalchemy import create_engine
engine = create_engine("postgresql://username:password@host:5432/database_name")
sql_database = SQLDatabase(engine, include_tables=["your_table"])
# All other code remains exactly the same!queries = [
"Who is the highest-paid employee?",
"How many employees were hired after 2021?",
"List all managers in the company.",
"What is the average salary by job title?",
"Which employees have been with the company for more than 3 years?",
]
for q in queries:
response = query_engine.query(q)
print(f"Q: {q}")
print(f"A: {response.response}\n")Query: "How many employees have salary greater than 70000?"
SQL Generated:
SELECT COUNT(*) FROM employee WHERE salary > 70000
Raw Result:
[(3,)]
Natural Language Answer:
"There are 3 employees with a salary greater than $70,000."
Query: "Who are the employees with salary greater than 70000?"
SQL Generated:
SELECT first_name, last_name, salary, job_title
FROM employee
WHERE salary > 70000
Raw Result:
[('Alice', 'Johnson', 85000.0, 'Software Developer'),
('Bob', 'Smith', 120000.0, 'Project Manager'),
('Charlie', 'Brown', 95000.0, 'Data Scientist')]
Natural Language Answer:
"The employees with salaries greater than $70,000 are:
Alice Johnson (Software Developer, $85,000),
Bob Smith (Project Manager, $120,000), and
Charlie Brown (Data Scientist, $95,000)."
Query: "Who is the manager in the company?"
SQL Generated:
SELECT first_name, last_name, job_title
FROM employee
WHERE is_manager = 1
Raw Result:
[('Bob', 'Smith', 'Project Manager')]
Natural Language Answer:
"The manager in the company is Bob Smith, who holds
the position of Project Manager."
What happens: create_engine("sqlite:///:memory:") creates a database that exists only in RAM. Every time the notebook restarts or the kernel resets, all table definitions and data are permanently deleted. The entire setup (create tables + insert rows) must be re-run from scratch.
Why it matters: In a real application, you connect to a persistent database. The in-memory setup is only appropriate for demos and testing.
What happens: The LLM generates SQL that is executed directly against the database without sanitization. A carefully crafted natural language input could theoretically prompt the LLM to generate destructive SQL (DROP TABLE, DELETE FROM, UPDATE without WHERE clause).
Why it matters: For any user-facing application, raw LLM-generated SQL must be validated before execution.
What happens: The LLM is not a SQL compiler. For complex queries involving multiple JOINs, subqueries, window functions, or database-specific syntax, the generated SQL may be syntactically incorrect or logically wrong β and the engine will execute whatever SQL was generated without pre-validation.
What happens: Approach 1 injects the full schema of all specified tables into the LLM prompt. With 20+ large tables, the combined schema text can exceed gpt-4o-mini's 128K context window, causing a context_length_exceeded error.
What happens: Neither query engine has built-in guardrails preventing the LLM from generating INSERT, UPDATE, DELETE, or DROP statements. A prompt like "Delete all employees with salary less than 50000" could trigger a destructive query.
What happens: The current setup has only one table. Real database queries often need JOINs across multiple related tables. While LlamaIndex supports multi-table queries, the demo doesn't demonstrate this capability.
What happens: By default, you only see response.response (natural language) and response.metadata["result"] (raw tuples). The actual SQL query generated is not prominently displayed β making debugging SQL errors difficult.
What happens: temperature=0.1 means there is still a small probability of variability in SQL generation. The same question asked twice might occasionally produce slightly different SQL queries, leading to inconsistent results.
What happens: In Approach 2, similarity_top_k=1 retrieves only the single most similar table. If a query spans two tables and only one is retrieved, the generated SQL will miss the second table entirely and produce an incomplete or incorrect answer.
What happens: Every call to query_engine.query() makes at minimum 2 LLM API calls: one for SQL generation and one for response synthesis. Identical questions asked repeatedly each incur the full API cost.
# Option A: Persistent SQLite file (simple, single-user)
engine = create_engine("sqlite:///employees.db")
# Option B: PostgreSQL (production, multi-user)
engine = create_engine("postgresql://user:password@localhost:5432/hr_db")
# Option C: Connect to an existing database (no schema creation needed)
engine = create_engine("postgresql://user:pass@host/existing_db")
sql_database = SQLDatabase(engine, include_tables=["employees", "departments"])import re
def safe_execute(sql: str, engine) -> list:
"""Block any SQL that modifies data β only allow SELECT."""
sql_upper = sql.strip().upper()
dangerous_keywords = ["INSERT", "UPDATE", "DELETE", "DROP", "ALTER",
"CREATE", "TRUNCATE", "REPLACE"]
for keyword in dangerous_keywords:
if re.search(r'\b' + keyword + r'\b', sql_upper):
raise ValueError(f"Blocked dangerous SQL operation: {keyword}")
with engine.connect() as conn:
return conn.execute(text(sql)).fetchall()response = query_engine.query("How many employees have salary > 70000?")
# Print all available metadata
print("Answer:", response.response)
print("Raw result:", response.metadata.get("result"))
print("SQL query:", response.metadata.get("sql_query", "Not available"))query_engine = SQLTableRetrieverQueryEngine(
sql_database,
obj_index.as_retriever(similarity_top_k=2) # retrieve top-2 tables
)
# Now cross-table queries can find both relevant tablesllm = OpenAI(temperature=0, model="gpt-4o-mini")
# temperature=0 β most deterministic SQL generation
# Same question always produces identical SQLfrom functools import lru_cache
import hashlib
query_cache = {}
def cached_query(query_engine, question: str):
"""Cache query results to avoid redundant LLM API calls."""
cache_key = hashlib.md5(question.encode()).hexdigest()
if cache_key in query_cache:
print("[Cache hit] Returning cached result")
return query_cache[cache_key]
result = query_engine.query(question)
query_cache[cache_key] = result
return result
response = cached_query(query_engine, "How many employees earn over 70000?")# Define related tables
department_table = Table("department", metadata_obj,
Column("dept_id", Integer, primary_key=True),
Column("dept_name", String(100)),
Column("manager_id", Integer), # FK to employee.employee_id
)
# Add both to ObjectIndex with descriptive context
table_schema_objs = [
SQLTableSchema(
table_name="employee",
context_str="Employee details including salary, title, and department assignment"
),
SQLTableSchema(
table_name="department",
context_str="Department names and their assigned manager employee IDs"
),
]
# Query engine will now generate JOINs when needed
response = query_engine.query(
"Which department does Alice Johnson work in?"
)import gradio as gr
def query_database(question: str) -> str:
"""Natural language database query interface."""
try:
response = query_engine.query(question)
sql = response.metadata.get("sql_query", "N/A")
return f"**Answer:** {response.response}\n\n**SQL Used:** `{sql}`"
except Exception as e:
return f"Error: {str(e)}"
gr.Interface(
fn=query_database,
inputs=gr.Textbox(placeholder="Ask anything about employees...",
label="Your Question"),
outputs=gr.Markdown(label="Answer"),
title="Employee Database Assistant",
examples=[
"How many employees have salary greater than 70000?",
"Who is the highest paid employee?",
"List all managers in the company",
]
).launch()Text-to-SQL is the task of converting a natural language question ("Who earns the most?") into a valid SQL query (SELECT first_name, salary FROM employee ORDER BY salary DESC LIMIT 1). LLMs are very good at this because they were trained on vast amounts of SQL code.
SQLAlchemy is Python's most popular database library. It lets you define database tables as Python classes/objects, insert and query data using Python syntax, and connect to almost any SQL database by just changing the connection string β no need to write raw SQL for basic operations.
An in-memory database like sqlite:///:memory: stores all data in RAM instead of on disk. It is extremely fast but all data is lost when the program ends. It is ideal for testing, demos, and temporary data β not for production systems.
A database schema is the structure definition of a table β its name, column names, column types, and constraints (primary key, NOT NULL, UNIQUE). LlamaIndex reads the schema automatically and sends it to the LLM so the LLM knows what columns it can filter, sort, and aggregate.
context_str is a plain English description you write for each table, explaining what data it contains. This description is embedded into the VectorStoreIndex alongside the schema. When a user asks a question, the semantic similarity between the question and the context_str determines which table is retrieved.
response.metadata["result"] contains the raw output of the SQL query β a list of tuples exactly as returned by the database engine. For example, [(3,)] means one row was returned with one column containing the value 3. The LLM then converts this raw result into the natural language response.response.
Each query makes two LLM calls:
- SQL Generation β LLM receives the schema + user question, outputs SQL
- Response Synthesis β LLM receives the SQL result + original question, outputs natural language
This two-step process ensures the answer is both accurate (grounded in real SQL results) and readable (expressed in natural language).
| Approach | Who writes queries | Speed | Accuracy | Technical skill |
|---|---|---|---|---|
| Text-to-SQL (this project) | AI from natural language | Fast | High (95%+ for simple queries) | None |
| Manual SQL | Developer writes SQL | Slow | Exact | High |
| ORM (SQLAlchemy Python API) | Developer writes Python | Medium | Exact | Medium |
| Spreadsheet filter | Anyone (UI) | Instant | Exact | None |
| BI Tool (Tableau, Power BI) | Anyone (UI) | Fast | Exact | Low |
Text-to-SQL fills the gap between BI tools (limited to pre-built reports) and manual SQL (requires developer time) β enabling ad-hoc queries from anyone, instantly.
| Use Case | Natural Language Query Example |
|---|---|
| π₯ HR Management | "How many employees were hired this year?" |
| π° Finance | "What is the total salary budget by department?" |
| π Executive reporting | "Who are our top 5 highest-paid employees?" |
| π₯ Healthcare | "How many patients were admitted last month?" |
| π E-commerce | "Which products sold more than 100 units today?" |
| π¦ Inventory | "List all items with stock below reorder level" |
| π Education | "How many students passed the final exam?" |
# Add 50+ employees to stress-test query quality
# Add date-range queries, aggregations by department# Create department, project, performance_review tables
# Test queries like "Which projects are managed by employees earning > 100K?"engine = create_engine("postgresql://user:pass@localhost/company_db")
# All LlamaIndex code remains the sameimport gradio as gr
# Non-technical users interact via browser
# Show both natural language answer AND generated SQL# Validate generated SQL before execution
# Block INSERT/UPDATE/DELETE
# Add query result cachingIdeas for extending this project:
- π Add read-only guard that blocks non-SELECT SQL before execution
- ποΈ Connect to a real PostgreSQL or MySQL database
- π Add more tables and demonstrate multi-table JOIN queries
- π₯οΈ Build a Gradio UI where anyone can query the database via browser
- πΎ Add response caching to avoid redundant LLM calls
- π‘οΈ Set
temperature=0and benchmark SQL correctness - π Log all generated SQL queries to a file for auditing
- π Add chart generation β query results plotted with matplotlib
To contribute:
git checkout -b feature/add-gradio-ui
git commit -m "Add Gradio interface for non-technical database querying"
git push origin feature/add-gradio-ui
# Then open a Pull Request on GitHubThis project is open-source and available under the MIT License.
- LlamaIndex β for the comprehensive Text-to-SQL pipeline
- OpenAI β for the gpt-4o-mini model
- SQLAlchemy β for the Python SQL toolkit
- SQLite β for the lightweight embedded database engine
Built with β€οΈ using LlamaIndex, SQLAlchemy, and OpenAI