🦙🗄️ Text-to-SQL — Natural Language Database Querying with LlamaIndex

Ask your database questions in plain English and get accurate answers — no SQL knowledge required. This project converts natural language questions into executable SQL queries using LlamaIndex and GPT-4o-mini, demonstrating two production-ready approaches for any database scale.

📌 Table of Contents

What is this project?
Why Text-to-SQL?
Two Approaches — Simple vs Scalable
How it works — Full Pipeline
Flowcharts
Database Schema
Project Structure
Tech Stack & Tools
Component Deep Dive
SQLAlchemy Fundamentals
Setup & Installation
Running on Google Colab
Configuration & Customization
Sample Outputs
Limitations
How Limitations Can Be Resolved
Key Concepts for Beginners
Text-to-SQL vs Manual SQL vs ORM
Real-World Use Cases
What to Build Next
Contributing

🧠 What is this project?

This project builds a Text-to-SQL system using LlamaIndex — an AI pipeline that takes a natural language question, automatically generates the correct SQL query, executes it against a real database, and returns a human-readable answer.

It uses a SQLite employee database as the demonstration data, with 5 employees across various roles and salaries. The project demonstrates two distinct approaches:

NLSQLTableQueryEngine — Simple, direct approach where the full table schema is injected into the LLM's context. Best for small databases with few tables.
SQLTableRetrieverQueryEngine — Scalable approach where table schemas are stored in a VectorStoreIndex and retrieved semantically at query time. Best for large databases with many tables.

"How many employees have salary greater than 70000?"
     │
     ▼
LLM generates: SELECT COUNT(*) FROM employee WHERE salary > 70000
     │
     ▼
SQLAlchemy executes the query on SQLite
     │
     ▼
"There are 3 employees with a salary greater than $70,000."

No SQL expertise required from the user. The entire translation is handled by the AI.

💡 Why Text-to-SQL?

The problem with traditional database access:

Every time a non-technical user needs data from a database, they must:

Find a developer or data analyst
Wait for the developer to write and run the query
Get results hours or days later

This bottleneck is extremely common in business — managers, HR teams, executives, and analysts constantly need data but cannot write SQL.

What Text-to-SQL solves:

Without Text-to-SQL	With Text-to-SQL
Must know SQL syntax	Ask in plain English
Depends on developers	Self-service data access
Hours or days for data	Seconds
Technical barrier	Anyone can query
Manual, error-prone	Automated, consistent
Expensive analyst time	Immediate answers

Real scale of the problem:

SQL is the language of nearly every business database. Hundreds of millions of business users access databases regularly, but only a fraction can write SQL. Text-to-SQL bridges this gap at scale.

🔀 Two Approaches — Simple vs Scalable

This project demonstrates both approaches, explaining when to use each:

┌─────────────────────────────────────────────────────────────┐
│  Approach 1: NLSQLTableQueryEngine                           │
│  ─────────────────────────────────────────────────────────  │
│  Schema injected directly into LLM prompt                    │
│  ✅ Simple setup — 5 lines of code                           │
│  ✅ Fast — no retrieval step                                  │
│  ✅ Best for: 1–10 tables with small schemas                  │
│  ❌ Context window overflow with many tables                  │
│  ❌ LLM sees ALL schemas even for simple queries             │
├─────────────────────────────────────────────────────────────┤
│  Approach 2: SQLTableRetrieverQueryEngine                    │
│  ─────────────────────────────────────────────────────────  │
│  Table schemas stored in VectorStoreIndex                    │
│  Semantically retrieves only the relevant schema(s)          │
│  ✅ Scales to 100+ tables without context overflow           │
│  ✅ Only relevant schemas sent to LLM                        │
│  ✅ Best for: Large enterprise databases                      │
│  ❌ Slightly more setup                                       │
│  ❌ Retrieval step adds small latency                         │
└─────────────────────────────────────────────────────────────┘

⚙️ How it works — Full Pipeline

Phase 1: Database Setup (run once)

SQLAlchemy Schema Definition
      │  Table, Column, primary key, constraints
      ▼
create_engine("sqlite:///:memory:")
      │  Creates in-memory SQLite database
      ▼
metadata_obj.create_all(engine)
      │  Executes CREATE TABLE SQL
      ▼
insert(employee_table).values(**row)
      │  Inserts 5 employee records
      ▼
SQLDatabase(engine, include_tables=["employee"])
      │  LlamaIndex wrapper — knows schema + can execute queries
      ▼
Database Ready for Querying

Phase 2A: Query Pipeline — Approach 1 (NLSQLTableQueryEngine)

User Natural Language Question
      │  "How many employees have salary greater than 70000?"
      ▼
NLSQLTableQueryEngine
      │
      ├─ Step 1: Schema Injection
      │          LLM receives full table schema:
      │          "Table: employee, Columns: employee_id (INT, PK),
      │           first_name (VARCHAR), salary (FLOAT), ..."
      │
      ├─ Step 2: SQL Generation
      │          LLM generates:
      │          SELECT COUNT(*) FROM employee WHERE salary > 70000
      │
      ├─ Step 3: SQL Execution
      │          SQLAlchemy executes query on SQLite
      │          Raw result: [(3,)]
      │
      └─ Step 4: Response Synthesis
                 LLM converts: [(3,)] → "There are 3 employees..."
                 Returns: response.response + response.metadata["result"]

Phase 2B: Query Pipeline — Approach 2 (SQLTableRetrieverQueryEngine)

User Natural Language Question
      │  "Who are the employees with salary greater than 70000?"
      ▼
SQLTableRetrieverQueryEngine
      │
      ├─ Step 1: Schema Retrieval (NEW step vs Approach 1)
      │          Question → embedding vector
      │          Search VectorStoreIndex of table schemas
      │          Find top-1 most relevant table schema
      │          Returns: "employee" table schema + context_str
      │
      ├─ Step 2: SQL Generation
      │          LLM receives only the RETRIEVED schema (not all schemas)
      │          LLM generates:
      │          SELECT * FROM employee WHERE salary > 70000
      │
      ├─ Step 3: SQL Execution
      │          SQLAlchemy executes → returns matching rows
      │
      └─ Step 4: Response Synthesis
                 LLM formats results as human-readable answer

🗺️ Flowcharts

Complete System Architecture

╔══════════════════════════════════════════════════════════════════════╗
║              Text-to-SQL System — Full Architecture                   ║
╠══════════════════════════════════════════════════════════════════════╣
║                                                                      ║
║  ┌──────────────────────────────────────────────────────────────┐   ║
║  │                  DATABASE LAYER                               │   ║
║  │                                                               │   ║
║  │  SQLAlchemy Schema                                            │   ║
║  │  ┌─────────────────────────────────────────────────────┐    │   ║
║  │  │  employee table                                       │    │   ║
║  │  │  ├── employee_id  INTEGER  PRIMARY KEY               │    │   ║
║  │  │  ├── first_name   VARCHAR(50)  NOT NULL              │    │   ║
║  │  │  ├── last_name    VARCHAR(50)  NOT NULL              │    │   ║
║  │  │  ├── email        VARCHAR(100) UNIQUE                │    │   ║
║  │  │  ├── phone_number VARCHAR(15)                        │    │   ║
║  │  │  ├── hire_date    DATE         NOT NULL              │    │   ║
║  │  │  ├── job_title    VARCHAR(50)                        │    │   ║
║  │  │  ├── salary       FLOAT                              │    │   ║
║  │  │  └── is_manager   BOOLEAN      DEFAULT False         │    │   ║
║  │  └─────────────────────────────────────────────────────┘    │   ║
║  │                          │                                    │   ║
║  │                          ▼                                    │   ║
║  │            SQLite in-memory database                          │   ║
║  │            (create_engine("sqlite:///:memory:"))              │   ║
║  │                          │                                    │   ║
║  │                          ▼                                    │   ║
║  │            SQLDatabase (LlamaIndex wrapper)                   │   ║
║  └──────────────────────────────────────────────────────────────┘   ║
║                          │                                           ║
║              ┌───────────┴──────────────┐                           ║
║              │                          │                            ║
║  ┌───────────▼──────────┐  ┌───────────▼───────────────────────┐   ║
║  │  APPROACH 1          │  │  APPROACH 2                         │   ║
║  │  NLSQLTable          │  │  SQLTableRetriever                  │   ║
║  │  QueryEngine         │  │  QueryEngine                        │   ║
║  │                      │  │                                     │   ║
║  │  Schema → LLM prompt │  │  Schema → VectorStoreIndex          │   ║
║  │  (direct injection)  │  │  (semantic retrieval first)         │   ║
║  │                      │  │                                     │   ║
║  │  Best: few tables    │  │  Best: many tables                  │   ║
║  └──────────┬───────────┘  └──────────────┬────────────────────┘   ║
║             │                             │                          ║
║             └─────────────┬───────────────┘                         ║
║                           │                                          ║
║  ┌────────────────────────▼─────────────────────────────────────┐   ║
║  │                  SHARED OUTPUT LAYER                           │   ║
║  │                                                               │   ║
║  │  response.response              → Natural language answer     │   ║
║  │  response.metadata["result"]    → Raw SQL query result        │   ║
║  └──────────────────────────────────────────────────────────────┘   ║
╚══════════════════════════════════════════════════════════════════════╝

Approach 1: NLSQLTableQueryEngine — Detailed Flow

  User: "How many employees have salary greater than 70000?"
        │
        ▼
  ┌──────────────────────────────────────────────────────────┐
  │  NLSQLTableQueryEngine                                    │
  │                                                           │
  │  Step 1 — Build Schema Context                           │
  │  ┌─────────────────────────────────────────────────────┐ │
  │  │ "Table 'employee' has columns:                       │ │
  │  │  employee_id (INTEGER), first_name (VARCHAR(50)),    │ │
  │  │  last_name (VARCHAR(50)), email (VARCHAR(100)),      │ │
  │  │  phone_number (VARCHAR(15)), hire_date (DATE),       │ │
  │  │  job_title (VARCHAR(50)), salary (FLOAT),            │ │
  │  │  is_manager (BOOLEAN)"                               │ │
  │  └───────────────┬─────────────────────────────────────┘ │
  │                  │ injected into LLM system prompt        │
  │                  ▼                                        │
  │  Step 2 — SQL Generation (gpt-4o-mini)                   │
  │  ┌─────────────────────────────────────────────────────┐ │
  │  │  LLM Input:  schema + "How many employees..."       │ │
  │  │  LLM Output: SELECT COUNT(*)                        │ │
  │  │              FROM employee                          │ │
  │  │              WHERE salary > 70000                   │ │
  │  └───────────────┬─────────────────────────────────────┘ │
  │                  │                                        │
  │                  ▼                                        │
  │  Step 3 — SQL Execution (SQLAlchemy → SQLite)            │
  │  ┌─────────────────────────────────────────────────────┐ │
  │  │  Raw SQL executed against employee table            │ │
  │  │  employees with salary > 70000:                     │ │
  │  │  Alice  → $85,000  ✅                               │ │
  │  │  Bob    → $120,000 ✅                               │ │
  │  │  Charlie→ $95,000  ✅                               │ │
  │  │  Diana  → $70,000  ❌ (not > 70000, equal)         │ │
  │  │  Evan   → $65,000  ❌                               │ │
  │  │  Result: [(3,)]                                     │ │
  │  └───────────────┬─────────────────────────────────────┘ │
  │                  │                                        │
  │                  ▼                                        │
  │  Step 4 — Response Synthesis (gpt-4o-mini)               │
  │  ┌─────────────────────────────────────────────────────┐ │
  │  │  LLM converts [(3,)] → natural language             │ │
  │  │  "There are 3 employees with a salary greater       │ │
  │  │   than $70,000."                                    │ │
  │  └─────────────────────────────────────────────────────┘ │
  └──────────────────────────────────────────────────────────┘

Approach 2: SQLTableRetrieverQueryEngine — Detailed Flow

  User: "Who are the employees with salary greater than 70000?"
        │
        ▼
  ┌─────────────────────────────────────────────────────────────┐
  │  SQLTableRetrieverQueryEngine                                │
  │                                                              │
  │  Step 1 — Schema Retrieval (NEW — not in Approach 1)        │
  │                                                              │
  │  Question → text-embedding-3-small → query vector           │
  │                                                              │
  │  VectorStoreIndex of table schemas:                          │
  │  ┌─────────────────────────────────────────────────────┐   │
  │  │  Indexed node:                                       │   │
  │  │  "employee" table                                    │   │
  │  │  context: "This table gives information regarding   │   │
  │  │            the employees of the organization"        │   │
  │  │  + full schema definition                            │   │
  │  │  vector: [0.23, -0.11, 0.87, ...]                  │   │
  │  └──────────────────┬──────────────────────────────────┘   │
  │                     │                                        │
  │  Cosine similarity: query_vector ≈ schema_vector            │
  │  similarity_top_k=1 → "employee" table selected             │
  │                     │                                        │
  │  Step 2 — SQL Generation (same as Approach 1)               │
  │  ┌─────────────────────────────────────────────────────┐   │
  │  │  LLM receives ONLY the retrieved "employee" schema  │   │
  │  │  (not all schemas — critical for large databases)   │   │
  │  │                                                      │   │
  │  │  LLM generates:                                      │   │
  │  │  SELECT first_name, last_name, salary, job_title    │   │
  │  │  FROM employee                                       │   │
  │  │  WHERE salary > 70000                               │   │
  │  └──────────────────┬──────────────────────────────────┘   │
  │                     │                                        │
  │  Step 3 — SQL Execution                                      │
  │  ┌─────────────────────────────────────────────────────┐   │
  │  │  Returns rows:                                       │   │
  │  │  ("Alice", "Johnson", 85000.0, "Software Developer")│   │
  │  │  ("Bob", "Smith", 120000.0, "Project Manager")      │   │
  │  │  ("Charlie", "Brown", 95000.0, "Data Scientist")    │   │
  │  └──────────────────┬──────────────────────────────────┘   │
  │                     │                                        │
  │  Step 4 — Response Synthesis                                 │
  │  ┌─────────────────────────────────────────────────────┐   │
  │  │  "The employees with salaries greater than $70,000  │   │
  │  │   are Alice Johnson (Software Developer, $85K),     │   │
  │  │   Bob Smith (Project Manager, $120K), and           │   │
  │  │   Charlie Brown (Data Scientist, $95K)."            │   │
  │  └─────────────────────────────────────────────────────┘   │
  └─────────────────────────────────────────────────────────────┘

Approach 1 vs Approach 2 — Schema Handling

  Approach 1 (NLSQLTableQueryEngine):
  ─────────────────────────────────────
  Database has 20 tables
         │
         │ ALL schemas injected into every LLM prompt
         ▼
  LLM context: [schema_1][schema_2]...[schema_20][user_query]
                                                 ↑
                                      19 irrelevant schemas wasted!
  Risk: Context window overflow
  Cost: Paying for tokens from all 20 schemas every query

  ─────────────────────────────────────────────────────────────

  Approach 2 (SQLTableRetrieverQueryEngine):
  ──────────────────────────────────────────
  Database has 20 tables
         │
         │ All 20 schemas indexed in VectorStoreIndex
         ▼
  User query → embedding → similarity search
         │
         │ Only top-1 relevant schema retrieved
         ▼
  LLM context: [schema_7_employee][user_query]
                ↑
     Only 1 relevant schema — 19 others not sent!
  Benefit: No context overflow, lower cost, faster

Response Object Structure

  query_engine.query("How many employees have salary > 70000?")
        │
        ▼
  Response Object
  ┌──────────────────────────────────────────────────────────┐
  │                                                           │
  │  response.response                                        │
  │  ─────────────────                                        │
  │  "There are 3 employees with a salary greater             │
  │   than $70,000."                                          │
  │  (Natural language answer synthesized by LLM)             │
  │                                                           │
  │  response.metadata["result"]                              │
  │  ────────────────────────────                             │
  │  [(3,)]                                                   │
  │  (Raw result from SQL execution — list of tuples)         │
  │                                                           │
  │  response.metadata["sql_query"] (available in some modes) │
  │  ──────────────────────────────────────────────────────   │
  │  "SELECT COUNT(*) FROM employee WHERE salary > 70000"     │
  │  (The actual SQL that was generated and executed)          │
  │                                                           │
  └──────────────────────────────────────────────────────────┘

SQLAlchemy Schema Definition Flow

  Python Code                        SQL Equivalent
  ───────────────────────────────    ─────────────────────────────────────
  create_engine                      [database connection]
  ("sqlite:///:memory:")             └── sqlite in-memory

  MetaData()                         [schema registry]

  Table("employee",                  CREATE TABLE employee (
    Column("employee_id",              employee_id INTEGER PRIMARY KEY,
           Integer,
           primary_key=True),
    Column("first_name",               first_name  VARCHAR(50) NOT NULL,
           String(50),
           nullable=False),
    Column("last_name",                last_name   VARCHAR(50) NOT NULL,
           String(50),
           nullable=False),
    Column("email",                    email       VARCHAR(100) UNIQUE,
           String(100), unique=True),
    Column("phone_number",             phone_number VARCHAR(15),
           String(15)),
    Column("hire_date",                hire_date   DATE NOT NULL,
           Date, nullable=False),
    Column("job_title",                job_title   VARCHAR(50),
           String(50)),
    Column("salary",                   salary      FLOAT,
           Float),
    Column("is_manager",               is_manager  BOOLEAN DEFAULT FALSE
           Boolean,               );
           default=False),
  )

  metadata_obj.create_all(engine)    [executes CREATE TABLE on database]

📦 Database Schema

The project uses a single employee table with the following structure and sample data:

employee_id	first_name	last_name	email	phone	hire_date	job_title	salary	is_manager
1	Alice	Johnson	alice@example.com	123-456-7890	2021-06-15	Software Developer	85,000	False
2	Bob	Smith	bob@example.com	987-654-3210	2020-03-10	Project Manager	120,000	True
3	Charlie	Brown	charlie@example.com	555-123-4567	2019-11-01	Data Scientist	95,000	False
4	Diana	Prince	diana@example.com	333-444-5555	2022-07-20	HR Specialist	70,000	False
5	Evan	Taylor	evan@example.com	222-333-4444	2023-01-05	Marketing Analyst	65,000	False

📁 Project Structure

text-to-sql-llamaindex/
│
├── Text_to_SQL_LlamaIndex.ipynb    # Main Colab notebook
├── requirements.txt                 # All Python dependencies
└── README.md                        # This file

This project uses an in-memory SQLite database — no external database files are created. All data is defined and populated within the notebook itself.

🛠️ Tech Stack & Tools

Tool / Library	Version	Purpose
LlamaIndex Core	≥ 0.11.0	Text-to-SQL framework — NLSQLTableQueryEngine, SQLTableRetrieverQueryEngine
llama-index-llms-openai	≥ 0.2.0	OpenAI LLM integration (gpt-4o-mini)
OpenAI	≥ 1.30.0	OpenAI API client — LLM + embeddings
SQLAlchemy	≥ 2.0.0	Python SQL toolkit — schema definition, ORM, query execution
SQLite	built-in	In-memory relational database (via Python `sqlite3`)
Python	3.10+	Runtime
Google Colab	—	Cloud notebook execution environment

Why LlamaIndex for Text-to-SQL?

LlamaIndex provides the most complete, production-ready Text-to-SQL pipeline of any open-source framework:

Feature	LlamaIndex	LangChain	Raw OpenAI	Manual
Schema auto-injection	✅	✅	❌ Manual	❌ Manual
Multi-table retrieval	✅ VectorIndex	⚠️ Basic	❌	❌
ObjectIndex for schemas	✅ Native	❌	❌	❌
context_str per table	✅	❌	❌	❌
Response + metadata	✅	⚠️	❌	❌
SQLAlchemy integration	✅ Native	✅	❌	❌
Production-ready	✅	⚠️	❌	❌

Why SQLAlchemy?

SQLAlchemy is Python's most widely-used SQL toolkit. It provides:

Schema definition in Python — no raw SQL CREATE TABLE statements needed
Database agnosticism — switch from SQLite to PostgreSQL/MySQL by changing one line (create_engine URL)
Connection management — automatic commit/rollback, connection pooling
Type safety — Python types map to SQL types (Integer, String, Float, Boolean, Date)
Both ORM and Core API — use Python objects or raw SQL as needed

Why SQLite `:memory:`?

create_engine("sqlite:///:memory:") creates a temporary, in-memory database:

✅ Zero setup — no files, no server, no configuration
✅ Perfect for demos, testing, and development
✅ Extremely fast — data lives in RAM
❌ Lost when Python process ends — not for production data

For production, just change the connection string:

engine = create_engine("postgresql://user:pass@host:5432/dbname")  # PostgreSQL
engine = create_engine("mysql://user:pass@host:3306/dbname")       # MySQL
engine = create_engine("sqlite:///my_database.db")                  # Persistent SQLite file

Why gpt-4o-mini at temperature=0.1?

SQL generation requires precision and consistency:

temperature=0.1 — near-deterministic output; the LLM consistently picks the most likely SQL tokens rather than exploring creative alternatives
Higher temperature risks generating syntactically varied but semantically incorrect SQL
gpt-4o-mini — sufficient intelligence for SQL generation at minimal cost

🔍 Component Deep Dive

1. SQLDatabase — LlamaIndex's Database Wrapper

sql_database = SQLDatabase(engine, include_tables=["employee"])

SQLDatabase wraps a SQLAlchemy engine and provides LlamaIndex with:

Schema introspection — automatically reads column names, types, and constraints
Query execution — runs generated SQL and returns results
Table filtering — include_tables limits which tables are exposed to the LLM

Why include_tables matters:

# Without include_tables — ALL tables in the database are exposed
sql_database = SQLDatabase(engine)
# Risk: If database has 50 tables, all 50 schemas go into context → overflow

# With include_tables — only specified tables are exposed
sql_database = SQLDatabase(engine, include_tables=["employee"])
# LLM only sees the "employee" schema — clean and efficient

2. NLSQLTableQueryEngine — Approach 1

query_engine = NLSQLTableQueryEngine(
    sql_database=sql_database,   # database connection
    tables=["employee"],          # tables to include in schema context
    llm=llm                       # LLM for SQL generation and synthesis
)

Internal flow:

Reads schema of all specified tables from sql_database
Builds a prompt: [system: you are a SQL expert] + [schema] + [user query]
LLM generates a SQL query string
LlamaIndex executes the SQL via SQLAlchemy
LLM synthesizes the raw result into a natural language answer

3. ObjectIndex and SQLTableNodeMapping — Approach 2

context_str = "This table gives information regarding the employees of the organization"

table_node_mapping = SQLTableNodeMapping(sql_database)
table_schema_objs = [
    SQLTableSchema(table_name="employee", context_str=context_str)
]

obj_index = ObjectIndex.from_objects(
    table_schema_objs,        # list of table schema objects
    table_node_mapping,       # maps schemas to indexable nodes
    VectorStoreIndex,         # index type for semantic search
)

What each piece does:

Component	Role
`SQLTableSchema`	Represents one table — `table_name` + optional `context_str`
`context_str`	Human-written description of what the table contains — helps semantic matching
`SQLTableNodeMapping`	Converts `SQLTableSchema` objects to LlamaIndex `Node` objects (indexable units)
`ObjectIndex`	Creates a `VectorStoreIndex` from the table nodes — enables semantic search
`obj_index.as_retriever(similarity_top_k=1)`	Retriever that finds the top-1 most relevant table for any query

Why context_str is powerful:

# Without context_str — LLM only has schema column names
SQLTableSchema(table_name="emp_master")
# Query "find employees" → vector similarity to "emp_master" schema is weak

# With context_str — rich semantic context for matching
SQLTableSchema(
    table_name="emp_master",
    context_str="Contains all employee personal details, salaries, job titles, and manager information"
)
# Query "find employees" → high similarity to context_str → correct table selected

4. SQLTableRetrieverQueryEngine — Approach 2

query_engine = SQLTableRetrieverQueryEngine(
    sql_database,                              # database connection
    obj_index.as_retriever(similarity_top_k=1) # retrieves top-1 relevant table schema
)

This engine:

Embeds the user query (via text-embedding-3-small by default)
Searches the ObjectIndex for the most semantically similar table schema
Retrieves only that table's schema (not all schemas)
Passes only the relevant schema to the LLM for SQL generation
Executes and synthesizes the answer

🗄️ SQLAlchemy Fundamentals

Understanding SQLAlchemy is key to extending this project. Here's a quick reference:

Creating Tables

from sqlalchemy import Table, Column, String, Integer, Float, Boolean, Date

my_table = Table(
    "table_name",     # SQL table name
    metadata_obj,     # schema registry
    Column("id",    Integer, primary_key=True),
    Column("name",  String(100), nullable=False),
    Column("value", Float),
    Column("active",Boolean, default=True),
    Column("date",  Date),
)
metadata_obj.create_all(engine)  # executes CREATE TABLE

Inserting Data

from sqlalchemy import insert

with engine.begin() as conn:
    conn.execute(insert(my_table).values(id=1, name="Alice", value=42.5))

Querying with SQLAlchemy Core

from sqlalchemy import select

query = select(my_table).where(my_table.c.value > 40)
with engine.connect() as conn:
    results = conn.execute(query).fetchall()

Running Raw SQL

from sqlalchemy import text

with engine.connect() as conn:
    results = conn.execute(text("SELECT * FROM my_table WHERE value > 40"))
    for row in results:
        print(row)

Switching Databases (production migration)

# SQLite file (development/small apps)
engine = create_engine("sqlite:///my_database.db")

# PostgreSQL (production)
engine = create_engine("postgresql://user:password@localhost:5432/mydb")

# MySQL (production)
engine = create_engine("mysql+pymysql://user:password@localhost:3306/mydb")

# SQL Server
engine = create_engine("mssql+pyodbc://user:password@server/db?driver=ODBC+Driver+17+for+SQL+Server")

🚀 Setup & Installation

Prerequisites

Python 3.10 or higher
An OpenAI API key
pip

Step 1: Clone the repository

git clone https://github.com/your-username/text-to-sql-llamaindex.git
cd text-to-sql-llamaindex

Step 2: Install dependencies

pip install -r requirements.txt

Step 3: Set your API key

echo "OPENAI_API_KEY=sk-your-key-here" > .env

Then in the notebook:

from dotenv import load_dotenv
load_dotenv()

Step 4: Run the notebook

jupyter notebook Text_to_SQL_LlamaIndex.ipynb

☁️ Running on Google Colab

Step 1: Upload the notebook

Go to colab.research.google.com → File → Upload Notebook

Step 2: Add your OpenAI key to Colab Secrets

Click the 🔑 Secrets icon in the left sidebar
Add secret: Name = OpenAI, Value = your OpenAI API key
Toggle Notebook access ON

Step 3: Install dependencies (first cell)

!pip install llama-index llama-index-llms-openai openai sqlalchemy

Step 4: Run all cells

Runtime → Run all (Ctrl+F9)

⚡ This project runs fast — no PDF indexing, no large downloads. Typical full run: under 30 seconds.

⚙️ Configuration & Customization

Add More Employees

new_employees = [
    {
        "employee_id": 6,
        "first_name": "Fiona",
        "last_name": "Green",
        "email": "fiona.green@example.com",
        "phone_number": "111-222-3333",
        "hire_date": date(2021, 9, 1),
        "job_title": "DevOps Engineer",
        "salary": 105000.0,
        "is_manager": False,
    }
]
for row in new_employees:
    with engine.begin() as connection:
        connection.execute(insert(employee_table).values(**row))

Add More Tables (and use Approach 2)

# Define a department table
department_table = Table(
    "department", metadata_obj,
    Column("dept_id", Integer, primary_key=True),
    Column("dept_name", String(100), nullable=False),
    Column("budget", Float),
    Column("head_employee_id", Integer),
)
metadata_obj.create_all(engine)

# Update SQLDatabase to include both tables
sql_database = SQLDatabase(engine, include_tables=["employee", "department"])

# Add both to ObjectIndex for Approach 2
table_schema_objs = [
    SQLTableSchema(table_name="employee",
                   context_str="Contains employee personal details, salaries, and roles"),
    SQLTableSchema(table_name="department",
                   context_str="Contains department names, budgets, and department heads"),
]

Connect to a Real Database

# PostgreSQL example
from sqlalchemy import create_engine
engine = create_engine("postgresql://username:password@host:5432/database_name")
sql_database = SQLDatabase(engine, include_tables=["your_table"])

# All other code remains exactly the same!

Ask Different Questions

queries = [
    "Who is the highest-paid employee?",
    "How many employees were hired after 2021?",
    "List all managers in the company.",
    "What is the average salary by job title?",
    "Which employees have been with the company for more than 3 years?",
]

for q in queries:
    response = query_engine.query(q)
    print(f"Q: {q}")
    print(f"A: {response.response}\n")

📄 Sample Outputs

Query 1 — Count with Condition

Query:    "How many employees have salary greater than 70000?"

SQL Generated:
  SELECT COUNT(*) FROM employee WHERE salary > 70000

Raw Result:
  [(3,)]

Natural Language Answer:
  "There are 3 employees with a salary greater than $70,000."

Query 2 — Who with Filter

Query:    "Who are the employees with salary greater than 70000?"

SQL Generated:
  SELECT first_name, last_name, salary, job_title
  FROM employee
  WHERE salary > 70000

Raw Result:
  [('Alice', 'Johnson', 85000.0, 'Software Developer'),
   ('Bob', 'Smith', 120000.0, 'Project Manager'),
   ('Charlie', 'Brown', 95000.0, 'Data Scientist')]

Natural Language Answer:
  "The employees with salaries greater than $70,000 are:
   Alice Johnson (Software Developer, $85,000),
   Bob Smith (Project Manager, $120,000), and
   Charlie Brown (Data Scientist, $95,000)."

Query 3 — Manager Filter

Query:    "Who is the manager in the company?"

SQL Generated:
  SELECT first_name, last_name, job_title
  FROM employee
  WHERE is_manager = 1

Raw Result:
  [('Bob', 'Smith', 'Project Manager')]

Natural Language Answer:
  "The manager in the company is Bob Smith, who holds
   the position of Project Manager."

⚠️ Limitations

1. In-Memory SQLite — Data Lost on Restart

What happens: create_engine("sqlite:///:memory:") creates a database that exists only in RAM. Every time the notebook restarts or the kernel resets, all table definitions and data are permanently deleted. The entire setup (create tables + insert rows) must be re-run from scratch.

Why it matters: In a real application, you connect to a persistent database. The in-memory setup is only appropriate for demos and testing.

2. No Protection Against SQL Injection

What happens: The LLM generates SQL that is executed directly against the database without sanitization. A carefully crafted natural language input could theoretically prompt the LLM to generate destructive SQL (DROP TABLE, DELETE FROM, UPDATE without WHERE clause).

Why it matters: For any user-facing application, raw LLM-generated SQL must be validated before execution.

3. LLM May Generate Incorrect or Invalid SQL

What happens: The LLM is not a SQL compiler. For complex queries involving multiple JOINs, subqueries, window functions, or database-specific syntax, the generated SQL may be syntactically incorrect or logically wrong — and the engine will execute whatever SQL was generated without pre-validation.

4. NLSQLTableQueryEngine Fails with Many Tables

What happens: Approach 1 injects the full schema of all specified tables into the LLM prompt. With 20+ large tables, the combined schema text can exceed gpt-4o-mini's 128K context window, causing a context_length_exceeded error.

5. Only SELECT Queries are Safe — No Write Operations

What happens: Neither query engine has built-in guardrails preventing the LLM from generating INSERT, UPDATE, DELETE, or DROP statements. A prompt like "Delete all employees with salary less than 50000" could trigger a destructive query.

6. No Multi-Table JOIN Support in This Demo

What happens: The current setup has only one table. Real database queries often need JOINs across multiple related tables. While LlamaIndex supports multi-table queries, the demo doesn't demonstrate this capability.

7. No SQL Query Visibility by Default

What happens: By default, you only see response.response (natural language) and response.metadata["result"] (raw tuples). The actual SQL query generated is not prominently displayed — making debugging SQL errors difficult.

8. Temperature is Not Zero

What happens: temperature=0.1 means there is still a small probability of variability in SQL generation. The same question asked twice might occasionally produce slightly different SQL queries, leading to inconsistent results.

9. `similarity_top_k=1` May Pick the Wrong Table

What happens: In Approach 2, similarity_top_k=1 retrieves only the single most similar table. If a query spans two tables and only one is retrieved, the generated SQL will miss the second table entirely and produce an incomplete or incorrect answer.

10. No Caching — Every Query Makes LLM API Calls

What happens: Every call to query_engine.query() makes at minimum 2 LLM API calls: one for SQL generation and one for response synthesis. Identical questions asked repeatedly each incur the full API cost.

🔧 How Limitations Can Be Resolved

Fix 1: Use a Persistent Database

# Option A: Persistent SQLite file (simple, single-user)
engine = create_engine("sqlite:///employees.db")

# Option B: PostgreSQL (production, multi-user)
engine = create_engine("postgresql://user:password@localhost:5432/hr_db")

# Option C: Connect to an existing database (no schema creation needed)
engine = create_engine("postgresql://user:pass@host/existing_db")
sql_database = SQLDatabase(engine, include_tables=["employees", "departments"])

Fix 2: Read-Only Safety Guard — Block Dangerous SQL

import re

def safe_execute(sql: str, engine) -> list:
    """Block any SQL that modifies data — only allow SELECT."""
    sql_upper = sql.strip().upper()
    dangerous_keywords = ["INSERT", "UPDATE", "DELETE", "DROP", "ALTER",
                          "CREATE", "TRUNCATE", "REPLACE"]
    for keyword in dangerous_keywords:
        if re.search(r'\b' + keyword + r'\b', sql_upper):
            raise ValueError(f"Blocked dangerous SQL operation: {keyword}")
    with engine.connect() as conn:
        return conn.execute(text(sql)).fetchall()

Fix 3: Display the Generated SQL for Transparency

response = query_engine.query("How many employees have salary > 70000?")

# Print all available metadata
print("Answer:", response.response)
print("Raw result:", response.metadata.get("result"))
print("SQL query:", response.metadata.get("sql_query", "Not available"))

Fix 4: Use `similarity_top_k=2` for Multi-Table Queries

query_engine = SQLTableRetrieverQueryEngine(
    sql_database,
    obj_index.as_retriever(similarity_top_k=2)  # retrieve top-2 tables
)
# Now cross-table queries can find both relevant tables

Fix 5: Set Temperature to 0 for Deterministic SQL

llm = OpenAI(temperature=0, model="gpt-4o-mini")
# temperature=0 → most deterministic SQL generation
# Same question always produces identical SQL

Fix 6: Add Response Caching to Reduce API Costs

from functools import lru_cache
import hashlib

query_cache = {}

def cached_query(query_engine, question: str):
    """Cache query results to avoid redundant LLM API calls."""
    cache_key = hashlib.md5(question.encode()).hexdigest()
    if cache_key in query_cache:
        print("[Cache hit] Returning cached result")
        return query_cache[cache_key]
    result = query_engine.query(question)
    query_cache[cache_key] = result
    return result

response = cached_query(query_engine, "How many employees earn over 70000?")

Fix 7: Add Multi-Table Support with JOINs

# Define related tables
department_table = Table("department", metadata_obj,
    Column("dept_id", Integer, primary_key=True),
    Column("dept_name", String(100)),
    Column("manager_id", Integer),  # FK to employee.employee_id
)

# Add both to ObjectIndex with descriptive context
table_schema_objs = [
    SQLTableSchema(
        table_name="employee",
        context_str="Employee details including salary, title, and department assignment"
    ),
    SQLTableSchema(
        table_name="department",
        context_str="Department names and their assigned manager employee IDs"
    ),
]

# Query engine will now generate JOINs when needed
response = query_engine.query(
    "Which department does Alice Johnson work in?"
)

Fix 8: Add a Gradio UI for Non-Technical Users

import gradio as gr

def query_database(question: str) -> str:
    """Natural language database query interface."""
    try:
        response = query_engine.query(question)
        sql = response.metadata.get("sql_query", "N/A")
        return f"**Answer:** {response.response}\n\n**SQL Used:** `{sql}`"
    except Exception as e:
        return f"Error: {str(e)}"

gr.Interface(
    fn=query_database,
    inputs=gr.Textbox(placeholder="Ask anything about employees...",
                      label="Your Question"),
    outputs=gr.Markdown(label="Answer"),
    title="Employee Database Assistant",
    examples=[
        "How many employees have salary greater than 70000?",
        "Who is the highest paid employee?",
        "List all managers in the company",
    ]
).launch()

📚 Key Concepts for Beginners

What is Text-to-SQL?

Text-to-SQL is the task of converting a natural language question ("Who earns the most?") into a valid SQL query (SELECT first_name, salary FROM employee ORDER BY salary DESC LIMIT 1). LLMs are very good at this because they were trained on vast amounts of SQL code.

What is SQLAlchemy?

SQLAlchemy is Python's most popular database library. It lets you define database tables as Python classes/objects, insert and query data using Python syntax, and connect to almost any SQL database by just changing the connection string — no need to write raw SQL for basic operations.

What is an In-Memory Database?

An in-memory database like sqlite:///:memory: stores all data in RAM instead of on disk. It is extremely fast but all data is lost when the program ends. It is ideal for testing, demos, and temporary data — not for production systems.

What is a Schema?

A database schema is the structure definition of a table — its name, column names, column types, and constraints (primary key, NOT NULL, UNIQUE). LlamaIndex reads the schema automatically and sends it to the LLM so the LLM knows what columns it can filter, sort, and aggregate.

What is `context_str` in SQLTableSchema?

context_str is a plain English description you write for each table, explaining what data it contains. This description is embedded into the VectorStoreIndex alongside the schema. When a user asks a question, the semantic similarity between the question and the context_str determines which table is retrieved.

What is `metadata["result"]`?

response.metadata["result"] contains the raw output of the SQL query — a list of tuples exactly as returned by the database engine. For example, [(3,)] means one row was returned with one column containing the value 3. The LLM then converts this raw result into the natural language response.response.

Why Two LLM Calls Per Query?

Each query makes two LLM calls:

SQL Generation — LLM receives the schema + user question, outputs SQL
Response Synthesis — LLM receives the SQL result + original question, outputs natural language

This two-step process ensures the answer is both accurate (grounded in real SQL results) and readable (expressed in natural language).

⚖️ Text-to-SQL vs Manual SQL vs ORM

Approach	Who writes queries	Speed	Accuracy	Technical skill
Text-to-SQL (this project)	AI from natural language	Fast	High (95%+ for simple queries)	None
Manual SQL	Developer writes SQL	Slow	Exact	High
ORM (SQLAlchemy Python API)	Developer writes Python	Medium	Exact	Medium
Spreadsheet filter	Anyone (UI)	Instant	Exact	None
BI Tool (Tableau, Power BI)	Anyone (UI)	Fast	Exact	Low

Text-to-SQL fills the gap between BI tools (limited to pre-built reports) and manual SQL (requires developer time) — enabling ad-hoc queries from anyone, instantly.

🎯 Real-World Use Cases

Use Case	Natural Language Query Example
👥 HR Management	"How many employees were hired this year?"
💰 Finance	"What is the total salary budget by department?"
📊 Executive reporting	"Who are our top 5 highest-paid employees?"
🏥 Healthcare	"How many patients were admitted last month?"
🛒 E-commerce	"Which products sold more than 100 units today?"
📦 Inventory	"List all items with stock below reorder level"
🎓 Education	"How many students passed the final exam?"

🚀 What to Build Next

Level 1 — Add More Data

# Add 50+ employees to stress-test query quality
# Add date-range queries, aggregations by department

Level 2 — Add More Tables + JOIN queries

# Create department, project, performance_review tables
# Test queries like "Which projects are managed by employees earning > 100K?"

Level 3 — Connect to a Real Database

engine = create_engine("postgresql://user:pass@localhost/company_db")
# All LlamaIndex code remains the same

Level 4 — Build a Gradio Interface

import gradio as gr
# Non-technical users interact via browser
# Show both natural language answer AND generated SQL

Level 5 — Add Validation + Safety Layer

# Validate generated SQL before execution
# Block INSERT/UPDATE/DELETE
# Add query result caching

🤝 Contributing

Ideas for extending this project:

🔐 Add read-only guard that blocks non-SELECT SQL before execution
🗄️ Connect to a real PostgreSQL or MySQL database
📊 Add more tables and demonstrate multi-table JOIN queries
🖥️ Build a Gradio UI where anyone can query the database via browser
💾 Add response caching to avoid redundant LLM calls
🌡️ Set temperature=0 and benchmark SQL correctness
📋 Log all generated SQL queries to a file for auditing
📈 Add chart generation — query results plotted with matplotlib

To contribute:

git checkout -b feature/add-gradio-ui
git commit -m "Add Gradio interface for non-technical database querying"
git push origin feature/add-gradio-ui
# Then open a Pull Request on GitHub

📜 License

This project is open-source and available under the MIT License.

🙏 Acknowledgements

LlamaIndex — for the comprehensive Text-to-SQL pipeline
OpenAI — for the gpt-4o-mini model
SQLAlchemy — for the Python SQL toolkit
SQLite — for the lightweight embedded database engine

Built with ❤️ using LlamaIndex, SQLAlchemy, and OpenAI

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
Text_to_SQL_LlamaIndex.ipynb		Text_to_SQL_LlamaIndex.ipynb
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation