A proof of concept demonstrating that with a data dictionary and agentic AI, you can answer any business questions—even when your schema uses cryptic column names.
Business data is often stored with technical or obscure column names (e.g., geo_c, chn_x, st_f). A traditional BI tool or analyst needs to know these mappings. This POC shows that an LLM agent, equipped with:
- A data dictionary — plain-language definitions of tables and columns
- Agentic tools — ability to inspect schema, read the dictionary, and run SQL in a loop
can interpret natural-language questions, figure out the right mappings, write correct SQL, and return clear answers—without hardcoding queries or mappings.
Maintaining good metadata — clear definitions, valid values, and examples for each column — is what makes this POC possible. Without it, the agent would be guessing at column meanings and likely produce wrong results or hallucinate mappings.
Data governance — consistent definitions, ownership, and change control — ensures the data dictionary stays accurate as schemas evolve. Poor or outdated metadata leads to AI that confidently gives wrong answers.
This POC illustrates that metadata and governance are the foundation for effective AI use. Invest in that foundation first; then agentic AI, natural-language querying, and self-service analytics become much more reliable and scalable.
- You ask a natural-language question (e.g., "What's the best-selling product?").
- The LLM receives the question and decides which tools to call.
- Tools the agent can use:
get_table_columns— list columns and types for a tableget_data_dictionary— fetch plain-language definitions for a table's columnsrun_sql— execute a SQL statement and return results
- Agent loop — the model calls tools, receives results, and can call more tools until it has enough to answer (up to 15 turns).
- Output — the agent returns an explanation, the SQL it ran, and the final answer.
The column names in the schema are intentionally cryptic (e.g., st_f, geo_c, chn_x). The agent must use the data dictionary to map your business question to the right columns and values.
The demo loads two tables into SQLite (agent_demo.db):
| Column | Type | Description |
|---|---|---|
tx_id |
INTEGER | Transaction ID (unique key) |
d_k |
TEXT | Transaction date (YYYY-MM-DD) |
geo_c |
TEXT | Region code: NE, SE, MW, W |
chn_x |
TEXT | Channel code: W=Web, S=Store, M=Mobile |
sku_z |
TEXT | Product SKU (e.g., SKU-001) |
cst_r |
TEXT | Customer reference ID |
u_n |
INTEGER | Units sold |
v_n |
REAL | Net sales (USD, after discounts) |
v_g |
REAL | Gross sales (USD, before discounts) |
st_f |
TEXT | Status: P=Paid, R=Refunded |
~135 rows of synthetic sales data across 4 regions, 3 channels, 4 SKUs, and 5 customers, spanning about 45 days.
Stores business definitions for each column: table_name, column_name, data_type, definition, and example. The agent uses this to understand what columns mean without prior knowledge.
Ask in plain language:
- "What's the best-selling product? Exclude refunds."
- "Who is the best customer by revenue?"
- "Which region had the highest net sales last week?"
- "Compare sales by channel: Web vs Store vs Mobile."
- "What was total revenue in February?"
- "List top 3 products by units sold."
- "How many refunded transactions are there?"
- "Which customer bought the most units?"
The agent will call the data dictionary, build the right SQL, and return the answer.
- Python 3.9+
- OpenAI API key in
.env - Packages in
requirements.txt
- Create and activate a virtual environment:
python -m venv .venv
source .venv/bin/activate- Install dependencies:
pip install -r requirements.txt- Configure environment variables:
cp .env.example .envThen edit .env and set your OPENAI_API_KEY.
- Run the Gradio app:
python sql_agent_gradio.pyOr run the original notebook:
jupyter labOpen 5_extra copy.ipynb and run the cells.
sql_agent_gradio.py— Gradio app; prompts come from the UI instead of being hardcoded5_extra copy.ipynb— Original notebook with the agent loop and trace outputagent_demo.db— Created at runtime; sample sales data with a data dictionary
- The demo creates
agent_demo.dbautomatically on startup. - The Gradio app lets you edit the system and user prompts to test different questions and behaviors.