A robust system for converting natural language queries into precise SQL statements using a multi-stage pipeline combining Retrieval-Augmented Generation (RAG) and Language Model (LLM) processing.
This project implements an advanced natural language to SQL conversion system that enhances query generation through a specialized pipeline:
- Table Selection: Identifies the most relevant tables needed for a given query
- Column Pruning: Determines the essential columns from those tables
- SQL Generation: Constructs accurate SQL queries using the filtered schema and similar examples
The system combines vector database similarity search (RAG) with strategic LLM prompting to achieve improved results compared to single-step conversion approaches.
The pipeline consists of the following components:
- RAG-Based Table Selection: Uses vector similarity to identify potentially relevant tables
- LLM-Based Table Refinement: Determines which tables are truly necessary for the query
- Column Pruning: Filters columns to include only those needed for the query
- Similar Query Retrieval: Finds analogous queries from the training data to serve as examples
- Final SQL Generation: Creates the SQL query using the pruned schema and example queries
- Vector database for storing embeddings of table schemas and example queries
- PostgreSQL integration for storing and retrieving data
- Embedding generation using text-embedding-ada-002
- LLM-powered processing using Groq API with Qwen and Llama models
- Multi-stage prompting strategy for improved query accuracy
- Schema embedding and retrieval
- Similar query identification based on vector similarity
- Integration of retrieved context into LLM prompting
main.py: Orchestrates the complete pipeline and evaluates query accuracysingle_pipeline.py: Implements the individual pipeline components (table selection, column pruning, SQL generation)RAG/embedding_creator.py: Handles embedding generation and vector search for tables and example queriesRAG/db_conf.py: Database connection configurationtest_schema_embedding.py: Utilities for schema embedding and vector database operations
| Metric Type | Component | Performance |
|---|---|---|
| Tokens | Table Selection | ~500 |
| Column Pruning | ~800 | |
| SQL Generation | ~1200 | |
| Time (sec) | Table Selection | 0.8s |
| Column Pruning | 1.2s | |
| SQL Generation | 1.5s | |
| Totals | Tokens/Query | ~2500 |
| Time/Query | 3.5s |
The project features an interactive web interface built with Streamlit, allowing users to easily:
- Input natural language queries
- Visualize each stage of the Text-to-SQL pipeline
- See generated SQL queries and performance metrics in real-time
The interface is implemented in view.py which:
- Provides a clean dashboard for query input
- Shows step-by-step processing with visual cues
- Displays performance metrics for each stage
- Presents the final SQL query with syntax highlighting
- Python 3.10+
- PostgreSQL database with pgvector extension
- Required Python packages (see requirements.txt)
- API keys for LLM services (Groq, Azure OpenAI)
-
Clone the repository:
git clone https://github.com/dash-dash-org/Adobe.git cd Adobe -
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables:
# Create .env file with necessary API keys GROQ_API_KEY_RAG=your_api_key AZURE_OPENAI_API_KEY_GPT4=your_api_key AZURE_OPENAI_ENDPOINT=your_endpoint
Launch the user-friendly web interface:
streamlit run view.pyThis opens an interactive dashboard where you can:
- Enter natural language queries in plain English
- Watch as the system processes each stage of the pipeline
- View the selected tables and columns
- Get the final SQL query with syntax highlighting
- See detailed performance metrics for each step
Run the main script with your test data:
python main.pyOr use the single pipeline directly in your own code:
from single_pipeline import table_agent, prune_agent, final_sql_query_generator
import RAG.embedding_creator as search_api
# Get relevant tables through RAG
relevant_tables = search_api.search_tables(nl_query)
# Refine tables through LLM
refined_tables, tokens = table_agent(nl_query, relevant_tables)
# Prune columns
relevant_schema, tokens = prune_agent(nl_query, refined_tables)
# Get similar queries as examples
similar_queries = search_api.search_similar_query(nl_query)
# Generate SQL query
sql_query, tokens = final_sql_query_generator(nl_query, relevant_schema, similar_queries)The system processes natural language queries through multiple stages:
- Initial RAG Search: Converts the query to an embedding and finds similar tables in the vector database
- Table Agent: Uses an LLM to analyze which tables are actually required
- Column Pruning: Determines which columns from the selected tables are necessary
- Similar Query Search: Finds examples of similar queries and their SQL translations
- SQL Generation: Combines the pruned schema and examples to generate the final SQL
This multi-step process achieves better results than single-step conversion by breaking down the complex task into manageable components.
Contributions are welcome! Please feel free to submit a Pull Request.
- Groq for providing fast LLM inference
- Azure OpenAI for embedding generation
- pgvector for efficient vector similarity search


