NL2SQL is a powerful and intelligent application that translates natural language queries into executable SQL statements using a hybrid retrieval-augmented generation (RAG) pipeline. It leverages a semantic + keyword-based retrieval system to extract schema context from a database and uses an LLM (OpenAI GPT-4o-mini) to generate SQL queries. The project also includes SQL execution and result visualization, making it a complete NL-to-Insight pipeline.
graph TD
A[Natural Language Query] --> B{Check Cache}
B -- Hit --> C[Return Cached SQL + Result]
B -- Miss --> D[Retrieve Schema Context]
D --> E[Use Hybrid Retriever (FAISS + BM25)]
E --> F[Generate SQL with OpenAI GPT-4o-mini]
F --> G[Execute SQL on SQLite DB]
G --> H[Return Result]
H --> I[Store in TinyDB Cache]
- LangChain v0.2+: For LLM pipelines and hybrid retrieval
- OpenAI GPT-4o-mini: For natural language to SQL generation
- HuggingFace Embeddings: For semantic search
- FAISS: Vector-based semantic retrieval
- BM25: Keyword-based retrieval
- TinyDB: Local caching of user queries and results
- SQLite: Target database for SQL execution
- Streamlit (optional): For interactive frontend (if used)
Which employees have helped the most customers in Canada?
List all customers from USA.
Show invoices issued in the year 2010.
How many albums were sold by each employee in Germany?
- Clone the Repository
git clone https://github.com/your-username/nl2sql.git
cd nl2sql
- Create a Virtual Environment
conda create -n nl2sqlenv python=3.10
conda activate nl2sqlenv
- Install Dependencies
pip install -r requirements.txt
- Set Environment Variable
Create a .env
file in the root directory with your OpenAI key:
OPENAI_API_KEY=your_openai_api_key
Or set it manually:
export OPENAI_API_KEY=your_openai_api_key
- Run the Main Script
python query_engine.py
- (Optional) Run Streamlit App
streamlit run app.py
├── app.py # (Optional) Streamlit frontend
├── query_engine.py # Main logic for NL to SQL conversion
├── rag_utils.py # Schema extraction + hybrid retriever
├── db_utils.py # SQLite schema parsing
├── cache_utils.py # TinyDB cache management
├── vectorstore_dir/ # FAISS index + BM25 docs
├── Data/
│ └── Chinook.sqlite # Sample SQLite database
└── query_cache.json # Cached NL → SQL + results
- ✅ Hybrid Retriever (semantic + keyword)
- ✅ Caching of queries and results
- ❌ Multi-turn interaction (can be added later)
- ❌ User authentication / query history per user
- ❌ Support for multiple databases
For questions or suggestions, feel free to reach out via LinkedIn or raise an issue in the repo.
💡 This project is a great portfolio piece to showcase LLMOps, retrieval-augmented generation, and data engineering capabilities.