Requires Python 3.10. Install the required dependencies using the provided requirements file:
pip install -r requirements.txt
Download the required models from HuggingFace to the models/
directory:
mkdir -p models/generator
huggingface-cli download Qwen/Qwen2.5-Coder-32B-Instruct \
--local-dir models/generator
mkdir -p models/reward
huggingface-cli download sheshansh-ctx/ctx-bird-reward-250121 \
--local-dir models/reward
Download and preprocess the BIRD benchmark dataset:
python src/prep_data.py
This will:
- Download the BIRD development dataset
- Extract and organize files in
data/
directory - Generate database schemas and processed JSONL output
Requires 2+ GPUs with 80GB RAM each. The pipeline consists of four main stages:
- Candidate Generation: Generate multiple SQL query candidates
python src/generate.py --input_file data/test_all.jsonl --output_dir output/generations/ --num_gpus 2
- SQL execution: Execute SQL candidates
python src/process_sqls.py --input_file data/test_all.jsonl --generations_dir output/generations/ --output_dir output/with_results/ --compare_against_gt --sql_timeout 30.0
Note: The
--sql_timeout
parameter should be tuned based on your database performance and CPU capabilities. SQL queries that exceed the timeout are filtered in subsequent steps. For slower hardware, consider increasing the timeout value to avoid excessive query filtering.
Note: The script is written such that SQL execution can be run in parallel with generation, and it waits on generation job to generate more candidates.
- Reward-Based Scoring: Score candidates using the reward model
VLLM_USE_V1=0 time python src/reward.py --input_file output/with_results/data_with_results.jsonl --output_dir output/with_rewards --num_gpus 2
- Analysis: Choose the highest-scoring SQL for final output
python src/analysis.py --rewards_dir output/with_rewards --gt_sql_file data/test_gold_sqls.txt --output_dir output/analysis --num_cpus 100
@misc{agrawal2025text2sql,
author = {Sheshansh Agrawal and Thien Nguyen},
title = {Open-Sourcing the Best Local Text-to-SQL System},
year = {2025},
url = {https://contextual.ai/blog/open-sourcing-the-best-local-text-to-sql-system/}
}