NL2SQL simplifies database interactions by enabling non-experts to convert natural language (NL) questions into Structured Query Language (SQL) queries. While recent advances in large language models (LLMs) have improved the zero/few-shot NL2SQL paradigm, existing methods face scalability challenges when dealing with massive databases. This paper introduces DBCopilot, a framework that addresses these challenges by employing a compact and flexible copilot model for routing over massive databases. Specifically, DBCopilot decouples schema-agnostic NL2SQL into domain-specific schema routing and generic SQL generation. This framework utilizes a lightweight differentiable search index to construct semantic mappings for massive database schemas and navigates natural language questions to their target databases and tables in a relation-aware, end-to-end manner. The routed schemas and questions are then fed into LLMs for effective SQL generation. Furthermore, DBCopilot introduces a reverse schema-to-question generation paradigm that can automatically learn and adapt the router over massive databases without manual intervention. Experimental results demonstrate that DBCopilot is a scalable and effective solution for schema-agnostic NL2SQL, providing a significant advance in handling large-scale schemas of real-world scenarios.
First, install dependencies
# clone project
git clone https://github.com/XXXX/DBCopilot
cd DBCopilot
# [SUGGESTED] use conda environment
conda env create -f environment.yaml
conda activate DBCopilot
# [ALTERNATIVE] install requirements directly
pip install -r requirements.txt
Then, run the experiments with the following commands:
# Train the schema questioning models:
./scripts/sweep --config configs/sweep_fit_schema_questioning.yaml
# Training data synthesis
python scripts/synthesize_data.py
# Train the schema routers:
./scripts/sweep --config configs/sweep_fit_schema_routing.yaml
# End-to-end text-to-SQL evaluation:
python scripts/evaluate_text2sql.py
You can also train and evaluate a single model with the run
script.
# fit with the XXX config
./run fit --config configs/XXX.yaml
# or specific command line arguments
./run fit --model Model --data DataModule --data.batch_size 32 --trainer.gpus 0,
# evaluate with the checkpoint
./run test --config configs/XXX.yaml --ckpt_path ckpt_path
# get the script help
./run --help
./run fit --help