GitHub - tshu-w/DBCopilot: Code and data for the paper "DBCᴏᴘɪʟᴏᴛ: Scaling Natural Language Querying to Massive Databases"

DBCᴏᴘɪʟᴏᴛ: Scaling Natural Language Querying to Massive Databases

Description

NL2SQL simplifies database interactions by enabling non-experts to convert natural language (NL) questions into Structured Query Language (SQL) queries. While recent advances in large language models (LLMs) have improved the zero/few-shot NL2SQL paradigm, existing methods face scalability challenges when dealing with massive databases. This paper introduces DBCopilot, a framework that addresses these challenges by employing a compact and flexible copilot model for routing over massive databases. Specifically, DBCopilot decouples schema-agnostic NL2SQL into domain-specific schema routing and generic SQL generation. This framework utilizes a lightweight differentiable search index to construct semantic mappings for massive database schemas and navigates natural language questions to their target databases and tables in a relation-aware, end-to-end manner. The routed schemas and questions are then fed into LLMs for effective SQL generation. Furthermore, DBCopilot introduces a reverse schema-to-question generation paradigm that can automatically learn and adapt the router over massive databases without manual intervention. Experimental results demonstrate that DBCopilot is a scalable and effective solution for schema-agnostic NL2SQL, providing a significant advance in handling large-scale schemas of real-world scenarios.

How to run

First, install dependencies

# clone project
git clone https://github.com/XXXX/DBCopilot
cd DBCopilot

# [SUGGESTED] use conda environment
conda env create -f environment.yaml
conda activate DBCopilot

# [ALTERNATIVE] install requirements directly
pip install -r requirements.txt

Then, run the experiments with the following commands:

# Train the schema questioning models:
./scripts/sweep --config configs/sweep_fit_schema_questioning.yaml

# Training data synthesis
python scripts/synthesize_data.py

# Train the schema routers:
./scripts/sweep --config configs/sweep_fit_schema_routing.yaml

# End-to-end text-to-SQL evaluation:
python scripts/evaluate_text2sql.py

You can also train and evaluate a single model with the run script.

# fit with the XXX config
./run fit --config configs/XXX.yaml
# or specific command line arguments
./run fit --model Model --data DataModule --data.batch_size 32 --trainer.gpus 0,

# evaluate with the checkpoint
./run test --config configs/XXX.yaml --ckpt_path ckpt_path

# get the script help
./run --help
./run fit --help

Name		Name	Last commit message	Last commit date
Latest commit History 165 Commits
configs		configs
notebooks		notebooks
scripts		scripts
src		src
.envrc		.envrc
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
data		data
environment.yaml		environment.yaml
models		models
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
results		results
run		run

tshu-w/DBCopilot

Folders and files

Latest commit

History

Repository files navigation

DBCᴏᴘɪʟᴏᴛ: Scaling Natural Language Querying to Massive Databases

Description

How to run

About

Topics

Resources

Stars

Watchers

Forks

Languages