ZeroNL2SQL

💭 Introduction

This repository contains the code for the paper ``Interleaving Pre-Trained Language Models and Large Language Models for Zero-Shot NL2SQL Generation''.

📂 Data Preparation

Train data

Spider: Put it under src/datasets/spider.

Test data

KaggleDBQA: Put it under src/datasets/kaggledbqa.
Dr.Spider: Put it under src/datasets/drspider.

mkdir data/
unzip src/datasets/kaggledbqa/kaggledbqa.zip -d data/
unzip src/datasets/drspider/drspider.zip -d data/
# Don't delete the original .zip file

💻 Environment Preparation

Please refer to requirements.txt to download the relevant toolkits.

Prepare the following folders:

cd ZeroNL2SQL
mkdir logs
mkdir experimental_outputs/train/template_generator
mkdir experimental_outputs/train/aligner

⚡ Quick Start

Download models

Template Generator: Put it under experimental_outputs/train/template_generator.
Aligner: Put it under experimental_outputs/train/aligner.

Text-to-SQL inference

Use the following script to directly infer on the text-to-sql test set. This script will take four steps: 1. generate SQL template; 2. align (SELECT, STRUCTURE) with the user question; 3. prepare data for LLM inference; 4. text2sql using LLM.

CUDA_VISIBLE_DEVICES={gpu_id} bash scripts/infer_LLM_with_template.sh test_set_name your_openai_key

The first argument is the name of the test set, which can be selected from kaggledbqa, DB_DBcontent_equivalence, DB_schema_abbreviation, DB_schema_synonym, NLQ_keyword_synonym, NLQ_keyword_carrier, NLQ_column_synonym, NLQ_column_carrier, NLQ_column_attribute, NLQ_column_value, NLQ_value_synonym, NLQ_multitype, NLQ_others, SQL_comparison, SQL_sort_order, SQL_NonDB_number, SQL_DB_text, SQL_DB_number.
The second argument is your openai key, which you can obtain from the official website.

Note that we evaluate the text-to-SQL results using the test_suite_evaluation, and the evaluation results are presented in eval.output.

👐 Train From Scratch

Train template generator

CUDA_VISIBLE_DEVICES={gpu_id} bash -c "python src/run.py configs/train_template_generator.json"

The best model will be saved at experimental_outputs/train/template_generator/BEST_MODEL/.

Train aligner

CUDA_VISIBLE_DEVICES={gpu_id} bash -c "python src/run_aligner.py configs/train_aligner.json"

The best model will be saved at experimental_outputs/train/aligner/checkpoint_best.pkl.

💬Citation

If our code is helpful to you, please cite our work:

@misc{gu2023interleaving,
      title={Interleaving Pre-Trained Language Models and Large Language Models for Zero-Shot NL2SQL Generation}, 
      author={Zihui Gu and Ju Fan and Nan Tang and Songyue Zhang and Yuxin Zhang and Zui Chen and Lei Cao and Guoliang Li and Sam Madden and Xiaoyong Du},
      year={2023},
      eprint={2306.08891},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
aligner_data		aligner_data
configs		configs
experimental_outputs		experimental_outputs
scripts		scripts
src		src
test-suite-sql-eval		test-suite-sql-eval
LICENSE		LICENSE
LLM_text2sql.py		LLM_text2sql.py
README.md		README.md
get_colval_map.py		get_colval_map.py
requirements.txt		requirements.txt
value_match.py		value_match.py

License

ruc-datalab/ZeroNL2SQL

Folders and files

Latest commit

History

Repository files navigation