Skip to content

Improving Generalization in Semantic Parsing by Increasing Natural Language Variation

License

Notifications You must be signed in to change notification settings

saparina/Text2SQL-NLVariation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Improving Generalization in Semantic Parsing by Increasing Natural Language Variation

This repo is the implementation of the following paper:

Improving Generalization in Semantic Parsing by Increasing Natural Language Variation
Irina Saparina and Mirella Lapata
EACL'24

License

This dataset is released under the CC BY-SA 4.0 license, meaning you must credit the original source and share any derivative works under the same license, even for commercial use.

Data and Checkpoints

You can download augmentated Spider and evaluation datasets from Google Drive.

Preprocess Dr.Spider:

cd data/diagnostic-robustness-text-to-sql
python data_preprocess.py

Preprocess KaggleDBQA:

cd data/kaggle-dbqa
python preprocess.py

T5 checkpoint is available on the HuggingFace Hub. RESDSQL checkpoints are available on Google Drive. Download it and unzip files into models/RESDSQL.

Dependencies

Create conda env:

conda env create -n nlvariation_env -f enviroment.yaml
conda activate nlvariation_env

Install RESDSQL dependencies:

cd RESDSQL
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz
python nltk_downloader.py

Clone evaluation scripts:

mkdir picard/third_party
cd picard/third_party
git clone https://github.com/facebookincubator/hsthrift
git clone https://github.com/facebook/zstd
git clone https://github.com/facebook/wangle
git clone https://github.com/facebook/folly
git clone https://github.com/elementai/spider
git clone https://github.com/elementai/test-suite-sql-eval
git clone https://github.com/hasktorch/tokenizers
git clone https://github.com/facebook/fbthrift
git clone https://github.com/fmtlib/fmt
git clone https://github.com/rsocket/rsocket-cpp
git clone https://github.com/facebookincubator/fizz
cd ../../

mkdir RESDSQL/third_party
cd RESDSQL/third_party
git clone https://github.com/ElementAI/spider.git
git clone https://github.com/ElementAI/test-suite-sql-eval.git
mv ./test-suite-sql-eval ./test_suite

T5 and PICARD

The code used for exeperiments with T5 and PICARD is a fork of official PICARD implementation:

cd picard

You can run T5 evaluation with:

sh ./configs/dr_spider/eval_dr_spider_t5-spider-augs.sh # Dr.Spider
sh ./configs/kaggle/eval_kaggle_t5-spider-augs.sh # KaggeDBQA
sh ./configs/geoquery/eval_geoquery_t5-spider-augs.sh # Dr.Spider

You need to use Docker (see more info) to run PICARD. You can run evaluation with:

sh ./configs/dr_spider/eval_dr_spider_t5-spider-augs.sh # Dr.Spider
sh ./configs/kaggle/eval_kaggle_t5-spider-augs.sh # KaggeDBQA
sh ./configs/geoquery/eval_geoquery_t5-spider-augs.sh # GeoQuery

You can run training on augmented dataset with:

python seq2seq/run_seq2seq.py configs/train_augs.json

RESDSQL

The code used for exeperiments with RESDSQL is a fork of official RESDSQL implementation:

cd RESDSQL

You can run RESDSQL evaluation with:

sh ./configs/dr_spider/eval_dr_spider_t5-spider-augs.sh # Dr.Spider
sh ./configs/kaggle/eval_kaggle_t5-spider-augs.sh # KaggeDBQA
sh ./configs/geoquery/eval_geoquery_t5-spider-augs.sh # GeoQuery

You can run training on augmented dataset with:

sh ./configs/train_augs.sh

Acknowledgements

We used the following datasets: Spider, Dr.Spider, KaggleDBQA, GeoQuery. The code is based on official PICARD implementation and official RESDSQL implementation (includes NatSQL). We thank all authors for their work.

About

Improving Generalization in Semantic Parsing by Increasing Natural Language Variation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published