GitHub - theSage21/syntaxSQL: SyntaxSQLNet: Syntax Tree Networks for Complex and Cross Domain Text-to-SQL Task

SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-Domain Text-to-SQL Task

Source code of our EMNLP 2018 paper: SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-DomainText-to-SQL Task .

Citation

@InProceedings{Yu&al.18.emnlp.syntax,
  author =  {Tao Yu and Michihiro Yasunaga and Kai Yang and Rui Zhang and Dongxu Wang and Zifan Li and Dragomir Radev},
  title =   {SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-Domain Text-to-SQL Task},
  year =    {2018},  
  booktitle =   {Proceedings of EMNLP},  
  publisher =   {Association for Computational Linguistics},
}

Environment Setup

The code uses Python 2.7 and Pytorch 0.2.0 GPU.
Install Python dependency: pip install -r requirements.txt

Download Data, Embeddings, Scripts, and Pretrained Models

Download the dataset from the Spider task website to be updated, and put tables.json, train.json, and dev.json under data/ directory.
Download the pretrained Glove, and put it as glove/glove.%dB.%dd.txt
Download evaluation.py and process_sql.py from the Spider github page
Download preprocessed train/dev datasets and pretrained models from here. It contains: -generated_datasets/
- generated_data for original Spider training datasets, pretrained models can be found at generated_data/saved_models
- generated_data_augment for original Spider + augmented training datasets, pretrained models can be found at generated_data_augment/saved_models

Generating Train/dev Data for Modules

You could find preprocessed train/dev data in generated_datasets/.

To generate them by yourself, update dirs under TODO in preprocess_train_dev_data.py, and run the following command to generate training files for each module:

python preprocess_train_dev_data.py train|dev

Folder/File Description

data/ contains raw train/dev/test data and table file
generated_datasets/ described as above
models/ contains the code for each module.
evaluation.py is for evaluation. It uses process_sql.py.
train.py is the main file for training. Use train_all.sh to train all the modules (see below).
test.py is the main file for testing. It uses supermodel.sh to call the trained modules and generate SQL queries. In practice, and use test_gen.sh to generate SQL queries.
generate_wikisql_augment.py for cross-domain data augmentation

Training

Run train_all.sh to train all the modules. It looks like:

python train.py \
    --data_root       path/to/generated_data \
    --save_dir        path/to/save/trained/module \
    --history_type    full|no \
    --table_type      std|no \
    --train_component <module_name> \
    --epoch           <num_of_epochs>

Testing

Run test_gen.sh to generate SQL queries. test_gen.sh looks like:

SAVE_PATH=generated_datasets/generated_data/saved_models_hs=full_tbl=std
python test.py \
    --test_data_path  path/to/raw/test/data \
    --models          path/to/trained/module \
    --output_path     path/to/print/generated/SQL \
    --history_type    full|no \
    --table_type      std|no \

Evaluation

Follow the general evaluation process in the Spider github page.

Cross-Domain Data Augmentation

You could find preprocessed augmented data at generated_datasets/generated_data_augment.

If you would like to run data augmentation by yourself, first download wikisql_tables.json and train_patterns.json from here, and then run python generate_wikisql_augment.py to generate more training data. Second, run get_data_wikisql.py to generate WikiSQL augment json file. Finally, use merge_jsons.py to generate the final spider + wikisql + wikisql augment dataset.

Acknowledgement

The implementation is based on SQLNet. Please cite it too if you use this code.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
models		models
README.md		README.md
generate_wikisql_augment.py		generate_wikisql_augment.py
get_data_wikisql.py		get_data_wikisql.py
merge_jsons.py		merge_jsons.py
preprocess_train_dev_data.py		preprocess_train_dev_data.py
process_sql.py		process_sql.py
requirements.txt		requirements.txt
supermodel.py		supermodel.py
test.py		test.py
test_gen.sh		test_gen.sh
train.py		train.py
train_all.sh		train_all.sh
utils.py		utils.py
word_embedding.py		word_embedding.py

theSage21/syntaxSQL

Folders and files

Latest commit

History

Repository files navigation

SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-Domain Text-to-SQL Task

Citation

Environment Setup

Download Data, Embeddings, Scripts, and Pretrained Models

Generating Train/dev Data for Modules

Folder/File Description

Training

Testing

Evaluation

Cross-Domain Data Augmentation

Acknowledgement

About

Resources

Stars

Watchers

Forks

Languages