Data and Code Release for "On the Potential of Lexico-logical Alignments for Semantic Parsing to SQL Queries"

What are included

Squall dataset
Seq2seq-based models with attention and copy mechanisms (LSTM/BERT encoder)
Supervised attention and column prediction using manually-annotated alignments

Licenses

Our code has MIT license. The evalutor contains modified code from mistic-sql-parser by Damien "Mistic" Sorel and Andrew Kent.

The Squall dataset has CC BY-SA 4.0 license and is build upon WikiTableQuestions by Panupong Pasupat and Percy Liang.

Requirements

python 3.x
nodejs (for evaluator, details below)

Setting Up

After cd scripts, run python make_splits.py to generate the train-dev splits used in our experiments; ./download_corenlp.sh will download and unzip the corresponding CoreNLP version.

To set up the evaluator, cd eval, and then run npm install file:sql-parser and npm install express.

To set up the python dependencies, run pip install -r requirements.txt.

Model Training and Testing

Make sure the evaluator service is running before performing any model training or testing. To do so, cd eval and run node evaluator.js. This will spawn a local service (default port 3000) that allows communication with the python model code to convert the (slightly) underspecified SQL queries into SQL queries fully-executable on our pre-processed databases.

Next, cd model and then run python main.py to train a baseline model with LSTM encoder, additional options to include our model variations:

--bert for BERT encoder
--enc-loss for encoder supervised attention
--dec-loss for decoder supervised attention
--aux-col for supervised column prediction

Once the model is trained, run python main.py --test to make predictions on the WTQ test set.

See model/main.py for command-line arguments to specify training file, dev file, test file, model saving location, etc.

Squall Dataset Format

The dataset is located at data/squall.json as a single JSON file. The file is a list of dictionaries, each corresponding to one annotated data instance with the following fields:

nt: question ID
tbl: Table ID
columns: a list of processed table columns with the format of [raw header text, tokenized header text, available column suffixes (ways to interpret this column beyond raw texts), column data type]
nl: tokenized English question
tgt: target execution result
nl_pos: automatically-analyzed POS tags for nl
nl_ner: automatically-analyzed NER tags for nl
nl_ralign: automatically-generated field that includes information about what type of SQL fragments each question token aligns to, used in the auxiliary task of column prediction.
nl_incolumns: Boolean values of whether the token matches any of the column tokens
nl_incells: Boolean values of whether the token matches any of the table cells
columns_innl: Boolean values of whether the column header appears in the question
sql: tokenized SQL queries, each token has the format of [SQL type, value, span indices], SQL type is one among Keyword, Column, Literal.Number, Literal.String. If the token is a literal, then the span indices include the beginning and end indices to extract the literal from nl.
align: Manual alignment between nl and sql. Each record in this list is in the format of [indices into nl, indices into sql]

Release History

0.1.0 (2020-10-20): initial release

Reference

If you make use of our code or data for research purposes, we'll appreciate your citing the following:

@inproceedings{Shi:Zhao:Boyd-Graber:Daume-III:Lee-2020,
	Title = {On the Potential of Lexico-logical Alignments for Semantic Parsing to {SQL} Queries},
	Author = {Tianze Shi and Chen Zhao and Jordan Boyd-Graber and Hal {Daum\'{e} III} and Lillian Lee},
	Booktitle = {Findings of EMNLP},
	Year = {2020},
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
embeddings		embeddings
eval		eval
log		log
model		model
scripts		scripts
tables		tables
.gitignore		.gitignore
LICENSE-code		LICENSE-code
LICENSE-data		LICENSE-data
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

Data and Code Release for "On the Potential of Lexico-logical Alignments for Semantic Parsing to SQL Queries"

What are included

Licenses

Requirements

Setting Up

Model Training and Testing

Squall Dataset Format

Release History

Reference

About

Licenses found

Releases

Packages

Languages

License

Licenses found

tzshi/squall

Folders and files

Latest commit

History

Repository files navigation

Data and Code Release for "On the Potential of Lexico-logical Alignments for Semantic Parsing to SQL Queries"

What are included

Licenses

Requirements

Setting Up

Model Training and Testing

Squall Dataset Format

Release History

Reference

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages