Skip to content

dnaihao/text-to-SQL-survey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Recent Advances in Text-to-SQL:
A Survey of What We Have and What We Expect

News in Text-to-SQL

2023 BIRD, a new cross-domain text-to-SQL benchmarks in the era of LLMs. Paper

Abstract

Text-to-SQL has attracted attention from both the natural language processing and database communities because of its ability to convert the semantics in natural language into SQL queries and its practical application in building natural language interfaces to database systems.
The major challenges in text-to-SQL lie in encoding the meaning of natural utterances, decoding to SQL queries, and translating the semantics between these two forms.
These challenges have been addressed to different extents by the recent advances.
However, there is still a lack of comprehensive surveys for this task.
To this end, we review recent progress on text-to-SQL for datasets, methods, and evaluation and provide this systematic survey, addressing the aforementioned challenges and discussing potential future directions.
We hope this survey can serve as quick access to existing work and motivate future research.

Citation

If you would like to cite our work, please cite the following work:

Naihao Deng, Yulong Chen, and Yue Zhang. 2022. Recent Advances in Text-to-SQL: A Survey of What We Have and What We Expect

@inproceedings{deng-etal-2022-recent-advances-in-text-to-sql,
    title = "Recent Advances in Text-to-{SQL}: A Survey of What We Have and What We Expect",
    author = "Deng, Naihao  and
      Chen, Yulong  and
      Zhang, Yue",
    booktitle = "COLING",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics",
    url = "https://aclanthology.org/events/coling-2022/",
}

Survey Sketch

Concept Diagram

The framework for text-to-SQL systems. Given the database schema and user utterance, the system outputs a corresponding SQL query to query the database system for the result. Appendix B gives more text-to-SQL examples.
Topology for text-to-SQL. Format adapted from (Liu et al., 2021a)

Datasets

Summarization for text-to-SQL datasets. #Size, #DB, #D, and #T/DB represent the number of question-SQL pairs, databases, domains, and tables per domain, respectively. We put “-” in the #D column because we do not know how many domains are in the Spider dev set and “-” in the Issues Addressed column because there is no specific issue addressed for the dataset. Datasets above and below the line are cross-domain and single-domain, respectively.
Datasets #Size #D #DB #T/DB Issues addressed Sources for data
Spider (Yu et al., 2018c) 10,181 138 200 5.1 Domain generalization College courses, DatabaseAnswers, WikiSQL
Spider-DK (Gan et al., 2021b) 535 - 10 4.8 Domain knowledge Spider dev set
Spider-tran (Zeng et al., 2020) 15,023 138 200 5.1 Untranslatable questions Spider + 5,330 untranslatable questions
Spider-L (Lei et al., 2020) 8,034 - 160 5.1 Schema linking Spider train/dev
SpiderSL (Taniguchi et al., 2021) 1,034 - 10 4.8 Schema linking Spider dev set
Spider-Syn (Gan et al., 2021a) 8,034 - 160 5.1 Robustness Spider train/dev
WikiSQL (Zhong et al., 2017) 80,654 - 26,521 1 Data size Wikipedia
Squall (Shi et al., 2020b) 11,468 - 1,679 1 Lexicon-level supervision WikiTableQuestions (Pasupat and Liang, 2015)
KaggleDBQA (Lee et al., 2021) 272 8 8 2.3 Domain generalization Real web databases
ATIS (Price, 1990), (Dahl et al., 1994) 5,280 1 1 32 - Flight-booking
GeoQuery (Zelle and Mooney, 1996) 877 1 1 6 - US geography
Scholar (Iyer et al., 2017) 817 1 1 7 - Academic publications
Academic (Li and Jagadish, 2014) 196 1 1 15 - Microsoft Academic Search (MAS)
IMDB (Yaghmazadeh et al., 2017) 131 1 1 16 - database Internet Movie Database
Yelp (Yaghmazadeh et al., 2017) 128 1 1 7 - Yelp website
Advising (Finegan-Dollak et al., 2018) 3,898 1 1 10 - University of Michigan course
Restaurants (Tang and Mooney, 2000) (Popescu et al., 2003) 378 1 1 3 - information Restaurants
MIMICSQL (Wang et al., 2020d) 10,000 1 1 5 - Healthcare domain
SEDE (Hazoom et al., 2021) 12,023 1 1 29 SQL template diversity Stack Exchange

Other relevant datasets

Resources

  • This Github Repo holds data for many of the single domain datasets including Academic, Advising, ATIS, Geography, IMDB, Restaurants, Scholar, Yelp; as well as cross-domain datasets including Spider and WikiSQL. The Github Repo is the official repo for This paper, and their methods of processing the data is described in the paper as well.

  • The official Spider leaderboard shows the performance of models on Spider dev and the hidden test set.

  • The official WikiSQL Github Repo holds leaderboard for models on WikiSQL dataset.

Methods

Data Augmentation

Encoding

Methods used for encoding in text-to-SQL.
Methods Adopted by Applied datasets Addressed challenges
Encode token type TypeSQL (Yu et al., 2018a) WikiSQL Representing question meaning
Graph-based GNN (Bogin et al., 2019a) Spider (1) Representing question and DB schemas in a structured way (2) Schema linking
Global-GCN (Bogin et al., 2019b) Spider
IGSQL (Cai and Wan, 2020) Sparc, CoSQL
RAT-SQL (Wang et al., 2020a) Spider
LEGSQL (Cao et al., 2021) Spider
SADGA (Cai et al., 2021) Spider
ShawdowGNN (Chen et al., 2021b) Spider
S2SQL (Hui et al., 2022) Spider, Spider-Syn
Self-attention X-SQL (He et al., 2019) WikiSQL
SQLova (Hwang et al., 2019) WikiSQL
RAT-SQL (Wang et al., 2020a) Spider
DuoRAT (Scholak et al., 2021a) Spider
UnifiedSKG (Xie et al., 2022) WikiSQL, Spider
Adapt PLM X-SQL (He et al., 2019) WikiSQL Leveraging external data to represent question and DB schemas
SQLova (Hwang et al., 2019) WikiSQL
(Guo and Gao, 2019) WikiSQL
HydraNet (Lyu et al., 2020) WikiSQL
(Liu et al., 2021b), etc Spider-L, SQUALL
Pre-training TaBERT (Yin et al., 2020) Spider
GraPPA (Yu et al., 2021) Spider
GAP (Shi et al., 2020a) Spider

Decoding

Methods used for decoding in text-to-SQL. ♠: Academic, Advising, ATIS, GeoQuery, Yelp, IMDB, Scholar, Restaurants; ♡: TableQA DuSQL, CoSQL, Sparc, Chase.
Methods Adopted by Applied datasets Addressed challenges
Tree-based Seq2Tree (Dong and Lapata, 2016) - Hierarchical decoding
Seq2AST (Yin and Neubig, 2017) -
SyntaxSQLNet (Yu et al., 2018b) Spider
Sketch-based SQLNet (Xu et al., 2017) WikiSQL
(Dong and Lapata, 2018) WikiSQL
IRNet (Guo et al., 2019) Spider
RYANSQL (Choi et al., 2021) Spider
Bottom-up SmBop (Rubin and Berant, 2021) Spider
Attention Mechanism Attention Seq2Tree (Dong and Lapata, 2016) - Synthesizing information for decoding
Seq2SQL (Zhong et al., 2017) WikiSQL
Bi-attention (Guo and Gao, 2018) WikiSQL
Structured attention (Wang et al., 2019) WikiSQL
Relation-aware Self-attention DuoRAT (Scholak et al., 2021a) Spider
Copy Mechanism Seq2AST (Yin and Neubig, 2017) -
Seq2SQL (Zhong et al., 2017) WikiSQL
(Wang et al., 2018a) WikiSQL
SeqGenSQL (Li et al., 2020a) WikiSQL
Intermediate Representation IncSQL (Shi et al., 2018) WikiSQL Bridging the gap between natural language and SQL query
IRNet (Guo et al., 2019) Spider
(Suhr et al., 2020) Spider and others♠
(Herzig et al., 2021) GeoQuery, ATIS, Scholar
(Gan et al., 2021c) Spider
(Brunner and Stockinger, 2021) Spider
Others Constrained decoding UniSAr (Dou et al., 2022) WikiSQL, Spider and others♡ Fine-grained decoding
PICARD (Scholak et al., 2021b) Spider, CoSQL
Execution-guided SQLova (Hwang et al., 2019) WikiSQL
(Wang et al., 2018b) WikiSQL
Discriminative re-ranking Global-GCN (Bogin et al., 2019b) Spider SQL Ranking
(Kelkar et al., 2020) Spider
Separate submodule SQLNet (Xu et al., 2017) WikiSQL Easier decoding
(Guo and Gao, 2018) WikiSQL
(Lee, 2019) Spider
BPE (Muller and Vlachos, 2019) Advising, ATIS, GeoQuery
Link gating (Chen et al., 2020b) Spider Synthesizing information for decoding

Learning Techniques

Miscellaneous

For context-dependent text-to-SQL:

Evaluation

Metrics

The summary of metrics, datasets that use these metrics, and their potential error cases.
Metrics Datasets Errors
Naiive Execution Accuracy GeoQuery, IMDB, Yelp, WikiSQL, etc False positive
Exact String Match Advising, WikiSQL, etc False negative
Exact Set Match Spider False negative
Test Suite Accuracy (execution accuracy with generated databases) Spider, GeoQuery, etc False positive

Evaluation Setup

Discussion and Future Directions

  • Cross-domain text-to-SQL: incorporate domain knowledge to models trained on the existing datasets, and deploy such models efficiently on different domains.

  • Real-world use cases:

    • Handle corrupted tables or scenarios where no table is provided.
    • Handle user inputs different from the existing datasets.
    • Facilitate DB administrator to manage DB schemas, updating DB content.
    • Multi-lingual text-to-SQL.
    • DB interface for the disabled.
  • Integrated into a larger scope of research:

    • QA system for DB.
    • Dialogue system with knowledge from DB.
    • Explore the inter-relation between SQL and other logical forms.
    • Generalized semantic parsing.
  • Others:

    • Apply prompt learning to text-to-SQL: how to make the system robust.
    • Evalulation of the existing text-to-SQL systems.

Text-to-SQL examples

Example of the table in the database

The domain for Restaurant dataset is restaurant information, where questions are typically about food type, restaurant location, etc.

There is a big difference in terms of how many tables a database has. For restaurants, there are 3 tables in the database, while there are 32 tables in ATIS (Suhr et al., 2020).

Geography, one of the 3 tables in Restaurants database. * denotes the primary key of this table. We only include 3 rows for demonstration purpose.
CITY.NAME* COUNTY REGION
VARCHAR(255) VARCHAR(255) VARCHAR(255)
Alameda Alameda County Bay Area
Alamo Contra Costa County Bay Area
Albany Alameda County Bay Area
... ... ...

Domain Knowledge

Question:

Will undergrads be okay to take 581 ?

SQL query:

SELECT DISTINCT T1.ADVISORY_REQUIREMENT ,
T1.ENFORCED_REQUIREMENT , T1.NAME FROM
COURSE AS T1 WHERE T1.DEPARTMENT =
"EECS" AND T1.NUMBER = 581 ;

In Advising dataset, Department “EECS” is considered as domain knowledge where “581” in the utterance means a course in “EECS” department with course number “581”.

Dataset Convention

Question:

Give me some restaurants in alameda ?

SQL query:

SELECT T1.HOUSE_NUMBER ,
T2.NAME FROM LOCATION AS T1 , RESTAURANT
AS T2 WHERE T1.CITY_NAME = "alameda"
AND T2.ID = T1.RESTAURANT_ID ;

In Restaurants dataset, when the user queries “restaurants”, by dataset convention, the corresponding SQL query returns the column “HOUSE_NUMBER” and “NAME”.

Text-to-SQL Templates

An example of the template for text-to-SQL pair used by (Iyer et al., 2017) is as follows:

Question template:

Get all <ENT1>.<NAME> having <ENT2>.<COL1>.<NAME> as <ENT2>.<COL1>.<TYPE>

SQL query template:

SELECT <ENT1>.<DEF> FROM JOIN_FROM(
<ENT1>, <ENT2>) WHERE JOIN_WHERE(<ENT1>,
<ENT2>) AND
<ENT2>.<COL1> = <ENT2>.<COL1>.<TYPE> ;

Generated question:

Get all author having dataset as DATASET_TYPE

Generated SQL query:

SELECT author.authorId
FROM author , writes , paper ,
paperDataset , dataset WHERE author.
authorId = writes.authorId
AND writes.paperId = paper.paperId
AND paper.paperId = paperDataset.paperId
AND paperDataset.datasetId = dataset.
datasetId AND dataset.datasetName =
DATASET_TYPE ;

, where they populate the slots in the templates with table and column names from the database schema, as well as join the corresponding tables accordingly.

(Iyer et al., 2017) also uses PPDB (Ganitkevitch et al., 2013) to paraphrase the NL. An example of PPDB paraphrasing is thrown into jail and imprisoned.

This Github Repo also holds NL-SQL templates for Academic, Advising, ATIS, Geography, IMDB, Restaurants, Scholar, Yelp, where they mask the corresponding entities in the NL-SQL pairs.

Complexity of NL-SQL pairs

(Yu et al., 2018c) defines the SQL hardness as the number of SQL components. The SQL query is harder when it contains more SQL keywords such as GROUP BY and nested subqueries. Here are some examples from the original paper:

Easy:

SELECT COUNT(*)
FROM cars_data
WHERE cylinders > 4 ;

Medium:

SELECT T2.name, COUNT(*)
FROM concert AS T1 JOIN stadium AS T2 ON
T1.stadium_id = T2.stadium_id GROUP
BY T1.stadium_id ;

Hard:

SELECT T1.country_name
FROM countries AS T1 JOIN continents AS
T2 ON T1.continent = T2.cont_id JOIN
car_makers AS T3 ON T1.country_id = T3.
country
WHERE T2.continent = ’Europe’
GROUP BY T1.country_name
HAVING COUNT(*) >= 3 ;

Extra Hard:

SELECT AVG(life_expectancy) FROM country
WHERE name NOT IN
(SELECT T1.name
FROM country AS T1 JOIN
country_language AS T2
ON T1.code = T2.country_code
WHERE T2.language = "English"
AND T2.is_official = "T") ;

There is no qualitative measure of how hard the NL is. Intuitively, models’ performance can decrease when faced with longer questions from users. However, the information conveyed in longer sentences can be more complete, while there can be ambiguity in shorter sentences. Besides, there can be domain-specific phrases that confuse the model in both short and long utterances (Suhr et al., 2020). Thus, researchers need to consider various perspectives to determine the complexity of natural utterance.

Acknowledgements

This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published