Official repository of EHR-SeqSQL : A Sequential Text-to-SQL Dataset For Interactively Exploring Electronic Health Records (ACL 2024 Findings)
We introduce EHR-SeqSQL, a novel sequential text-to-SQL dataset for Electronic Health Record (EHR) databases. EHR-SeqSQL is designed to address critical yet underexplored aspects in text-to-SQL parsing: interactivity, compositionality, and efficiency. To the best of our knowledge, EHR-SeqSQL is not only the largest but also the first medical text-to-SQL dataset benchmark to include sequential and contextual questions. We provide a data split and the new test set designed to assess compositional generalization ability. With EHR-SeqSQL, we aim to bridge the gap between practical needs and academic research in the text-to-SQL domain.
The data.json file contains the following fields:
seed_question: The original question from EHRSQL dataset.value: Sampled values from the database.question: Paraphrased version of the question sequences.question_template: The original template question sequences.seqsql: Our version of SQL query sequences with special tokens.sql: Executable SQL query sequences without special tokens.random_split: Whether the sample is for 'train' or 'test' in the random split.compositional_split: Whether the sample is for 'train' or 'test' in the compositional split.
id, department, and importance fields keeps the same values as in the corresponding EHRSQL data sample.
When you use the this dataset, we would appreciate it if you cite our paper:
@article{ryu2024ehr,
title={EHR-SeqSQL: A Sequential Text-to-SQL Dataset For Interactively Exploring Electronic Health Records},
author={Ryu, Jaehee and Cho, Seonhee and Lee, Gyubok and Choi, Edward},
journal={arXiv preprint arXiv:2406.00019},
year={2024}
}