Skip to content

Competition Code for "Reliable Text-to-SQL on Electronic Health Records" - Clinical NLP Workshop @ NAACL 2024

License

Notifications You must be signed in to change notification settings

snoop2head/EHRSQL

Repository files navigation

Description

We prioritize the privacy of patient information by exclusively using white-box models like flan-t5, avoiding potential data leakage associated with black-box models via APIs. To handle the computational constraints posed by lengthy SQL queries, we employed a T5 encoder-decoder architecture, allowing separate attention mechanisms for different parts of the input. We also exploit data efficiently by implementing an end-to-end architecture with a linear projection layer, enabling us to retain and utilize every part of the dataset, including traditionally discarded unanswerable questions. To optimize validation, we adopted a stratified KFold approach, ensuring no data split remains unused. Our monitoring strategy involves rigorous metric tracking during training, providing insights into model performance and areas needing adjustment. This approach helped us identify significant discrepancies in performance metrics and pinpoint areas for post-processing enhancement. Overall, our project demonstrates effective strategies for managing data privacy, optimizing dataset usage, and refining model performance through continuous evaluation.

Install

source /home/ubuntu/miniconda3/bin/activate base
conda create -n ehr python=3.11
conda activate ehr
pip install -r requirements.txt

Train

source /home/ubuntu/miniconda3/bin/activate base
conda activate ehr
python train.py config/flan-t5-base.yaml

Or

bash run.sh

Predict

python predict.py config/flan-t5-base.yaml

References

About

Competition Code for "Reliable Text-to-SQL on Electronic Health Records" - Clinical NLP Workshop @ NAACL 2024

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published