<!-- Original Implementation by Gyubok Lee -->

---


<!-- Refined by Gyubok Lee on 2024-01-15. -->
<!-- Note: This Jupyter notebook is tailored to the unique requirements of the EHRSQL project. It includes specific modifications and additional adjustments to cater to the dataset and experiment objectives. -->

# Dummy Model Sample Code for EHRSQL: Reliable Text-to-SQL Modeling on Electronic Health Records

<p align="left" float="left">
  <img src="https://github.com/glee4810/ehrsql-2024/raw/master/image/logo.png" height="100" />
</p>

<!-- ## Task Introduction
The goal of the task is to **develop a reliable text-to-SQL system on EHR**. Unlike standard text-to-SQL tasks, this system must handle all types of questions, including answerable and unanswerable ones with respect to the EHR database structure. For answerable questions, the system must accurately generate SQL queries. For unanswerable questions, the system must correctly identify them as such, thereby preventing incorrect SQL predictions for infeasible questions. The range of questions includes answerable queries about MIMIC-IV, covering topics such as patient demographics, vital signs, and specific disease survival rates ([EHRSQL](https://github.com/glee4810/EHRSQL)). Additionally, there are specially designed unanswerable questions intended to challenge the system. Successfully completing this task will result in the creation of a reliable question-answering system for EHRs, significantly improving the flexibility and efficiency of clinical knowledge exploration in hospitals. -->

## Steps of Baseline Code

- [x] Step 1: Clone the GitHub Repository and Install Dependencies
- [x] Step 2: Import Global Packages and Define File Paths
- [x] Step 3: Load Data and Prepare Datasets
- [x] Step 4: Building a Dummy Model
- [x] Step 5: Evaluation
- [x] Step 6: Submission

## Step 1: Clone the GitHub Repository and Install Dependencies

Before you begin, make sure you're in the correct directory. If you need to reset the repository directory, remove the existing directory by uncommenting and executing the following lines:

In [None]:
%cd /content
!rm -rf ehrsql-2024

/content


Now, clone the repository and install the required Python packages:

In [None]:
# Cloning the GitHub repository
!git clone -q https://github.com/glee4810/ehrsql-2024.git
%cd ehrsql-2024

/content/ehrsql-2024


Use the `%load_ext` magic command to automatically reload modules before executing a new line:

In [None]:
%load_ext autoreload
%autoreload 2

## Step 2: Import Global Packages and Define File Paths

After setting up the repository and dependencies, the next step is to import packages that will be used globally throughout this notebook and to define the file paths to our datasets.

In [None]:
import os
import json
import pandas as pd

# Directory paths for database, results and scoring program
DB_ID = 'mimic_iv'
BASE_DATA_DIR = 'sample_data'
RESULT_DIR = 'sample_result_submission/'
SCORE_PROGRAM_DIR = 'scoring_program/'

# File paths for the dataset and labels
TABLES_PATH = os.path.join('data', DB_ID, 'tables.json')               # JSON containing database schema
TRAIN_DATA_PATH = os.path.join(BASE_DATA_DIR, 'train', 'data.json')    # JSON file with natural language questions for training data
TRAIN_LABEL_PATH = os.path.join(BASE_DATA_DIR, 'train', 'label.json')  # JSON file with corresponding SQL queries for training data
VALID_DATA_PATH = os.path.join(BASE_DATA_DIR, 'valid', 'data.json')    # JSON file for validation data
DB_PATH = os.path.join('data', DB_ID, f'{DB_ID}.sqlite')               # Database path

## Step 3: Load Data and Prepare Datasets

Now that we have our environment and paths set up, the next step is to load the data and prepare it for our model. This involves preprocessing the MIMIC-IV database, reading the data from JSON files, splitting it into training and validation sets, and then initializing our dataset object.

### Download and Preprocess MIMIC-IV Database Demo

In [None]:
!wget https://physionet.org/static/published-projects/mimic-iv-demo/mimic-iv-clinical-database-demo-2.2.zip
!unzip mimic-iv-clinical-database-demo-2.2
!gunzip -r mimic-iv-clinical-database-demo-2.2

--2024-01-29 22:57:08--  https://physionet.org/static/published-projects/mimic-iv-demo/mimic-iv-clinical-database-demo-2.2.zip
Resolving physionet.org (physionet.org)... 18.18.42.54
Connecting to physionet.org (physionet.org)|18.18.42.54|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16189661 (15M) [application/zip]
Saving to: ‘mimic-iv-clinical-database-demo-2.2.zip’


2024-01-29 22:57:30 (718 KB/s) - ‘mimic-iv-clinical-database-demo-2.2.zip’ saved [16189661/16189661]

Archive:  mimic-iv-clinical-database-demo-2.2.zip
  inflating: mimic-iv-clinical-database-demo-2.2/LICENSE.txt  
  inflating: mimic-iv-clinical-database-demo-2.2/README.txt  
  inflating: mimic-iv-clinical-database-demo-2.2/SHA256SUMS.txt  
  inflating: mimic-iv-clinical-database-demo-2.2/demo_subject_id.csv  
 extracting: mimic-iv-clinical-database-demo-2.2/hosp/admissions.csv.gz  
  inflating: mimic-iv-clinical-database-demo-2.2/hosp/d_hcpcs.csv.gz  
  inflating: mimic-iv-clinical-database-d

In [None]:
%cd preprocess
!bash preprocess.sh
%cd ..

/content/ehrsql-2024/preprocess
timeshift is True
start_year: 2100
time_span: 0
current_time: 2100-12-31 23:59:00
Processing patients, admissions, icustays, transfers
Cannot take a larger sample than population when 'replace=False
Use all available patients instead.
num_cur_patient: 4
num_non_cur_patient: 90
num_patient: 94
patients, admissions, icustays, transfers processed (took 0.2169 secs)
Processing dictionary tables (d_icd_diagnoses, d_icd_procedures, d_labitems, d_items)
d_icd_diagnoses, d_icd_procedures, d_labitems, d_items processed (took 2.1067 secs)
Processing diagnoses_icd table
diagnoses_icd processed (took 0.139 secs)
Processing procedures_icd table
procedures_icd processed (took 0.0544 secs)
Processing labevents table
labevents processed (took 2.0276 secs)
Processing prescriptions table
prescriptions processed (took 0.6992 secs)
Processing COST table
cost processed (took 0.4841 secs)
Processing chartevents table
chartevents processed (took 0.8852 secs)
Processing inputev

### Load Data from JSON

In [None]:
from utils.data_io import read_json as read_data

train_data = read_data(TRAIN_DATA_PATH)
train_label = read_data(TRAIN_LABEL_PATH)

valid_data = read_data(VALID_DATA_PATH)

### Data Statistics

In [None]:
print("Train data:", (len(train_data['data']), len(train_label)))
print("Valid data:", len(valid_data['data']))

Train data: (20, 20)
Valid data: 20


### Data Format

Before proceeding with the model, it is always a good idea to explore the dataset. This includes checking the keys in the dataset, and viewing the first few entries to understand the structure of the data.



In [None]:
# Explore keys and data structure
print(train_data.keys())
print(train_data['version'])
print(train_data['data'][0])

# Explore the label structure
print(train_label.keys())
print(train_label[list(train_label.keys())[0]])

dict_keys(['version', 'data'])
mimic_iv_demo_1.0_train_sample
{'id': '3b9849548e56c59f768d5447', 'question': 'Tell me the minimum respiratory rate in patient 10021118 in the first ICU visit.'}
dict_keys(['3b9849548e56c59f768d5447', '3f720ea93f94700efe7fe7ee', 'b0b4926abc6fba2f9b78a017', 'ba575890ff7ebaa1dcaa88c2', '84bfcde2dd95748732f9e8a4', '003865fdb611e9f27fe18281', '904c2c0c49be0f63d95d5358', '4d29021ef5a4f0368bdfa1a1', '6866d0c0c9db304d897bd1b8', 'b53daa3c6e0af875928b6bd7', 'e925cbf7b7c9050c2dd7ae1a', 'd9e3f6df43bf2206232807cd', '09eada5be412b54a7d6e3604', '9ce15859816f9b5e712ab02d', 'fc32af7511179af7d45fdec6', 'cb76dc5d463262b63169e9b8', 'e32546ba93600e3d68b462a5', 'f43b1b3a095d2596461aec78', '09bdf5adbc4a4371b27a0b84', 'd76a237f02e60dd98ef55e08'])
SELECT MIN(chartevents.valuenum) FROM chartevents WHERE chartevents.stay_id IN ( SELECT icustays.stay_id FROM icustays WHERE icustays.hadm_id IN ( SELECT admissions.hadm_id FROM admissions WHERE admissions.subject_id = 10021118 ) AND i

## Step 4: Building a Dummy Model

In [None]:
class Model():
    def __init__(self):
        pass

    def generate(self, input_data):
        """
        Arguments:
            input_data: list of python dictionaries containing 'id' and 'input'
        Returns:
            labels: python dictionary containing sql prediction or 'null' values associated with ids
        """
        labels = {}

        for sample in input_data:
            labels[sample["id"]] = "null"

        return labels

In [None]:
myModel = Model()
data = valid_data["data"]

In [None]:
input_data = []
for sample in data:
    sample_dict = {}
    sample_dict['id'] = sample['id']
    sample_dict['input'] = sample['question']
    input_data.append(sample_dict)

In [None]:
# Generate answer(SQL)
label_y = myModel.generate(input_data)

Below is how the predicted labels(SQLs) look like

In [None]:
label_y

{'d084d1f3c277e6827087bb44': 'null',
 '1039ad255c53fd49a3e45f2f': 'null',
 'a35a9346ab483d0db0f202ca': 'null',
 'aef8b935473950853a7d8448': 'null',
 '144cd6f1acfad4416003c26c': 'null',
 'e3a4a0a695c07e841e1db5aa': 'null',
 'b20d40188481222bfbb9b02f': 'null',
 '154a51192c40cea8dc4c8273': 'null',
 'f5f185ff5f7901dc7c4dd711': 'null',
 '605dc49bacfb0b462cf31880': 'null',
 'e6db613772003ec72d44ebe5': 'null',
 'dd62c1497314b1bea83b2d03': 'null',
 '769ea1c5d6c42c47ac9a1735': 'null',
 'e5b7cc8f9163e9b60cd96d94': 'null',
 '6730aa47b18b0105eb3dd8a2': 'null',
 '3e937842e2eceef28e27788e': 'null',
 '9d0210cd4045e7f8e860ce69': 'null',
 '073f2bf50f7338fb5c3bb42b': 'null',
 'ffc47b7e01463f229eb09bce': 'null',
 'ee3ef44107690c988c06c3e4': 'null'}

In [None]:
from utils.data_io import write_json as write_label

# Save the filtered predictions to a JSON file
os.makedirs(RESULT_DIR, exist_ok=True)
SCORING_OUTPUT_DIR = os.path.join(RESULT_DIR, 'prediction.json')
write_label(SCORING_OUTPUT_DIR, label_y)

# Verify the file creation
print("Listing files in RESULT_DIR:")
!ls {RESULT_DIR}

Listing files in RESULT_DIR:
prediction.json


## Step 5: Evaluation

To get a sense of how well the dummy prediction performs on the traininig set, use the evaluation code below:

In [None]:
from scoring_program.reliability_score import calculate_score, penalize
from scoring_program.postprocessing import post_process_sql

data = train_data["data"]

input_data = []
for sample in data:
    sample_dict = {}
    sample_dict['id'] = sample['id']
    sample_dict['input'] = sample['question']
    input_data.append(sample_dict)
label_y = myModel.generate(input_data)
label = train_label

real_dict = {id_: post_process_sql(label[id_]) for id_ in label}
pred_dict = {id_: post_process_sql(label_y[id_]) for id_ in label_y}
assert set(real_dict) == set(pred_dict), "IDs do not match"

scores = calculate_score(real_dict, pred_dict, db_path=DB_PATH)
accuracy0 = penalize(scores, penalty=0)
accuracy5 = penalize(scores, penalty=5)
accuracy10 = penalize(scores, penalty=10)
accuracyN = penalize(scores, penalty=len(scores))

print('Scores:')
scores_dict = {
    'accuracy0': accuracy0*100,
    'accuracy5': accuracy5*100,
    'accuracy10': accuracy10*100,
    'accuracyN': accuracyN*100
}
print(scores_dict)

Scores:
{'accuracy0': 50.0, 'accuracy5': 50.0, 'accuracy10': 50.0, 'accuracyN': 50.0}


## Step 6: Submission

Here, we prepare the submission package for the Codabench competition. We begin by compressing the `prediction.json` file into a zip archive, which is the required format for submission.

In [None]:
# Change to the directory containing the prediction file
%cd {RESULT_DIR}

# Compress the prediction.json file into a ZIP archive
!zip predictions.zip prediction.json

/content/ehrsql-2024/sample_result_submission
  adding: prediction.json (deflated 54%)


- Submission File: Ensure that the `predictions.zip` file contains only the `prediction.json` file. This ZIP archive is the required format for submission to Codabench.

- Submitting on Codabench: Navigate to the Codabench competition page and go to the **My Submissions** tab. Upload the `predictions.zip` file following the provided instructions. Make sure to adhere to any guidelines or submission requirements detailed on the competition page.