# Creating a dataset with a simple format 
In this tutorial, we will walk through how to generate a dataset using a simple input format that will work with some parts of the FEMR pipeline.

First, we'll start with some setup to create temporary folders to store results.

In [1]:
import os
import csv

# Create directories to save the dataset and associated files 
TARGET_DIR = "trash/simple_femr"
os.makedirs(TARGET_DIR, exist_ok=True)

INPUT_DIR = os.path.join(TARGET_DIR, "simple_input")
os.makedirs(INPUT_DIR, exist_ok=True)

# 1. Basic input schema
The input schema is a folder of csv files, where each csv file has <u>at minimum</u> the following columns:

`patient_id`, `start`, `code`, and `value`

 - `patient_id` is the ID for the patient who has the event. `patient_id` must be a 64 bit unsigned integer

 - `start` is the start timestamp for an event, ideally when the event is initially recorded in the database. `start` must be an ISO 8601 timestamp string

 - `code` is a string that identifies what type of event occurred. It must consist of two parts, a vocabulary signfier and the code itself, seperated by a "/" character. For example ICD10CM/E11.4 would indicate an E11.4 ICD10 code.

 - `value` is a value associated with the event. It can either be a numeric value, an arbitrary string, or nothing.

There are also some additional optional columns which apply to some EHR events but not others:
 
 - `visit_id` is a visit identifier indicating that the event is associated with a particular visit.

 - `end` is the end timestamp for an event. `end` must be an ISO 8601 timestamp string.

 - `dose` is a string indicating the dosage for a medication event.
 
 - `units` is a string indicating the units for a labratory result event.


A single patient can be split across many files, with rows in any arbitrary order.

## Exercise 1: Add demographic information
Now, we create a dataset, "minimum.csv", manually that contains a single patient with three rows. 

We use patients' date of birth as the very first visit time in our data format, so the first event is always `Birth`

Rows 2 and 3 include the demographic information, `Gender` and `Race`, respectively, of this patient with corresponding values. Demographics should generally be mapped as demographics codes assigned to the birth date of the patient (with no value assigned). For demographic rows, the vocabulary signifier and the code itself are the same.

In [2]:
with open(os.path.join(INPUT_DIR, "minimum.csv"), "w") as f:
    f.write("patient_id,start,code,value\n")              
    f.write("3,1990-01-07,Birth/Birth,\n")         # First event is always birth
    f.write("3,1990-01-07,Gender/Gender,Female\n") # demographics 
    f.write("3,1990-01-07,Race/Race,White\n")      # demographics

# 2. Expanding your dataset
You may now start adding rows for other events present in your EHR records.

## Exercise 2: Add diagnosis information
We now add more events/rows that capture patients' diagnosis information, e.g., ICD 9/10 codes

For diagnosis, the `code` column has two parts, the vocabulary signifier (e.g., ICD10CM) and the code itself (e.g., E11.4, E10.1, etc.) The `value` column should be left as empty. Procedures and diagnosis codes should generally be mapped to when the event happened.

Note that different diagnoses may be given at different visits, so the corresponding `start` timestamps may be different.

In [3]:
with open(os.path.join(INPUT_DIR, "diagnosis.csv"), "w") as f:
    f.write("patient_id,start,code,value\n")              
    f.write("3,2022-05-03,ICD10CM/E11.4,\n")  # diabetes
    f.write("3,2022-06-05,ICD10CM/E10.1,\n")  # hypertension 

## Exercise 3: Add lab test information
We now add more events/rows that capture patients' lab values, e.g., Vitals/Blood Pressure

For vitals, the `code` column has two parts, the vocabulary signifier (e.g., Vitals) and the code itself (e.g., Blood Pressure, HbA1c, etc.) The `value` column should contain corresponding numeric values when possible, but text valued otherwise.

We also recommend adding another column, `units`, to record the units of each test result.

Note that different lab tests may be given at different visits, so the corresponding `start` timestamps may differ.

In [4]:
with open(os.path.join(INPUT_DIR, "lab_tests.csv"), "w") as f:
    f.write("patient_id,start,code,value,units\n")              
    f.write("3,2000-07-09,Vitals/Blood Pressure,160,mmHg\n") 
    f.write("3,2000-08-09,Vitals/HbA1c,7,%\n") 

## Exercise 4: Add medication information
We now add more events/rows that capture patients' medication intake info., e.g., Drug/Atorvastatin

For medications, the `code` column has two parts, the vocabulary signifier (e.g., Drug) and the code itself (e.g., Atorvastatin, Heparin Lock Flush, Multivitamins, etc.) The `value` column should be empty.

We add another column, `dosage`, to record the dose of the prescribed medication. For medications, the `units` column indicates the unit of the medication dose.

Note that different lab tests may be given at different visits, so the corresponding `start` timestamps may differ.

In [5]:
with open(os.path.join(INPUT_DIR, "drug.csv"), "w") as f:
    f.write("patient_id,start,code,value,units,dosage\n")              
    f.write("3,2022-06-05,Drug/Atorvastatin,,mg,50\n") 
    f.write("3,2022-07-06,Drug/Multivitamins,,ml,5\n")

# 3. Additional considerations
When adding columns, the goal is to add as many columns as needed to maximum the clarity and completeness of the raw/original data. 

For simplicity, we only included one patient in the above datasets, but an arbitrary number of patients can be captured in the same file by adding more rows with different `patient_id`.


## Exercise 5: Generate synthetic examples
We now create a dataset with 100 synthetic patients' data

In [6]:
import random
import datetime
import string

def add_random_patient(patient_id, f):
    epoch = datetime.date(1990, 1, 1)
    birth = epoch + datetime.timedelta(days=random.randint(100, 1000))
    current_date = birth
    
    code_cat = ["Birth","Gender", "Race","ICD10CM","CPT", "Drug", "Vitals"]
    gender_values = ["F","M"]
    race_values = ["White","Non-White"]
    index = random.randint(0,1)
    for code_type in code_cat:
        if code_type == "Birth":
            clarity = "PATIENT"
            visit_id = 1
            value = ''
            dosage = ''
            unit = ''
            f.write(f"{patient_id},{birth.isoformat()},{code_type}/{code_type},{value},{dosage},{visit_id},{unit},{clarity}\n")
        elif code_type == "Gender":
            clarity = "PATIENT"
            visit_id = 1
            value = gender_values[index]
            dosage = ''
            unit = ''
            f.write(f"{patient_id},{birth.isoformat()},{code_type}/{code_type},{value},{dosage},{visit_id},{unit},{clarity}\n")
        elif code_type == "Race":
            clarity = "PATIENT"
            visit_id = 1
            value = race_values[index]
            dosage = ''
            unit = ''
            f.write(f"{patient_id},{birth.isoformat()},{code_type}/{code_type},{value},{dosage},{visit_id},{unit},{clarity}\n") 
        else:
            for code in range(random.randint(1, 10)):
                code = ''.join(random.choices(string.ascii_uppercase + string.digits, k=8))
                current_date = current_date +  datetime.timedelta(days=random.randint(0, 100))
                visit_id = random.randint(0,20)
                if code_type == "ICD10CM":
                    clarity = "DIAGNOSIS"
                    value = ''
                    dosage = ''
                    unit = ''
                elif code_type == "CPT":
                    clarity = "PROCEDURES"
                    value = ''
                    dosage = ''
                    unit = ''
                elif code_type == "Drug":
                    clarity = "MED_ORDER"
                    value = ''
                    dosage = random.randint(10,50)
                    unit = "mg"
                elif code_type == "Vitals":
                    clarity = "LAB_RESULT"
                    value = random.randint(80,200)
                    dosage = ''
                    unit = "mmHg"
                f.write(f"{patient_id},{current_date.isoformat()},{code_type}/{code},{value},{dosage},{visit_id},{unit},{clarity}\n") 


In [7]:
for file_no in range(1, 3):
    with open(os.path.join(INPUT_DIR, f"{file_no}.csv"), "w") as f:    
        # Use the simpler 4 column minimum format, with no optional columns
        f.write("patient_id,start,code,value,dosage,visit_ids,lab_units,clarity_source\n")
        for patient_id in range(file_no * 100, (file_no + 1) * 100):
            add_random_patient(patient_id, f)
        
        if file_no == 1:
            with open(os.path.join(INPUT_DIR, f"{file_no}.csv"), newline='') as csvfile:
                reader = csv.reader(csvfile, delimiter=' ', quotechar='|')
                for i, row in enumerate(reader):
                    print(', '.join(row))
                    if(i >= 9):
                        break

patient_id,start,code,value,dosage,visit_ids,lab_units,clarity_source
100,1990-07-03,Birth/Birth,,,1,,PATIENT
100,1990-07-03,Gender/Gender,F,,1,,PATIENT
100,1990-07-03,Race/Race,White,,1,,PATIENT
100,1990-10-01,ICD10CM/98L0QGO4,,,3,,DIAGNOSIS
100,1990-11-01,ICD10CM/N3V2INYQ,,,7,,DIAGNOSIS
100,1991-02-03,ICD10CM/K1H9MXMG,,,20,,DIAGNOSIS
100,1991-02-18,ICD10CM/9S5L6DSO,,,10,,DIAGNOSIS
100,1991-03-11,ICD10CM/JHEA5N22,,,17,,DIAGNOSIS
100,1991-05-19,CPT/IBKSF5WN,,,16,,PROCEDURES


# Additional notes
Ordering of rows for each patient, or patient rows being split across files doesn't matter. Everything will be resorted and joined as part of the ETL process.

Other atypical datatypes, such as flowsheets can be added as needed, with either string or numeric values as whatever is more natural.

## 4. Convert the directory to an extract
We now convert the dataset we created above to an extract using the function [etl_simple_femr](https://github.com/som-shahlab/femr/blob/main/src/femr/etl_pipelines/simple.py#L66) from the femr repo

The output extract is a femr [PatientDatabase](https://github.com/som-shahlab/femr/blob/Miking98-patch-1/tutorials/0_How%20FEMR%20Works%20%2B%20Toy%20Example.ipynb) that can be directly used by the femr pipeline

In [8]:
# Create directories for storing the extract and extract log
LOG_DIR = os.path.join(TARGET_DIR, "logs")
EXTRACT_DIR = os.path.join(TARGET_DIR, "extract")

import femr
import femr.etl_pipelines.simple
os.system(f"etl_simple_femr {INPUT_DIR} {EXTRACT_DIR} {LOG_DIR} --num_threads 2")

2023-04-19 21:26:15,949 [MainThread  ] [INFO ]  Extracting from OMOP with arguments Namespace(simple_source='trash/simple_femr/simple_input', target_location='/local-scratch/nigam/projects/ethanid/femr_develop/femr/tutorials/trash/simple_femr/extract', temp_location='/local-scratch/nigam/projects/ethanid/femr_develop/femr/tutorials/trash/simple_femr/logs', num_threads=2)
2023-04-19 21:26:15,952 [MainThread  ] [INFO ]  Already converted to events, skipping
2023-04-19 21:26:15,952 [MainThread  ] [INFO ]  Already converted to patients, skipping
2023-04-19 21:26:15,952 [MainThread  ] [INFO ]  Already converted to extract, skipping


0

# 5. Open and view the data
We now open and take a look at the femr extract we generated in the last step using the function [PatientDatabase](https://github.com/som-shahlab/femr/blob/main/src/femr/extension/datasets.pyi#L24) from the femr repo

In [9]:
import femr.datasets

database = femr.datasets.PatientDatabase(EXTRACT_DIR)

# Number of patients
print("Num patients", len(database))

# Print out patient_id 3 (the first example patient we created)
patient = database[3]
print(patient)

# Also note that concepts have been mapped to integers
print("What code 3 means", database.get_code_dictionary()[3])  # Returns what code=3 means in the database

# You can pull things like dosage by looking at the event
for event in patient.events:
    print(event, event.dosage)

Num patients 201
Patient(patient_id=3, events=(Event(start=1990-01-07 00:00:00, code=0), Event(start=1990-01-07 00:00:00, code=1, value=Female), Event(start=1990-01-07 00:00:00, code=2, value=White), Event(start=2000-07-09 00:00:00, code=202, value=160.0, units=mmHg), Event(start=2000-08-09 00:00:00, code=1548, value=7.0, units=%), Event(start=2022-05-03 00:00:00, code=2650), Event(start=2022-06-05 00:00:00, code=1035), Event(start=2022-06-05 00:00:00, code=3264, units=mg, dosage=50), Event(start=2022-07-06 00:00:00, code=6, units=ml, dosage=5)))
What code 3 means Drug/T8XRC4NU
Event(start=1990-01-07 00:00:00, code=0) None
Event(start=1990-01-07 00:00:00, code=1, value=Female) None
Event(start=1990-01-07 00:00:00, code=2, value=White) None
Event(start=2000-07-09 00:00:00, code=202, value=160.0, units=mmHg) None
Event(start=2000-08-09 00:00:00, code=1548, value=7.0, units=%) None
Event(start=2022-05-03 00:00:00, code=2650) None
Event(start=2022-06-05 00:00:00, code=1035) None
Event(star