## Objectives 
FEMR is primarily designed to work with OMOP datasources, but it is possible to provide a simpler input format that will work with some parts of the FEMR pipeline.

This tutorial documents that simple schema and shows how to use it.

The input schema is a folder of csv files, where each csv file has at minimum the following columns:

patient_id, start, code, value

Each row in a file corresponds to an event.

 - patient_id is some id for the patient that has the event patient_id must be a 64 bit unsigned integer

 - start is the start timestamp for that event, ideally when the event is initially recorded in the database. start must be an ISO 8601 timestamp string

 - code is a string identifying what the event is. It must internally consist of a vocab signifier and the code itself, split by a "/" character. For example ICD10CM/E11.4

 - value is a value associated with the event. It can either be a numeric value, an arbitrary string, or an empty string (indicating no value).

You may also add arbitrary columns for any csv file. Those will be added to each event. The columns can vary between csv files. We recommend adding columns to note dosage, visit_ids, and lab units, source Clarity tables, etc, etc.

The first row (in time) for each patient is considered their birth event.

Ordering of rows for each patient, or patient rows being split across files doesn't matter. Everything will be resorted and joined as part of the ETL process.

All different types of EMR data can be mapped to those four core columns. Here are the common tips for different types of fields:

Demographics should generally be mapped as as demographics codes assigned to the birth date of the patient (with no value assigned).

Labs should be assigned to when the lab result is available, with the value numeric is possible, but text valued otherwise.

Procedures and diagnosis codes should generally be mapped to when the event happened, with no value attached.

Other strange datatypes, such as flowsheets can be added as needed, with either string or numeric values as whatever is more natural.

In [None]:
import os

cwd = os.getcwd()
TARGET_DIR = os.path.join(cwd,"trash/simple_femr")
#TARGET_DIR = "/trash/simple_femr"
os.makedirs(TARGET_DIR)

INPUT_DIR = os.path.join(TARGET_DIR, "simple_input")
os.mkdir(INPUT_DIR)

In [None]:
# Write an example file (Manual)
# In this example, we include the mandatory fields, as well as dosage, visit_id, lab_unit, and clarity_table

# Add 3 columns and 9 rows

with open(os.path.join(INPUT_DIR, "example.csv"), "w") as f:
    f.write("patient_id,start,code,value,dosage,visit_ids,lab_units,clarity_source\n")
    f.write("2,1994-01-03,Birth/Birth,,,1,,PATIENT\n")  # First event is always birth
    f.write("2,1994-03-06,Drug/Atorvastatin,,50,3,mg,MED_ORDER\n")  # Example usage of dosage
    f.write("2,1994-02-03,ICD10CM/E11.4,,,2,,DIAGNOSIS\n")  # Note how events can be out of order
    f.write("2,1994-07-09,Vitals/Blood Pressure,150,,4,mm Hg,LAB_RESULT\n")  # Example use of a numeric value
    
    f.write("3,1990-01-07,Birth/Birth,,,1,,PATIENT\n")
    f.write("3,1990-01-07,Gender/Female,,,1,,PATIENT\n") # demographics 
    f.write("3,1990-01-07,Race/White,,,1,,PATIENT\n")    # demographics
    f.write("3,2000-03-09,Drug/Atorvastatin,,50,2,mg,MED_ORDER\n")  
    f.write("3,2000-05-03,ICD10CM/E11.4,,,3,,DIAGNOSIS\n")  
    f.write("3,2000-07-09,Vitals/Blood Pressure,160,,4,mm Hg,LAB_RESULT\n")  

    f.write("5,1980-05-03,Birth/Birth,,,1,,PATIENT\n")  
    f.write("5,2000-03-06,Drug/Atorvastatin,,50,2,mg,MED_ORDER\n")  
    f.write("5,2000-05-03,ICD10CM/E11.4,,,4,,DIAGNOSIS\n")  
    f.write("5,2000-09-09,Vitals/Blood Pressure,160,,5,mm Hg,LAB_RESULT\n") 
    f.write("5,2000-04-10,Vitals/HbA1c,7,,3,%,LAB_RESULT\n") # add HbA1c lab result
    
    f.write("7,1995-01-03,Birth/Birth,,,1,,PATIENT\n")  
    f.write("7,2012-03-06,Drug/Atorvastatin,,50,2,mg,MED_ORDER\n")  
    f.write("7,2012-05-03,ICD10CM/E11.4,,,3,,DIAGNOSIS\n")  
    f.write("7,2012-06-05,ICD10CM/E10.1,,,4,,DIAGNOSIS\n")  # add hypertension dignosis 
    f.write("7,2012-07-09,Vitals/Blood Pressure,160,,5,mm Hg,LAB_RESULT\n")
    
    f.write("8,1970-01-03,Birth/Birth,,,1,,PATIENT\n")  
    f.write("8,2010-03-06,Drug/Fentanyl Citrate,,100,2,mcg,MED_ORDER\n")  
    f.write("8,2010-05-03,ICD10CM/E10.1,,,2,,DIAGNOSIS\n")  
    f.write("8,2010-05-03,CPT/10060,,,2,,DIAGNOSIS\n") # add a CPT event  
    f.write("8,2010-07-09,Vitals/Blood Pressure,160,,2,mm Hg,LAB_RESULT\n")
    
    f.write("11,1996-01-03,Birth/Birth,,,1,,PATIENT\n")  
    f.write("11,2015-03-06,Drug/Heparin Lock Flush,,5,2,ml,MED_ORDER\n")  
    f.write("11,2015-05-03,ICD10CM/E10.1,,,2,,DIAGNOSIS\n")  
    f.write("11,2015-07-09,Vitals/Blood Pressure,160,,2,mm Hg,LAB_RESULT\n")
    
    f.write("13,1989-01-03,Birth/Birth,,,1,,PATIENT\n")  
    f.write("13,2020-03-06,Drug/Multivitamins,,5,2,ml,MED_ORDER\n")  
    f.write("13,2020-05-03,ICD10CM/E10.1,,,2,,DIAGNOSIS\n")  
    f.write("13,2020-07-09,Vitals/Blood Pressure,160,,2,mm Hg,LAB_RESULT\n")
    
    f.write("17,1992-01-03,Birth/Birth,,,1,,PATIENT\n")  
    f.write("17,2018-03-06,Drug/Dexamethasone Sod Phosphate,,4,2,mg,MED_ORDER\n")  
    f.write("17,2018-05-03,ICD10CM/E10.1,,,2,,DIAGNOSIS\n")  
    f.write("17,2018-07-09,Vitals/Blood Pressure,160,,2,mm Hg,LAB_RESULT\n")
    
    f.write("18,1988-01-03,Birth/Birth,,,1,,PATIENT\n")  
    f.write("18,2019-03-06,Drug/Atorvastatin,,50,2,mg,MED_ORDER\n")  
    f.write("18,2019-05-03,ICD10CM/E10.1,,,2,,DIAGNOSIS\n")  
    f.write("18,2019-07-09,Vitals/Blood Pressure,160,,2,mm Hg,LAB_RESULT\n")
    
    f.write("20,1994-01-03,Birth/Birth,,,1,,PATIENT\n")  
    f.write("20,2022-03-06,Drug/Atorvastatin,,50,2,mg,MED_ORDER\n")  
    f.write("20,2022-05-03,ICD10CM/E10.1,,,2,,DIAGNOSIS\n")  
    f.write("20,2022-07-09,Vitals/Blood Pressure,160,,2,mm Hg,LAB_RESULT\n")

In [1]:
import os
cwd = os.getcwd()
TARGET_DIR = os.path.join(cwd,"trash/simple_femr")
INPUT_DIR = os.path.join(TARGET_DIR, "simple_input")
LOG_DIR = os.path.join(TARGET_DIR, "logs")
EXTRACT_DIR = os.path.join(TARGET_DIR, "extract")

In [36]:
# Automate the manual generation of patient data above 
# Generate a ton more randomly generated synthetic examples
import random
import datetime
import string

def add_random_patient(patient_id, f):
    epoch = datetime.date(1990, 1, 1)
    birth = epoch + datetime.timedelta(days=random.randint(100, 1000))
    current_date = birth
    
    code_cat = ["Birth","Gender", "Race","ICD10CM","CPT", "Drug", "Vitals"]
    gender_values = ["F","M"]
    race_values = ["White","Non-White"]
    index = random.randint(0,1)
    for code_type in code_cat:
        if code_type == "Birth":
            clarity = "PATIENT"
            visit_id = 1
            value = ''
            dosage = ''
            unit = ''
            f.write(f"{patient_id},{birth.isoformat()},{code_type}/{code_type},{value},{dosage},{visit_id},{unit},{clarity}\n")
        elif code_type == "Gender":
            clarity = "PATIENT"
            visit_id = 1
            value = gender_values[index]
            dosage = ''
            unit = ''
            f.write(f"{patient_id},{birth.isoformat()},{code_type}/{code_type},{value},{dosage},{visit_id},{unit},{clarity}\n")
        elif code_type == "Race":
            clarity = "PATIENT"
            visit_id = 1
            value = race_values[index]
            dosage = ''
            unit = ''
            f.write(f"{patient_id},{birth.isoformat()},{code_type}/{code_type},{value},{dosage},{visit_id},{unit},{clarity}\n") 
        else:
            for code in range(random.randint(1, 10)):
                code = ''.join(random.choices(string.ascii_uppercase + string.digits, k=8))
                current_date = current_date +  datetime.timedelta(days=random.randint(0, 100))
                visit_id = random.randint(0,20)
                if code_type == "ICD10CM":
                    clarity = "DIAGNOSIS"
                    value = ''
                    dosage = ''
                    unit = ''
                elif code_type == "CPT":
                    clarity = "PROCEDURES"
                    value = ''
                    dosage = ''
                    unit = ''
                elif code_type == "Drug":
                    clarity = "MED_ORDER"
                    value = ''
                    dosage = random.randint(10,50)
                    unit = "mg"
                elif code_type == "Vitals":
                    clarity = "LAB_RESULT"
                    value = random.randint(80,200)
                    dosage = ''
                    unit = "mmHg"
                f.write(f"{patient_id},{current_date.isoformat()},{code_type}/{code},{value},{dosage},{visit_id},{unit},{clarity}\n") 


In [37]:
for file_no in range(1, 2):
    with open(os.path.join(INPUT_DIR, f"{file_no}.csv"), "w") as f:    
        # Use the simpler 4 column minimum format, with no optional columns
        f.write("patient_id,start,code,value,dosage,visit_ids,lab_units,clarity_source\n")
        for patient_id in range(file_no * 100, (file_no + 1) * 100):
            add_random_patient(patient_id, f)

In [3]:

# Convert the directory to an extract

LOG_DIR = os.path.join(TARGET_DIR, "logs")
EXTRACT_DIR = os.path.join(TARGET_DIR, "extract")

import femr
import femr.etl_pipelines.simple
os.system(f"etl_simple_femr {INPUT_DIR} {EXTRACT_DIR} {LOG_DIR} --num_threads 2")


2023-04-17 15:50:10,078 [MainThread  ] [INFO ]  Extracting from OMOP with arguments Namespace(simple_source='/local-scratch/nigam/projects/yizhex/trash/simple_femr/simple_input', target_location='/local-scratch/nigam/projects/yizhex/trash/simple_femr/extract', temp_location='/local-scratch/nigam/projects/yizhex/trash/simple_femr/logs', num_threads=2)
2023-04-17 15:50:10,108 [MainThread  ] [INFO ]  Already converted to events, skipping
2023-04-17 15:50:10,108 [MainThread  ] [INFO ]  Already converted to patients, skipping
2023-04-17 15:50:10,108 [MainThread  ] [INFO ]  Already converted to extract, skipping


0

In [4]:
# Open and look at the data.

import femr.datasets

database = femr.datasets.PatientDatabase(EXTRACT_DIR)

# We have one patient
print("Num patients", len(database))


Num patients 1


In [5]:

# Open and look at the data.

LOG_DIR = os.path.join(TARGET_DIR, "logs")
EXTRACT_DIR = os.path.join(TARGET_DIR, "extract")

import femr.datasets

database = femr.datasets.PatientDatabase(EXTRACT_DIR)

# We have one patient
print("Num patients", len(database))

# Print out that patient
patient = database[0]
print(patient)

# Note that the patient ids get remapped, you can unmap with the database
original_id = database.get_original_patient_id(0)
print("Oiringla id for patient 0", original_id)

# Also note that concepts have been mapped to integers
print("What code 3 means", database.get_code_dictionary()[3])  # Returns what code=0 means in the database

# You can pull things like dosage by looking at the event
for event in patient.events:
    print(event, event.dosage)


Num patients 1
Patient(patient_id=0, events=(Event(start=1994-01-03 00:00:00, code=3, dosage=), Event(start=1994-02-03 00:00:00, code=1, dosage=), Event(start=1994-03-06 00:00:00, code=0, dosage=50mg), Event(start=1994-07-09 00:00:00, code=2, value=150.0, dosage=)))
Oiringla id for patient 0 2
What code 3 means Birth/Birth
Event(start=1994-01-03 00:00:00, code=3, dosage=) 
Event(start=1994-02-03 00:00:00, code=1, dosage=) 
Event(start=1994-03-06 00:00:00, code=0, dosage=50mg) 50mg
Event(start=1994-07-09 00:00:00, code=2, value=150.0, dosage=) 
