FEMR is primarily designed to work with OMOP datasources, but it is possible to provide
a simpler input format that will work with some parts of the FEMR pipeline.

This tutorial documents that simple schema and shows how to use it.

The input schema is a folder of csv files, where each csv file has at minimum the following columns:

patient_id, start, code, value

Each row in a file corresponds to an event.

- patient_id is some id for the patient that has the event
  patient_id must be a 64 bit unsigned integer

- start is the start timestamp for that event, ideally when the event is initially recorded in the database.
  start must be an ISO 8601 timestamp string

- code is a string identifying what the event is. It must internally consist of a vocab signifier and the code itself, split by a "/" character.
  For example ICD10CM/E11.4

- value is a value associated with the event. It can either be a numeric value, an arbitrary string, or an empty string (indicating no value).

You may also add arbitrary columns for any csv file. Those will be added to each event. The columns can vary between csv files.
We recommend adding columns to note dosage, visit_ids, and lab units, source Clarity tables, etc, etc.

The first row (in time) for each patient is considered their birth event.

Ordering of rows for each patient, or patient rows being split across files doesn't matter.
Everything will be resorted and joined as part of the ETL process.

All different types of EMR data can be mapped to those four core columns. Here are the common tips for different types of fields:

Demographics should generally be mapped as as demographics codes assigned to the birth date of the patient (with no value assigned).

Labs should be assigned to when the lab result is available, with the value numeric is possible, but text valued otherwise.

Procedures and diagnosis codes should generally be mapped to when the event happened, with no value attached.

Other strange datatypes, such as flowsheets can be added as needed, with either string or numeric values as whatever is more natural.

In [1]:
# Create some folders

import os

TARGET_DIR = "trash/simple_femr"
os.mkdir(TARGET_DIR)

INPUT_DIR = os.path.join(TARGET_DIR, "simple_input")
os.mkdir(INPUT_DIR)


In [2]:
# Write an example file according to that schema

# In this example, we include the mandatory fields, as well as dosage, visit_id, lab_unit, and clarity_table

with open(os.path.join(INPUT_DIR, "example.csv"), "w") as f:
    f.write("patient_id,start,code,value,dosage,visit_id,lab_unit,clarity_table\n")
    f.write("2,1994-01-03,Birth/Birth,,,,,\n")  # First event is always birth

    f.write("2,1994-03-06,Drug/Atorvastatin,,50mg,,,\n")  # Example usage of dosage

    f.write("2,1994-02-03,ICD10/E11.4,,,,,some_table\n")  # Note how events can be out of order

    f.write("2,1994-07-09,Vitals/Blood Pressure,150,,2,mmhg,\n")  # Example use of a numeric value

In [3]:
# Generate a ton more randomly generated synthetic examples
import random
import datetime
import string

def add_random_patient(patient_id, f):
    epoch = datetime.date(1990, 1, 1)
    birth = epoch + datetime.timedelta(days=random.randint(100, 1000))
    current_date = birth
    
    f.write(f"{patient_id},{birth.isoformat()},Birth/Birth,\n")  # First event is always birth
    for code in range(random.randint(1, 10)):
        code = ''.join(random.choices(string.ascii_uppercase + string.digits, k=8))
        current_date = current_date +  datetime.timedelta(days=random.randint(0, 100))
        f.write(f"{patient_id},{current_date.isoformat()},FAKE_CODE/{code},\n")  # First event is always birth

for file_no in range(1, 11):
    with open(os.path.join(INPUT_DIR, f"{file_no}.csv"), "w") as f:    
        # Use the simpler 4 column minimum format, with no optional columns
        f.write("patient_id,start,code,value\n")
        for patient_id in range(file_no * 100, (file_no + 1) * 100):
            add_random_patient(patient_id, f)
        

In [None]:
# Convert the directory to an extract

LOG_DIR = os.path.join(TARGET_DIR, "logs")
EXTRACT_DIR = os.path.join(TARGET_DIR, "extract")

os.system(f"etl_simple_femr {INPUT_DIR} {EXTRACT_DIR} {LOG_DIR} --num_threads 2 --athena_download ~/vocab")


2023-04-28 01:39:46,957 [MainThread  ] [INFO ]  Extracting from OMOP with arguments Namespace(simple_source='trash/simple_femr/simple_input', target_location='/local-scratch/nigam/projects/ethanid/femr_develop/femr/tutorials/trash/simple_femr/extract', temp_location='/local-scratch/nigam/projects/ethanid/femr_develop/femr/tutorials/trash/simple_femr/logs', num_threads=2, athena_download='/home/ethanid/vocab')
2023-04-28 01:39:46,957 [MainThread  ] [INFO ]  Converting to events


In [None]:
# Open and look at the data.


import femr.datasets

database = femr.datasets.PatientDatabase(EXTRACT_DIR)

# We have 1001 patient
print("Num patients", len(database))

# Note that FEMR remaps patient ids, we have to use a mapping to get back our original patient 2

# Print out that patient
patient = database[2]
print(patient)


# You can pull things like dosage by looking at the event
for event in patient.events:
    print(event, event.dosage, database.get_ontology().get_all_parents(event.code))
  
all_patient_ids = list(database)
print(len(all_patient_ids))