# Creating a PatientDatabase from simple FEMR format data

In this tutorial, we will walk through how to generate a FEMR PatientDatabase using the simple FEMR format.

The simple FEMR format is a simple and flexible custom CSV format designed for working with non-OMOP datasources.

The idea is that you would transform your data to the simple FEMR format, and then FEMR would do an ETL from that to a PatientDatabase.

In [1]:
import os
import csv
import pandas as pd
import shutil

INPUT_DIR = 'input/simple_femr'

# Import the example dataset 
example_dat = pd.read_csv(os.path.join(INPUT_DIR, "example.csv"), sep=',')

# 1. Basic input schema
The input schema is a (folder of) csv file, where each csv file has <u>at minimum</u> the following columns:

`patient_id`, `start`, `code`, and `value`

 - `patient_id` is the ID for the patient who has the event. `patient_id` must be a 64 bit unsigned integer

 - `start` is the start timestamp for an event, ideally when the event is initially recorded in the database. `start` must be an ISO 8601 timestamp string

 - `code` is a string that identifies what type of event occurred. It must consist of two parts, a vocabulary signfier and the code itself, seperated by a "/" character. For example ICD10CM/E11.4 would indicate an E11.4 ICD10 code.

 - `value` is a value associated with the event. It can either be a numeric value, an arbitrary string, or nothing.


You may also add arbitrary columns for any csv file. Those will be added to each event. The columns can vary between csv files.
We recommend adding columns to note dosage, visit_ids, and lab units, source Clarity tables, etc, etc.

The first row (in time) for each patient is considered their birth event.

Ordering of rows for each patient, or patient rows being split across files doesn't matter.
Everything will be resorted and joined as part of the ETL process (i.e. creating a `PatientDatabase` involves sorting by patient and events).

All different types of EMR data can be mapped to those four core columns. Here are the common tips for different types of fields:

Demographics should generally be mapped as as demographics codes assigned to the birth date of the patient (with no value assigned).

Labs should be assigned to when the lab result is available, with the value numeric is possible, but text valued otherwise.

Procedures and diagnosis codes should generally be mapped to when the event happened, with no value attached.

Other strange datatypes, such as flowsheets can be added as needed, with either string or numeric values as whatever is more natural.

## Exercise 1: Add demographic information
Now, we display `example_dat` that contains a single patient with three rows. 

We use patients' date of birth as the very first visit time in our data format, so the first event is always `Birth`

Rows 2 and 3 include the demographic information, `Gender` and `Race`, respectively, of this patient with corresponding values. Demographics should generally be mapped as demographics codes assigned to the birth date of the patient (with no value assigned). For demographic rows, the vocabulary signifier and the code itself are the same.

In [2]:
example_dat.loc[0:2]

Unnamed: 0,patient_id,start,code,value,units,dosage
0,3,1970-01-07,Birth/Birth,,,
1,3,1990-01-07,Gender/Gender,Female,,
2,3,1990-01-07,Race/Race,White,,


# 2. Expanding your dataset
You now show additional rows for other events present in your EHR records.

## Exercise 2: Add diagnosis information
We now add more events/rows that capture patients' diagnosis information, e.g., ICD 9/10 codes

For diagnosis, the `code` column has two parts, the vocabulary signifier (e.g., ICD10CM) and the code itself (e.g., E11.4, E10.1, etc.) The `value` column should be left as empty. Procedures and diagnosis codes should generally be mapped to when the event happened.

Note that different diagnoses may be given at different visits, so the corresponding `start` timestamps may be different.

In [3]:
example_dat.loc[3:4]

Unnamed: 0,patient_id,start,code,value,units,dosage
3,3,2022-05-03,ICD10CM/E11.4,,,
4,3,2022-06-05,ICD10CM/E10.1,,,


## Exercise 3: Add lab test information
We now add more events/rows that capture patients' lab values, e.g., Vitals/Blood Pressure

For vitals, the `code` column has two parts, the vocabulary signifier (e.g., Vitals) and the code itself (e.g., Blood Pressure, HbA1c, etc.) The `value` column should contain corresponding numeric values when possible, but text valued otherwise.

We also recommend adding another column, `units`, to record the units of each test result.

Note that different lab tests may be given at different visits, so the corresponding `start` timestamps may differ.

In [4]:
example_dat.loc[5:6]

Unnamed: 0,patient_id,start,code,value,units,dosage
5,3,2020-07-09,Vitals/Blood Pressure,160,mmHg,
6,3,2020-08-09,Vitals/HbA1c,7,,%


## Exercise 4: Add medication information
We now add more events/rows that capture patients' medication intake info., e.g., Drug/Atorvastatin

For medications, the `code` column has two parts, the vocabulary signifier (e.g., Drug) and the code itself (e.g., Atorvastatin, Heparin Lock Flush, Multivitamins, etc.) The `value` column should be empty.

We add another column, `dosage`, to record the dose of the prescribed medication. For medications, the `units` column indicates the unit of the medication dose.

Note that different lab tests may be given at different visits, so the corresponding `start` timestamps may differ.

In [5]:
example_dat.loc[7:8]

Unnamed: 0,patient_id,start,code,value,units,dosage
7,3,2022-06-05,Drug/Atorvastatin,,mg,50
8,3,2022-07-06,Drug/Multivitamins,,ml,5


## Exercise 5: Add note information
We now add more events/rows that capture notes written about patients.

For notes, the main consideration is that you often need quoting and escaping in order to process notes with quote characters, commas and newlines. 

We follow the [RFC 4180 spec](https://www.loc.gov/preservation/digital/formats/fdd/fdd000323.shtml#:~:text=RFC%204180%20stipulates%20the%20use,double%20quotes%20(Hex%2022).) for escaping, which is the default format for the Python csv library.

In [6]:
example_dat.loc[9:10]

Unnamed: 0,patient_id,start,code,value,units,dosage
9,3,2022-06-05,Note/ProgressNote,Patient Bob came to the clinic today,,
10,3,2022-06-06,Note/ProgressNote,"Complicated notes generally need escaping , ""\...",,


# 3. Scaling up to many more patients
For simplicity, we only included one patient in the above dataset, but an arbitrary number of patients can be added.

You can add more patients in two ways. Either to the same file, or by creating additional csv files.

We do this with two additional files in our example, many_examples_1.csv and many_examples_2.csv

In [7]:
with open(os.path.join(INPUT_DIR, 'many_examples_1.csv')) as f:
    for _, line in zip(range(20), f):
        print(line.strip())

patient_id,start,code,value,dosage,visit_ids,lab_units,clarity_source
100,1990-11-30,Birth/Birth,,,1,,PATIENT
100,1990-11-30,Gender/Gender,M,,1,,PATIENT
100,1990-11-30,Race/Race,Non-White,,1,,PATIENT
100,1990-12-28,ICD10CM/WLNYRRJR,,,15,,DIAGNOSIS
100,1990-12-29,ICD10CM/AQ5CDLKT,,,14,,DIAGNOSIS
100,1991-03-03,CPT/S3XW86UW,,,16,,PROCEDURES
100,1991-04-07,CPT/1SKRBSJ6,,,18,,PROCEDURES
100,1991-05-30,CPT/GGIA8RIA,,,6,,PROCEDURES
100,1991-07-31,CPT/06RO6RNS,,,8,,PROCEDURES
100,1991-10-24,CPT/AN6KSH7X,,,15,,PROCEDURES
100,1991-12-20,CPT/J225K010,,,14,,PROCEDURES
100,1992-02-13,Drug/HRVT01O1,,46,7,mg,MED_ORDER
100,1992-05-01,Drug/NAI1E4K3,,46,11,mg,MED_ORDER
100,1992-05-26,Drug/T8F38A5J,,41,7,mg,MED_ORDER
100,1992-07-12,Drug/96O4KD7B,,37,15,mg,MED_ORDER
100,1992-10-18,Vitals/2WF52DX6,174,,6,mmHg,LAB_RESULT
101,1991-04-15,Birth/Birth,,,1,,PATIENT
101,1991-04-15,Gender/Gender,F,,1,,PATIENT
101,1991-04-15,Race/Race,White,,1,,PATIENT


# Additional notes
Ordering of rows for each patient, or patient rows being split across files doesn't matter. Everything will be resorted and joined as part of the ETL process.

Atypical datatypes, such as flowsheets can be added as needed, with either string or numeric values as whatever is more natural.

## 4. Convert the directory to an extract
We now convert the dataset we created above to an extract using the function [etl_simple_femr](https://github.com/som-shahlab/femr/blob/main/src/femr/etl_pipelines/simple.py#L66) from the femr repo

We need to first create folders to save the dataset and associated files 

In [8]:
import shutil
import os

TARGET_DIR = 'trash/tutorial_2b'

if os.path.exists(TARGET_DIR):
    shutil.rmtree(TARGET_DIR)

os.mkdir(TARGET_DIR)

We now move/copy the `example.csv` into the `INPUT_DIR` folder

The output extract is a femr [PatientDatabase](https://github.com/som-shahlab/femr/blob/Miking98-patch-1/tutorials/0_How%20FEMR%20Works%20%2B%20Toy%20Example.ipynb) that can be directly used by the femr pipeline

In [9]:
# Create directories for storing the extract and extract log
LOG_DIR = os.path.join(TARGET_DIR, "logs")
EXTRACT_DIR = os.path.join(TARGET_DIR, "extract")

import femr
import femr.etl_pipelines.simple
os.system(f"etl_simple_femr {INPUT_DIR} {EXTRACT_DIR} {LOG_DIR} --num_threads 2")

Done with main 2023-07-08T12:32:19.663202997+00:00
Done with meta 2023-07-08T12:32:19.663352584+00:00
Converting to extract 2023-07-08 12:32:19.624892


2023-07-08 12:32:19,506 [MainThread  ] [INFO ]  Extracting from OMOP with arguments Namespace(simple_source='input/simple_femr', target_location='/home/ethan/femr/tutorials/trash/tutorial_2b/extract', temp_location='/home/ethan/femr/tutorials/trash/tutorial_2b/logs', num_threads=2, athena_download=None)
2023-07-08 12:32:19,506 [MainThread  ] [INFO ]  Converting to events
2023-07-08 12:32:19,597 [MainThread  ] [INFO ]  Converting to patients
2023-07-08 12:32:19,624 [MainThread  ] [INFO ]  Converting to extract


0

# 5. Open and view the data
We now open and take a look at the femr extract we generated in the last step using the [PatientDatabase](https://github.com/som-shahlab/femr/blob/main/src/femr/extension/datasets.pyi#L24) class

In [10]:
import femr.datasets

database = femr.datasets.PatientDatabase(EXTRACT_DIR)

# Number of patients
print("Num patients", len(database))

# Print out patient_id 3 (the first example patient we created)
patient = database[3]
print(patient)

# You can pull things like dosage by looking at the event
for event in patient.events:
    print(event, 'dosage is', event.dosage)

Num patients 201
Patient(patient_id=3, events=(Event(start=1970-01-07 00:00:00, code=Birth/Birth, value=None), Event(start=1990-01-07 00:00:00, code=Gender/Gender, value=Female), Event(start=1990-01-07 00:00:00, code=Race/Race, value=White), Event(start=2020-07-09 00:00:00, code=Vitals/Blood Pressure, value=160.0, units=mmHg), Event(start=2020-08-09 00:00:00, code=Vitals/HbA1c, value=7.0, dosage=%), Event(start=2022-05-03 00:00:00, code=ICD10CM/E11.4, value=None), Event(start=2022-06-05 00:00:00, code=ICD10CM/E10.1, value=None), Event(start=2022-06-05 00:00:00, code=Note/ProgressNote, value=Patient Bob came to the clinic today), Event(start=2022-06-05 00:00:00, code=Drug/Atorvastatin, value=None, units=mg, dosage=50), Event(start=2022-06-06 00:00:00, code=Note/ProgressNote, value=Complicated notes generally need escaping , "
 example), Event(start=2022-07-06 00:00:00, code=Drug/Multivitamins, value=None, units=ml, dosage=5)))
Event(start=1970-01-07 00:00:00, code=Birth/Birth, value=Non