# Loading EHR Data with Piton

In this notebook, we go through a toy example of using Piton on a custom dataset. 

We will define a set of patients who have a unique set of associated clinical events, and then initialize a Piton dataset with this data. 

Before going through this tutorial, please make sure you've installed **piton** as detailed in the [README.md](https://github.com/som-shahlab/piton)

#### Learning Goals
1. Learn how to use the two main Piton datatypes: `Patient` and `Event`.
2. Initialize a Piton dataset from scratch with custom data.
3. Understand the three steps of the Piton data pipeline: `EventCollection` => `PatientCollection` => `PatientDatabase`

In [7]:
import datetime
import os
import contextlib
import piton
import piton.datasets

# 0. Overview of **Piton**

The **piton** workflow of creating patient timelines from an EHR dataset is as follows:

1. Extract clinical events from the source EHR database. This could be OMOP-CDM, MIMIC-III, eICU, etc.
    * This code is specific to the source EHR database, and is not covered in this tutorial.
    * We have already written extractors for some popular EHR databases in the `src/piton/extractors` folder.
2. Write these extracted events to disk as an `EventCollection`. 
    * In an abstract sense, an `EventCollection` is simply an unordered list of `(Patient ID, Event)` tuples. 
    * Internally, **piton** will shard these tuples across a set of files on disk that will be stored in a single folder, but you don't need to worry about that.
3. Transform the `EventCollection` into a `PatientCollection`. 
    * In an abstract sense, a `PatientCollection` is a **sorted** list of `(Patient ID, Event)` tuples, where the sorting is done first by `Patient ID`, second by the `start` time of each `Event`. 
    * Internally, **piton** will shard these consecutively arranged tuples across a set of files on disk that will be stored in a single folder, but you don't need to worry about that.
4. Apply transformations to your patient timelines (e.g. move coding assignments to the end of their visits, move all events before a patient's birthdate to after their birthdate, etc.) that you want to your `PatientCollection` to generate a new `PatientCollection`.
5. Index and save your `PatientCollection` to disk as a `PatientDatabase`.
    * In an abstract sense, a `PatientDatabase` lets you quickly retrieve all the events for a given patient ID. 
    * Internally, **piton** will create another set of indexes and files on disk that will be stored in a single folder, but you don't need to worry about that.

Note that everything is done on disk (i.e. not in memory), so you can work with arbitrarily large datasets. 

However, this means that you must specify the location of the folders where you want to store your `EventCollection`, `PatientCollection`, and `PatientDatabase` on disk.

## 1. Define our dataset with `Patient` and `Event` objects

#### Patients

Piton represents every patient with the `Patient` class. 

A `Patient` object contains the following two attributes:

* `patient_id` (str): a unique identifier for the patient
* `events` (list): a list of `Event` objects associated with that patient

The definition of the `Patient` class can be [found here](https://github.com/som-shahlab/piton/blob/main/src/piton/__init__.py#L11)

#### Events

Piton represents clinical events with the `Event` object.

An `Event` object can contain any number of arbitrary attributes, but it must have at least the following two attributes:

* `start` (datetime.datetime): the start time of the event
* `code` (int): the code that Piton associates with events of the same type

The definition of the `Event` class can be [found here](https://github.com/som-shahlab/piton/blob/main/src/piton/__init__.py#L22)


Below, we'll create some events for some fictional patients.

In [2]:
events = [
    # This event contains the bare minimum attributes -- start and code.
    piton.Event(
        start=datetime.datetime(2010, 1, 5),
        code=2,
    ),
    # This event contains a couple custom attributes -- value and source_table.
    piton.Event(
        start=datetime.datetime(2010, 1, 3, hour=10, minute=45),
        code=2,
        value="test_value",
        source_table=None,
    ),
    # This event contains even more attributes.
    piton.Event(
        start=datetime.datetime(2010, 1, 3, hour=10, minute=30),
        code=0,
        value=34.0,
        source_table='visit',
        extra_attr=True,
    ),
]

Now that we have our events, let's create a couple fictional patients.

In [3]:
patients = [
    # Let's assign each patient a subset of our events for example purposes.
    piton.Patient(patient_id=0, events=events[:2]),
    piton.Patient(patient_id=1, events=events[1:]),
    piton.Patient(patient_id=10, events=events),
]

# Lets print out the events of one patient to see what they look like.
# Note that attributes associated with `None` (i.e. `source_table` in the 
# second event) won't be printed out, but they are still tracked internally by Piton.
patients[0].events

[Event(start=2010-01-05 00:00:00, code=2),
 Event(start=2010-01-03 10:45:00, code=2, value=test_value)]

We can access a specific event's attributes as follows. Note that accessing an attribute that wasn't defined on an event will return `None` by default (rather than raise an exception).

In [4]:
for patient in patients:
    print(f"Patient {patient.patient_id}:")
    for idx, event in enumerate(patient.events):
        print(f"    Event #{idx}:")
        print(f"       Start = {event.start} | Code = {event.code} | Value = {event.value} | Source Table = {event.source_table} | Extra Attr = {event.extra_attr}")

Patient 0:
    Event #0:
       Start = 2010-01-05 00:00:00 | Code = 2 | Value = None | Source Table = None | Extra Attr = None
    Event #1:
       Start = 2010-01-03 10:45:00 | Code = 2 | Value = test_value | Source Table = None | Extra Attr = None
Patient 1:
    Event #0:
       Start = 2010-01-03 10:45:00 | Code = 2 | Value = test_value | Source Table = None | Extra Attr = None
    Event #1:
       Start = 2010-01-03 10:30:00 | Code = 0 | Value = 34.0 | Source Table = visit | Extra Attr = True
Patient 10:
    Event #0:
       Start = 2010-01-05 00:00:00 | Code = 2 | Value = None | Source Table = None | Extra Attr = None
    Event #1:
       Start = 2010-01-03 10:45:00 | Code = 2 | Value = test_value | Source Table = None | Extra Attr = None
    Event #2:
       Start = 2010-01-03 10:30:00 | Code = 0 | Value = 34.0 | Source Table = visit | Extra Attr = True


## 2. Create an `EventCollection` from our events

The first step in generating a **piton** database is to first create an `EventCollection`. 

You can think of an `EventCollection` as simply an unordered list of all of our events.

An `EventCollection` is simply an unordered list of events, where each event is associated with a specific patient ID.

Because it is internally represented by **piton** as a folder containing multiple files across which events will be sharded, we must specify a **target directory** where **piton** can store our `EventCollection`.

We can create an `EventCollection` from a list of events as follows:

In [9]:
# Create directory to store EventCollection
target_directory = "../ignore/dataset_tutorial_target/"
os.makedirs(target_directory, exist_ok=True)

# Create EventCollection
event_collection = piton.datasets.EventCollection(
    os.path.join(target_directory, "events")
)

Now, we'll actually add `Event`s into our `EventCollection`.

First, we must call `EventCollection.create_writer()` in order to create a writer object that will be used to write events to the `EventCollection` on disk (remember that everything in **piton** lives on disk). 

Then, we can call `add_event()` to add each individual event to our `EventCollection`.

In [10]:
# Add events to the EventCollection
with contextlib.closing(event_collection.create_writer()) as writer:
    # NOTE: We need to use the `create_writer()` handler to create
    # a writer object for adding events to the EventCollection
    # This will automatically create a new file for these events
    for patient in patients:
        for event in patient.events:
            # Note that these are getting written to disk as part of the EventCollection
            writer.add_event(patient_id=patient.patient_id, 
                             event=event)

In order to read events from an `EventCollection`, we use the `EventCollection.create_reader()` method to ingest events from disk.

In [12]:
# We need to create a reader object as an EventCollection will be natively stored on disk
# Note that (Patient ID, Event) tuples can be returned in any order -- piton makes no guarantees
with event_collection.reader() as reader:
    for event in reader:
        print(event)

(0, Event(start=2010-01-05 00:00:00, code=2))
(0, Event(start=2010-01-03 10:45:00, code=2, value=test_value))
(1, Event(start=2010-01-03 10:45:00, code=2, value=test_value))
(1, Event(start=2010-01-03 10:30:00, code=0, value=34.0, source_table=visit, extra_attr=True))
(10, Event(start=2010-01-05 00:00:00, code=2))
(10, Event(start=2010-01-03 10:45:00, code=2, value=test_value))
(10, Event(start=2010-01-03 10:30:00, code=0, value=34.0, source_table=visit, extra_attr=True))


# 3. Create a `PatientCollection` from our `EventCollection`

TODO

In [14]:
# Piton stores patients within PatientCollections
# These are simply generated right from EventCollections
patients = event_collection.to_patient_collection(
    os.path.join(target_directory, "patients")
)

# We can iterate over patients
with patients.reader() as reader:
    for patient in reader:
        print(patient)

Patient(patient_id=0, events=[Event(start=2010-01-03 10:45:00, code=2, value=test_value), Event(start=2010-01-05 00:00:00, code=2)])
Patient(patient_id=1, events=[Event(start=2010-01-03 10:30:00, code=0, value=34.0, source_table=visit, extra_attr=True), Event(start=2010-01-03 10:45:00, code=2, value=test_value)])
Patient(patient_id=10, events=[Event(start=2010-01-03 10:30:00, code=0, value=34.0, source_table=visit, extra_attr=True), Event(start=2010-01-03 10:45:00, code=2, value=test_value), Event(start=2010-01-05 00:00:00, code=2)])


# 4. Apply transformations to our `PatientCollection`

TODO

In [15]:
def transform(input: piton.Patient) -> piton.Patient:
    return piton.Patient(
        patient_id=input.patient_id,
        events=[
            a
            for a in input.events
            if a.value
            != b"test_value"  # Note that text values are stored as bytes
        ],  # Remove test_value for some reason
    )

transformed_patients: piton.datasets.PatientCollection = patients.transform(
    os.path.join(target_directory, "transformed_patients"), transform
)

with transformed_patients.reader() as reader:
    for patient in reader:
        print(patient)

Patient(patient_id=0, events=[Event(start=2010-01-03 10:45:00, code=2, value=test_value), Event(start=2010-01-05 00:00:00, code=2)])
Patient(patient_id=1, events=[Event(start=2010-01-03 10:30:00, code=0, value=34.0, source_table=visit, extra_attr=True), Event(start=2010-01-03 10:45:00, code=2, value=test_value)])
Patient(patient_id=10, events=[Event(start=2010-01-03 10:30:00, code=0, value=34.0, source_table=visit, extra_attr=True), Event(start=2010-01-03 10:45:00, code=2, value=test_value), Event(start=2010-01-05 00:00:00, code=2)])


# 5. Create a `PatientDatabase` from our `PatientCollection`

TODO