# Labeling

A core component of FEMR is labeling subjects.

Labels within FEMR follow the [label schema within MEDS](https://github.com/Medical-Event-Data-Standard/meds/blob/e93f63a2f9642123c49a31ecffcdb84d877dc54a/src/meds/__init__.py#L70).

Per MEDS, each label consists of three attributes:

* `subject_id` (int64): The identifier for the subject to predict on
* `prediction_time` (datetime.datetime): The timestamp for when the prediction should be made. This indicates what features are allowed to be used for prediction.
* `boolean_value` (bool): The target to predict

Additional types of labels will be added to MEDS over time, and then supported here.

In [1]:
import shutil
import os

TARGET_DIR = 'trash/tutorial_2'

if os.path.exists(TARGET_DIR):
    shutil.rmtree(TARGET_DIR)

os.mkdir(TARGET_DIR)

# Demonstration of some example labels

In [2]:
# We can construct these labels manually

import femr.labelers
import datetime
import meds

# Predict False on March 2nd, 1994
example_label = {'subject_id': 100, 'prediction_time': datetime.datetime(1994, 3, 2), 'boolean_value': False}

# Predict True on March 2nd, 2009
example_label2 = {'subject_id': 100, 'prediction_time': datetime.datetime(2009, 3, 2), 'boolean_value': True}


# Multiple labels are stored using a list
labels = [example_label, example_label2]

# Generating labels programatically within FEMR

One core feature of FEMR is the ability to algorithmically generate labels through the use of a labeling function class.

The core for FEMR's labeling code is the abstract base class [Labeler](https://github.com/som-shahlab/femr/blob/main/src/femr/labelers/core.py#L40).

Labeler has one abstract methods:

```python
def label(self, subject: meds_reader.Subject) -> List[meds.Label]:
    Generate a list of labels for a subject
```

Note that the subject is assumed to be the [MEDS Subject schema](https://github.com/Medical-Event-Data-Standard/meds/blob/e93f63a2f9642123c49a31ecffcdb84d877dc54a/src/meds/__init__.py#L18).

Once this method is implemented, the apply function becomes available for generating labels.

In [12]:
from typing import List
import femr.pat_utils
import meds_reader
import meds
import femr.labelers


class IsMaleLabeler(femr.labelers.Labeler):
    # Dummy labeler to predict gender at birth
    
    def label(self, subject: meds_reader.Subject) -> List[meds.Label]:
        is_male = any('Gender/M' == event.code for event in subject.events)
        return [{
            'subject_id': subject.subject_id, 
            'prediction_time': subject.events[-1].time,
            'boolean_value': is_male,
        }]
    
database = meds_reader.SubjectDatabase("input/synthetic_meds")

labeler = IsMaleLabeler()
labeled_subjects = labeler.apply(database)


print(labeled_subjects)



     subject_id prediction_time  boolean_value
0             0      1993-01-31          False
1             1      1991-08-31           True
2             2      1992-08-05           True
3             3      1991-01-11           True
4             4      1994-04-05           True
..          ...             ...            ...
195         195      1995-10-07          False
196         196      1995-08-31          False
197         197      1992-05-29           True
198         198      1992-10-06           True
199         199      1993-05-02           True

[200 rows x 3 columns]


In [13]:
# We can save these to a parquet

labeled_subjects.to_parquet("trash/tutorial_2/labels.parquet", index=False)