# Noisy Labeling of Clinical Notes

This notebook allows you to assign "noisy" labels to clinical notes using heuristics known as labelling functions (LFs).

Because this is a largely exploratory process, it may be useful to run the following cell, which allows you to modify the `NoisyLabeler` code without restarting the kernel.

In [None]:
%load_ext autoreload
%autoreload 2

## Load the Data

First, you must load some text to label. You will want to have some source of "gold" labels to determine the accuracy of your labelling functions. Your labels should be `1`, indicating the presence of a disease, or `0`, indicating its absence. The following code assumes your data is in a [JSON Lines](https://jsonlines.org/) format, with the fields `"text"` and `"label"`, but you can load the data any way you like.

In [None]:
gold_data_filepath = "../data/MIMIC-III-HEART-DISEASE/valid.jsonl"

In [None]:
import json
from pathlib import Path

import numpy as np

valid = [json.loads(line) for line in Path(gold_data_filepath).read_text().strip().split("\n")]
texts = [example["text"] for example in valid]
labels = np.asarray([example["label"] for example in valid])

## (Noisy) Label the Data

First, initialize the labeller

> Note, this can take a few minutes as it loads the language model and resources into memory.

In [None]:
from deep_patient_cohorts import NoisyLabeler

labeler = NoisyLabeler()

Although optional, it makes sense to preprocess the text with spaCy only one. We can do this easily like so

> Note, this will take a few minutes per 1000 documents

In [None]:
processed_texts = labeler.preprocess(texts)

Then, label the data and check the accuracy of each labelling function

In [None]:
noisy_labels = labeler.fit_lfs(processed_texts)
_ = labeler.accuracy(noisy_labels=noisy_labels, gold_labels=labels)

In [None]:
1 LF 0: Accuracy 56%, Abstain rate 45%
2 LF 0: Accuracy 61%, Abstain rate 67%
3

neg
1 LF 0: Accuracy 61%, Abstain rate 0%
2 LF 0: Accuracy 65%, Abstain rate 22%
3 LF 0: Accuracy 66%, Abstain rate 35%

### Adding New LFs

You may need to continually modify your LFs until they reach acceptable accuracy. The following example demonstrates how to add a new LF to the existing `labeler`, and evaluate its accuracy.

In [None]:
from typing import List
from deep_patient_cohorts import POSITIVE, NEGATIVE, ABSTAIN

def heart_disease(self, texts: List[str]) -> List[int]:
    return [POSITIVE if "heart disease" in text.text.lower() else ABSTAIN for text in texts]

labeler.add(heart_disease)

noisy_labels = labeler.fit_lfs(processed_texts)
labeler.accuracy(noisy_labels=noisy_labels, gold_labels=labels)

Of course, you can also modify the `NoisyLabeler` code directly.

### Training a Label Model

Using [FlyingSquid](https://github.com/HazyResearch/flyingsquid), we can train a probablistic model to combine our LFs (assuming we have at least 3!)

In [None]:
labeler.fit_lm(noisy_labels=noisy_labels, gold_labels=labels)

Alternatively, you can fit both the labelling functions and the label models in one step with

```python
labeler.fit(noisy_labels=noisy_labels, gold_labels=labels)
```

### Removing LFs

If it turns out our new LF performs poorly, we can remove it and try again

In [None]:
del labeler.lfs[-1]