# Noisy Labeling of Clinical Notes

This notebook allows you to assign "noisy" labels to clinical notes using heuristics known as labelling functions (LFs).

Because this is a largely exploratory process, it may be useful to run the following cell, which allows you to modify the `NoisyLabeler` code without restarting the kernel.

In [1]:
%load_ext autoreload
%autoreload 2

## Load the Data

First, you must load some text to label. You will want to have some source of "gold" labels to determine the accuracy of your labelling functions. Your labels should be `1`, indicating the presence of a disease, or `0`, indicating its absence. The following code assumes your data is in a [JSON Lines](https://jsonlines.org/) format, with the fields `"text"` and `"label"`, but you can load the data any way you like.

In [2]:
gold_data_filepath = "../data/MIMIC-III-HEART-DISEASE/valid.jsonl"

In [3]:
import json
from pathlib import Path

import numpy as np

valid = [json.loads(line) for line in Path(gold_data_filepath).read_text().strip().split("\n")]
texts = [example["text"] for example in valid]
labels = np.asarray([example["label"] for example in valid])

## (Noisy) Label the Data

First, initialize the labeller

In [4]:
from deep_patient_cohorts import NoisyLabeler

labeler = NoisyLabeler()



Although optional, it makes sense to preprocess the text with spaCy only one. We can do this easily like so

> note, this will take a few minutes per 1000 documents

In [5]:
texts = labeler.preprocess(texts)

  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]
5272it [1:26:15,  1.02it/s]


Finally, we can label the data and check the accuracy of each labelling function

In [6]:
noisy_labels = labeler(texts)

labeler.accuracy(noisy_labels=noisy_labels, gold_labels=labels)

100%|██████████| 2/2 [00:28<00:00, 14.05s/it]

LF 0: Accuracy 57%, Abstain rate 56%
LF 1: Accuracy 91%, Abstain rate 93%





### Adding New LFs

You may need to continually modify your LFs until they reach acceptable accuracy. The following example demonstrates how to add a new LF to the existing `labeler`, and evaluate its accuracy.

In [7]:
from typing import List
from deep_patient_cohorts import POSITIVE, NEGATIVE, ABSTAIN

# heart disease LF
def heart_disease(self, texts: List[str]) -> List[int]:
    return [POSITIVE if "heart disease" in text.text.lower() else ABSTAIN for text in texts]

labeler.add(heart_disease)

# st elevation LF
def st_elevation(self, texts: List[str]) -> List[int]:
    search_list = ["STEMI", "ST elevation", "ST elevation MI"]
    return [POSITIVE if any([x in text.text.lower() for x in search_list]) else ABSTAIN for text in texts]

# st elevation LF
def st_elevation(self, texts: List[str]) -> List[int]:
    search_list = ["STEMI", "ST elevation", "ST elevation MI"]
    return [POSITIVE if any([x in text.text.lower() for x in search_list]) else ABSTAIN for text in texts]

# atherosclerosis
def atherosclerosis(self, texts: List[str]) -> List[int]:
    search_list = ["atherosclerosis", "arteriosclerosis", "atherosclerotic", "arterial sclerosis", "artherosclerosis", "atherosclerotic disease"]
    return [POSITIVE if any([x in text.text.lower() for x in search_list]) else ABSTAIN for text in texts]

# heart_attack
def heart_attack(self, texts: List[str]) -> List[int]:
    search_list = ["myocardial infarcation", "MI", "ischemic heart disease", "cardiac arrest", "coronary infarction", "asystole", "cardiopulmonary arrest", "coronary thrombosis", "heart arrest", "heart attack", "heart stoppage"]
    return [POSITIVE if any([x in text.text.lower() for x in search_list]) else ABSTAIN for text in texts]

# heart_failure
def heart_failure(self, texts: List[str]) -> List[int]:
    search_list = ["congestive heart failure", "decomensated heart failure", "CHF", "left-side heart failure", "right-sided heart failure"]
    return [POSITIVE if any([x in text.text.lower() for x in search_list]) else ABSTAIN for text in texts]

labeler.add(heart_disease)
labeler.add(st_elevation)
labeler.add(atherosclerosis)
labeler.add(heart_attack)
labeler.add(heart_failure)

noisy_labels = labeler(texts)
labeler.accuracy(noisy_labels=noisy_labels, gold_labels=labels)

100%|██████████| 3/3 [00:25<00:00,  8.46s/it]

LF 0: Accuracy 57%, Abstain rate 56%
LF 1: Accuracy 91%, Abstain rate 93%
LF 2: Accuracy 58%, Abstain rate 97%





Of course, you can also modify the `NoisyLabeler` code directly.

### Training a Label Model

Using [FlyingSquid](https://github.com/HazyResearch/flyingsquid), we can train a probablistic model to combine our LFs (assuming we have at least 3!)

In [None]:
from flyingsquid.label_model import LabelModel

m = noisy_labels.shape[1]
label_model = LabelModel(m)

label_model.fit(noisy_labels)

preds = label_model.predict(noisy_labels).reshape(labels.shape)
accuracy = np.sum(preds == labels) / labels.shape[0]

print(f"Label model accuracy: {int(100 * accuracy)}%")

### Removing LFs

If it turns out our new LF performs poorly, we can remove it and try again

In [5]:
del labeler.lfs[-1]

## LF backlog

- ~~ST elevation (common sign of heart attack)~~
    - ~~STEMI|ST elevation~~
- obstruction of heart vessels (occurs in 15-20% of people with heart disease)
    - xx 
    - %|percent 
    - blockage|obstruction|occlusion|narrowed 
    - of|in|around 
    - coronary arteries(or arteries) | Left coronary artery | LCA | Left anterior descending artery | LAD | Left circumflex artery | Posterior descending artery | Right coronary artery | RCA | Right marginal artery | Posterior descending artery | PDA
- swelling/edema (commmon symptom of heart failure)
    - swelling|edema|puffiness
    - in 
    - left | right | l | r (optional)
    - ankle(s), leg(s), feet | foot
- angina (common symptom of coronary artery disease)
    - stable|unstable|variant (optional)
    - angina|chest pain|angina pectoris
- abnormal diagnostic test results
    - abnormal|concerning 
    - ECG|echo|echocardiogram
- ~~atherosclerosis~~
    - ~~atherosclerosis|arteriosclerosis|atherosclerotic|arterial sclerosis|artherosclerosis|atherosclerotic disease~~
- ~~heart attack~~
    - ~~myocardial infarcation|MI|ischemic heart disease|cardiac arrest|coronary infacrtion|asystole|cardiopulmonary arrest|coronary thrombosis|heart arrest|heart attack|heart stoppage~~
- ~~heart failure~~
    - ~~congestive heart failure|decomensated heart failure|CHF|left-side heart failure|right-sided heart failure~~
- correlated procedures: 
    - coronary|cardiac cath|catheter|catheterization
    - coronary|cardiac stent|stenting|angioplasty
    - Percutaneous coronary intervention|PCI
- correlated drugs (get the individual drug names from SNOMED for the following SNOMED categories, and exact/synonym match for these in the text. At least 1 hit should be a POSITIVE: 
    - Cardiovascular Agents (all of them)
    - Hematologic agent
    - Thrombolytic
    - Anticoagulant
- correlated diseases (get the individual drug names from SNOMED for the following SNOMED categories, and exact/synonym match for these in the text. At least 1 hit should be a POSITIVE: 
    - Anything with ICD10CM chapter 9 (get all of the code names/synonyms and exact match)
- correlated cardiac markers - regex, similar to ejection fraction, to search for cardiac markers with abnormal ranges:
    - see this: https://en.wikipedia.org/wiki/Cardiac_marker




   
   