In [1]:
import numpy as np
import pandas as pd
import utils

# Silence future warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

___
# Get to know the CHIFIR dataset

_The corpus of Cytology and Histopathology Invasive Fungal Infection Reports (CHIFIR) is available at [PhysioNet](https://physionet.org/content/corpus-fungal-infections/1.0.0/). Since these are medical reports and can contain sensitive information, the dataset can only be accessed by credentialed users who have signed the Data Use Agreement._

___
## Background

Cytology and histopathology reports are a common type of clinical documentation. These are pathologist-produced free-text reports outlining the macroscopic and microscopic structure of a specimen. Depending on the sample and what it contains, a report might describe its overall structure, which types cells or tissue can be seen, and any pathological findings. In other words, the information contained in a report can vary a lot and directly depends on the patient's medical condition.  

CHIFIR was created to support the development of an automated tool for the detection of invasive fungal infection (IFI). IFIs are rare but serious infections most commonly affecting immunocompromised and critically ill patients. Traditionally, surveillance of IFI is a laboriuos process which requires a physician to perform a detailed review of patient's medical history. Histopathology reports play a key role as they provide, albeit not with 100% certainty, evidence for the presence or absence of IFI.

___
## Aim

As mentioned above, the ultimate goal is to build a tool that can accurately detect IFI based on a patient's medical history. Part of this is to be able to tell if any associated histopathology reports contain any evidence for IFI. This can be done in two steps:
- By extracting any relevant information from a report, e.g., phrases describing fungal organisms.
- Based on this information, classifying a report as positive or negative for IFI.

In this tutorial, we will be focussing on the task of information extraction, specifically, named-entity recognition (NER). This means we would like to **detect words or phrases in the text that describe a particular concept**. Since the reports are free-text, we might need to use text analytics and natural language processing (NLP) methods. But first let's take a look at the data...

___
# Explore the CHIFIR dataset

___
## Metadata

In [None]:
# Load the csv file with report metadata
path = "../../../Data/CHIFIR/"
df = pd.read_csv(path + "chifir_metadata.csv")
print(df.shape)
df.head()

In [None]:
# How many reports?
df.histopathology_id.nunique()

In [None]:
# Number of patients
df.patient_id.nunique()

In [None]:
# Number of reports per patient
df.groupby('patient_id').size().aggregate(['min', 'max'])

In [None]:
# Report-level annotations
df.y_report.value_counts()

In [None]:
# Proportion of positive reports
df.y_report.value_counts(normalize=True).round(2)

In [None]:
# Recommended data split: development and test sets
df.dataset.value_counts()

In [None]:
# Recommended data split for 10-fold cross-validation --
# ensures reports from the same patient are allocated to the same fold.
df[df.dataset=='development'].val_fold.value_counts().sort_index()

<div class="alert alert-block alert-info">
<b>Tip:</b> Use the same cross-validation split to compare different models/approaches otherwise your results might not be reliable. Here, we appended fold numbers to the dataset; you can also reproduce the splitting strategy each time but make sure to initialise the random number generator with the same value.
</div>

___
## Reports

In [None]:
# Add free-text reports to the dataframe
df['report'] = df.apply(utils.read_report, path=path + "reports/", axis=1)

In [None]:
# What does a report look like? Let's look at an example
print(df.report.iloc[20])

<div class="alert alert-block alert-info">
<li>Personal identifiying information is replaced with a string of Xs of equal length.
    
<code>"Reported by Dr XXXXXXXXXX with Dr XXXXXXXXX, XXXXXXXXXXXXXXXXXXXXXXXXXXXXX, validated XXXXXXXXXXX "</code></li>
<li>Report sections are separated by newline characters and headers in caps lock: 
    
<code>REQUEST DETAILS</code>, <code>MACROSCOPIC DESCRIPTION</code>, <code>MICROSCOPIC DESCRIPTION</code>, <code>OPINION</code></li>
<li>The report uses some abbreviations and specific terminology, it is characterised by short and sometimes incomplete sentences:

<code>"Stage IV FL- on bispecific Ab."</code>
</li>
</div>

In [None]:
# Are all reports structured in the same way?
print(df.report.iloc[0])

<div class="alert alert-block alert-info">
Not quite. Some formatting may have been lost during data transfer. Headers vary, for example, of the concluding section: <code>OPINION</code> vs <code>DIAGNOSIS</code>.
</div>

In [None]:
# Let's calculate the character length of reports
df['report_length'] = df.report.apply(len)

sns.histplot(x='report_length', data=df);

In [None]:
# Btw, is there any correlation between report length and its IFI label?
sns.histplot(x='report_length', data=df, hue='y_report');

___
# The annotation process

___
## What to annotate?

Now let's take a look at another report.

Assuming no prior knowledge, which words or phrases would you identify as __related to fungal infection__?

In [None]:
print(df.report.iloc[2])

<div class="alert alert-block alert-info">
<li>The doctor suspects fungal infection: <code>?fungus</code></li>
<br/>
<li>Explicit negation of fungal infection: <code>no ... fungal elements are identified</code>, <code>no evidence of ... fungal elements</code></li>
<br/>
<li>What else? <code>Pneumocystis</code> is a type of fungi. <code>Grocott</code> is a dye that stains fungi.</li>
</div>

___
## The annotation schema

From what we have seen above, what information in the report is **relevant to detecting fungal infection**?

- Did the referring doctor suspect fungal infection?
- What stains were used to examine the sample?
- What organisms/species were mentioned?
- Were these explicitly negated?

How can we **categorise this information**? Let's define several concept categories: 

| **Concept category** | **Definition**                                                 |
|:----------------------:|:----------------------------------------------------------------|
| _ClinicalQuery_      | Queries about IFI                                              |
| _FungalDescriptor_   | Generic descriptors of fungal elements                         |
| _Fungus_             | Specific fungal organisms or syndromes                         |
| _Invasiveness_       | Depth and degree of fungal invasion into tissues               |
| _Stain_              | Names of histological stains used to visualise fungal elements |
| _SampleType_         | Names of the sampled organ, site, or tissue source             |
| _Positive_           | Affirmative expression                                         |
| _Equivocal_          | Expression of uncertainty                                      |
| _Negative_           | Negating expression                                            |

The phrases we have identified in our example would fall into the following categories:
- `?fungal` is a _ClinicalQuery_
- `fungal elements` is a _FungalDescriptor_ (**note** that both instances are negated)
- `Pneumocystis` is a _Fungus_
- `Grocott` is a _Stain_
- `no` and `no evidence of` is are both _Negative_

Now, if we were to encounter a **new report**, we would be able to identify and categorise these phrases. Take a look at the report below, do you think it is positive or negative for fungal infection?

In [None]:
print(df.loc[25].report)

<div class="alert alert-block alert-info">
Already familiar to us are the phrases <code>Pneumocystis</code>, which is a type of <i>Fungus</i>, and <code>fungal elements</code> categorised as <i>FungalDescriptor</i>. The report clearly states that <code>there is no evidence of Pneumocystis</code>.
</div>

If at this point we were to ask a medical professional to **help us find other mentions of fungal infection**, they would point us to phrases:
- `fungal hyphae`
- `thick, non-septate and branch at 90 degrees`
- `Aspergillus` 

The first two would be tagged as _FungalDescriptor_ and the latter as a type of _Fungus_.

<div class="alert alert-block alert-info">
With this <b>additional knowledge</b> we can now confidently say that the report is positive for fungal infection: <code>the smear contains scattered fungal hyphae</code> and <code>fungal elements resembling Aspergillus identified</code> both suggest there is something going on. 
</div>

___
## Annotations in the CHIFIR dataset

The example above showed us why it is important to look through multiple reports to **collect as much information as possible**.

Luckily, some kind medical professionals agreed to annotate all of our 283 reports tagging words and phrases that belong to one of the concept categories. We refer to these as the **gold standard**. Let's take a look at what was annotated:

In [None]:
# Parse annotation files and load gold standard annotations
concepts = utils.read_annotations(df, path + "annotations/")

Below is a table summarising how common are the concept categories and how much the language used varies within each category.

| **Concept category** | **Examples** | **Total occurrences** | **Number of reports with at least one occurrence**| **Number of unique phrases** | **Lexical diversity** |
|:----------------------:|:--------------------------------------------------------|:---:|:---:|:---:|:---:|
| _ClinicalQuery_      | ?cryptococcus, ?fungal infection|65|53|36|0.55|
| _FungalDescriptor_   | budding yeasts, fungal hyphae, pathogenic organisms|282|128|67|0.24|
| _Fungus_             | aspergillus, candida, cryptococcal organisms|106|60|15|0.14|
| _Invasiveness_       | angioinvasion, infiltration, intravascular spaces|37|12|25|0.68|
| _Stain_              | alcian blue, d/pas, grocott, mucicarmine|172|100|13|0.08|
| _SampleType_         | abdomen, cheek, lung, lymph node, skin|198|179|55|0.28|
| _Positive_           | containing, favouring, resembling, suggestive|118|42|37|0.31|
| _Equivocal_          | atypical, possibility, possible|7|5|5|0.71|
| _Negative_           | do not feature, failed to identify, no evidence, not seen|152|104|11|0.07|










<div class="alert alert-block alert-success">    
<h2>Considerations when doing NER</h2>
<br/>
<li><b>The size of the datase:</b> We only have 283 reports and some concept categories (e.g., <i>Equivocal</i>) are very rare.</li>
<br/>
<li><b>Lexical diversity:</b> There may be more than one way to say the same thing. <i>Stain</i> and <i>Negative</i> are looking good, <i>ClinicalQuery</i> and <i>Invasiveness</i> are going to be tricky.</li>
<br/>
<li><b>Very specific/narrow subject / Limited utility:</b> Pre-trained models / terminology sets are likely to be too generic.</li>
</div>

___
# NER

___
## Dictionary-based approach

### How does it work?

Think back to our example. Let's construct a dictionary of possible phrases for each concept category. Here is what our dictionary would look like after parsing the first report:
-  _ClinicalQuery_: `?fungal`
-  _FungalDescriptor_: `fungal elements`
-  _Fungus_: `Pneumocystis`
-  _Stain_: `Grocott`

After reading through the second report, we can add a few more phrases to our dictionary:
-  _ClinicalQuery_: `?fungal`
-  _FungalDescriptor_: `fungal elements`, `fungal hyphae`, `thick, non-septate and branch at 90 degrees`
-  _Fungus_: `Pneumocystis`, `Aspergillus`
-  _Stain_: `Grocott`

<div class="alert alert-block alert-info">
Now, if we were to stop here, these two reports would constitute the <b>training set</b> and the updated dictionary would be our <b>learned dictionary</b>. The remaining unseen 281 reports would be the <b>test set</b>. Alternatively, we can continue updating the dictionary by parsing addtional reports. 
</div>

### Once we have learned the dictionary

Here are the steps to apply it to an **unseen report**:
- Tokenise the report in the same manner as the reports in the training set.
- Scan the report to find tokens that match the learned dictionary.
- If there is a match, record the start and the end character positions and the concept category. 

### What to expect?
- Poor performace if the **training set is too small**.
- Poor performance if **lexical diversity is high**. This includes **spelling mistakes!**
- Good performance if the **language is consistent** and if **context does not play a big part**.
- **Very easy to interpret**. Good if you want to be able to visualise the workings of your NER and present it to non-tech audiences. 

___
## Conditional random fields (CRF) model

### How does it work?

Let's think a bit about the **context**. 

Some phases are unambiguous, such as scientific names of fungal species. Others are open to more than one interpretation, for example, the word `organisms` might refer to a fungal, bacterial, or possibly viral infection.
- While the dictionary-based approach might know to recognise `fungal organisms`, a slightly more complex phrase `organisms of fungal origin`, if not seen before, would present a challenge. 
- Same goes for false positives: we would not want to pick up `organisms` if the context it is mentioned in has nothing to do with fungal infection. 

<div class="alert alert-block alert-info">
A CRF model allows to incorporate <b>contextual features</b> to address this polysemy. The goal here is to augment a given word with attributes describing its position in the text and spatial (contextual) relation to other words in the document.
</div>

### Which features to include?

In this exercise we included the following word attributes:
- The start and end character positions
- Capitalisation patterns (i.e., if the word starts with a capital letter, is uppercased, lowercased, or has alternating casing)
- Morphologic patterns (i.e., word prefixes and suffixes)
- Numeric and punctuation patterns (i.e., if the word contains any digits, hyphens, etc.)

Other common attributes are: **part-of-speech tags**, **sentence-level position**, **preceeding and following words**.

### How to apply to new data?
Here are the key steps to apply CRF NER to an **unseen report**:
- Tokenise the report in the same manner as the reports in the training set.
- For each token, compute word attributes, where possible.
- Apply a trained CRF model to make predictions based on the attributes and the predicted label of the preceeding word. 

### What to expect?

- Good performance when **linguistic structure is consistent**
- Poor performance if the included **attributes are not predcitive**
- Poor performance if **attributes are not computable**
- Still quite **easy to interpret**

...Before we go on, let's consider the major levels of linguistic structure:
<div>
<img src="Major levels of linguistic structure.svg" width="500"/>
</div>

[Source: Wikipedia](https://en.wikipedia.org/wiki/Linguistics)

___
## Transformer models

### How does it work?

- Text is split into tokens and subtokens which are mapped (encoded) to numeric vectors.
- Each encoding depends on the word itself and all its neighbours (broader context).
- Attention mechanism helps to focus on the more important words.

<div class="alert alert-block alert-info">   
Intuitively:
<li>The dictionary-based approach operates on the level of indivadual words.</li>
<br/>
<li>CRFs start to dip into the syntaxis by taking into account closest neighbours.</li>
<br/>
<li>A transformer model allows us to scale out and start inferring the semantics by handling long-range dependencies between words.</li>
</div>

### How to apply to ~~new~~ your data?

When there is a limited amount of annotated data, it is common to fine-tune a pre-trained transformer model rather than training it from scratch.

BERT is the most popular architectiure with many flavours pre-trained on different datasets:
- BioBERT (large-scale biomedical corpora)
- PubMedBERT (Pubmed articles)
- ClinicalBERT (EHR notes)
- DischargeSummaryBERT (EHR discharge summaries)

<div class="alert alert-block alert-info">   
You'll have to decide which flavours fit your problem best. It might be worth talking to clinicians to narrow down the number of models before stepping into training.
</div>

For both fine-tuning and testing, the key step is to prepare your data in **exactly the same way** as it was done for the initial training.

### What to expect?

- Good performance if the annotated dataset is **generally similar to the data used in pre-training**
- Poor performance if the amount of data is **not enough even for fine-tuning**
- **Non-trivial to explain** the inner workings of a transformer model

___
## Comparison

<div>
<img src="precision on test.png"/>
</div>

<div>
<img src="recall on test.png"/>
</div>

<div class="alert alert-block alert-success">    
<h2>Other tips, considerations, and food for thought</h2>
    <br/>
    <li><b>Negation</b> detection (e.g., <a href="https://www.sciencedirect.com/science/article/pii/S1532046401910299">NegEx by Wendy Chapman</a> and its spacy implementation <a href="https://spacy.io/universe/project/negspacy">negspacy</a>)</li>
    <br/>
    <li>Equivocal/positive terms: can we implement <b>uncertainty/affirmation</b> detection?</li>
    <br/>
    <li>Other tools for clinical NER:
        <ul>
            <li>Using a knowledge base (e.g., <a href="https://lhncbc.nlm.nih.gov/ii/tools/MetaMap.html">MetaMap</a>)</li>
            <li>Pre-trained CRF/transition-based models (e.g., <a href="https://spacy.io/api/entityrecognizer">spacy's NER</a> and its derivatives <a href="https://github.com/allenai/scispacy">ScispaCy</a>, <a href="https://spacy.io/universe/project/medspacy">medspaCy</a>)</li>
            <li>Combined rule- and machine learning-based models (e.g., <a href="https://medcat.rosalind.kcl.ac.uk">MedCAT</a>)</li>
        </ul>
    </li>
    <br/>
    <li>How to <b>pre-annotate the data</b>? (e.g., running Metamap before feeding into clinicians)</li>
    <br/>
    <li><b>Relation extraction, concept linking</b></li>
</div>