In [1]:
import numpy as np
import pandas as pd
import utils

___
# Get to know the CHIFIR dataset

The corpus of Cytology and Histopathology Invasive Fungal Infection Reports (CHIFIR) is available at [PhysioNet](https://physionet.org/content/corpus-fungal-infections/1.0.0/). Since these are medical reports and can contain sensitive information, the dataset can only be accessed by credentialed users who have signed the Data Use Agreement.


**Background**

Cytology and histopathology reports are a common type of clinical documentation. These are pathologist-produced free-text reports outlining the macroscopic and microscopic structure of a specimen. Depending on the sample and what it contains, a report might describe its overall structure, which types cells or tissue can be seen, and any pathological observations. In other words, the information contained in a report can vary a lot and directly depends on the patient's medical condition.

In case of CHIFIR, the corpus was created to support the development of an automated tool for the detection of invasive fungal infection (IFI). IFIs are rare but serious infections most commonly affecting immunocompromised and critically ill patients. Traditionally, surveillance of IFI is a laboriuos process which requires a physician to perform a detailed review of patient medical history. Histopathology reports play a key role as they provide, albeit not with 100% certainty, evidence for the presence or absence of IFI.  


**Aim**

As mentioned above, the final goal is to build a tool that can accurately detect IFI based on a patient's medical history. Part of this is to be able to tell if any associated histopathology reports contain any evidence for IFI. This can be done in two steps:
- Extract any relevant information from a report, e.g., phrases describing fungal organisms.
- Based on this information, classify a report into positive or negative for IFI.

Since the reports are free-text, we might need to use text analytics and natural language processing (NLP) methods. But first let's take a look at the data...

___
# Explore the CHIFIR dataset
### Metadata

In [2]:
# Load a csv file with report metadata
df = pd.read_csv("../../../Data/CHIFIR/chifir_metadata.csv")
print(df.shape)
df.head()

(283, 6)


Unnamed: 0,histopathology_id,patient_id,report_no,y_report,dataset,val_fold
0,658,13,1,Positive,development,10.0
1,189,14,1,Positive,development,7.0
2,529,28,1,Negative,development,8.0
3,325,28,2,Positive,development,8.0
4,559,28,3,Negative,development,8.0


In [3]:
# What does this file contain?
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 283 entries, 0 to 282
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   histopathology_id  283 non-null    int64  
 1   patient_id         283 non-null    int64  
 2   report_no          283 non-null    int64  
 3   y_report           283 non-null    object 
 4   dataset            283 non-null    object 
 5   val_fold           231 non-null    float64
dtypes: float64(1), int64(3), object(2)
memory usage: 13.4+ KB


In [4]:
# Number of patients
df.patient_id.nunique()

201

In [5]:
# Number of reports per patient
df.groupby('patient_id').size().aggregate([min, max])

min    1
max    6
dtype: int64

In [6]:
# Report-level annotations
df.y_report.value_counts()

Negative    243
Positive     40
Name: y_report, dtype: int64

In [7]:
# Proportion of positive reports
df.y_report.value_counts(normalize=True).round(2)

Negative    0.86
Positive    0.14
Name: y_report, dtype: float64

In [8]:
# Recommended data split: development and test sets
df.dataset.value_counts()

development    231
test            52
Name: dataset, dtype: int64

In [9]:
# Recommended data split: 10-fold cross-validation
df[df.dataset=='development'].val_fold.value_counts().sort_index()

1.0     21
2.0     19
3.0     19
4.0     30
5.0     19
6.0     26
7.0     29
8.0     29
9.0     18
10.0    21
Name: val_fold, dtype: int64

### Reports

### Annotated concepts

### Considerations when doing NER
- Small dataset
- High lexical diversity
- Very specific/narrow subject

___
# NER

### Dictionary-based
- How does it work? What to expect?
- Results
- Pros & Cons

### CRF model
- How does it work? What to expect?
- Results
- Pros & Cons

### Transformer-based
- How does it work? What to expect?
- Results
- Pros & Cons