# NBME Data Exploration

This notebook is an in-depth data exploration of the [NBME - Score Clinical Patient Notes](https://www.kaggle.com/c/nbme-score-clinical-patient-notes/overview) competition. I am going through all data sources step-by-step to get a better understanding of the data that is available in this competition as well as prediction problem we are trying to solve.

### Setup

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os
import seaborn as sns

from IPython.display import display
from pathlib import Path

In [None]:
input_path = Path("/kaggle/input/nbme-score-clinical-patient-notes")
os.listdir(input_path)

Define helper function for plotting

In [None]:
def plot_case_numbers(df, title, top_y_lim, ylabel="Count", xlabel="Case number"):
    
    fig, ax = plt.subplots()

    bars = ax.bar(range(len(df)), df, align='center')
    ax.set_xticks(range(len(df)))
    ax.set_title(title, fontsize=14)
    ax.set_xlabel(xlabel, fontsize=12)
    ax.set_ylabel(ylabel, fontsize=12)
    ax.bar_label(bars, padding=3)
    ax.set_ylim(top=top_y_lim)
    
    plt.show()

## Load data

In [None]:
train = pd.read_csv(input_path/"train.csv")
test = pd.read_csv(input_path/"test.csv")
ftr = pd.read_csv(input_path/"features.csv")
notes = pd.read_csv(input_path/"patient_notes.csv")
submit = pd.read_csv(input_path/"sample_submission.csv")

## EDA

Look at each dataset to understand what we are dealing with. Dataset descriptions are copied  from competition host: https://www.kaggle.com/c/nbme-score-clinical-patient-notes/data

### Patient notes

Let's start with patient notes, since these are the documents that will be our model's text features in this competition

**patient_notes.csv** - A collection of about 40,000 Patient Note history portions. Only a subset of these have features annotated. You may wish to apply unsupervised learning techniques on the notes without annotations. The patient notes in the test set are not included in the public version of this file.
* `pn_num` - A unique identifier for each patient note.
* `case_num` - A unique identifier for the clinical case a patient note represents.
* `pn_history` - The text of the encounter as recorded by the test taker.

In [None]:
print(f"Patient notes has {notes.shape[0]} rows and {notes.shape[1]} columns")

In [None]:
notes.info()

In [None]:
notes.head()

There are no missing values, two identifier columns and a text field `pn_history`

In [None]:
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (10,6)

In [None]:
plot_case_numbers(notes.case_num.value_counts().sort_index(), "Patient notes by case number", 10500)

There are different numbers of patient notes for each of the ten cases, ranging from 808 to 9753 notes per case number

Let's get an approximate measure of text length by splitting on whitespaces

In [None]:
notes["text_len"] = notes.pn_history.apply(lambda x: len(x.split(" ")))

In [None]:
notes.text_len.describe()

There are no empty texts (minimum length is 8 words) and the documents are generally quite short (maximum is 223 words). Of course the total number of tokens will differ based on the tokenizer we will use and likely increase, since most tokenizers use sub-word tokenization. But overall this is a good sign, since transformer models such as BERT have a max length of 512 tokens.

In [None]:
sns.kdeplot(notes.text_len, hue=notes.case_num, palette="tab10")
plt.title("Distribution of text length (number of words)");

The plot above shows the distribution of text lengths by case numbers. There are no strong differences visible except case number 9 (and maybe also 6), which appears to be less skewed than the other case numbers.

Now print a few example texts

In [None]:
for i in range(3):
    idx = np.random.randint(len(notes))
    print(f"Patient note {idx} for case number {notes.loc[idx, 'case_num']}:\n")
    display(notes.loc[idx, "pn_history"])
    print("\n")

We can see immediately that there are some formatting strings like "\r\n" in the data that we may want to remove. Therea are also technical terms like "HPI" and "G2P2002" that might be difficult for generally pretrained models to understand. 

To sum up, this dataset serves two purposes:
1. input texts for *supervised learning*: we have only 1000 annotated patient notes for the prediction task we want to solve in this competition
2. input texts for *self-supervised pre-training*: we can use the other 41146 patient notes to pre-train or fine-tune an existing model using a self-supervised training strategy, e.g. masked language modeling to give the model a better understanding of medical terms and the writing style of patient notes

Before we move on to train, let's take a quick look at features

### Features

**features.csv** - The rubric of features (or key concepts) for each clinical case.
* `feature_num` - A unique identifier for each feature.
* `case_num` - A unique identifier for each case.
* `feature_text` - A description of the feature.

In [None]:
print(f"Features has {ftr.shape[0]} rows and {ftr.shape[1]} columns")

In [None]:
ftr.info()

There are no missing values and `feature_num` is just an id column, then there are different `feature_text` for each `case_num`

In [None]:
ftr[ftr.case_num==0]

In [None]:
plot_case_numbers(ftr.case_num.value_counts().sort_index(), "Features by case number", 20)

As we can see, the number of features varies by case number, ranging from 9 to 18

Similar to above, we will also have a quick look at feature text length (this time split by "-")

In [None]:
ftr["feature_text_len"] = ftr.feature_text.apply(lambda x: len(x.split("-")))

In [None]:
ftr.groupby("case_num").agg({"feature_text_len": [np.min, np.max, np.mean]}).T.style.background_gradient(axis=1)  

Feature text lengths are short and on average similar.

It is not yet clear to me how we can use these feature descriptions, but maybe we will find a way that can help the model.


In [None]:
ftr.drop("feature_text_len", axis=1, inplace=True)

### Train

**train.csv** - Feature annotations for 1000 of the patient notes, 100 for each of ten cases.
* `id` - Unique identifier for each patient note / feature pair.
* `pn_num` - The patient note annotated in this row.
* `feature_num` - The feature annotated in this row.
* `case_num` - The case to which this patient note belongs.
* `annotation` - The text(s) within a patient note indicating a feature. A feature may be indicated multiple times within a single note.
* `location` - Character spans indicating the location of each annotation within the note. Multiple spans may be needed to represent an annotation, in which case the spans are delimited by a semicolon ;

In [None]:
print(f"Train has {train.shape[0]} rows and {train.shape[1]} columns")

In [None]:
train.info()

There are no missing values in train

In [None]:
train.head()

Column `annotation` contains our annotations in text form while column `location` contains the spans that mark the annotation in the patient note

In [None]:
notes[notes.pn_num == 16].pn_history[16]

In [None]:
notes[notes.pn_num == 16].pn_history[16][696:724]

As expected, applying the span from `location` to the text of the entire patient note results the same text as in `annotation`. The typo in "attcak" is also present in the full patient note.

Since we're dealing with spans that are aligned with the original texts we need to be careful when applying modifications to the text, e.g. cleaning formatting strings.


In [None]:
notes[notes.pn_num == 16].pn_history[16].replace("\r\n","")[696:724]

As we can see above, simply replacing "\r\n" will corrupt the annotations, which are no longer aligned with the text

Since one patient note has multiple rows, i.e. annotations in `train`, we should look at all annotations for a single patient note

In [None]:
train[train.pn_num == 16]

There are several important observations to make. Each patient note corresponds to a single case number. For each case number we have a number of features (see `ftr` dataframe above). Every row in train describes the annotations for a single feature in a given patient note. Each feature can have zero, one or multiple annotations. So in conclusion, each patient note is described by multiple annotations that appear across the different features in its case number.

In the case of patient note 16, we have 13 features that correspond to case number 0: 
* 3 features do not have any annotations (rows 5, 7 and 8)
* 7 features have a single annotation (rows 0, 1, 2, 4, 10, 11 and 12)
* 2 features have two annotations each (rows 3 and 9)
* 1 feature even has three annotations (row 6)

So this patient note has 14 annotations from 10 features in total.

*Side note: One thing we should also be aware of is that one patient note can have multiple identical annotations. An example for this is row 6, which contains 3 instances of "adderall", spelled in two different ways: "adderall" and "adderrall".*

In order to better understand the distribution of annotations in `train` we need to carry out some transformations

How many empty features are there? 

In [None]:
print(f"There are {len(train[train.location == '[]'])} empty features")

Let's drop all empty features for now

In [None]:
train = train[train.location != '[]'].copy().reset_index(drop=True)

In [None]:
print(f"Train has {len(train)} rows after dropping empty features")

Check if the number of annotated patient notes changed after dropping empty features (this would be the case if a patient note would only have empty features). There should be 100 per case.

In [None]:
train.drop_duplicates(subset=["case_num", "pn_num"]).case_num.value_counts()

Next, get number of annotations per row

In [None]:
train["n_annotation"] = train.annotation.apply(lambda x: len(x.split(",")))

In [None]:
train[train.pn_num == 16]

Now we can apply `groupby` twice to get the average number of non-empty features and annotations per patient note to see if there are any significant differences across case numbers

In [None]:
plot_case_numbers(pd.DataFrame(train.groupby(["case_num", "pn_num"])["n_annotation"].count()).groupby("case_num")["n_annotation"].mean(), 
                 "Average number of non-empty features per patient note by case number", 13, ylabel="Avg. # features")

The average number of non-empty features per patient note follows largely the same pattern as the total number of features in each case number. This implies that the share of empty features is similar across case numbers.

In [None]:
plot_case_numbers(pd.DataFrame(train.groupby(["case_num", "pn_num"])["n_annotation"].sum()).groupby("case_num")["n_annotation"].mean(), 
                 "Average number of annotations per patient note by case number", 17, ylabel="Avg. # annotations")

When looking at the average number of annotations (remember that one feature can have multiple annotations) follows a similar pattern as above, but the differences are smaller. This implies that while some case numbers (e.g. 4 and 7) have less features in total, they must have on average more annotations *per feature*.

Overall, while there is certainly some variation, the distribution of annotations across case numbers is not too imbalanced. 

Next, let's explore what is the distribution of annotations across different features.

In [None]:
train_ftr = pd.DataFrame(train.groupby("feature_num")["n_annotation"].sum()).reset_index()
train_ftr = pd.merge(train_ftr, ftr, how="left", on="feature_num")
train_ftr.head()

In [None]:
train_ftr.n_annotation.describe()

On average there are 87 annotations per feature. Let's plot a histogram to see the distribution in more detail

In [None]:
sns.histplot(train_ftr.n_annotation, bins=30)
plt.title("Histogram of annotations by feature");

As we can see, there are quite a lot of features that don't have many annotations

In [None]:
print(f"Out of all {len(train_ftr)} features:" )
for i in [100, 50, 30, 20, 10]:
    print(f"* {len(train_ftr[train_ftr.n_annotation < i])} features have less than {i} annotations")

Print least common features

In [None]:
train_ftr[train_ftr.n_annotation < 30].sort_values(by="n_annotation")

Here are the most common features with more than 150 annotations each

In [None]:
train_ftr[train_ftr.n_annotation > 150].sort_values(by="n_annotation", ascending=False)

As opposed to the case level (which has a mild imbalance), the feature level is very imbalanced, with many features having less than 50 annotations. There are two features with a single observation and another feature with two observations. If we intend to make predictions on the feature level we need to think how we can handle features with such low numbers of annotations.

## Training patient notes

As a final step in EDA, before moving on to test and submission, I would like to plot text length for annotated vs not annotated patient notes, to see if there are any obvious differences. For this I need to go back to the `notes` dataframe and merge it with the ids in `train`

In [None]:
train_notes = pd.DataFrame(train.pn_num.unique(), columns=["pn_num"])
train_notes["in_train"] = True
train_notes.shape

In [None]:
train_notes.head()

In [None]:
notes = pd.merge(notes, train_notes, how="left", on="pn_num")
notes["in_train"] = notes["in_train"].fillna(False)
notes.in_train.value_counts()

In [None]:
notes.head()

In [None]:
sns.kdeplot(notes[notes.in_train == True].text_len)
sns.kdeplot(notes[notes.in_train == False].text_len)
plt.legend(["train", "not train"])
plt.title("Distribution of text length (number of words)");

The distribution of text lengths is very similar for annotated and note annotated patient notes. Now drop the previously added columns

In [None]:
notes = notes.drop(["text_len", "in_train"], axis=1)

### Test and submission

Finally, we need to understand how to submit predictions on the test set. 

From the competition description: *To help you author submission code, we include a few example instances selected from the training set. When your submitted notebook is scored, this example data will be replaced by the actual test data. The patient notes in the test set will be added to the patient_notes.csv file. These patient notes are from the same clinical cases as the patient notes in the training set. There are approximately 2000 patient notes in the test set.*

In [None]:
print(f"Test example has {test.shape[0]} rows and {test.shape[1]} columns")

In [None]:
test

So during scoring, the example test file is replaced with the actual test data, which contains columns `id`, `case_num`, `pn_num` and `feature_num` for "approximately 2000 patient notes". These new patient notes will also be added to `patient_notes.csv`, from where we need to get the text data.

Let's try it using the example above.

In [None]:
test = test.join(notes[["pn_num", "pn_history"]], on=["pn_num"], how="left", rsuffix="_r").drop("pn_num_r", axis=1)
test

From the test file it becomes obvious that we need to make separate predictions per feature. Hence we will make multiple predictions using the same patient note.

To continue with the example, let's just use the annotations from the training set as predictions

In [None]:
test = pd.merge(test, train.drop(["annotation", "n_annotation"], axis=1), how="left")
test

Check which exact format of spans we need to provide in the submission

In [None]:
submit

Convert spans in test to target format

In [None]:
test["location"] = test.location.apply(lambda x: x.replace("[","").replace("'","").replace("]","").replace(", ",";"))

In [None]:
test = test[submit.columns]
test

Submit predictions

In [None]:
test.to_csv("submission.csv", index=False)

## Important takeaways

**Observations from the data:**
- There are no missing values or duplicates to take care of.
- Patient notes are rather short, we should be easily able to handle such lengths with standard transformer models.
- Text lengths are similar across case numbers and annotation status (= in train vs not in train).
- Patient notes might need some cleaning, but we need to be careful about messing up text-annotation alignment.
- Patient notes contain technical terms, which might be difficult for models pre-trained on more general corpora to handle.
- There are only 1000 annotated patient notes, but more than 40k additional notes without annotations.
- There is some variation in the number of annotations across case numbers, but a strong imbalance across features.

**Implications for modeling:**
- From test example and sample submission is seems like we have to make predicitons on the feature level.
- Since there are 143 features, with many features having less than 50 annotations, modeling will be challenging. There are even 4 features with less than 10 observations.
- Given the relatively small number of annotations, we can leverage the additional 40k patient notes without annotations for self-supervised pre-training or fine-tuning of an existing model. 
- Patient notes in the test set will be dynamically added to `patient_notes.csv` during submission, i.e. test patient notes won't be part of the pre-training corpus.
- This implies that we should also exclude those patient notes from the pre-training corpus that will be used as validation set. When using cross-validation on the entire training set, which seems like a good idea since there aren't many annotations, we would need to exclude all annotated patient notes from pre-training.

### Thanks for checking out my notebook. If you find it helpful, please consider upvoting :-)