# Problem Statement

1. Use ML to evaluate a candidate doctor's notes regarding a test patient (**an actor who is made to portray some specific clinical case/disease**). 
2. Usually experienced doctors (**examiners**) are given these notes along with some rubrics (**a scoring guide used to evaluate the performance of a candidate**) that outlines the important concepts or features regarding the clinical case potrayed by the actor/test patient. 

The main motive of this exercise is to see that whether the candidate doctor can identify all the important/relevant medical concepts/features related to a specific clinical case. 

Simple : more features found in candidate doctor's note, higher the score he/she gets. (**This implies that the candidate has developed good ability to perform correct diagnosis**). 

This whole process is a part of an exam **USMLE**, to measure a candidate's ability to recognize pertinent clinical facts (**features**) during encounters with standardized patients ( **actors** ) so that a correct diagnosis is made & thus a correct assesment of the possible disease of the patient is made.

# Challenge for ML :

1. Symptoms/features can be written in a spectrum of ways. ( eg a feature : "diminished appetite" can be written as “eating less,” “clothes fit looser”, "lost a lot of weight", "having only 2 meals a day" .......) or (if there is a feature : "17 year old" can be written as "17 yo", "17 year-old", "17 year o M" .....)


2. There could be scenarios where a paritcular case's feature might make sense semantically, when non contiguous pieces of text in the notes are combined. ( eg feature : "Stress-due-to-caring-for-elderly-parents" can be written in note as :  "Feels very overwhelmed"  & "takes care of her mother", the combination of both sentences conveys the fact that the feature was captured by the candiate)


3. Capture the intent of sentences based on the context of a note in order to map it to a correct feature.


4. Capture medical keywords for correct feature identification.


5. Being robust to spelling mistakes (quite common human error).


# Basic AIM :

(**from the competition description**)

To develop an automated method to map clinical concepts/features from an exam rubric (e.g., “diminished appetite”) to various ways in which these concepts are expressed in candidate's notes (e.g., “eating less,” “clothes fit looser”). Great solutions will be both accurate and reliable. The model must be able to find useful information that is pertaining to a feature & ignore all noisy aspects.

In [None]:
# Basic imports
import re
import os
import pickle
import numpy as np
import pandas as pd
import random
import matplotlib.pyplot as plt
import seaborn as sns
from pylab import rcParams
import ast
from wordcloud import WordCloud, STOPWORDS
from tqdm import tqdm

**Paths & Constants**

In [None]:
data_dir = "../input/nbme-score-clinical-patient-notes/"

# For code reporducability
seed = 100
random.seed(seed)
np.random.seed(seed)

# Plotting constants
rcParams['figure.figsize'] = 7,4
plt.rcParams['axes.grid'] = False
plt.style.use('seaborn')

# Facts about the dataset

1. **Patient Note** : After conversation with an actor/test patient, the candidate documents all relevant facts of the encounter. ( Typical note making we observe whenever we go to any doctor). The notes contain our complains, historical facts or maybe the results of some tests taken before.

2. **Feature** : The item which each trained physician/examiner is trying to search in every candiate's patient note. A clinically relevant concept. A rubric of concepts describes the key features relevant to each case. So each case in this data has some unique features which are essential for correct diagnosis. The goal for every candidate is to capture most of these features in their notes.

3. **Clinical Case** : The scenario (e.g., symptoms, complaints, concerns) the actor presents to the candidate. 10 cases are represented in this dataset.


## Items in the dataset :

- **patient_notes.csv** : 
    - 40k documents containing a portion of patient notes. ( Ofcourse portion beacuse every detail cannot be revealed ! ). 
    - Only a subset of these documents have features annotated in them. 
    - These notes could be used for unsupervised learning. Just to understand the expected medical lingo in testing scenario 
    - Columns in this data :  
        1. **pn_num**     : Unique ID of the patient.
        2. **case_num**   : Unique ID of the case.
        3. **pn_history** : The note containing all information about the interaction.


- **features.csv** : 
    - The rubric that contains features to be searched by trained doctors for every clinical case. 
    - Columns in this file : 
        1. **feature_num**  : Unique id of the feature/clinical concept. 
        2. **case_num**     : Unique ID of the case.
        3. **feature_text** : Text describing the feature in detail.

- **train.csv** : 
    - Feature annotations for only 1000 of the patient notes, 100 for each of the ten clinical cases is available.
    - Columns in this file : 
        1. **id**          : a unqiue combination of patient-feature id.
        2. **pn_num**      : Unique ID of the patient.
        3. **feature_num** : Unique id of the feature/clinical concept.
        4. **case_num**    : Unique ID of the case.
        5. **annotation**  : Part of text indicating the feature in patient note. Also a note can contain repitions of one feature.
        6. **location**    : In terms of character index, it specifies the feature.

# Focus & Intutions from Patient Notes

**Facts Gained** : 

1. For every unique patient there is only one note. Therefore the data has style & little bit of fact variations but no missing entries. (**All the entries in this csv are unique therefore we should not care about who the patient was or who the candidate doctor was.**)


2. Each actor/test patient would be given the true scenario with detailed information about a specific case & its clinical concepts/features ( **symptoms which actor will have to potray using his/her skills**), but the candidate has no clue about this. 
    - Thus each entry for a specific case in this csv would be an estimate of the ground reality (**complete information given to all actors potraying a specific case**) based on what the candidate doctor understands & how well the actor/test patient performs in front of them (** lol, if the actor forgets a fact how will candidate mention that in their note.**) 
    - Every note belonging to a particular case, irrespective of patient num would share quite a lot of similarities because the ground facts of the specific case are fixed. 
    - Also the same 10 clinical cases will be asked in the test set.


3. We don't need to know which candidate wrote which note, but clearly since the data was taken from an exam each candidate would have their own style of writing notes. 
    - Some would prefer mentioning facts in bullet points.
    - Some would prefer paragraphs.
    - Some would use a lot of short forms (**MLH, FH ...**).
    - Some would give emphasis on use of correct punctuations.
    - Some would give weightage to certain features more. (**mind develops biases in mysterious ways !**)
    
    
4. The beauty about this problem statement is that the set of features i.e. **clinical concepts** to be searched in each note, when a case is given, is fixed. It is the **context/patient notes** in which we have to search a specific set of features, is the item that keeps on changing. This is due to the fact that each candidate has their own version of seeing the ground truth & may see different actor/test patient who could portay the whole case in a slighlty different way.

In [None]:
pat_notes_df = pd.read_csv(data_dir + "patient_notes.csv")

print("Number of patients    : ",pat_notes_df["pn_num"].nunique())
print("Total number of notes : ",len(pat_notes_df))
print("Total number of cases : ",pat_notes_df["case_num"].nunique())

pat_notes_df["text_len"] = pat_notes_df["pn_history"].map(lambda x : len(x))
print("Average note length (number of characters)   : ",pat_notes_df["text_len"].mean())
print("Min text length : %d | Max text length : %d"%(pat_notes_df["text_len"].min(),pat_notes_df["text_len"].max()))
print("Number of missing entries : ",pat_notes_df["pn_history"].isna().sum())

# Print patient notes....
pat_notes_df.head()

**Notes Distribution for each case**

- The labelled dataset is balanced (**100 annotations for each case**) but the total notes available is clearly skewed.
- One has to be careful if they are trying to levarage unsupervised learning because the underlying skewed distribtuion could lead the model astray by making it learn more of the dominant case features/terminology, even though we perform a case wise independent feature search !

In [None]:
# Notes distribution per case
case_to_notes_len = {grp.iloc[0]["case_num"] : len(grp) for tmp, grp in pat_notes_df.groupby('case_num')}
plt.bar(case_to_notes_len.keys(), case_to_notes_len.values(), 1, color='r',edgecolor = "black")
plt.ylabel("Number of notes")
plt.yticks(np.arange(0,10000,500))
plt.xlabel("Clinical case number")
plt.xticks(np.arange(0,10))
plt.show()

**Popular Words in each Clinical Case**

 1. The goal of these word clouds generated by merging all notes case wise, is to get an idea about the most common words found in each note. PS. since in the test setup we will be given the case num & thus we can perform search on each set of case features independently.
 
 2. This also gives us an idea about the ground truth because if all the notes capture a term/token it must have some weightage (since every candidate found it) which is captured by the size of the term in world cloud.
 
 3. But this strategy has the problem of attending to only a specific word frequency therefore it would not be able to give us a true idea about tokens/words that have similar/same semantic meaning or became slightly different due to spelling mistake or short form writing.

In [None]:
word_cloud_list = []
for t, group in tqdm(pat_notes_df.groupby('case_num'), position = 0, leave = True):
    combine_all_group_text = ""
    for i in range(len(group)):
        combine_all_group_text+=group.iloc[i]["pn_history"].lower() + " "
    
    word_cloud_list.append(WordCloud(width = 800, height = 800,
                                     background_color ='black',
                                     stopwords = set(STOPWORDS),
                                     min_font_size = 10).generate(combine_all_group_text))
    
fig, ax = plt.subplots(nrows=2, ncols=5, figsize=(27, 12))
fig.subplots_adjust(hspace=0.05, wspace = 0.03)
ctr = 0
for r in range(2):
    for c in range(5):
        ax[r,c].imshow(word_cloud_list[ctr])
        ax[r,c].title.set_text('Case {}'.format(ctr))
        ax[r,c].axis("off")
        ctr+=1
plt.show()

**Text Length Distribution**
- Most of the notes are long in nature (**this could be due to newline i.e. bullet point style or use of special characters**).

- In this case while preprocessing we need to be careful because the annotations provided in train_df would be w.r.t. to the original uncleaned text.

- So this plot also gives us an idea that usually we have complete information available & less cases of abrupt end of data due to faulty data collection.

- Also I don't feel space based token count histograms would be beneficial because subword tokenisation schemes in today's transformers break the words in their own ways, thus character count gives us an upper bound estimate of the maximum length scenario.

In [None]:
# note length distribution original
sns.displot(pat_notes_df["text_len"], kde = True, height = 6, aspect = 2)
plt.xticks(np.arange(0, 1000,50))
plt.show()

**Case wise text length distribution** :

- As we have seen the number of notes for each case is different therefore the sample size of each box & violin plots is different, therefore different plot comparison is futile.

- The median size of all case notes is generally between 800 to 1000 characters.

- Most of them have similar size.

- The box plot contains few outliers but if we plan to use supervised learning we should see the other plots.

**Labelled Case** :
- The notes that are labelled have far fewer outliers when observed case wise. (Data is quite clean during training)

- The distribution is more uniform therefore more variation in terms of length observed for a specific case's note.

- Also comparison across violin & box plots for this data makes sense because the sample size is same i.e. 100 annotations for each case, balanced training data.

**Note**

The whisker length on the top is not same as the bottom because it extends only till the maximum data point between Q3 & Q3+ 1.5\*IQR in this dataset for each clinical case there are no exceptionally set of text but surely there are small text when compared to the median or the mean !

In [None]:
# case wise length distribution
print("")
plt.figure(figsize=(17,10))
plt.suptitle('FULL DATA')
plt.subplot(211)
sns.violinplot(x="case_num", y="text_len", data = pat_notes_df)

plt.subplot(212)
sns.boxplot(x="case_num", y="text_len", data = pat_notes_df)
plt.show()

# case wise length distribution
tmp_df = pd.read_csv(data_dir + "train.csv")
labelled_patient_cases = tmp_df["pn_num"].unique()
pat_notes_df["label_status"] = pat_notes_df["pn_num"].apply(lambda x: True if x in labelled_patient_cases else False)

plt.figure(figsize=(17,10))
plt.suptitle('SUPERVISED DATA')
plt.subplot(211)
sns.violinplot(x="case_num", y="text_len", data = pat_notes_df[pat_notes_df["label_status"] == True])

plt.subplot(212)
sns.boxplot(x="case_num", y="text_len", data = pat_notes_df[pat_notes_df["label_status"] == True])
plt.show()

**As mentioned each candidate has a way of writing notes:**

- One candiate prefers to write the patient's name while others only focus on gender & age.

- Bullet Pointers for terms like PMH (Past medical history), FH (family history) etc. is commonly seen for this random sample (change the seed to explore different samples).

- Some candidates prefer to use heavy medical lingo like "burning epigastric pain" etc.

- As mentioned earlier since all the text entries in this sample belong to the same case therefore they share the ground truth facts. This clinical case involves a 35 year old male who has some form of stomach pain that burns & does not improve throughout the day.

The thing that makes the task so challenging for NLP-ML is that there is no single way of intepreting the actor's clinical case/concepts/features correctly. We ought to build a robust model which has the ability to search for each clinical case specific features if they are present in any context.

In [None]:
# for reproducability & my for my text box to make sense
random.seed(15)
few_samples = 4
pick_case_num = random.sample(range(pat_notes_df["case_num"].nunique()),1)[0]
# Visualise few samples from a specific case ....
print("All patient notes belong to the same case ..")
print("Case number : %d"%pick_case_num)
print("----------------------------")

sub_df = pat_notes_df[pat_notes_df["case_num"] == pick_case_num]
random_samples = random.sample(range(len(sub_df)), few_samples)

for cnt,i in enumerate(random_samples):
    print("Text : ",cnt)
    print("----------------------------")
    print(sub_df.iloc[i]["pn_history"])
    print("----------------------------")

# Focus & Intutions from features

This is the exam rubric which is only available to the expert physician & not to the candidate appearing for the exam. Thus this is the set of ground truth labels for each of the 10 cases which is going to stay constant throughout this whole problem statement even in the test setup. 

Our goal is to pick a set of features for a clinical case ( which we recieve in the test setup ) & search them in the candidate's patient note & find it anyhow if it is present in that note. 

The challenge is that each feature:
   - It can be written multiple times. ( Detecting the location of every such excerpt )
   
   - In many different ways. ( Plethora of ways to write the same sentence )
   
   - Non contigous pieces of text could semantically mean a feature/clinical concept. ( Joining the chunks to make a feature )

In [None]:
features_df  = pd.read_csv(data_dir + "features.csv")

print("Number of unique clinical cases : ",features_df["case_num"].nunique())
print("Number of unique features : ",features_df["feature_num"].nunique())
features_df.head()

**Case Feature distribution**

- This plot is just to get a feel for the number of rows for each patient case in the train.csv because labeling has been done in such a way that for each labelled patient note we have mentioned every feature, if present then all the character index (start/stop location) chunks are specified & if absent then that feature is mentioned in the csv but the character index is blank.

- There are atleast 9 features for each case in this problem statement.

In [None]:
# exploring the number of features in each clinical case
case_to_features_cnt = {}
for t, case_fet in features_df.groupby('case_num'):
    case_to_features_cnt[case_fet["case_num"].iloc[0]] = case_fet["feature_num"].nunique()

plt.bar(case_to_features_cnt.keys(), case_to_features_cnt.values(), 1, color='r',edgecolor = "black")
plt.xticks(np.arange(0, 10))
plt.yticks(np.arange(0,19))
plt.ylabel("Number of features")
plt.xlabel("Clinical case number")
plt.show()

Just a plot to get the feel for label size. (not important)

In [None]:
# Word token distribution in features 
cleaned_features = features_df["feature_text"].map(lambda x : x.replace("-", " "))
sns.displot([len(i.split()) for i in cleaned_features])

1. Since we get the case num in the test setup therefore each case's feature are searched independently but there are some features repeated across the cases. This should cause no trouble while building the model.

2. Also no feature is repeated in each case therefore no problem with training data.

In [None]:
# just to check whether any feature/concept repeats or not in a particular case
rep_cnt_grp_wise = 0
tmp = features_df['feature_text'].map(lambda x: x.strip().lower())
for t, grp in features_df.groupby('case_num'):
    tmp_grp = grp['feature_text'].map(lambda x: x.strip().lower())
    rep_cnt_grp_wise += len(set(tmp_grp[tmp_grp.duplicated()]))
print("Number of features repeated in a group : ",rep_cnt_grp_wise)

# This is not a problem becasue once a clinical case is given we don't need to worry about the other features
print("Number of unqiue features overall : ",tmp.nunique())
print("Duplicate features are : ")
set(tmp[tmp.duplicated()])

# Focus & Intutions from train file


As mentioned in the data section we have 100 annotations for each of the 10 clinical cases therefore we have a total of 1000 labels for our overall problem statement. **While splitting the data for train & validation keep in mind to use "*case_num*" for stratified splits so that we have equal represenatations for both scenarios.**

After observing the data the 2 most tricky aspects observed are :
1. Multiple spans of the same feature in the patient note. The index for this case is specified as feature_1 -> [(start_1, end_1),(start_2, end_2),(start_3, end_3)]. Simple start stop indicies to show multiple excerpts.

2. Tring to link different, non contiguous spans to map to a feature. The index for this case is specified as feature_1 -> [(start_1_1, end_1_1 ; start_1_2, end_1_2) ....]. Use of semicolons to specify disjoint chunks to map to a feature.

In [None]:
# if len of any tuple is >2 then it is part of one seperated text in (start_1,end_1,start_2,end_2...) format 
def order_func(x):
    coll_list = []
    for i in x:
        i = i.strip()
        if ";" in i:
            tmp = []
            set_list = i.split(";")
            for j in set_list:
                j = j.strip()
                tmp.extend((int(j.split()[0]), int(j.split()[-1])))
            coll_list.append(tuple(tmp))
        else:
            coll_list.append((int(i.split()[0]), int(i.split()[-1])))
    
    return coll_list

- A helper function is used to read the list of indices for multiple repitions & multiple chunks case as tuples of start & end. 

- Also the each features text description is added to the annotation csv to get a true idea about the feature to search for.

In [None]:
train_df = pd.read_csv(data_dir + "train.csv")
print("Number of patients for which we have annotations : ",len(train_df.groupby('pn_num')))

# also for every patient we exactly have the same number of entries as there are features belonging to that case.
mismatch = 0
for t, grp in train_df.groupby('pn_num'):
    if len(grp) != case_to_features_cnt[grp.iloc[0]["case_num"]]:
        mismatch+=1
        
print("Number of patients with missing features         :",mismatch)
# since the data is present in the df as string list
train_df["location"] = (train_df["location"].map(lambda x : ast.literal_eval(x))).map(lambda x : order_func(x))
train_df["annotation"] = train_df["annotation"].map(lambda x : ast.literal_eval(x))
train_df["feature_text"] = train_df["feature_num"].map(lambda x : features_df[features_df["feature_num"] == x]["feature_text"].iloc[0])
train_df

**Case-wise feature repitition distribution**

The goal of these set of plots is to get an idea about a single feature repition in our data.

1. For almost all the cases there is clearly a skew in feature presence. It could be that understadning that feature was quite hard & only the best candidates could see that.
2. For case 2,4,8,9 there is 1 feature whose presence in the labels is extremely small.

In [None]:
# case wise plot

fig, ax = plt.subplots(nrows=10, ncols=1, figsize=(18, 70))
fig.subplots_adjust(hspace=0.2)
i = 0
for t, grp in train_df.groupby('case_num'):
    feature_to_count = {}
    feature_num_to_text = {}
    for t_j, fet_grp in grp.groupby('feature_num'):
        feature_to_count[fet_grp.iloc[0]['feature_num']] = 0
        if len(fet_grp.iloc[0]['feature_text'].split("-")) > 5:
            feature_num_to_text[fet_grp.iloc[0]['feature_num']] = "-".join(fet_grp.iloc[0]['feature_text'].split("-")[0:5])
        else:
            feature_num_to_text[fet_grp.iloc[0]['feature_num']] = fet_grp.iloc[0]['feature_text']
        for ctr in range(len(fet_grp)):
            feature_to_count[fet_grp.iloc[0]['feature_num']] += len(fet_grp.iloc[ctr]['annotation'])
    

    ax[i].set_title('Case {}'.format(grp.iloc[0]['case_num']), fontweight ="bold")
    ax[i].bar(feature_to_count.keys(), feature_to_count.values(), 1, color='r',edgecolor = "black")
    ax[i].set_ylabel("Label Count")
    ax[i].set_xticklabels([])
    
    for index, value in enumerate(feature_to_count):
        ax[i].text(value, index,str(feature_num_to_text[value]) + " = ( " + str(feature_to_count[value]) + " )", rotation = 90)
 
    i+=1
plt.show()

Scatter plot to get an idea about the 

To get a feel for full text vs the present & absent features.

In [None]:
# Random Viz
random.seed(40)
unq_patients = random.sample(list(train_df["pn_num"].unique()),1)[0]
patient_note = pat_notes_df[pat_notes_df["pn_num"] == unq_patients]["pn_history"].iloc[0]
print("Case number :",pat_notes_df[pat_notes_df["pn_num"] == unq_patients]["case_num"].iloc[0])

print("--------------------------------------")
print("Original Text ........")
print(patient_note)
print("--------------------------------------")

print("Labelled features ...... {feature : text}")
tmp_df = train_df[train_df["pn_num"] == unq_patients]
for i in range(len(tmp_df)):
    location_list = tmp_df.iloc[i]["location"]
    print("--------------------------------------")
    print("Feature : ",tmp_df.iloc[i]["feature_text"])
    if len(location_list) == 0:
        print("Value   :  Not found")
        
    for j in location_list:
        tmp = ""
        if len(j) > 2:
            for k in range(0,len(j),2):
                tmp+= patient_note[j[k]:j[k+1]] + " + "
        else:
            tmp+= patient_note[j[0]:j[1]]
        print("Value   :  %s"%(tmp))
    print("--------------------------------------")

In [None]:
train_df[train_df["pn_num"] == unq_patients]