## NBME - Score Clinical Patient Notes

### Identify Key Phrases in Patient Notes from Medical Licensing Exams
![](https://storage.googleapis.com/kaggle-competitions/kaggle/33607/logos/header.png?t=2022-01-24-18-05-41")

## Data decsription:

### Important Terms
Clinical Case: The scenario (e.g., symptoms, complaints, concerns) the Standardized Patient presents to the test taker (medical student, resident or physician). Ten clinical cases are represented in this dataset.

Patient Note: Text detailing important information related by the patient during the encounter (physical exam and interview).

Feature: A clinically relevant concept. A rubric describes the key concepts relevant to each case.

### Training Data
>- patient_notes.csv - A collection of about 40,000 Patient Note history portions. Only a subset of these have features annotated. You may wish to apply unsupervised learning techniques on the notes without annotations. The patient notes in the test set are not included in the public version of this file.
>- pn_num - A unique identifier for each patient note.
>- case_num - A unique identifier for the clinical case a patient note represents.
>- pn_history - The text of the encounter as recorded by the test taker.

### features.csv - The rubric of features (or key concepts) for each clinical case.
>- feature_num - A unique identifier for each feature.
>- case_num - A unique identifier for each case.
>- feature_text - A description of the feature.

### train.csv - Feature annotations for 1000 of the patient notes, 100 for each of ten cases.
>- id - Unique identifier for each patient note / feature pair.
>- pn_num - The patient note annotated in this row.
>- feature_num - The feature annotated in this row.
>- case_num - The case to which this patient note belongs.
>- annotation - The text(s) within a patient note indicating a feature. A feature may be indicated multiple times within a single note.
>- location - Character spans indicating the location of each annotation within the note. Multiple spans may be needed to represent an annotation, in which case the spans are delimited by a semicolon ;.

## Competition evaluation metrics: 
This competition is evaluated by a micro-averaged F1 score.

## Import packages

In [None]:
# Install packages

In [None]:
# Import packages
import pandas as pd
import numpy as np
from termcolor import colored
import ipywidgets as widgets 
import plotly.express as px
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

## Load data

In [None]:
features_data = pd.read_csv('../input/nbme-score-clinical-patient-notes/features.csv')
patient_notes_data = pd.read_csv('../input/nbme-score-clinical-patient-notes/patient_notes.csv')
sample_submission = pd.read_csv('../input/nbme-score-clinical-patient-notes/sample_submission.csv')
test_data = pd.read_csv('../input/nbme-score-clinical-patient-notes/test.csv')
train_data = pd.read_csv('../input/nbme-score-clinical-patient-notes/train.csv')

## EDA

In [None]:
print(features_data.shape)
features_data.head()

In [None]:
i = 0
print(f'Feature {i}:', features_data[features_data['feature_num']==0]['feature_text'].iloc[i])

In [None]:
features_data.groupby('case_num').agg({'feature_num':'nunique'}).reset_index().rename({'feature_num':'# Features'})

In [None]:
print(patient_notes_data.shape)
patient_notes_data.head()

In [None]:
note_id = 1613 # Specify any patient note number in train data
print('\nFor note id:', note_id,'\n')

print('\nPatient note:\n')
temp = train_data[train_data['pn_num']==note_id]
text = patient_notes_data[patient_notes_data['pn_num']==note_id]['pn_history'].iloc[0]
formatted_text = text
temp = train_data[train_data['pn_num']==note_id]
for i in range(0, temp.shape[0]):
    locations = eval(temp['location'].iloc[i])
    if len(locations) != 0:
        for k in range(0, len(locations)):
            loc_k = locations[k]
            if ';' in loc_k:
                loc_k = loc_k.split(';')[0]
            loc_k_start = int(loc_k.split(' ')[0])
            if len(loc_k.split(' '))==2:
                loc_k_end = int(loc_k.split(' ')[1])
            else:
                loc_k_end = loc_k_start + 1
            formatted_text = formatted_text.replace(text[loc_k_start:loc_k_end], colored(text[loc_k_start:loc_k_end], 'red', 'on_white', attrs=['bold']))

print(formatted_text)

print('\nTrain data records:\n')
display(train_data[train_data['pn_num']==note_id])

"no hair loss" - Certain phrases which aren't adjacent words are annotated

In [None]:
patient_notes_data.groupby('case_num').agg({'pn_num':'nunique'}).reset_index().rename({'pn_num':'# Patient notes'})

In [None]:
print(train_data.shape)
print('# Patients:', train_data['id'].nunique())
train_data.head()

In [None]:
print('Max number of features in a patient note:', train_data.groupby(['pn_num']).agg({'feature_num':'nunique'}).reset_index()['feature_num'].max())

In [None]:
print(test_data.shape)
test_data.head()

Currently the test set contains examples from training set which would be replaced while scoring

In [None]:
features_data

In [None]:
print(sample_submission.shape)
sample_submission.head()

### Distribution by length

In [None]:
patient_notes_data['pn_history length'] = patient_notes_data['pn_history'].apply(lambda x: len(x.split(' ')))

In [None]:
display(patient_notes_data['pn_history length'].describe().reset_index().T)
fig = px.histogram(patient_notes_data, x='pn_history length', title = 'Distribution of patient note length')
fig.show()

Words in all patient notes is below 250

## Modelling

### NER

In [None]:
def tokenize_sentence(x):
    "takes in a string and returns tokenized list after special character padded"

    return [x for x in x.strip().lower().split(" ") if len(x) > 0]

In [None]:
def get_tags(sent, ep):
    '''
    Input: sent as a sentence tokenized as list of tokens, ep is list of eparker strings (not tokenized)
    output: tags 
    '''
    sent = [x.lower() for x in sent]
    ep_non_nan = False
    if isinstance(ep, list):
        ep_non_nan = True
        ep = [tokenize_sentence(x.lower()) for x in ep]
    i = 0
    tag = []
    if(ep_non_nan):
        while (i < len(sent)) and (len(ep) > 0):
            if (len(ep[0]) == 1) and (ep[0][0] == sent[i]):
                tag.append("B")
                i = i + 1
                ep = ep[1:]

            elif (len(ep[0]) > 1) and (ep[0] == sent[i:i + len(ep[0])]):
                tag = tag + ['B'] + ['I'] * (len(ep[0]) - 1)
                i = i + len(ep[0])
                ep = ep[1:]

            else:
                tag.append("O")
                i = i + 1

    tag = tag + ['O'] * (len(sent) - len(tag))

    return tag

<div align="center"><img src="https://www.pngall.com/wp-content/uploads/2018/04/Under-Construction-PNG-File.png" width=600>