# NBME - Score Clinical Patient Notes 

![](https://images.unsplash.com/photo-1576091160550-2173dba999ef?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=800&q=80)  
**Task**: Prediction location of the annotation (feature) for each patient note



In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re
from wordcloud import WordCloud, STOPWORDS

In [None]:
DIR = '/kaggle/input/nbme-score-clinical-patient-notes'

df_notes = pd.read_csv(f'{DIR}/patient_notes.csv')
df_features = pd.read_csv(f'{DIR}/features.csv')
df_train = pd.read_csv(f'{DIR}/train.csv')
df_test = pd.read_csv(f'{DIR}/test.csv')

# Important Terms

- **Clinical Case**: The scenario (e.g., symptoms, complaints, concerns) the Standardized Patient presents to the test taker (medical student, resident or physician). **Ten** clinical cases are represented in this dataset.
- **Patient Note**: Text detailing important information related by the patient during the encounter (physical exam and interview).
- **Feature**: A clinically relevant concept. A rubric describes the key concepts relevant to each case.

# Patient's Notes

- A collection of about 40,000 Patient Note history portions. 
- Only a subset of these have features annotated. 
- You may wish to apply unsupervised learning techniques on the notes without annotations. 
- The patient notes in the test set are not included in the public version of this file.


- **pn_num** - A unique identifier for each patient note.  
- **case_num** - A unique identifier for the clinical case a patient note represents.  
- **pn_history** - The text of the encounter as recorded by the test taker.  


- Each clinical case (`case_num`) has multiple patient notes (`pn_num`)

In [None]:
case_number = 0

df_notes.loc[df_notes['case_num'] == {case_number}]
print (f"Case Number {case_number} has {df_notes.loc[df_notes['case_num'] == case_number].shape[0]} patient notes")
print (f'Total Case Numbers: {df_notes.case_num.nunique()}')

In [None]:
df_notes\
    .groupby('case_num', as_index = False)\
    .agg({'pn_num':'count'})\
    .plot(kind = 'bar', xlabel = 'case_num', ylabel = 'Number of Patient Notes', title = 'Number of Patient Notes Per Case', legend = False)

### Patient Notes

In [None]:
print (f'Sample of Patient Notes for Case Number: {case_number}')
df_notes.loc[df_notes['case_num'] == case_number, 'pn_history'].head(3).values

### Word Cloud

In [None]:
case_number = 0

stopwords = set(STOPWORDS)
wordcloud = WordCloud(stopwords=stopwords, background_color="white")
text = ' '.join(df_notes.loc[df_notes['case_num'] == case_number, 'pn_history'].values.tolist())
wordcloud.generate(text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
case_number = 1

stopwords = set(STOPWORDS)
wordcloud = WordCloud(stopwords=stopwords, background_color="white")
text = ' '.join(df_notes.loc[df_notes['case_num'] == case_number, 'pn_history'].values.tolist())
wordcloud.generate(text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
case_number = 2

stopwords = set(STOPWORDS)
wordcloud = WordCloud(stopwords=stopwords, background_color="white")
text = ' '.join(df_notes.loc[df_notes['case_num'] == case_number, 'pn_history'].values.tolist())
wordcloud.generate(text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

### Word Count

In [None]:
df_notes['word_count'] = df_notes['pn_history'].apply(lambda x: len(re.split('\s|; |, |\*|\n|-',x)))

In [None]:
fig, ax = plt.subplots()

ax = sns.boxplot(data = df_notes,x = 'case_num', y = 'word_count')
ax.set_ylabel('Number of Words')
ax.set_xlabel('Case Number')
ax.set_title('Word Count Per Patient Note')

# Features
The rubric of features (or key concepts) for each clinical case.


- **feature_num** - A unique identifier for each feature.
- **case_num** - A unique identifier for each case.
- **feature_text** - A description of the feature.

In [None]:
df_features.head()

# Train
Feature annotations for 1000 of the patient notes, 100 for each of ten cases.

- **id** - Unique identifier for each patient note / feature pair.
- **pn_num** - The patient note annotated in this row.
- **feature_num** - The feature annotated in this row.
- **case_num** - The case to which this patient note belongs.
- **annotation** - The text(s) within a patient note indicating a feature. A feature may be indicated multiple times within a single note.
- **location** - Character spans indicating the location of each annotation within the note. Multiple spans may be needed to represent an annotation, in which case the spans are delimited by a semicolon ;.

In [None]:
df_train.head()

# Test

This is an example test set with records taken from the train set.  
Purpose of this example test set is to help author the code.


**The task is to identify the location of the annotation (features) of patient notes indicated in the test set**


In [None]:
df_test

# Train + Features

In [None]:
df_train_features = df_train.merge(df_features, on = ['case_num', 'feature_num'], how = 'left')
df_train_features.loc[(df_train_features['case_num'] == 0) & (df_train_features['feature_num'] == 0)].head(10)

# 