# NBME Data Exploration

This notebook was created during a live coding session on twitch.

Follow here for future streams: [Follow here](https://www.twitch.tv/medallionstallion_)

Initially I had a hard time understanding the dataset until I realized the data is from a test being taken by future doctors. This is why features and case numbers are repeated in the training set. This notebook explores some of the best/worst test takers and hard/easy cases.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
plt.style.use('ggplot')

In [None]:
train = pd.read_csv('../input/nbme-score-clinical-patient-notes/train.csv')
test = pd.read_csv('../input/nbme-score-clinical-patient-notes/test.csv')
ss = pd.read_csv('../input/nbme-score-clinical-patient-notes/sample_submission.csv')
pn = pd.read_csv('../input/nbme-score-clinical-patient-notes/patient_notes.csv')
features = pd.read_csv('../input/nbme-score-clinical-patient-notes/features.csv')
train = train.merge(features, on=['case_num','feature_num'], validate='m:1')
train = train.merge(pn, validate='m:1')

In [None]:
# Print an example patient history notes
print(pn.query('pn_num == 16 and case_num == 0')['pn_history'].values[0])

# Score the Test Takers
- Which test takers performed best/worst?
- Which cases were hard or easy to score?

These results may help you when designing a model to predict the annotations.

In [None]:
# Add label if the test taker correctly identified the feature
train['correct'] = ~(train['annotation'] == "[]")

In [None]:
train.groupby('pn_num')['correct'].mean().sort_values() \
    .plot(kind='hist', bins=25, figsize=(12, 5),
          title='% of Features Correctly Noted by Doctor')
plt.show()

## Bad Test Taker?
This is an example of a test taker who labeled 27.7% of the features from this case:

In [None]:
print(train.query('pn_num == 52923')['pn_history'].values[0])

## Perfect test taker!

This test taker captured 100% of the features.

In [None]:
print(train.query('pn_num == 71865')['pn_history'].values[0])

# Are any "cases" harder for test takers?
- We can aggregate the percent correct for each test taker.
- Take a look at scores per case.

In [None]:
test_taker_results = train.groupby(['pn_num','case_num'])['correct'] \
    .mean().reset_index()

In [None]:
fig, ax = plt.subplots(figsize=(12, 5))
sns.boxplot(data=test_taker_results, x='case_num', y='correct')
ax.set_title('% of Features Captured by Case Number')
ax.set_xlabel('Case Number')
ax.set_ylabel('% of Features Captured')
plt.show()

# What is the best score of a test taker for each case?

In [None]:
ax = test_taker_results.groupby('case_num')['correct'].max() \
    .plot(kind='bar', color='#F8766D', figsize=(12, 5),
         title='Best Score for Each Case', edgecolor='black')

In [None]:
# This case is the hardest to get correct.
# Even the best test taker only found 76% of the features.
test_taker_results.query('case_num == 2').sort_values('correct', ascending=False).head(1)

In [None]:
print(train.query('pn_num == 21325')['pn_history'].values[0])

## Test Taker Score by Feature

In [None]:
train.groupby('feature_num')['correct'].mean() \
    .plot(kind='hist', bins=50, color='#00BFC4', figsize=(12, 5),
          title='% of Correct Annotation for Features', edgecolor='black')
ax.set_xlabel('% of Correct Annotations')
plt.show()

## Whats the least identified feature(s)?
- There were 2 features that were identified only once out of 100!

In [None]:
train.groupby('feature_num')['correct'].mean().sort_values()

In [None]:
train.query('feature_num == 807').head()

In [None]:
train.query('feature_num == 807').loc[
    train.query('feature_num == 807')['pn_history'].str.lower().str.contains('hallucinations')
]

In [None]:
print(
    train.query('feature_num == 807').loc[
    train.query('feature_num == 807')['pn_history'].str.lower().str.contains('hallucinations') &
    train.query('feature_num == 807')['pn_history'].str.lower().str.contains('ambien')
]['pn_history'].values[0]
)

In [None]:
print(train.query('feature_num == 209').loc[
    train.query('feature_num == 209')['pn_history'].str.lower().str.contains('stress')
]['pn_history']
     )

In [None]:
print(train.query('feature_num == 209').sort_values('correct') \
    .query('pn_num == 21054')['pn_history'].values[0])

# Make colored annotations

In [None]:
# import spacy
# sample_text = train.query('pn_num == 16')['pn_history'].values[0]
# # vocab = spacy.vocab.Vocab()
# ?nlp = spacy.load(vocab)
# doc = nlp(sample_text)
# spacy.displacy.render(doc, style='ent', manual=True, jupyter=True)