  # Score Clinical Patient Notes: EDA

When you visit a doctor, how they interpret your symptoms can determine whether your diagnosis is accurate. By the time they’re licensed, physicians have had a lot of practice writing patient notes that document the history of the patient’s complaint, physical exam findings, possible diagnoses, and follow-up care. Learning and assessing the skill of writing patient notes requires feedback from other doctors, a time-intensive process that could be improved with the addition of machine learning.

![](https://2dilis1szzwz2ijnst3bl6kc-wpengine.netdna-ssl.com/wp-content/uploads/2018/09/ThinkstockPhotos-531739713.jpg)

In this competition, you’ll identify specific clinical concepts in patient notes. Specifically, you'll develop an automated method to map clinical concepts from an exam rubric (e.g., “diminished appetite”) to various ways in which these concepts are expressed in clinical patient notes written by medical students (e.g., “eating less,” “clothes fit looser”).

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud,STOPWORDS
import string
import plotly.express as px
import plotly.graph_objects as go

from warnings import filterwarnings
filterwarnings('ignore')

colors = ['#3B43EE', '#181D7E', '#656AC4', '#3B3F79', '#060F8A']

# About the Data
The text data presented here is from the USMLE® Step 2 Clinical Skills examination, a medical licensure exam. This exam measures a trainee's ability to recognize pertinent clinical facts during encounters with standardized patients.

During this exam, each test taker sees a Standardized Patient, a person trained to portray a clinical case. After interacting with the patient, the test taker documents the relevant facts of the encounter in a patient note. Each patient note is scored by a trained physician who looks for the presence of certain key concepts or features relevant to the case as described in a rubric. The goal of this competition is to develop an automated way of identifying the relevant features within each patient note, with a special focus on the patient history portions of the notes where the information from the interview with the standardized patient is documented.

In [None]:
features = pd.read_csv('../input/nbme-score-clinical-patient-notes/features.csv')
patient_notes = pd.read_csv('../input/nbme-score-clinical-patient-notes/patient_notes.csv')
train = pd.read_csv('../input/nbme-score-clinical-patient-notes/train.csv')
test = pd.read_csv('../input/nbme-score-clinical-patient-notes/test.csv')
sample_submission = pd.read_csv('../input/nbme-score-clinical-patient-notes/sample_submission.csv')

<b>features.csv</b> - The rubric of features (or key concepts) for each clinical case.

* feature_num - A unique identifier for each feature.
* case_num - A unique identifier for each case.
* feature_text - A description of the feature.

In [None]:
features.head()

In [None]:
features.info()

**patient_notes.csv -** A collection of about 40,000 Patient Note history portions. Only a subset of these have features annotated. You may wish to apply unsupervised learning techniques on the notes without annotations. The patient notes in the test set are not included in the public version of this file.
* pn_num - A unique identifier for each patient note.
* case_num - A unique identifier for the clinical case a patient note represents.
* pn_history - The text of the encounter as recorded by the test taker.

In [None]:
patient_notes.head()

In [None]:
patient_notes.info()

**train.csv -** Feature annotations for 1000 of the patient notes, 100 for each of ten cases.
* id - Unique identifier for each patient note / feature pair.
* pn_num - The patient note annotated in this row.
* feature_num - The feature annotated in this row.
* case_num - The case to which this patient note belongs.
* annotation - The text(s) within a patient note indicating a feature. A feature may be indicated multiple times within a single note.
* location - Character spans indicating the location of each annotation within the note. Multiple spans may be needed to represent an annotation, in which case the spans are delimited by a semicolon ;.

In [None]:
train.head()

In [None]:
train.info()

**test.csv -** Example instances selected from the training set.

In [None]:
test.head()

In [None]:
test.info()

In [None]:
patient_notes['word_count'] = patient_notes['pn_history'].apply(lambda x: len(x.split()))
features['word_count'] = features['feature_text'].apply(lambda x: len(x.split('-')))
train['word_count'] = train['annotation'].apply(lambda x: len(x.split()))

fig = px.histogram(patient_notes,
                   x='word_count',
                   color_discrete_sequence=[colors[0]],
                  title='Patient History Word Count Distribution')
fig.show()

In [None]:
fig = px.histogram(features,
                   x='word_count',
                   color_discrete_sequence=[colors[1]],
                  title='Feature Text Word Count Distribution')
fig.show()

In [None]:
fig = px.histogram(train,
                   x='word_count',
                   color_discrete_sequence=[colors[2]],
                  title='Annotation Word Count Distribution')
fig.show()

In [None]:
patient_notes['cleaned_text'] = patient_notes['pn_history'].apply(lambda x: "".join(i for i in x if i not in string.punctuation))
patient_notes['avg_len'] = patient_notes['cleaned_text'].apply(lambda x: np.mean([len(i) for i in x.split()]))

features['avg_len'] = features['feature_text'].apply(lambda x: np.mean([len(i) for i in x.split('-')]))

fig = px.histogram(patient_notes,
                   x='avg_len',
                   color_discrete_sequence=[colors[3]],
                  title='Patient History average word length')
fig.show()

In [None]:
fig = px.histogram(features,
                   x='avg_len',
                   color_discrete_sequence=[colors[4]],
                  title='Feature Text Average Word Length')
fig.show()

In [None]:
f_counts = features.groupby("case_num").count()

fig = px.bar(f_counts,
       x = f_counts.index,
       y = 'feature_num',
       color_discrete_sequence=[colors[0]],
       title = "Features on every cases")

fig.show()

In [None]:
p_counts = patient_notes.groupby("case_num").count()
fig = px.bar(p_counts,
       x = p_counts.index,
       y = 'pn_num',
       color_discrete_sequence=[colors[1]],
       title = "Patient numbers on every cases")

fig.show()

### Feature Text

In [None]:
features_text = ' '.join(features['feature_text'])

plt.subplots(figsize=(16,16))
wc = WordCloud(stopwords=STOPWORDS,background_color="white", contour_width=2, contour_color='blue',width=1500, height=750, max_font_size=256, max_words=150,random_state=1)
wc.generate(features_text)
plt.imshow(wc, interpolation="bilinear")
plt.axis('off')
plt.show()

### Patient History

In [None]:
pn_history_text = ' '.join(patient_notes['pn_history'])

plt.subplots(figsize=(16,16))
wc = WordCloud(stopwords=STOPWORDS,background_color="white", contour_width=2, contour_color='blue',width=1500, height=750, max_font_size=256, max_words=150,random_state=1)
wc.generate(pn_history_text)
plt.imshow(wc, interpolation="bilinear")
plt.axis('off')
plt.show()

### Annotation

In [None]:
annotation_text = ' '.join(train['annotation'])

plt.subplots(figsize=(16,16))
wc = WordCloud(stopwords=STOPWORDS,background_color="white", contour_width=2, contour_color='blue',width=1500, height=750, max_font_size=256, max_words=150,random_state=1)
wc.generate(annotation_text)
plt.imshow(wc, interpolation="bilinear")
plt.axis('off')
plt.show()

## Thank you