**Created by Sanskar Hasija**

**📊 NBME Detailed - EDA 📊**

**2 February 2021**


  # <center> 📊 NBME DETAILED- EDA 📊 </center>
## <center>If you find this notebook useful, support with an upvote👍</center>

# Table of Contents

<a id="toc"></a>
- [1. Introduction](#1)
- [2. Imports](#2)
- [3. EDA](#3)
    - [3.1 Train Data](#3.1)
    - [3.2 Test Data](#3.2)
    - [3.3 Patient Notes Data](#3.3)
        - [3.3.1 Patient Notes Distribution ( Per case ) ](#3.3.1)
        - [3.3.2 Patient Notes Length Distribution ](#3.3.2)
    - [3.4 Features Data](#3.4)
        - [3.4.1 Feature Distribution (per Case)  ](#3.4.1)
        - [3.4.2 Feature  Length Distribution ](#3.4.2)
    - [3.5 Patient analysis ](#3.5)
    - [3.6 Annotation analysis ](#3.6)
        - [3.6.1 Annotation Count Distribution  ](#3.6.1)
        - [3.6.2 Annotation Length Distribution   ](#3.6.2)
- [4. Annotation Visualiation with Spacy](#4)
- [5. WORD Clouds](#5)
    - [5.1 WORDCLOUD for Patient history](#5.1)
    - [5.2 WORDCLOUD for Features](#5.2)
    - [5.3 WORDCLOUD for Annotations](#5.3)
    - [5.4  WORDCLOUD for two characters words in Patient notes](#5.4)
   

  

<a id="1"></a>
# Introduction

### <center>[NBME - Score Clinical Patient Notes](https://www.kaggle.com/c/nbme-score-clinical-patient-notes/overview)</center>

![](https://raw.githubusercontent.com/sanskar-hasija/kaggle/main/images/header.png)

<b>The text data presented here is from the USMLE® Step 2 Clinical Skills examination, a medical licensure exam. This exam measures a trainee's ability to recognize pertinent clinical facts during encounters with standardized patients.</b>

<b>During this exam, each test taker sees a Standardized Patient, a person trained to portray a clinical case. After interacting with the patient, the test taker documents the relevant facts of the encounter in a patient note. Each patient note is scored by a trained physician who looks for the presence of certain key concepts or features relevant to the case as described in a rubric. The goal of this competition is to develop an automated way of identifying the relevant features within each patient note, with a special focus on the patient history portions of the notes where the information from the interview with the standardized patient is documented.</b>

<a href="#toc" role="button" aria-pressed="true" >⬆️Back to Table of Contents ⬆️</a>

# <center>IMPORTS</center> 
<a id="2"></a>

In [None]:
import os
import spacy
import warnings
import wordcloud
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.graph_objects as go

In [None]:
train = pd.read_csv("../input/nbme-score-clinical-patient-notes/train.csv")
test = pd.read_csv("../input/nbme-score-clinical-patient-notes/test.csv")
features = pd.read_csv("../input/nbme-score-clinical-patient-notes/features.csv")
patient_notes = pd.read_csv("../input/nbme-score-clinical-patient-notes/patient_notes.csv")
submission = pd.read_csv("../input/nbme-score-clinical-patient-notes/sample_submission.csv")

RANDOM_IDX = 12
warnings.filterwarnings('ignore')

<a href="#toc" role="button" aria-pressed="true" >⬆️Back to Table of Contents ⬆️</a>

# <center>EDA</center> 
<a id="3"></a>

<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;">
    📌 &nbsp;<b><u>Observations in Train Data:</u></b><br>
 
* <i> There are total of <b><u>6</u></b> columns and <b><u>146300</u></b> rows in <b><u>train</u></b> data.</i><br>
* <i> Train data contain <b><u>85800</u></b> observation with <b><u>0</u></b>  missing values.</i><br>
* <i> <b><u>10</u></b> unique cases and <b><u>1000</u></b> unique patient ids are present.</i><br>
* <i> Multilple annotations and locations are present in single rows.( More Discussed in below sections ) </i><br>
</div>

<a id="3.1"></a>
## Train data

**Column Description :**
* `id` - Unique identifier for each patient note / feature pair.
* `pn_num` - The patient note annotated in this row.
* `feature_num` - The feature annotated in this row.
* `case_num` - The case to which this patient note belongs.
* `annotation` - The text(s) within a patient note indicating a feature. A feature may be indicated multiple times within a single note.
* `location` - Character spans indicating the location of each annotation within the note. Multiple spans may be needed to represent an annotation, in which case the spans are delimited by a semicolon ;.

### Quick view of Train Data

In [None]:
print(f'\033[92mNumber of rows in train data: {train.shape[0]}')
print(f'\033[94mNumber of columns in train data: {train.shape[1]}')
print(f'\033[91mNumber of values in train data: {train.count().sum()}')
print(f'\033[91mNumber missing values in train data: {sum(train.isna().sum())}')
train.head()

In [None]:
print(f"Total number of train: {train.count()}")
print(f"Unique case numbers are:  {train.case_num.unique()}") 
print(f"Feature annotated numbers feature_num are:  {train.case_num.unique()}")

print(f"Unique pn_num are:  {len(train.pn_num.unique())}") 

<a href="#toc" role="button" aria-pressed="true" >⬆️Back to Table of Contents ⬆️</a>

## Test data
<a id="3.2"></a>

### Quick view of Test Data

In [None]:
print(f'\033[92mNumber of rows in test data: {test.shape[0]}')
print(f'\033[94mNumber of columns in test data: {test.shape[1]}')
print(f'\033[91mNumber of values in test data: {test.count().sum()}')
print(f'\033[91mNo of rows with missing values  in test data: {sum(test.isna().sum())}')
test.head()

<a href="#toc" role="button" aria-pressed="true" >⬆️Back to Table of Contents ⬆️</a>

<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;">
    📌 &nbsp;<b><u>Observations in Patient Notes Data:</u></b><br>
 
* <i> There are total of <b><u>3</u></b> columns and <b><u>42146</u></b> rows in <b><u>Patient Notes</u></b> data.</i><br>
* <i> Patient Notes Data contain <b><u>126438</u></b> observation with <b><u>0</u></b>  missing values.</i><br>
* <i> Number of patients per case are unequally distributed with <b><u>Case 3</u></b> having maximum and <b><u>Case 1</u></b> having minimum.</i><br>
* <i> Average length of <b><u>ph_history column</u></b> is <b><u>818.17</u></b>. </i><br>
</div>

## Patient Notes Data
<a id="3.3"></a>

**Column Description :**
* `pn_num` - A unique identifier for each patient note.
* `case_num` - A unique identifier for the clinical case a patient note represents.
* `pn_history` - The text of the encounter as recorded by the test taker.

### Quick view of Patient Notes Data

In [None]:
print(f'\033[92mNumber of rows in Patient Notes Data: {patient_notes.shape[0]}')
print(f'\033[94mNumber of columns in Patient Notes Data: {patient_notes.shape[1]}')
print(f'\033[91mNumber of values in Patient Notes Data: {patient_notes.count()}')

patient_notes.head()

### Sample Patient Note 

In [None]:
# print(RANDOM_IDX)
# print(patient_notes.iloc[100, :])
print(patient_notes["pn_history"].iloc[RANDOM_IDX])

### Patient Notes Distribution ( Per case ) 
<a id="3.3.1"></a>

In [None]:
notes_counts = patient_notes.groupby("case_num").count()
fig = px.bar(data_frame =notes_counts, 
             x = notes_counts.index,
             y = 'pn_num' , 
             color = "pn_num",
             color_continuous_scale="Emrld") 
fig.update_layout(title = {
        'text': 'Distribution of patient notes for each case',
        'y':0.95,
        'x':0.48,
        'xanchor': 'center',
        'yanchor': 'top'} ,
                   xaxis = dict(
        tickmode = 'array',
        tickvals = [0, 1,2, 3, 4,5, 6,7,8,9],
        ticktext = ['Case 0', 'Case 1', 'Case 2', 'Case 3', 'Case 4', 'Case 5', 'Case 6', 'Case 7', 'Case 8', 'Case 9']),
                  template = "plotly_white")
fig.show()

### Patient Notes Length Distribution 
<a id="3.3.2"></a>

In [None]:
all_notes = []
all_notes_len = []
for notes in patient_notes['pn_history']:
    all_notes.append(notes)
    all_notes_len.append(len(notes))
print("Average length of Patient History - ",np.mean(all_notes_len))
fig = px.histogram(x = all_notes_len,  marginal="violin",nbins = 100)
fig.update_layout(template="plotly_white")
fig.update_xaxes(title = "Lenght of patient Notes")
fig.show()

<a href="#toc" role="button" aria-pressed="true" >⬆️Back to Table of Contents ⬆️</a>

<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;">
    📌 &nbsp;<b><u>Observations in Features Data:</u></b><br>
 
* <i> There are total of <b><u>3</u></b> columns and <b><u>143</u></b> rows in <b><u>Features</u></b> data.</i><br>
* <i> Patient Notes Data contain <b><u>429</u></b> observation with <b><u>0</u></b>  missing values.</i><br>
* <i> Number of patients per case are unequally distributed with <b><u>Case 5</u></b> and <b><u>Case 8</u></b> having maximum and <b><u>Case 7</u></b> having minimum.</i><br>
* <i> Average length of <b><u>feature_text</u></b> column is <b><u>23.20</u></b>. </i><br>
</div>

## Features Data
<a id="3.4"></a>


**Column Description :**
* `feature_num` - A unique identifier for each feature.
* `case_num` - A unique identifier for each case.
* `feature_text` - A description of the feature.

### Quick view of features Data

In [None]:
print(f'\033[92mNumber of rows in test data: {features.shape[0]}')
print(f'\033[94mNumber of columns in test data: {features.shape[1]}')
print(f'\033[91mNumber of values in train data: {features.count().sum()}')
features.head()

In [None]:
len(features.feature_num.unique())

### Sample Feature text

In [None]:
features["feature_text"].iloc[0]

## Feature Distribution (per Case) 
<a id="3.4.1"></a>

In [None]:
feature_counts = features.groupby("case_num").count()
fig = px.bar(data_frame =feature_counts, 
             x = feature_counts.index,
             y = 'feature_num' , 
             color = "feature_num",
             color_continuous_scale="Emrld") 
fig.update_layout(title = {
        'text': 'Distribution of Features for each case',
        'y':0.95,
        'x':0.48,
        'xanchor': 'center',
        'yanchor': 'top'} ,
                   xaxis = dict(
        tickmode = 'array',
        tickvals = [0, 1,2, 3, 4,5, 6,7,8,9],
        ticktext = ['Case 0', 'Case 1', 'Case 2', 'Case 3', 'Case 4', 'Case 5', 'Case 6', 'Case 7', 'Case 8', 'Case 9']),
                  template = "plotly_white")
fig.show()

## Feature Length Distribution 
<a id="3.4.2"></a>

In [None]:
all_feat = []
all_feat_len = []
for notes in features['feature_text']:
    all_feat.append(notes)
    all_feat_len.append(len(notes))
print("Average length of Patient History - ",np.mean(all_feat_len))
fig = px.histogram(x = all_feat_len,  marginal="violin",nbins = 200)
fig.update_layout(template="plotly_white")
fig.update_xaxes(title = "Lenght of Features")
fig.show()

<a href="#toc" role="button" aria-pressed="true" >⬆️Back to Table of Contents ⬆️</a>

<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;">
    📌 &nbsp;<b><u>Observations in Patient analysis:</u></b><br>
 
* <i> There are total of <b><u>1000</u></b> unique patients</i><br>
* <i> For every unqiue <b><u>pat_num</u></b> there are several rows depecting several anotations in patient notes.</i><br>
</div>

## Patient analysis 
<a id="3.5"></a>

### Unique Patient Count

In [None]:
print("Unique Patient Count in train data : ",len(train["pn_num"].value_counts()))

### Dataframe for a particular patient

In [None]:
PATIENT_IDX = 74087
patient_df = train[train["pn_num"] == PATIENT_IDX]
patient_df

### Patient Notes and Annotations 

In [None]:
print(f"\033[94mPatient Notes - ")
print(f'\033[94m',patient_notes[patient_notes["pn_num"] == PATIENT_IDX]["pn_history"].iloc[0])
print("------------")
print(f'\033[92mAnnotaions:')
for i in range(len(patient_df)):
    print(f'\033[92m',patient_df["annotation"].iloc[i])

<a href="#toc" role="button" aria-pressed="true" >⬆️Back to Table of Contents ⬆️</a>

<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;">
    📌 &nbsp;<b><u>Observations in Annotation analysis:</u></b><br>
    
 
* <i> There are total of <b><u>12234</u></b> annotations present in train data</i><br>
* <i> A total of <b><u>4399</u></b> annotations are empty and their corresponding location is an <b><u>empty list</u></b></i><br>
* <i> There is <b><u>1</u></b> row each in train data having <b><u>7 and 8 annotations</u></b> in a single row</i><br>
* <i> Average length of annotations is <b><u>16.52</u></b>. </i><br>
</div>

## Annotation Analysis 
<a id="3.6"></a>

### Empty Annotation count

In [None]:
print("Number of Empty annotions and locations = ", sum(train["location"] == '[]'))

### Annotation Count Distribution
<a id="3.6.1"></a>

In [None]:
train["location"] = train["location"].apply(eval)
train['annotation'] = train['annotation'].apply(eval)
train["annot_count"] = 0
for i in range(len(train)):
    train["annot_count"][i] = len(train["annotation"][i])
total_annot = 0
for idx in train["annot_count"].value_counts().sort_index().index:
    total_annot += train["annot_count"].value_counts().sort_index()[idx] * idx
print(f'\033[92mTotal number of Annotations is train data  : ' , total_annot)
print(f'\033[94mAnnotation count per row: ')
print(f'\033[94m',train["annot_count"].value_counts().sort_index())

In [None]:
fig = px.bar(data_frame =train, 
             x = train["annot_count"].value_counts().sort_index().index,
             y = train["annot_count"].value_counts().sort_index() , 
             color = train["annot_count"].value_counts().sort_index(),
             color_continuous_scale="Emrld") 
fig.update_xaxes(title ="Number of Annotations")
fig.update_yaxes(title ="Number of Rows")
fig.update_layout(title = {
        'text': 'Number of Annotations per row',
        'y':0.95,
        'x':0.48,
        'xanchor': 'center',
        'yanchor': 'top'} ,
                   
                  template = "plotly_white")
fig.show()

### Annotation Length Distribution 
<a id="3.6.2"></a>

In [None]:
annot_lengths = []
all_annot_words = []
for annot in train["annotation"]:
    for words in annot:
        annot_lengths.append(len(words))
        all_annot_words.append(words)
print("Average length of Annotations - ",np.mean(annot_lengths))
fig = px.histogram(x = annot_lengths,  marginal="violin",nbins = 300)
fig.update_layout(template="plotly_white")
fig.update_xaxes(title = "Lenght of Annotation")
fig.show()

<a href="#toc" role="button" aria-pressed="true" >⬆️Back to Table of Contents ⬆️</a>

# <center>Annotation Visualisation (SPACY)</center> 
<a id="4"></a>

In [None]:
patient_df = train[train["pn_num"] == PATIENT_IDX]
location  = patient_df["location"]
annotation = patient_df["annotation"]
start_pos = []
end_pos = []
for i in location:
    for j in i:
        start_pos.append(j.split()[0])
        end_pos.append(j.split()[1])
        
ents = []
for i in range(len(start_pos)):
    ents.append({
        'start': int(start_pos[i]), 
        'end' : int(end_pos[i]),
        "label" : "Annotation"
    })
doc = {
    'text' : patient_notes[patient_notes["pn_num"] == PATIENT_IDX]["pn_history"].iloc[0],
    "ents" : ents
}
colors = {"Annotation" :"linear-gradient(90deg, #aa9cfc, #fc9ce7)" } 
options = {"colors": colors}
spacy.displacy.render(doc, style="ent", options = options , manual=True, jupyter=True);

<a href="#toc" role="button" aria-pressed="true" >⬆️Back to Table of Contents ⬆️</a>

# <center>WORD CLOUDS</center> 
<a id="5"></a>

## WORDCLOUD for Patient history
<a id="5.1"></a>

In [None]:
wordcloud_notes = wordcloud.WordCloud(stopwords=wordcloud.STOPWORDS, max_font_size=120, max_words=5000,
                      width = 600, height = 400,
                      background_color='white').generate(" ".join(all_notes))
fig, ax = plt.subplots(figsize=(14,10))
ax.imshow(wordcloud_notes, interpolation='bilinear')
ax.set_axis_off()
plt.imshow(wordcloud_notes);

## WORDCLOUD for Features
<a id="5.2"></a>

In [None]:
wordcloud_feat = wordcloud.WordCloud(stopwords=wordcloud.STOPWORDS, max_font_size=120, max_words=5000,
                      width = 600, height = 400,
                      background_color='white').generate(" ".join(all_feat))
fig, ax = plt.subplots(figsize=(14,10))
ax.imshow(wordcloud_feat, interpolation='bilinear')
ax.set_axis_off()
plt.imshow(wordcloud_feat);

## WORDCLOUD for Annotations
<a id="5.3"></a>

In [None]:
wordcloud_annot = wordcloud.WordCloud(stopwords=wordcloud.STOPWORDS, max_font_size=120, max_words=5000,
                      width = 600, height = 400,
                      background_color='white').generate(" ".join(all_annot_words))
fig, ax = plt.subplots(figsize=(14,10))
ax.imshow(wordcloud_annot, interpolation='bilinear')
ax.set_axis_off()
plt.imshow(wordcloud_annot);

## WORDCLOUD for two characters words in Patient notes
<a id="5.4"></a>

In [None]:
two  = []
for note in all_notes:
    for word in note.split():
        if len(word)==2:
            two.append(word)
wordcloud_two_chars = wordcloud.WordCloud(stopwords=wordcloud.STOPWORDS, max_font_size=120, max_words=len(set(two)),
                      width = 600, height = 400,
                      background_color='white').generate(" ".join(two ))
fig, ax = plt.subplots(figsize=(14,10))
ax.imshow(wordcloud_two_chars, interpolation='bilinear')
ax.set_axis_off()
plt.imshow(wordcloud_two_chars);

<a href="#toc" role="button" aria-pressed="true" >⬆️Back to Table of Contents ⬆️</a>

<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em;">
    
    
### <center>Thank you for reading🙂</center>
### <center>If you have any feedback or find anything wrong, please let me know!</center>
