# NBME - Problem Statement

**If you like the notebook do give an upvote for support :) Thank You.**

**When you visit a doctor, how they interpret your symptoms can determine whether your diagnosis is accurate. By the time they’re licensed, physicians have had a lot of practice writing patient notes that document the history of the patient’s complaint, physical exam findings, possible diagnoses, and follow-up care. Learning and assessing the skill of writing patient notes requires feedback from other doctors, a time-intensive process that could be improved with the addition of machine learning. The goal of this competition is to develop an automated way of identifying the relevant features within each patient note, with a special focus on the patient history portions of the notes where the information from the interview with the standardized patient is documented.**

**In this competition, you’ll identify specific clinical concepts in patient notes. Specifically, you'll develop an automated method to map clinical concepts from an exam rubric (e.g., “diminished appetite”) to various ways in which these concepts are expressed in clinical patient notes written by medical students (e.g., “eating less,” “clothes fit looser”).**

# Importing Libraries

In [None]:
!pip install stylecloud

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

import matplotlib.pyplot as plt
from wordcloud import WordCloud,STOPWORDS
import string

import plotly.express as px
import plotly.graph_objects as go


import stylecloud
from IPython.display import Image

from warnings import filterwarnings
filterwarnings('ignore')

# About Dataset

**train.csv - Feature annotations for 1000 of the patient notes, 100 for each of ten cases.**

* **id** - Unique identifier for each patient note / feature pair.
* **pn_num** - The patient note annotated in this row.
* **feature_num** - The feature annotated in this row.
* **case_num** - The case to which this patient note belongs.
* **annotation** - The text(s) within a patient note indicating a feature. A feature may be indicated multiple times within a * single note.
* **location** - Character spans indicating the location of each annotation within the note. Multiple spans may be needed to * represent an annotation, in which case the spans are delimited by a semicolon ;.

**Reading Dataset**

In [None]:
train = pd.read_csv('../input/nbme-score-clinical-patient-notes/train.csv')

display(train.shape)


display(train.head())

In [None]:
display(train.info())

**patient_Notes.csv** - The notes description of each patient are written in patient_Notes.csv while the main feature from that description is written in train.csv given the note we need an NER approach to find the location of the main features or main key phrase in that text the location is stored as [starting_character_index_of_main_feat_in_the_text:ending_character_index_of_main_feat_in_the_text]

**patient_Notes Description** -A collection of about 40,000 Patient Note history portions. Only a subset of these have features annotated. You may wish to apply unsupervised learning techniques on the notes without annotations. The patient notes in the test set are not included in the public version of this file.

* **pn_num** - A unique identifier for each patient note.
* **case_num** - A unique identifier for the clinical case a patient note represents.
* **pn_history** - The text of the encounter as recorded by the test taker.

In [None]:
notes = pd.read_csv('../input/nbme-score-clinical-patient-notes/patient_notes.csv')

display(notes.shape)

display(notes.head())

Since the location of note of pn_num 16 in first row of train.csv is  ['696 724'] so,

In [None]:
notes['pn_history'][16][696:724]

In [None]:
notes['pn_history'][16]

In [None]:
notes.info()

**features.csv** - The rubric of features (or key concepts) for each clinical case.

* **feature_num** - A unique identifier for each feature.
* **case_num** - A unique identifier for each case.
* **feature_text** - A description of the feature.

In [None]:
features = pd.read_csv('../input/nbme-score-clinical-patient-notes/features.csv')

display(features.shape)
features.head()

In [None]:
features.info()

# EDA

**Patient Notes**

In [None]:
notes_count = notes.groupby('case_num').count()

fig = px.bar(data_frame=notes_count , x = notes_count.index , y='pn_num' , color='pn_num' , color_continuous_scale="Emrld"
             , title = 'Patient Notes per case')

fig.show()

In [None]:
notes['word_count'] = notes['pn_history'].apply(lambda x : len(x.split()))

fig = px.histogram(notes , x ='word_count' , title='Patient History Word Count')

fig.show()

**Features**

In [None]:
features_count = features.groupby('case_num').count()

fig = px.bar(data_frame=features_count , x = features_count.index , y='feature_num' , color='feature_num' , color_continuous_scale="Emrld"
             , title = 'Features Distribution per case')

fig.show()

In [None]:
features['word_count'] = features['feature_text'].apply(lambda x : len(x.split('-')))

fig = px.histogram(features , x ='word_count' , title='Features Word Count')
fig.update_layout(bargap=0.1)
fig.show()

In [None]:
train.head()

**Patient Analysis**

In [None]:
print('Unique Patient Counts = ' , len(train['pn_num'].value_counts()))

In [None]:
print('Data Frame for a particular patient')
train[train['pn_num'] == 16]

**Annotation Analysis**

In [None]:
print('Total Annotations = ', len(train['location']))
print('Number of Empty annotations and locations = ' , len(train[train['location'] == '[]']))

In [None]:
train["location"] = train["location"].apply(eval) 
train['annotation'] = train['annotation'].apply(eval) # else whole list len will be counted as 1

train["annot_count"] = 0
for i in range(len(train)):
    train["annot_count"][i] = len(train["annotation"][i])
    
print('Annotation counts:')
print(train['annot_count'].value_counts().sort_index())

In [None]:
fig = px.bar(data_frame=train , x = train['annot_count'].value_counts().sort_index().index ,
             y = train['annot_count'].value_counts().sort_index() , color = train['annot_count'].value_counts().sort_index() ,
             color_continuous_scale='Emrld' , title='Number of Annotations per row')

fig.update_xaxes(title= 'Number of Annotations')
fig.update_yaxes(title= 'Annotations Count')

fig.show()

**Inspired by the work of Sanskar Hasija, Do check out his work too -> 'https://www.kaggle.com/odins0n/nbme-detailed-eda/notebook'**

In [None]:
import spacy

patient_df = train[train["pn_num"] == 16]
location  = patient_df["location"]
annotation = patient_df["annotation"]
start_pos = []
end_pos = []
for i in location:
    for j in i:
        start_pos.append(j.split()[0])
        end_pos.append(j.split()[1])
        
ents = []
for i in range(len(start_pos)):
    ents.append({
        'start': int(start_pos[i]), 
        'end' : int(end_pos[i]),
        "label" : "Annotation"
    })
doc = {
    'text' : notes[notes["pn_num"] == 16]["pn_history"].iloc[0],
    "ents" : ents
}
colors = {"Annotation" :"linear-gradient(90deg, #aa9cfc, #fc9ce7)" } 
options = {"colors": colors}
spacy.displacy.render(doc, style="ent", options = options , manual=True, jupyter=True);

# WordClouds

**Inspired by the work of Marilia.. Do check out their work too -> https://www.kaggle.com/mpwolke/patient-s-key-phrases/comments**

**Word Cloud for Features**

In [None]:
concat_features = ' '.join([i for i in features.feature_text.astype(str)])
print(concat_features[:1000])

In [None]:

stylecloud.gen_stylecloud(text=concat_features, icon_name= "fab fa-twitter", 
                          palette="cartocolors.diverging.TealRose_7", background_color="black" , size=1024)


In [None]:
Image(filename="./stylecloud.png", width=1024, height=1024)

In [None]:
features.feature_text.value_counts()[:30]

**WordCloud for Patient History**

In [None]:
concat_pnhist = ''.join([i for i in notes.pn_history.astype(str)])

concat_pnhist[:200]

In [None]:

stylecloud.gen_stylecloud(text=concat_pnhist, icon_name= "fas fa-dharmachakra", 
                          palette="cartocolors.diverging.TealRose_7", background_color="black" , size=1024)


In [None]:
Image(filename="./stylecloud.png", width=1024, height=1024)

**Word Cloud for Annotations**

In [None]:
concat_annot = ' '.join([i for i in train.annotation.astype(str)])
print(concat_annot[:500])

In [None]:
stylecloud.gen_stylecloud(text=concat_annot,
                          icon_name='fas fa-yin-yang',
                          palette='colorbrewer.sequential.BuGn_9',
                          background_color='black',
                          gradient='horizontal',
                          size=1024)

In [None]:
Image(filename="./stylecloud.png", width=1024, height=1024)