# NBME EDA and Starter

## CONTENTS
1. [Introduction](##1) 
2. [IMPORTS](##2)
3. [TRAIN DATA](##3)
4. [ANALYZING ANNOTATIONS NUMBER](##4)
5. [COUNT PLOTS](##5)
6. [Analyze features](##6)
7. [Analyze patient history](##7)
8. [Annotation features](##8)

## 1. INTRODUCTION
When you visit a doctor, how they interpret your symptoms can determine whether your diagnosis is accurate. By the time they’re licensed, physicians have had a lot of practice writing patient notes that document the history of the patient’s complaint, physical exam findings, possible diagnoses, and follow-up care. Learning and assessing the skill of writing patient notes requires feedback from other doctors, a time-intensive process that could be improved with the addition of machine learning.

Until recently, the Step 2 Clinical Skills examination was one component of the United States Medical Licensing Examination® (USMLE®). The exam required test-takers to interact with Standardized Patients (people trained to portray specific clinical cases) and write a patient note. Trained physician raters later scored patient notes with rubrics that outlined each case’s important concepts (referred to as features). The more such features found in a patient note, the higher the score (among other factors that contribute to the final score for the exam).

However, having physicians score patient note exams requires significant time, along with human and financial resources. Approaches using natural language processing have been created to address this problem, but patient notes can still be challenging to score computationally because features may be expressed in many ways. For example, the feature "loss of interest in activities" can be expressed as "no longer plays tennis." Other challenges include the need to map concepts by combining multiple text segments, or cases of ambiguous negation such as “no cold intolerance, hair loss, palpitations, or tremor” corresponding to the key essential “lack of other thyroid symptoms.”

## 2. IMPORTS

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ast
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

## 3. Train data

## Important Terms
* Clinical Case: The scenario (e.g., symptoms, complaints, concerns) the Standardized Patient presents to the test taker (medical student, resident or physician). Ten clinical cases are represented in this dataset.
* Patient Note: Text detailing important information related by the patient during the encounter (physical exam and interview).
* Feature: A clinically relevant concept. A rubric describes the key concepts relevant to each case.

## Data  
* patient_notes.csv - A collection of about 40,000 Patient Note history portions. Only a subset of these have features annotated. You may wish to apply unsupervised learning techniques on the notes without annotations. The patient notes in the test set are not included in the public version of this file.
    1. pn_num - A unique identifier for each patient note.
    2. case_num - A unique identifier for the clinical case a patient note represents.
    3. pn_history - The text of the encounter as recorded by the test taker.

* features.csv - The rubric of features (or key concepts) for each clinical case.
    1. feature_num - A unique identifier for each feature.
    2. case_num - A unique identifier for each case.
    3. feature_text - A description of the feature.
* train.csv - Feature annotations for 1000 of the patient notes, 100 for each of ten cases
    1. id - Unique identifier for each patient note / feature pair.
    2. pn_num - The patient note annotated in this row.
    3. feature_num - The feature annotated in this row.
    4. case_num - The case to which this patient note belongs.
    5. annotation - The text(s) within a patient note indicating a feature. A feature may be  indicated multiple times within a single note.
    6. location - Character spans indicating the location of each annotation within the note. Multiple spans may be needed to represent an annotation, in which case the spans are delimited by a semicolon ;.

In [None]:
patch_notes=pd.read_csv("../input/nbme-score-clinical-patient-notes/patient_notes.csv")
features=pd.read_csv("../input/nbme-score-clinical-patient-notes/features.csv")
train=pd.read_csv("../input/nbme-score-clinical-patient-notes/train.csv")

In [None]:
patch_notes

In [None]:
features

In [None]:
train

## 4. ANALYZING ANNOTATIONS NUMBER

In [None]:
def return_length(data):
    return len(ast.literal_eval(data))

In [None]:
train['length_annotation']=train['location'].apply(return_length)

In [None]:
sns.countplot(train['length_annotation'])

*  We iterate through all the data and look for the length of each annotation available in  the data
*  we observe that number of annotations for each oberservations range from 0 to 8 while 1 beign most common number of annaotation

## 5. COUNTPLOTS

In [None]:
sns.countplot(features['feature_text'].value_counts())

In [None]:
sns.countplot(train['case_num'])

1. WE try to count number of unique features in features csv file we find that most features repeat only once
2. We also need to take look into cases we find that the number of cases is 9 and number of training observation for each case varies from 1000 to 1750 

## 5. Analysis on the patient Ids

In [None]:
sns.kdeplot(train['pn_num'].value_counts())

In [None]:
len(train['pn_num'].unique())

In [None]:
patient_num =[]
feature_count=[]

for df in train.groupby(by='pn_num'):
    patient_num.append(df[0])
    feature_count.append(len(df[1]['feature_num'].unique()))
    

In [None]:
sns.kdeplot(feature_count)

In [None]:
case_num =[]
patient_count=[]

for df in train.groupby(by='case_num'):
    case_num.append(df[0])
    patient_count.append(len(df[1]['pn_num'].unique()))

In [None]:
sns.barplot(x=case_num,y=patient_count)

* We can clearly see that the patient ids number varies from 8 to 20
* Intestingly when the patient column is grouped based on the features the diatribution is same as full data
* Finally we can clearly see that number of patients in each case is 100

## 6. Analyzing features

In [None]:
sns.countplot(train['feature_num'].value_counts())

In [None]:
features['feature_num'].unique()

In [None]:
case_num =[]
feature_count=[]

for df in train.groupby(by='case_num'):
    case_num.append(df[0])
    feature_count.append(len(df[1]['feature_num'].unique()))

In [None]:
sns.barplot(x=case_num,y=feature_count)

* All features reapeat exactly 100 times each in the dataset
* And there are 140 unique features in the entire data
* The features in each case varies from 8 to 20

## 7. Analyzing patient history

In [None]:
def create_word_cloud(data,title):
    comment_words = ''
    stopwords = set(STOPWORDS)
    data=[data]
    for val in data:
     
        # typecaste each val to string
        val = str(val)
 
    # split the value
        tokens = val.split()
     
    # Converts each token into lowercase
        for i in range(len(tokens)):
            tokens[i] = tokens[i].lower()
        comment_words += " ".join(tokens)+" "
        
        wordcloud = WordCloud(width = 800, height = 800,
                background_color ='white',
                stopwords = stopwords,
                min_font_size = 10).generate(comment_words)
                     
    plt.figure(figsize = (8, 8), facecolor = None)
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.tight_layout(pad = 0)
    plt.title(title,{'fontsize': 30})
    plt.show()

def consolidate_text(data):
    out_data=[]
    for text in data:
        out_data.append(text)
    return ' '.join(out_data)

In [None]:
consolidated_data=consolidate_text(patch_notes['pn_history'])
create_word_cloud(consolidated_data,'cloud of all cases')

In [None]:
for df in patch_notes.groupby(by='case_num'):
    case_num ='casenum_'+str(df[0])
    consolidated_data=consolidate_text(df[1]['pn_history'])
    create_word_cloud(consolidated_data,case_num)

* We try to figure out the common words in entrire in entire corpous of by building the word cloud
* similarly  we  try to figure out common words after splitting it based on case num  

## 8. Annotation features

In [None]:
def consolidate_annotation(data):
    out_data=''
    for point in data:
        out_data=out_data+' '.join(ast.literal_eval(point))+' '
    return out_data

In [None]:
annotated=consolidate_annotation(train['annotation'])
create_word_cloud(annotated,'annotation')

In [None]:
for df in train.groupby(by='case_num'):
    case_num ='casenum_'+str(df[0])
    consolidated_data=consolidate_annotation(df[1]['annotation'])
    create_word_cloud(consolidated_data,case_num)

* We do similar analysis on the annotations column of train data
* first build a word cloud on entire data base 
* Then word cloud for each case