# NBME - Score Clinical Patient Notes
### Identify Key Phrases in Patient Notes from Medical Licensing Exams


**TASK DESCRIPTION:** 
We should be able to develop an **automated method to map clinical concepts from an exam rubric** (e.g., “diminished appetite”) to various ways in which these concepts are expressed in clinical patient notes written by medical students (e.g., “eating less,” “clothes fit looser”).

**DATA:**

**Clinical Case:** The scenario (e.g., symptoms, complaints, concerns) the Standardized Patient presents to the test taker (medical student, resident or physician). Ten clinical cases are represented in this dataset.

**Patient Note:** Text detailing important information related by the patient during the encounter (physical exam and interview).

**Feature:** A clinically relevant concept. A rubric describes the key concepts relevant to each case.

**Training Data**

**patient_notes.csv - A collection of about 40,000 Patient Note history portions. Only a subset of these have features annotated. You may wish to apply unsupervised learning techniques on the notes without annotations. The patient notes in the test set are not included in the public version of this file.**

pn_num - A unique identifier for each patient note.

case_num - A unique identifier for the clinical case a patient note represents.

pn_history - The text of the encounter as recorded by the test taker.

**features.csv - The rubric of features (or key concepts) for each clinical case.**

feature_num - A unique identifier for each feature.

case_num - A unique identifier for each case.

feature_text - A description of the feature.

**train.csv - Feature annotations for 1000 of the patient notes, 100 for each of ten cases.**

id - Unique identifier for each patient note / feature pair.

pn_num - The patient note annotated in this row.

feature_num - The feature annotated in this row.

case_num - The case to which this patient note belongs.

annotation - The text(s) within a patient note indicating a feature. A feature may be indicated multiple times within a single note.

location - Character spans indicating the location of each annotation within the note. Multiple spans may be needed to represent an annotation, in which case the spans are delimited by a semicolon ;.

### Initial Understanding of the data

In [None]:
# Importing necessary libraries
import numpy as np 
import pandas as pd
import os

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
PATIENT_NOTES_PATH ='/kaggle/input/nbme-score-clinical-patient-notes/patient_notes.csv'
FEATURES_PATH= '/kaggle/input/nbme-score-clinical-patient-notes/features.csv'
TRAIN_PATH= '/kaggle/input/nbme-score-clinical-patient-notes/train.csv'
TEST_PATH= '/kaggle/input/nbme-score-clinical-patient-notes/test.csv'
SAMPLE_SUBMISSION_PATH= '/kaggle/input/nbme-score-clinical-patient-notes/sample_submission.csv'


##### Reading all files and look into the data

patient_notes= pd.read_csv(PATIENT_NOTES_PATH)
pd.set_option('display.max_colwidth', None)
patient_notes.head()

In [None]:
features= pd.read_csv(FEATURES_PATH)
features.head(10)

In [None]:
train= pd.read_csv(TRAIN_PATH)
train.head(10)

In [None]:
test= pd.read_csv(TEST_PATH)
test.head(10)

In [None]:
sample_sub= pd.read_csv(SAMPLE_SUBMISSION_PATH)
sample_sub.head(10)

#### So, based on the above data we can figure out that for each case there is a Feature_num <--> Feature_text mapping in features.csv file. And in the training data, we have the Annotations from the patient notes. Each of these Annotations are associated with the corresponding feature num. 

In [None]:
## Merge other fields with the train data
train_merged= ((train.merge(features, on=['case_num','feature_num'], how='left'))
            .merge(patient_notes,on=['pn_num','case_num'],how='left'))



train_merged= train_merged[['id','case_num','pn_num','feature_num','feature_text','pn_history','annotation', 'location']]
train_merged.head()