# Intro
Welcome to the [NBME - Score Clinical Patient Notes](https://www.kaggle.com/c/nbme-score-clinical-patient-notes) compedition

![](https://storage.googleapis.com/kaggle-competitions/kaggle/33607/logos/header.png)

In this competition, we will identify specific clinical concepts in patient notes. Specifically, we will develop an automated method to map clinical concepts from an exam rubric to various ways in which these concepts are expressed in clinical patient notes written by medical students (e.g., “eating less,” “clothes fit looser”). Great solutions will be both accurate and reliable.

In this notebook we analyse the dataset an give an example for dealing with the data based on regular expressions.

**Table of content:**
1. [Overview](#Overview)
2. [Look On A Patient](#LookOnAPatient)
3. [Use Regular Expression](#UseRE)
4. [Simple Approach For Feature 12 - Male](#SimpleApproach)
5. [Predict Test Data](#Predict)

<font size="4"><span style="color: royalblue;">Please vote the notebook up if it helps you. Feel free to leave a comment above the notebook. Thank you. </span></font>

# Libraries

In [None]:
import os
import pandas as pd
import re

# Path

In [None]:
path = '/kaggle/input/nbme-score-clinical-patient-notes/'
os.listdir(path)

# Load Data

In [None]:
train_data = pd.read_csv(path+'train.csv')
test_data = pd.read_csv(path+'test.csv')
samp_subm = pd.read_csv(path+'sample_submission.csv')
features = pd.read_csv(path+'features.csv')
notes = pd.read_csv(path+'patient_notes.csv')

# Overview <a name="Overview"></a>

**Important Terms**
* Clinical Case: The scenario (e.g., symptoms, complaints, concerns) the Standardized Patient presents to the test taker (medical student, resident or physician). Ten clinical cases are represented in this dataset.
* Patient Note: Text detailing important information related by the patient during the encounter (physical exam and interview).
* Feature: A clinically relevant concept. A rubric describes the key concepts relevant to each case.


**patient_notes.csv** - A collection of about 40,000 Patient Note history portions. Only a subset of these have features annotated. You may wish to apply unsupervised learning techniques on the notes without annotations. The patient notes in the test set are not included in the public version of this file.
* pn_num - A unique identifier for each patient note.
* case_num - A unique identifier for the clinical case a patient note represents.
* pn_history - The text of the encounter as recorded by the test taker.

In [None]:
print('Number note samples:', len(notes))
notes.head()

**features.csv** - The rubric of features (or key concepts) for each clinical case.
* feature_num - A unique identifier for each feature.
* case_num - A unique identifier for each case.
* feature_text - A description of the feature.

In [None]:
print('Number feature samples:', len(features))
features.head()

**train.csv** - Feature annotations for 1000 of the patient notes, 100 for each of ten cases.
* id - Unique identifier for each patient note / feature pair.
* pn_num - The patient note annotated in this row.
* feature_num - The feature annotated in this row.
* case_num - The case to which this patient note belongs.
* annotation - The text(s) within a patient note indicating a feature. A feature may be indicated multiple times within a single note.
* location - Character spans indicating the location of each annotation within the note. Multiple spans may be needed to represent an annotation, in which case the spans are delimited by a semicolon ;.

In [None]:
print('Number train samples:', len(train_data))
train_data.head()

In [None]:
train_data[(train_data['case_num']==0)&(train_data['feature_num']==11)]['annotation'].value_counts()

To help you author submission code, we include a few example instances selected from the training set. When your submitted notebook is scored, this example data will be replaced by the actual test data. The patient notes in the test set will be added to the patient_notes.csv file.

**test.csv** - Example instances selected from the training set. <br>
**sample_submission.csv** - A sample submission file in the correct format.

In [None]:
print('Number test samples:', len(test_data))
test_data.head()

In [None]:
samp_subm.head()

# Look On A Patient <a name="LookOnAPatient"></a>
We focus on the patient with the number 16. First we look on the patien notes:

In [None]:
pn_num = 16
notes[notes['pn_num']==pn_num]['pn_history'][pn_num]

There are 13 annotations for 13 features, where 3 are empty.

In [None]:
train_data[train_data['pn_num']==pn_num]

# Use Regular Expression <a name="UseRE"></a>
One possibility is to use regular expressions. We demonstrate the usage with the patient number 16.

In [None]:
text = notes[notes['pn_num']==16]['pn_history'].values[0]
text

Search for the sex:

In [None]:
match = re.search('M', text)
print('Start position:', match.start())
print('End position:', match.end())

Search for the age:

In [None]:
match = re.search('17\w+', text)
print('Match:', match.group())
print('Start position:', match.start())
print('End position:', match.end())

# Simple Approach For Feature 12 - Male <a name="SimpleApproach"></a>
We can search for a pattern an will get also the location of the pattern.

Now we focus on the feature 12 - Male. 

In [None]:
df_temp = train_data[train_data['feature_num']==12]
df_temp.head()

Here we can see the distrubution of the values:

In [None]:
df_temp['annotation'].value_counts()

We define a pattern list based on the distribution and search these patterns in the notes:

In [None]:
keys = df_temp['annotation'].value_counts().keys()
pattern_list = [key.replace("[", '').replace("]", '').replace("'", '') for key in keys if len(key)>3]
pattern_list

# Predict Test Data <a name="Predict"></a>

In [None]:
data = test_data.copy()

In [None]:
for row in data.index:
    case_num = data.loc[row, 'case_num']
    pn_num = data.loc[row, 'pn_num']
    feature_num = data.loc[row, 'feature_num']
    
    # Create Pattern List For The Feature
    df_temp = train_data[train_data['feature_num']==feature_num].copy()
    keys = df_temp['annotation'].value_counts().keys()
    pattern_list = [key.replace("[", '').replace("]", '').replace("'", '') for key in keys if len(key)>3]
    
    # Extract Patient Note
    text = notes[(notes['pn_num']==pn_num)&(notes['case_num']==case_num)]['pn_history'].values[0]
    text = text.replace('.', ' ')
    pred = ''
    for pattern in pattern_list:
        pattern = pattern.replace('+', 'P').replace('(', ' ').replace(')', ' ')
        text = text.replace('+', 'P')
        if text.find(pattern)>0:
            pred += str(text.find(pattern))+' '+str(text.find(pattern)+len(pattern))+';'
    pred = pred[:-1]
    data.loc[row, 'location'] = pred
    samp_subm.loc[row, 'id'] = data.loc[row, 'id']
    samp_subm.loc[row, 'location'] = pred


# Export

In [None]:
samp_subm

In [None]:
samp_subm.to_csv('submission.csv', index=False)