# MIMIC Notes Pre-Processing

Pre-processing MIMIC notes for further use (when loading data directly from the filesystem).

Below is a list of redacted items with an example and the replacement token. Replacement tokens are changeable. Check `preprocess_notes.py` for more details.

Redacted items:
* [x] First Name: `[**First Name (Titles) 137**]`, `xxname`
* [x] Last Name: `[**Last Name (Titles) **]`, `xxln`
* [x] Initials: `[**Initials (NamePattern4) **]`, `xxinit`
* [x] Name: `[**Name (NI) **]`, `xxname`
* [x] Doctor First Name: `[**Doctor First Name 1266**]`, `xxdocfn`
* [x] Doctor Last Name: `[**Doctor Last Name 1266**]`, `xxdocln`
* [x] Known Last Name: `[**Known lastname 658**]`, `xxln`
* [x] Hospital: `[**Hospital1 **]`, `xxhosp`
* [x] Hospital Unit Name: `**Hospital Unit Name 10**`, `xxhosp`
* [x] Company: `[**Company 12924**]`, `xxwork`
* [x] University/College: `[**University/College **]`, `xxwork`
* [x] Date of format YYYY-M-DD: `[**2112-4-18**]`, `xxdate`
* [x] Year: `[**Year (4 digits) **]`, `xxyear`
* [x] Year YYYY format: `[**2119**]`, `xxyear` - I use a regex `\b\d{4}\b` that will match **any** 4 digits which might be problematic, but for the most part 4 digits by itself seems to indicate a year.
* [x] Date of format M-DD: `[**6-12**]`, `[**12/2151**]`, `xxmmdd`
* [x] Month/Day: `[**Month/Day (2) 509**]`, `xxmmdd`
* [x] Month (only): `[**Month (only) 51**]`, `xxmonth`
* [x] Holiday: `[**Holiday 3470**]`, `xxhols`
* [x] Date Range: `[**Date range (1) 7610**]`, `xxdtrnge`
* [x] Country: `[**Country 9958**]`, `xxcntry`
* [x] State: `[**State 3283**]`, `xxstate`
* [x] Location: `**Location (un) 2432**`, `xxloc`
* [x] Telephone/Fax: `[**Telephone/Fax (3) 8049**]`, `xxph`
* [x] Clip Number: `[**Clip Number (Radiology) 29923**]`, `xxradclip`
* [x] Pager Numeric Identifier: `[**Numeric Identifier 6403**]`, `xxpager`
* [x] Pager Number: `[**Pager number 13866**]`, `xxpager`
* [x] Social Security Number: `[**Security Number 10198**]`, `xxssn`
* [x] Serial Number: `[**Serial Number 3567**]`, `xxsno`
* [x] Medical Record Number: `[**Medical Record Number **]`, `xxmrno`
* [x] Provider Number: `[**Provider Number 12521**]`, `xxpno`
* [x] Age over 90: `[**Age over 90 **]`, `xxage90`
* [x] Contact Info: `[**Contact Info **]`, `xxcontact`
* [x] Job Number: `[**Job Number **]`, `xxjobno`
* [x] Dictator Number: `[**Dictator Info **]`, `xxdict`
* [x] Pharmacy MD Number/MD number: `[**Pharmacy MD Number **]`, `xxmdno`
* [x] Time: `12:52 PM`, split into 6 segments by the hour and replace with the following tokens: `midnight, dawn, forenoon, afternoon, dusk, night`
* 2-digit Numbers: `[** 84 **]`, `xx2digit`
* 3-digit Numbers: `[** 834 **]`, `xx3digit`
* Wardname

`886` notes are marked incorrect with `iserror` flag set to 1. Thus, there are total of `2,082,294` notes. I have set up a `view` called `correctnotes` in the database, which only includese the correct notes. All the data I grab is from that `view`.

## Imports and Inits

In [1]:
import pandas as pd
import numpy as np
import re
import random
import datetime
from pathlib import Path
import pickle
import numpy as np

In [3]:
PATH = Path('/home/paperspace/data/mimic-iii/')

In [4]:
from preprocess_notes import *

## Grab Data from MIMIC Notes File

Here the data is grabbed from the gzipped CSV file.

In [5]:
%%time
df = pd.read_csv(PATH/'NOTEEVENTS.csv.gz')
print(df.shape)



(2083180, 11)
CPU times: user 1min 10s, sys: 3.18 s, total: 1min 13s
Wall time: 1min 13s


Confirm that the number of notes match the actual number.

In [8]:
df[['CATEGORY', 'TEXT']].groupby(['CATEGORY']).agg(['count'])

Unnamed: 0_level_0,TEXT
Unnamed: 0_level_1,count
CATEGORY,Unnamed: 1_level_2
Case Management,967
Consult,98
Discharge summary,59652
ECG,209051
Echo,45794
General,8301
Nursing,223556
Nursing/other,822497
Nutrition,9418
Pharmacy,103


In [9]:
%%time
df['proc_text'] = df['TEXT'].apply(preprocess_note)

CPU times: user 20min 12s, sys: 1.9 s, total: 20min 14s
Wall time: 20min 15s


In [10]:
df.to_parquet(fname=PATH/'preprocessed_noteevents.parquet')

## Create datasets for Language Modeling

To follow the FastAI language modeling lesson, I've created a subset of the original dataframe to sample for the datasets. In particular, I've included the `description` and `preprocessed_text` fields in the datasets. The `description` column is composed of free-text and has `3840` unique descriptions. I consider the description as a unique `field` which will be marked as such during tokenization as done in the FastAI library.

In [12]:
sub_df = pd.DataFrame({'proc_text': df['proc_text'], 'category': df['CATEGORY'], 'description': df['DESCRIPTION'], 'labels': [0]*len(df)},\
                      columns=['labels', 'category', 'description', 'proc_text'])
sub_df.sample(5)

Unnamed: 0,labels,category,description,proc_text
1900581,0,Nursing/other,Report,NPN\n\n\n#1Bili O- had single spot light shut...
1500411,0,Nursing/other,Report,respiratory care\npatient on the vent changes ...
1279908,0,Nursing/other,Report,Correction: Diamox administered for metabolic ...
445949,0,Physician,Physician Attending Progress Note,"Chief Complaint: Hypotension, hypoxemia\n I ..."
1407630,0,Nursing/other,Report,micu nsg d/c note- see d/c summary from xxmmdd...


Now we can just do a train/test split on the entire dataset for getting a 90/10 training and testing dataset. However, I would like the train/test set have a 90%/10% split in **each category**. So I chose to iterate over each entry of the `category` column and create masks to split data with a 90/10 split for training and testing so that I grab 10% of texts in each category for testing instead of a global 10%.

Set random seed for reproducible results.

In [14]:
np.random.seed(42)

dfs = [sub_df.loc[sub_df['category'] == c] for c in sub_df['category'].unique()]
msks = [np.random.rand(len(d)) < 0.9 for d in dfs]

train_dfs = [None] * len(dfs)
val_dfs = [None] * len(dfs)

for i in range(len(dfs)):
    idf = dfs[i]
    mask = msks[i]
    train_dfs[i] = idf[mask]
    val_dfs[i] = idf[~mask]
    
train_df = pd.concat(train_dfs)
val_df = pd.concat(val_dfs)

print(len(train_df), (len(df) - len(df)//10), len(train_df)-(len(df) - len(df)//10))
print(len(val_df), (len(df)//10), len(val_df)-(len(df)//10))    

1875066 1874862 204
208114 208318 -204


Sanity check the aggregate count for each category over the 3 dataframes. Then write the `train` and `val` dataframes to disk.

In [15]:
val_df[['category', 'proc_text']].groupby(['category']).agg(['count'])

Unnamed: 0_level_0,proc_text
Unnamed: 0_level_1,count
category,Unnamed: 1_level_2
Case Management,96
Consult,9
Discharge summary,5930
ECG,20970
Echo,4439
General,821
Nursing,22324
Nursing/other,82157
Nutrition,927
Pharmacy,7


In [16]:
train_df[['category', 'proc_text']].groupby(['category']).agg(['count'])

Unnamed: 0_level_0,proc_text
Unnamed: 0_level_1,count
category,Unnamed: 1_level_2
Case Management,871
Consult,89
Discharge summary,53722
ECG,188081
Echo,41355
General,7480
Nursing,201232
Nursing/other,740340
Nutrition,8491
Pharmacy,96


In [17]:
sub_df[['category', 'proc_text']].groupby(['category']).agg(['count'])

Unnamed: 0_level_0,proc_text
Unnamed: 0_level_1,count
category,Unnamed: 1_level_2
Case Management,967
Consult,98
Discharge summary,59652
ECG,209051
Echo,45794
General,8301
Nursing,223556
Nursing/other,822497
Nutrition,9418
Pharmacy,103


In [19]:
train_df[['labels', 'description', 'proc_text']].to_parquet(PATH/'train.parquet')
val_df[['labels', 'description', 'proc_text']].to_parquet(PATH/'test.parquet')