# MIMIC Notes Pre-Processing

Pre-processing MIMIC notes for language model and CUIs

Below is a list of redacted items with an example and the replacement token.

Redacted items:
* [x] First Name: `[**First Name (Titles) 137**]`, `xxname`
* [x] Last Name: `[**Last Name (Titles) **]`, `xxln`
* [x] Initials: `[**Initials (NamePattern4) **]`, `xxinit`
* [x] Name: `[**Name (NI) **]`, `xxname`
* [x] Doctor First Name: `[**Doctor First Name 1266**]`, `xxdocfn`
* [x] Doctor Last Name: `[**Doctor Last Name 1266**]`, `xxdocln`
* [x] Known Last Name: `[**Known lastname 658**]`, `xxln`
* [x] Hospital: `[**Hospital1 **]`, `xxhosp`
* [x] Hospital Unit Name: `**Hospital Unit Name 10**`, `xxhosp`
* [x] Company: `[**Company 12924**]`, `xxwork`
* [x] University/College: `[**University/College **]`, `xxwork`
* [x] Date of format YYYY-M-DD: `[**2112-4-18**]`, `xxdate`
* [x] Year: `[**Year (4 digits) **]`, `xxyear`
* [x] Year YYYY format: `[**2119**]`, `xxyear` - I use a regex `\b\d{4}\b` that will match **any** 4 digits which might be problematic, but for the most part 4 digits by itself seems to indicate a year.
* [x] Date of format M-DD: `[**6-12**]`, `[**12/2151**]`, `xxmmdd`
* [x] Month/Day: `[**Month/Day (2) 509**]`, `xxmmdd`
* [x] Month (only): `[**Month (only) 51**]`, `xxmonth`
* [x] Holiday: `[**Holiday 3470**]`, `xxhols`
* [x] Date Range: `[**Date range (1) 7610**]`, `xxdtrnge`
* [x] Country: `[**Country 9958**]`, `xxcntry`
* [x] State: `[**State 3283**]`, `xxstate`
* [x] Location: `**Location (un) 2432**`, `xxloc`
* [x] Telephone/Fax: `[**Telephone/Fax (3) 8049**]`, `xxph`
* [x] Clip Number: `[**Clip Number (Radiology) 29923**]`, `xxradclip`
* [x] Pager Numeric Identifier: `[**Numeric Identifier 6403**]`, `xxpager`
* [x] Pager Number: `[**Pager number 13866**]`, `xxpager`
* [x] Social Security Number: `[**Security Number 10198**]`, `xxssn`
* [x] Serial Number: `[**Serial Number 3567**]`, `xxsno`
* [x] Medical Record Number: `[**Medical Record Number **]`, `xxmrno`
* [x] Provider Number: `[**Provider Number 12521**]`, `xxpno`
* [x] Age over 90: `[**Age over 90 **]`, `xxage90`
* [x] Contact Info: `[**Contact Info **]`, `xxcontact`
* [x] Job Number: `[**Job Number **]`, `xxjobno`
* [x] Dictator Number: `[**Dictator Info **]`, `xxdict`
* [x] Pharmacy MD Number/MD number: `[**Pharmacy MD Number **]`, `xxmdno`
* [x] Time: `12:52 PM`, split into 6 segments by the hour and replace with the following tokens: `midnight, dawn, forenoon, afternoon, dusk, night`
* 2-digit Numbers: `[** 84 **]`, `xx2digit`
* 3-digit Numbers: `[** 834 **]`, `xx3digit`
* Wardname

## Imports and Inits

In [1]:
import pandas as pd
import psycopg2
import numpy as np
import re
import random
import datetime
from pathlib import Path
import pickle
import numpy as np

In [2]:
path = Path('data')

In [3]:
from process_notes import *

## Grab sample data from MIMIC

In [None]:
cats = pd.read_csv('cats.csv')
max_limit = -1

queries = []
for category, n_notes in zip(cats['category'], cats['number_of_notes']):
    limit = min(max_limit, n_notes) if max_limit > 0 else n_notes
    if limit == max_limit:
        q = f"""
        select * from correctnotes where category=\'{category}\' order by random() limit {limit};
        """
    else:
        q = f"""
        select * from correctnotes where category=\'{category}\';
        """
    queries.append(q)

In [None]:
%%time
dfs = []

con = psycopg2.connect(dbname='mimic', user='sudarshan', host='/var/run/postgresql')
for q in queries:
    df = pd.read_sql_query(q, con)
    dfs.append(df)
con.close()
    
df = pd.concat(dfs)
df.reset_index(inplace=True, drop=True)
# df.set_index('row_id', inplace=True)
print(df.shape)

In [None]:
df[['category', 'text']].groupby(['category']).agg(['count'])

In [None]:
%%time
df['text'] = df['text'].apply(process_note)

In [None]:
with open(path/'processed_noteevents.pkl', 'wb') as f:
    pickle.dump(df, f)

texts = df['text']
texts.to_csv(path/'texts.csv')

In [4]:
ori_df = pickle.load(open(path/'processed_noteevents.pkl', 'rb'))
sub_df = pd.DataFrame({'text': ori_df['text'], 'category': ori_df['category'], 'description': ori_df['description'], 'labels': [0]*len(ori_df)}, columns=['labels', 'category', 'description', 'text'])

In [8]:
dfs = [sub_df.loc[ori_df['category'] == c] for c in sub_df['category'].unique()]
msks = [np.random.rand(len(d)) < 0.9 for d in dfs]

In [12]:
train_dfs = [None] * len(dfs)
val_dfs = [None] * len(dfs)

for i in range(len(dfs)):
    df = dfs[i]
    mask = msks[i]
    train_dfs[i] = df[mask]
    val_dfs[i] = df[~mask]

In [13]:
train_df = pd.concat(train_dfs)
val_df = pd.concat(val_dfs)

print(len(train_df), (len(ori_df) - len(ori_df)//10), len(train_df)-(len(ori_df) - len(ori_df)//10))
print(len(val_df), (len(ori_df)//10), len(val_df)-(len(ori_df)//10))

1874688 1874065 623
207606 208229 -623


In [17]:
val_df[['labels', 'description', 'text']].to_csv(path/'test-texts.csv', header=False, index=False)

In [14]:
val_df[['category', 'text']].groupby(['category']).agg(['count'])

Unnamed: 0_level_0,text
Unnamed: 0_level_1,count
category,Unnamed: 1_level_2
Case Management,106
Consult,11
Discharge summary,6031
ECG,20688
Echo,4618
General,865
Nursing,22190
Nursing/other,81951
Nutrition,865
Pharmacy,6


In [None]:
train_df[['category', 'text']].groupby(['category']).agg(['count'])

In [None]:
ori_df[['category', 'text']].groupby(['category']).agg(['count'])

In [15]:
for i in range(len(dfs)):
    assert (len(dfs[i]) == len(train_dfs[i])+len(val_dfs[i]))
    print(len(dfs[i]))
    print(len(train_dfs[i]), len(val_dfs[i]))
    print('****')

953
847 106
****
98
87 11
****
59652
53621 6031
****
209051
188363 20688
****
45794
41176 4618
****
8236
7371 865
****
223182
200992 22190
****
822497
740546 81951
****
9400
8535 865
****
101
95 6
****
141281
127242 14039
****
522279
470036 52243
****
5408
4906 502
****
31701
28518 3183
****
2661
2353 308
****
