# MIMIC Notes Pre-Processing

Pre-processing MIMIC notes for language model and CUIs

Below is a list of redacted items with an example and the replacement token.

Redacted items:
* [x] First Name: `[**First Name (Titles) 137**]`, `xxname`
* [x] Last Name: `[**Last Name (Titles) **]`, `xxln`
* [x] Initials: `[**Initials (NamePattern4) **]`, `xxinit`
* [x] Name: `[**Name (NI) **]`, `xxname`
* [x] Doctor First Name: `[**Doctor First Name 1266**]`, `xxdocfn`
* [x] Doctor Last Name: `[**Doctor Last Name 1266**]`, `xxdocln`
* [x] Known Last Name: `[**Known lastname 658**]`, `xxln`
* [x] Hospital: `[**Hospital1 **]`, `xxhosp`
* [x] Hospital Unit Name: `**Hospital Unit Name 10**`, `xxhosp`
* [x] Company: `[**Company 12924**]`, `xxwork`
* [x] University/College: `[**University/College **]`, `xxwork`
* [x] Date of format YYYY-M-DD: `[**2112-4-18**]`, `xxdate`
* [x] Year: `[**Year (4 digits) **]`, `xxyear`
* [x] Year YYYY format: `[**2119**]`, `xxyear` - I use a regex `\b\d{4}\b` that will match **any** 4 digits which might be problematic, but for the most part 4 digits by itself seems to indicate a year.
* [x] Date of format M-DD: `[**6-12**]`, `[**12/2151**]`, `xxmmdd`
* [x] Month/Day: `[**Month/Day (2) 509**]`, `xxmmdd`
* [x] Month (only): `[**Month (only) 51**]`, `xxmonth`
* [x] Holiday: `[**Holiday 3470**]`, `xxhols`
* [x] Date Range: `[**Date range (1) 7610**]`, `xxdtrnge`
* [x] Country: `[**Country 9958**]`, `xxcntry`
* [x] State: `[**State 3283**]`, `xxstate`
* [x] Location: `**Location (un) 2432**`, `xxloc`
* [x] Telephone/Fax: `[**Telephone/Fax (3) 8049**]`, `xxph`
* [x] Clip Number: `[**Clip Number (Radiology) 29923**]`, `xxradclip`
* [x] Pager Numeric Identifier: `[**Numeric Identifier 6403**]`, `xxpager`
* [x] Pager Number: `[**Pager number 13866**]`, `xxpager`
* [x] Social Security Number: `[**Security Number 10198**]`, `xxssn`
* [x] Serial Number: `[**Serial Number 3567**]`, `xxsno`
* [x] Medical Record Number: `[**Medical Record Number **]`, `xxmrno`
* [x] Provider Number: `[**Provider Number 12521**]`, `xxpno`
* [x] Age over 90: `[**Age over 90 **]`, `xxage90`
* [x] Contact Info: `[**Contact Info **]`, `xxcontact`
* [x] Job Number: `[**Job Number **]`, `xxjobno`
* [x] Dictator Number: `[**Dictator Info **]`, `xxdict`
* [x] Pharmacy MD Number/MD number: `[**Pharmacy MD Number **]`, `xxmdno`
* [x] Time: `12:52 PM`, split into 6 segments by the hour and replace with the following tokens: `midnight, dawn, forenoon, afternoon, dusk, night`
* 2-digit Numbers: `[** 84 **]`, `xx2digit`
* 3-digit Numbers: `[** 834 **]`, `xx3digit`
* Wardname

## Imports and Inits

In [1]:
import pandas as pd
import psycopg2
import numpy as np
import re
import random
import datetime

In [2]:
from process_notes import *

## Grab sample data from MIMIC

In [3]:
cats = pd.read_csv('cats.csv')
max_limit = -1

queries = []
for category, n_notes in zip(cats['category'], cats['number_of_notes']):
    limit = min(max_limit, n_notes) if max_limit > 0 else n_notes
    if limit == max_limit:
        q = f"""
        select category, text from correctnotes where category=\'{category}\' order by random() limit {limit};
        """
    else:
        q = f"""
        select category, text from correctnotes where category=\'{category}\';
        """
    queries.append(q)

In [4]:
%%time
dfs = []

con = psycopg2.connect(dbname='mimic', user='sudarshan', host='/var/run/postgresql')
for q in queries:
    df = pd.read_sql_query(q, con)
    dfs.append(df)
con.close()
    
df = pd.concat(dfs)
df.reset_index(inplace=True, drop=True)
# df.set_index('row_id', inplace=True)
print(df.shape)

(2082294, 2)
CPU times: user 4 s, sys: 2.42 s, total: 6.43 s
Wall time: 14.5 s


In [5]:
# df[['category', 'text']].groupby(['category']).agg(['count'])

In [6]:
pat = re.compile(r'\[\*\*(.*?)\*\*\]', re.IGNORECASE)
tpat = re.compile(r'\[\*\*(\d{2})\*\*\] \b[a|p].?m.?\b', re.IGNORECASE)
ypat = re.compile(r'-?\byears? ?-?old\b|\by(?:o|r)*[ ./-]*o(?:ld)?\b', re.IGNORECASE)

In [7]:
%%time
df['scrubbed'] = df['text'].apply(process_note)

CPU times: user 17min 49s, sys: 1.43 s, total: 17min 50s
Wall time: 17min 50s


In [None]:
test = df.iloc[np.random.randint(0, len(df))]['text']
# for m in tpat.finditer(test):
#     print(m)

print(test)

In [None]:
out = process_note(test)
for m in tpat.finditer(out):
    print(m)

print(out)

In [None]:
for i, row in df.iterrows():
    if len(tpat.findall(row['text'])) != 0:
        print(i, tpat.findall(row['text']))

In [None]:
for i, row in df.iterrows():
    if len(tpat.findall(row['scrubbed'])) != 0:
        print(i, tpat.findall(row['scrubbed']))

In [None]:
for i, row in df.iterrows():
    m = pat.findall(row['scrubbed'])
    if len(m) != 0 and m[0] != '' and m[0] != ' ':
        print(i, m)

1105725 ['URL ', 'URL ']
