# MIMIC Notes Pre-Processing

Pre-processing MIMIC notes for language model and CUIs

Below is a list of redacted items with an example and the replacement token.

Redacted items:
* [x] First Name: `[**First Name (Titles) 137**]`, `t_firstname`
* [x] Last Name: `[**Last Name (Titles) **]`, `t_lastname`
* [x] Initials: `[**Initials (NamePattern4) **]`, `t_initials`
* [x] Name: `[**Name (NI) **]`, `t_name`
* [x] Doctor First Name: `[**Doctor First Name 1266**]`, `t_doctor_firstname`
* [x] Doctor Last Name: `[**Doctor Last Name 1266**]`, `t_doctor_lastname`
* [x] Known Last Name: `[**Known lastname 658**]`, `t_lastname`
* [x] Hospital: `[**Hospital1 **]`, `t_hospital`
* [x] Hospital Unit Name: `**Hospital Unit Name 10**`, `t_hospital`
* [x] Company: `[**Company 12924**]`, `t_workplace`
* [x] University/College: `[**University/College **]`, `t_workplace`
* [x] Date of format YYYY-M-DD: `[**2112-4-18**]`, `t_fulldate`
* [x] Year: `[**Year (4 digits) **]`, `t_year`
* [x] Year YYYY format: `[**2119**]`, `t_year` - I use a regex `\b\d{4}\b` that will match **any** 4 digits which might be problematic, but for the most part 4 digits by itself seems to indicate a year.
* [x] Date of format M-DD: `[**6-12**]`, `[**12/2151**]`, `t_monthday`
* [x] Month/Day: `[**Month/Day (2) 509**]`, `t_monthday`
* [x] Month (only): `[**Month (only) 51**]`, `t_month`
* [x] Holiday: `[**Holiday 3470**]`, `t_month`
* [x] Date Range: `[**Date range (1) 7610**]`, `t_daterange`
* [x] Country: `[**Country 9958**]`, `t_country`
* [x] State: `[**State 3283**]`, `t_state`
* [x] Location: `**Location (un) 2432**`, `t_location`
* [x] Telephone/Fax: `[**Telephone/Fax (3) 8049**]`, `t_phone`
* [x] Clip Number: `[**Clip Number (Radiology) 29923**]`, `t_radclip_id`
* [x] Pager Numeric Identifier: `[**Numeric Identifier 6403**]`, `t_pager_id`
* [x] Pager Number: `[**Pager number 13866**]`, `t_pager_id`
* [x] Social Security Number: `[**Security Number 10198**]`, `t_ssn`
* [x] Serial Number: `[**Serial Number 3567**]`, `t_sn`
* [x] Medical Record Number: `[**Medical Record Number **]`, `t_mrn`
* [x] Provider Number: `[**Provider Number 12521**]`, `t_provider_no`
* [x] Age over 90: `[**Age over 90 **]`, `t_oldage`
* [x] Time: `12:52 PM`, split into 6 segments by the hour and replace with the following tokens: `midnight, dawn, forenoon, afternoon, dusk, night`
* Just numbers: `[** 7901**]`
* Wardname
* Pharmacy MD Number* 

## Imports and Inits

In [1]:
import pandas as pd
import psycopg2
import numpy as np
import re
import random
import datetime

In [2]:
from process_notes import *

## Grab sample data from MIMIC

In [3]:
cats = pd.read_csv('cats.csv')
max_limit = 100

queries = []
for category, n_notes in zip(cats['category'], cats['number_of_notes']):
    limit = min(max_limit, n_notes) if max_limit > 0 else n_notes
    if limit == max_limit:
        q = f"""
        select category, text from correctnotes where category=\'{category}\' order by random() limit {limit};
        """
    else:
        q = f"""
        select category, text from correctnotes where category=\'{category}\';
        """
    queries.append(q)

In [4]:
# limit = 50
# queries = [
#     f"""
#     select category, text from correctnotes where category=\'{cats.iloc[7]['category']}\' order by random() limit {limit}
#     """
# ]

In [5]:
%%time
dfs = []

con = psycopg2.connect(dbname='mimic', user='sudarshan', host='/var/run/postgresql')
for q in queries:
    df = pd.read_sql_query(q, con)
    dfs.append(df)
con.close()
    
df = pd.concat(dfs)
df.reset_index(inplace=True, drop=True)
# df.set_index('row_id', inplace=True)
print(df.shape)

(1398, 2)
CPU times: user 33.3 ms, sys: 3.75 ms, total: 37.1 ms
Wall time: 3.82 s


In [6]:
pat = re.compile(r'\[\*\*(.*?)\*\*\]', re.IGNORECASE)
tpat = re.compile(r'\[\*\*(\d{2})\*\*\] \b[a|p].?m.?\b', re.IGNORECASE)
ypat = re.compile(r'-?\byears? ?-?old\b|\by(?:o|r)*[ ./-]*o(?:ld)?\b', re.IGNORECASE)

In [7]:
%%time
df['scrubbed'] = df['text'].apply(process_note)

CPU times: user 968 ms, sys: 0 ns, total: 968 ms
Wall time: 967 ms


In [23]:
test = df.iloc[1183]['text']
# for m in tpat.finditer(test):
#     print(m)

print(test)

Attending Physician: [**Name10 (NameIs) 33**]
   Referral date: [**2167-12-17**]
   Medical Diagnosis / ICD 9: hyperglycemia / 790.29
   Reason of referral: Eval & treat
   History of Present Illness / Subjective Complaint: 56 yo M admitted
   from clinic with c/o LLE weakness and back pain as well as h/o 2 falls
   over the past week.  Found to have glucose of 1300 and was started on
   insulin gtt.
   Past Medical / Surgical History: DM2 with neuropathy, ESRD s/p
   transplant '[**62**], HTN, hypercholesterolemia, GERD, obesity, h/o R charcot
   foot, s/p ccy, chronic back pain s/p L4-5 lami '[**65**]
   Medications: aspirin, percocet, ultram, insulin
   Radiology: CXR [**12-17**]- Lungs clear. Heart size normal. No pleural
   abnormality or evidence of central adenopathy
   Labs:
   45.1
   14.6
   168
   2.6
         [image002.jpg]
   Other labs:
   Activity Orders: OK for OOB per micu team
   Social / Occupational History: lives with his wife, wife is out of
   state until [**Mont

In [24]:
out = process_note(test)
for m in ypat.finditer(out):
    print(m)

print(out)

Attending Physician: xxname
   Referral date: xxdate
   Medical Diagnosis / ICD 9: hyperglycemia / 790.29
   Reason of referral: Eval & treat
   History of Present Illness / Subjective Complaint: 56 xxage M admitted
   from clinic with c/o LLE weakness and back pain as well as h/o 2 falls
   over the past week.  Found to have glucose of 1300 and was started on
   insulin gtt.
   Past Medical / Surgical History: DM2 with neuropathy, ESRD s/p
   transplant '[**62**], HTN, hypercholesterolemia, GERD, obesity, h/o R charcot
   foot, s/p ccy, chronic back pain s/p L4-5 lami '[**65**]
   Medications: aspirin, percocet, ultram, insulin
   Radiology: CXR xxmmdd- Lungs clear. Heart size normal. No pleural
   abnormality or evidence of central adenopathy
   Labs:
   45.1
   14.6
   168
   2.6
         [image002.jpg]
   Other labs:
   Activity Orders: OK for OOB per micu team
   Social / Occupational History: lives with his wife, wife is out of
   state until xxyear
   Living Environment: lives i

In [None]:
for i, row in df.iterrows():
    if len(tpat.findall(row['text'])) != 0:
        print(i, tpat.findall(row['text']))

In [None]:
for i, row in df.iterrows():
    if len(tpat.findall(row['scrubbed'])) != 0:
        print(i, tpat.findall(row['scrubbed']))

In [8]:
for i, row in df.iterrows():
    if len(pat.findall(row['scrubbed'])) != 0:
        print(i, pat.findall(row['scrubbed']))

439 ['90']
618 [' 193']
629 [' 25']
632 [' 34']
634 [' 502']
640 [' 103']
661 [' 8', ' 8']
675 [' 8']
693 [' 18', ' 29']
792 ['62']
914 ['26']
960 ['99']
1098 ['71', '81', '88']
1107 ['23']
1108 ['47']
1160 ['17', '98']
1168 ['23', '91']
1182 ['38']
1183 ['62', '65']
1193 ['67', '68', '69', '72', '73']
1335 ['83']
1347 ['91']
1351 ['47']
1356 ['47']
