# MIMIC Notes Pre-Processing

Pre-processing MIMIC notes for language model and CUIs

## Imports and Inits

In [1]:
import pandas as pd
import psycopg2
import numpy as np

from pathlib import Path

In [2]:
PATH = Path('data')

## Functions

## Grab sample data from MIMIC

In [3]:
cats = pd.read_csv('cats.csv')
cats.sort_values('number_of_notes', ascending=False)

In [11]:
cats = pd.read_csv('cats.csv')
max_limit = 4

queries = []
for category, n_notes in zip(cats['category'], cats['number_of_notes']):
    limit = min(max_limit, n_notes) if max_limit > 0 else n_notes
    if limit == max_limit:
        q = f"""
        select category, text from correctnotes where category=\'{category}\' order by random() limit {limit};
        """
    else:
        q = f"""
        select category, text from correctnotes where category=\'{category}\';
        """
    queries.append(q)

In [None]:
# limit = 50
# queries = [
#     f"""
#     select category, text from correctnotes where category=\'{cats.iloc[7]['category']}\' order by random() limit {limit}
#     """
# ]

In [12]:
dfs = []

con = psycopg2.connect(dbname='mimic', user='sudarshan', host='/var/run/postgresql')
for q in queries:
    df = pd.read_sql_query(q, con)
    dfs.append(df)
con.close()
    
df = pd.concat(dfs)
# df.set_index('row_id', inplace=True)
df.shape

(56, 2)

1. Get list of redacted types using `re`
2. replace it with appropriate holder tokens

Redacted items:
* ~~First Name~~
* ~~Last Name~~
* Hospital
* Date of format M-DD
* Date of format YYYY-M-DD
* ~~Name~~
* NULL
* ~~Known lastname~~
* ~~Doctor First Name~~
* Doctor Last Name
* Month (only)
* Just numbers
* Location
* Month/Day
* Telephone/Fax
* Wardname
* Numeric Identifier

In [13]:
import re
import random

In [31]:
test = df.iloc[random.randint(0, 50)]['text']
print(test)

Pharmacy Note
 Sedation
   Assessment:
   Day 19 of continuous sedation; Currently on midazolam 3 mg/hr (low dose
   compared to previous doses this stay); currently no pain medications;
   currently no antipsychotic medications; [**Last Name (un) 550**] 5 (agitated);  Has been
   on continuous pentobarbital, fentanyl, propofol, midazolam and
   intermittent methadone this stay.
   Recommendation:
          Continue midazolam as is and wean as tolerated 25 to 50% per
   day.  Given the extended duration of exposure to sedatives, a schedule
   of lorazepam 2 mg q6h may be required to avoid benzodiazepine
   withdrawal once the versed drip is completely discontinued; would
   schedule the lorazepam and taper to off over a 5 to 7 day period (2 mg
   q6h x 2 days, 2 mg q8h x 3 days, then 2 mg [**Hospital1 **] x 2 days then only as
   needed.  Supplemental pain medication may also be required to avoid
   agitation when removing the benzodiazepine.
          Consider supplementing haloperido

In [32]:
pat_all = re.compile(r'\[\*\*(.*?)\*\*\]', re.IGNORECASE)

for m in pat_all.finditer(test):
    print(m)

<_sre.SRE_Match object; span=(230, 254), match='[**Last Name (un) 550**]'>
<_sre.SRE_Match object; span=(800, 816), match='[**Hospital1 **]'>
<_sre.SRE_Match object; span=(1465, 1495), match='[**Initials (NamePattern4) **]'>
<_sre.SRE_Match object; span=(1496, 1527), match='[**Last Name (NamePattern4) **]'>
<_sre.SRE_Match object; span=(1545, 1573), match='[**Numeric Identifier 499**]'>


In [33]:
pat_name = re.compile(r'\[\*\*(.*?Name.*?)\*\*\]', re.IGNORECASE)

for m in pat_name.finditer(test):
    print(m)

<_sre.SRE_Match object; span=(230, 254), match='[**Last Name (un) 550**]'>
<_sre.SRE_Match object; span=(1465, 1495), match='[**Initials (NamePattern4) **]'>
<_sre.SRE_Match object; span=(1496, 1527), match='[**Last Name (NamePattern4) **]'>


In [34]:
def replace_name(m):
    r = 't_name'
    if 'Last' in m.group() or 'last' in m.group():
        if 'Doctor' in m.group():
            r = 't_doc_ln'
        else:
            r = 't_ln'
    elif 'First' in m.group() or 'first' in m.group():
        if 'Doctor' in m.group():
            r = 't_doc_fn'
        else:
            r = 't_fn'
    return r

In [35]:
out = pat_name.sub(replace_name, test)

In [36]:
for m in pat_name.finditer(out):
    print(m)

print(out)

Pharmacy Note
 Sedation
   Assessment:
   Day 19 of continuous sedation; Currently on midazolam 3 mg/hr (low dose
   compared to previous doses this stay); currently no pain medications;
   currently no antipsychotic medications; t_ln 5 (agitated);  Has been
   on continuous pentobarbital, fentanyl, propofol, midazolam and
   intermittent methadone this stay.
   Recommendation:
          Continue midazolam as is and wean as tolerated 25 to 50% per
   day.  Given the extended duration of exposure to sedatives, a schedule
   of lorazepam 2 mg q6h may be required to avoid benzodiazepine
   withdrawal once the versed drip is completely discontinued; would
   schedule the lorazepam and taper to off over a 5 to 7 day period (2 mg
   q6h x 2 days, 2 mg q8h x 3 days, then 2 mg [**Hospital1 **] x 2 days then only as
   needed.  Supplemental pain medication may also be required to avoid
   agitation when removing the benzodiazepine.
          Consider supplementing haloperidol 5 mg q6h or q4h fo