# MIMIC Notes Pre-Processing

Pre-processing MIMIC notes for further use.

Below is a list of redacted items with an example and the replacement token. Replacement tokens are changeable. Check `preprocess_notes.py` for more details.

Redacted items:
* [x] First Name: `[**First Name (Titles) 137**]`, `xxname`
* [x] Last Name: `[**Last Name (Titles) **]`, `xxln`
* [x] Initials: `[**Initials (NamePattern4) **]`, `xxinit`
* [x] Name: `[**Name (NI) **]`, `xxname`
* [x] Doctor First Name: `[**Doctor First Name 1266**]`, `xxdocfn`
* [x] Doctor Last Name: `[**Doctor Last Name 1266**]`, `xxdocln`
* [x] Known Last Name: `[**Known lastname 658**]`, `xxln`
* [x] Hospital: `[**Hospital1 **]`, `xxhosp`
* [x] Hospital Unit Name: `**Hospital Unit Name 10**`, `xxhosp`
* [x] Company: `[**Company 12924**]`, `xxwork`
* [x] University/College: `[**University/College **]`, `xxwork`
* [x] Date of format YYYY-M-DD: `[**2112-4-18**]`, `xxdate`
* [x] Year: `[**Year (4 digits) **]`, `xxyear`
* [x] Year YYYY format: `[**2119**]`, `xxyear` - I use a regex `\b\d{4}\b` that will match **any** 4 digits which might be problematic, but for the most part 4 digits by itself seems to indicate a year.
* [x] Date of format M-DD: `[**6-12**]`, `[**12/2151**]`, `xxmmdd`
* [x] Month/Day: `[**Month/Day (2) 509**]`, `xxmmdd`
* [x] Month (only): `[**Month (only) 51**]`, `xxmonth`
* [x] Holiday: `[**Holiday 3470**]`, `xxhols`
* [x] Date Range: `[**Date range (1) 7610**]`, `xxdtrnge`
* [x] Country: `[**Country 9958**]`, `xxcntry`
* [x] State: `[**State 3283**]`, `xxstate`
* [x] Location: `**Location (un) 2432**`, `xxloc`
* [x] Telephone/Fax: `[**Telephone/Fax (3) 8049**]`, `xxph`
* [x] Clip Number: `[**Clip Number (Radiology) 29923**]`, `xxradclip`
* [x] Pager Numeric Identifier: `[**Numeric Identifier 6403**]`, `xxpager`
* [x] Pager Number: `[**Pager number 13866**]`, `xxpager`
* [x] Social Security Number: `[**Security Number 10198**]`, `xxssn`
* [x] Serial Number: `[**Serial Number 3567**]`, `xxsno`
* [x] Medical Record Number: `[**Medical Record Number **]`, `xxmrno`
* [x] Provider Number: `[**Provider Number 12521**]`, `xxpno`
* [x] Age over 90: `[**Age over 90 **]`, `xxage90`
* [x] Contact Info: `[**Contact Info **]`, `xxcontact`
* [x] Job Number: `[**Job Number **]`, `xxjobno`
* [x] Dictator Number: `[**Dictator Info **]`, `xxdict`
* [x] Pharmacy MD Number/MD number: `[**Pharmacy MD Number **]`, `xxmdno`
* [x] Time: `12:52 PM`, split into 6 segments by the hour and replace with the following tokens: `midnight, dawn, forenoon, afternoon, dusk, night`
* 2-digit Numbers: `[** 84 **]`, `xx2digit`
* 3-digit Numbers: `[** 834 **]`, `xx3digit`
* Wardname

`886` notes are marked incorrect with `iserror` flag set to 1. Thus, there are total of `2,082,294` notes. I have set up a `view` called `correctnotes` in the database, which only includese the correct notes. All the data I grab is from that `view`.

## Imports and Inits

In [1]:
import pandas as pd
import psycopg2
import numpy as np
import re
import random
import datetime
from pathlib import Path
import pickle
import numpy as np

Softlink `ln -s` your data path to a `data` variable in the current folder. That way we don't need to change the path in the notebook.

In [2]:
PATH = Path('data')

In [3]:
from preprocess_notes import *

## Grab Data from MIMIC

### From Database

Here the data is grabbed from the MIMIC database. Data can also be grabbed from other sources

In [4]:
cats = pd.read_csv('note_categories.csv')
max_limit = 10

queries = []
for category, n_notes in zip(cats['category'], cats['number_of_notes']):
    limit = min(max_limit, n_notes) if max_limit > 0 else n_notes
    if limit == max_limit:
        q = f"""
        select * from correctnotes where category=\'{category}\' order by random() limit {limit};
        """
    else:
        q = f"""
        select * from correctnotes where category=\'{category}\';
        """
    queries.append(q)

In [5]:
%%time
dfs = []

con = psycopg2.connect(dbname='mimic', user='sudarshan', host='/var/run/postgresql')
for q in queries:
    df = pd.read_sql_query(q, con)
    dfs.append(df)
con.close()
    
df = pd.concat(dfs)
print(df.shape)

(150, 10)
CPU times: user 57.3 ms, sys: 0 ns, total: 57.3 ms
Wall time: 2.26 s


### From Notes File

In [None]:
%%time
df = pd.read_csv(PATH/'NOTEEVENTS.csv.gz')
print(df.shape)

## Preprocess

In [6]:
df.columns = map(str.lower, df.columns)
df.set_index('row_id', inplace=True)
print(df.shape)

(150, 9)


In [None]:
pat1 = re.compile(r'-?\byears? ?-?old\b|\by(?:o|r)*[ ./-]*o(?:ld)?\b')
pat2 = re.compile(r'(\d+)\s*(year\s*old|y.\s*o.|yo|year\s*old|year-old|-year-old|-year old)')

In [None]:
# patm,patw = re.compile(r'\b(male|man|m|M)\b', re.IGNORECASE),re.compile(r'\b(woman|female|f|F)\b')
patm,patw = re.compile(r'\b(male|man|m|M)(?!\S)\b', re.IGNORECASE),re.compile(r'\b(female|woman|f|F)(?!\S)\b')

In [None]:
x = df.iloc[random.randint(0, len(df))]['text']
print(x)

In [None]:
patm.findall(x)

In [None]:
patw.findall(x)

In [None]:
pat1.findall(x)

In [None]:
pat2.findall(x)

Confirm that the number of notes match the actual number.

In [None]:
df[['category', 'text']].groupby(['category']).agg(['count'])

In [7]:
%%time
df['proc_text'] = df['text'].apply(preprocess_note)

CPU times: user 170 ms, sys: 0 ns, total: 170 ms
Wall time: 169 ms


In [33]:
print(df.iloc[random.randint(0, len(df))]['proc_text'])

TITLE:
BEDSIDE SWALLOWING EVALUATION:
HISTORY:
Thank you for consulting on this 66 yo female with hx of prior
strokes and PD admitted on xxdate from OSH with confusion and
weakness. CT scan showed bilateral hypodensities. patient s daughter
reported recent hospitalization ~1 week PTA for possible seizure
w/ d/c home. patient was transferred from Far 11 to the ICU on xxmmdd
for vomiting x 2, left sided weakness and seizure activity. Head
CT showed new right sided infarct. patient was found with moderate
stenosis and is awaiting CEA. We were consulted to evaluate for
oral and pharyngeal dysphagia.
Other PMH includes asthma, COPD, CAD, HTN and MI
EVALUATION:
The examination was performed while the patient was seated
upright in the bed in the SICU.
Cognition, language, speech, voice:
patient was lethargic, but did open her eyes on command. patient was seen
following PT in attempts to have her most awake for the
evaluation. Language could not be assessed, as her spontaneous
output was minim

In [None]:
with open(PATH/'preprocessed_noteevents.pkl', 'wb') as f:
    pickle.dump(df, f)

## Create datasets for Language Modeling

To follow the FastAI language modeling lesson, I've created a subset of the original dataframe to sample for the datasets. In particular, I've included the `description` and `preprocessed_text` fields in the datasets. The `description` column is composed of free-text and has `3840` unique descriptions. I consider the description as a unique `field` which will be marked as such during tokenization as done in the FastAI library.

In [None]:
sub_df = pd.DataFrame({'proc_text': df['proc_text'], 'category': df['category'], 'description': df['description'], 'labels': [0]*len(df)},\
                      columns=['labels', 'category', 'description', 'proc_text'])
sub_df.sample(5)

Now we can just do a train/test split on the entire dataset for getting a 90/10 training and testing dataset. However, I would like the train/test set have a 90%/10% split in **each category**. So I chose to iterate over each entry of the `category` column and create masks to split data with a 90/10 split for training and testing so that I grab 10% of texts in each category for testing instead of a global 10%.

Set random seed for reproducible results.

In [None]:
np.random.seed(42)

dfs = [sub_df.loc[df['category'] == c] for c in sub_df['category'].unique()]
msks = [np.random.rand(len(d)) < 0.9 for d in dfs]

train_dfs = [None] * len(dfs)
val_dfs = [None] * len(dfs)

for i in range(len(dfs)):
    idf = dfs[i]
    mask = msks[i]
    train_dfs[i] = idf[mask]
    val_dfs[i] = idf[~mask]
    
train_df = pd.concat(train_dfs)
val_df = pd.concat(val_dfs)

print(len(train_df), (len(df) - len(df)//10), len(train_df)-(len(df) - len(df)//10))
print(len(val_df), (len(df)//10), len(val_df)-(len(df)//10))    

Sanity check the aggregate count for each category over the 3 dataframes. Then write the `train` and `val` dataframes to disk.

In [None]:
val_df[['category', 'proc_text']].groupby(['category']).agg(['count'])

In [None]:
train_df[['category', 'proc_text']].groupby(['category']).agg(['count'])

In [None]:
sub_df[['category', 'proc_text']].groupby(['category']).agg(['count'])

In [None]:
train_df[['labels', 'description', 'proc_text']].to_csv(PATH/'train.csv', header=False, index=False)
val_df[['labels', 'description', 'proc_text']].to_csv(PATH/'test.csv', header=False, index=False)