## Import Data

In [1]:
import pandas as pd
from keras.preprocessing.text import Tokenizer
import nltk
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
import pickle

In [2]:
notes_database = pd.read_csv("/home/sanjaycollege15/ekg_notes.csv")

In [3]:
notes_database.head()

Unnamed: 0,SUBJECT_ID,ICD9_CODE,TEXT
0,87424,4019,Sinus rhythm with 2:1 A-V block. Right bundle...
1,14211,4019,Sinus tachycardia\nConsider old septal myocard...
2,88174,4019,Sinus rhythm. Low limb lead voltage. Compare...
3,30927,4019,Sinus rhythm. Early precordial QRS transition...
4,30927,4019,Sinus rhythm. Early precordial QRS transition...


---

## Pre-process Text

### Drop Duplicates

In [4]:
notes_database.drop_duplicates(inplace=True)

In [5]:
notes_database.drop(columns="SUBJECT_ID", inplace=True)

### Lowercase text

In [6]:
notes_database['lower_text']=notes_database.TEXT.str.lower()

In [7]:
notes_database.drop(columns='TEXT', inplace=True)

### Remove Identifiers

Privacy identifiers are in the form [\*\*2157-1-7\*\*], as shown in the example below. We can remove them with a simple regex.

In [8]:
notes_database.lower_text[0]

'sinus rhythm with 2:1 a-v block.  right bundle-branch block.  inferolateral\nst-t wave abnormalities.  compared to the previous tracing of [**2157-1-7**]\n2:1 a-v block is new.  inferolateral st-t wave abnormalities are more\nmarked.  cannot rule out myocardial ischemia or subendocardial infarction.\nsuggest clinical correlation and repeat tracing.\n\n'

In [9]:
notes_database.lower_text.replace('(\[\*\*)(.*)(\*\*\])', '', regex=True)[0]

'sinus rhythm with 2:1 a-v block.  right bundle-branch block.  inferolateral\nst-t wave abnormalities.  compared to the previous tracing of \n2:1 a-v block is new.  inferolateral st-t wave abnormalities are more\nmarked.  cannot rule out myocardial ischemia or subendocardial infarction.\nsuggest clinical correlation and repeat tracing.\n\n'

In [10]:
notes_database['removedIdentifiers']=notes_database.lower_text.replace('(\[\*\*)(.*)(\*\*\])', '', regex=True)

In [11]:
notes_database.drop(columns="lower_text", inplace=True)

In [12]:
notes_database.to_pickle('intermediate_ekg.pkl')

### Dropping stop words

Did not end up dropping stop words because they seemed important in interpreting the notes. Unlike sentiment analysis, negations like "not" or "nor" could be important in clinical notes. The models I'm going to be using will need to leverage the context from stop words.

In [13]:
# nltk.download('stopwords')
# stop_words = stopwords.words('english')

In [14]:
#notes_database['notes_without_stopwords'] = notes_database['removedIdentifiers'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

In [15]:
#notes_database.notes_without_stopwords[0]

### Noise removal, sentence splitting

Removed most punctuation, but leaving in '-' and ':' since they're commonly used medical terms.

In [16]:
notes_database['removedIdentifiers'][0]

'sinus rhythm with 2:1 a-v block.  right bundle-branch block.  inferolateral\nst-t wave abnormalities.  compared to the previous tracing of \n2:1 a-v block is new.  inferolateral st-t wave abnormalities are more\nmarked.  cannot rule out myocardial ischemia or subendocardial infarction.\nsuggest clinical correlation and repeat tracing.\n\n'

In [17]:
tokenizer = Tokenizer(
    #num_words = 150,
    filters='!"#$%&()*+/<=>@[\\]^_`{|}~\t\n',
    split = ' ', 
    char_level = False)

In [18]:
tokenizer.fit_on_texts(notes_database['removedIdentifiers'])

In [19]:
sequences = tokenizer.texts_to_sequences(notes_database['removedIdentifiers'])


In [20]:
denoised_sentences = []

for i in sequences:
    denoised_sentences.append(' '.join(tokenizer.index_word[w] for w in i))

In [21]:
with open("denoised_sentences.txt", "wb") as fp:
    pickle.dump(denoised_sentences, fp)

Switch to Preprocess Text - Notebook 2 in order to continue constructing the dataframe. Ran out of memory in this notebook.

---

### Tokenize into Sentences

In [25]:
tokenized_sentences = []

for i in denoised_sentences:
    tokenized_sentences.append(sent_tokenize(i))

In [26]:
with open("tokenized_sentences.txt", "wb") as fp:
    pickle.dump(tokenized_sentences, fp)

Switch to Preprocess Text - Notebook 2 in order to continue constructing the dataframe. Ran out of memory in this notebook.

### Stemming