# Generate a word frequency list

This notebook loads the vocabulary learned from the MIMIC-III free-text notes and uses it as a starting point to generate a custom word frequency list. We expand the vocabulary by adding names of common drugs (including generic, brand, and slang names) and local mental health organisations.
The word frequency list is generated by parsing the whole dataset and appending to an empty list every word that is known to the vocabulary.

In [1]:
import pandas as pd
import re
import spacy
from spellchecker import SpellChecker
import pickle
import time
from nlp_utils import preprocess, find_pattern
from custom_tokenizer import combined_rule_tokenizer

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/vrozova/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**Retrieve MIMIC vocabulary**

In [2]:
# Load the vocab retreived from the Med7 model
with open ('../data/spelling_correction/med7_vocab.txt', 'rb') as f:
    vocab = pickle.load(f)

# Create an empty spellchecker object and initialise it with MIMIC vocab retrieved from the Med7 model
spell = SpellChecker(language=None)
spell.word_frequency.load_words(vocab)

print("MIMIC vocabulary contains %d unique words (%d words in total)." % 
      (spell.word_frequency.unique_words, spell.word_frequency.total_words))

MIMIC vocabulary contains 790195 unique words (1509550 words in total).


**Add names of drugs and local mental health organisations**

In [3]:
df_drugs = pd.read_csv("../data/spelling_correction/medication_names.csv")

generic_names = [
    word
    for line in df_drugs.generic_name.dropna().str.strip().str.lower().str.replace("&", " ").tolist() 
    for word in line.split()
]

brand_names = [
    word 
    for line in df_drugs.brand_name.dropna().str.strip().str.lower().str.replace("&", " ").str.replace("\n", " ").tolist()
    for word in line.split()
]

slang_names = df_drugs.slang.dropna().str.strip().str.lower().unique().tolist()

drug_names = set(generic_names + brand_names + slang_names)

In [4]:
spell.word_frequency.load_words(drug_names)

spell.word_frequency.load_words(["ecatt", "orygen", "saapu", 
                                "unrousable","batcall","acopia", 
                                "daswest","neurovasc", "vasc", "bibp"])

print("Extended vocabulary contains %d unique words (%d words in total)." % 
      (spell.word_frequency.unique_words, spell.word_frequency.total_words))

Extended vocabulary contains 790509 unique words (1510106 words in total).


**Load RMH data**

In [5]:
df = pd.read_csv("../data/rmh_raw.csv")
print(df.shape)
df.head()

(486458, 4)


Unnamed: 0,SH,SI,length,text
0,0.0,,140,"SOB for 5/7, been to GP given prednisolone, co..."
1,0.0,,107,"pt has lac down right forehead, to eyebrow, wi..."
2,0.0,,74,"pt expect MBA, trapped for 45mins, #right femu..."
3,0.0,,167,L) sided flank pain same as previous renal col...
4,0.0,,193,generalised abdo pain and associated headache ...


**Preprocess and tokenize**

In [6]:
%%time
# Preprocess comments
df['text_clean'] = df.text.apply(preprocess)

CPU times: user 52 s, sys: 96.9 ms, total: 52.1 s
Wall time: 52.1 s


In [7]:
# Load scispacy model for tokenization
nlp = spacy.load("en_core_sci_sm", disable=['tagger', 'attribute_ruler', 'lemmatizer', 'parser', 'ner'])
nlp.tokenizer = combined_rule_tokenizer(nlp)

In [8]:
%%time
df['text_clean'] = list(nlp.pipe(df.text_clean))

CPU times: user 9min 7s, sys: 29.7 s, total: 9min 37s
Wall time: 9min 37s


**Select a subset of tokens present in the dataset**

In [9]:
# Checks if a token is known and add it to the vocab
def add_to_vocab(text):
    vocab.extend(spell.known([token.text for token in text])) 
    
# Apply the function to each triage comment
vocab = []
df.text_clean.apply(add_to_vocab)

print("Domain-specific vocabulary contains %d unique words (%d words in total)." % 
      (len(set(vocab)), len(vocab)))
      
with open('../data/spelling_correction/rmh_custom_vocab.txt', 'wb') as f:
    pickle.dump(vocab, f)

Domain-specific vocabulary contains 36506 unique words (9127336 words in total).
