## An Introduction to Clinical Text Processing using Spacy

### Written by: Robert Thombley, UCSF (July 19th, 2018)


In [157]:
import os
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
import warnings
warnings.filterwarnings('ignore')

This is a very brief example of how to use Spacy for processing text. Unlike NLTK which requires a lot of hands on plumbing to connect all of the NLP pieces, Spacy does most of the work behind the scenes and simplifies a lot of the process.  The advantage is that this gives you the ability to code quickly and, relatively, simply allowing you to, best case, solve your problem sooner or, worst case, quickly understand the complicated parts of your task.

What follows is only a tiny fraction of what Spacy is capable of. For more information, check out: https://spacy.io/usage/

In [48]:
# Local directory filled with (in this case, de-identified fake clinical notes from the web), organized 1 note per file
notes_dir = '/PATH/TO/NLP/CLINICAL_DATA' 

nlp = spacy.load('en_core_web_sm')

In [108]:
note_toks = []
for filename in os.listdir(notes_dir):
    with open(os.path.join(notes_dir, filename), 'r') as f:
        # Remove all of the \n (newline) characters 
        list_of_lines = [x.strip() for x in f.readlines() if x != '\n']
        note_txt = '. '.join(list_of_lines)
        doc = nlp(note_txt)
        toks = []
        for token in doc:
            # The beauty of the Spacy approach is that much of the validation/NLP has been done for you behind the scenes.
            # If we want to remove problematic, or uninsteresting parts of speech, that tagging has already been 
            # for you and is stored, ready to go.  So I can just say: well, I don't want to see any proper nouns,
            # any numbers, punctuations or symbols in my output tokens and it's a one line operation:
            
            if token.pos_ not in ('PROPN', 'NUM', 'PUNCT', 'SYM'):
                # Lets say, we also only want to store nouns, because that feels like an appropriate decision.
                if token.pos_ == 'NOUN': 
                    toks.append(token.text.lower())
        note_toks.append(toks)
        
        

In [109]:
note_toks

[['history',
  'illness',
  'gentleman',
  'hypertension',
  'hypercholesterolemia',
  'pt',
  'reports',
  'trouble',
  'medications',
  'pain',
  'mo',
  'side',
  'day',
  'event',
  'radiation',
  'legs',
  'fevers',
  'night',
  'sweats',
  'weight',
  'loss',
  'bowel',
  'bladder',
  'problems',
  'note',
  'radiation',
  'area',
  'years',
  'treatment',
  'skin',
  'cancer',
  'exercise',
  'history',
  'note',
  'review',
  'cardiovascular',
  'chest',
  'pain',
  'palpitations',
  'orthopnea',
  'edema',
  'syncope',
  'pulmonary',
  'shortness',
  'breath',
  'cough',
  'pain',
  'changes',
  'bowel',
  'habits',
  'physical',
  'signs',
  'weight',
  'pounds',
  'temp',
  'heent',
  'pink',
  'lungs',
  'heart',
  'rate',
  'rhythm',
  'murmur',
  'extremities',
  'edema',
  'neurologic',
  'sensation',
  'dtrs',
  'intact',
  'musculoskeletal',
  'tenderness',
  'spine',
  'leg',
  'hypertension',
  'control',
  'enalapril',
  'day',
  'day',
  'diet',
  'exercise',
  'ch

In [110]:
# Can also use Spacy to do some (many) more advanced things.
# You can create noun chunks which uses the part of speech tagging to identify noun phrases or
# noun "chunks", which are basically parts of phrases that belong together.
note_chunks = []
for filename in os.listdir(notes_dir):
    with open(os.path.join(notes_dir, filename), 'r') as f:
        # Remove all of the \n (newline) characters 
        list_of_lines = [x.strip() for x in f.readlines() if x != '\n']
        note_txt = '. '.join(list_of_lines)
        doc = nlp(note_txt)
        chunks = []
        # Build a list of noun chunks
        for chunk in list(doc.noun_chunks):
            chunks.append(chunk)
        note_chunks.append(chunks)

In [100]:
note_chunks

[[Phillip D. Smith,
  HISTORY,
  PRESENT ILLNESS,
  Mr. Smith,
  a 66-year-old gentleman,
  hypertension,
  hypercholesterolemia,
  pt reports,
  he,
  He,
  any trouble,
  his medications,
  His bp,
  He,
  low back pain,
  the past 6 mo,
  It,
  the right side,
  the day,
  No known initial precipitating event,
  He,
  any radiation,
  his legs,
  any fevers,
  night sweats,
  weight loss,
  bowel or bladder problems,
  note,
  he,
  significant radiation,
  this area,
  treatment,
  skin cancer,
  He,
  much exercise,
  PMH,
  MEDS,
  ALLERGIES,
  SOCIAL HISTORY,
  July 2005 note,
  REVIEW,
  SYSTEMS,
  CARDIOVASCULAR,
  no chest pain,
  palpitations,
  PND,
  orthopnea,
  edema,
  syncope,
  PULMONARY,
  no shortness,
  breath,
  cough,
  GI,
  no abdominal pain,
  changes,
  bowel habits,
  PHYSICAL EXAMINATION,
  VITAL SIGNS,
  210.7 pounds,
  temp,
  HEENT,
  Conjunctivae,
  Sclerae,
  NECK,
  LUNGS,
  HEART,
  Regular rate,
  rhythm,
  murmur,
  EXTREMITIES,
  No edema,
  NEURO

In [82]:
toks

[Nursing Progress,
 Problem,
 y/o M w/ extensive past cardiac history,
 (refer,
 chart,
 details,
 EW c/o polyuria,
 3 weeks,
 slight R,
 weakness,
 Serum,
 pt,
 non ketotic insulin gtt,
 head,
 M/SICU,
 management,
 hyperglycemia,
 admission,
 Foley cath,
 central access,
 pt,
 cool/diaphoretic and lethargic w/ FSBS,
 Action Foley,
 insulin gtt,
 glycemic control,
 increased fluid,
 100cc/hr mult attempts,
 PICC,
 bedside,
 placed L IJ TLCL,
 CXR,
 placement,
 No AM PO meds,
 pt,
 aspiration,
 Plan Cont,
 heme/resp status,
 placement,
 L IJ,
 use,
 volume repletion,
 Cycle enzymes,
 insulin gtt,
 tight gylcemic control,
 Administer all evening PO meds,
 pt,
 Update family,
 POC,
 it]

Now, let's demonstrate a very simple way to use Spacy to create a bag of words TF-IDF matrix for use in modeling.

In [143]:
simple_toks = []
sents = ['This is a sample document.',
        'This document is written by Robert.',
         'Robert is a document creator.']
for sent in sents:
    doc = nlp(sent)
    toks = []
    for token in doc:
        if token.pos_ not in ('PUNCT', 'SYM'):
            # Ignore any punctuation.
            toks.append(token.text.lower())
    simple_toks.append(toks)
        
        

In [144]:
simple_toks

[['this', 'is', 'a', 'sample', 'document'],
 ['this', 'document', 'is', 'written', 'by', 'robert'],
 ['robert', 'is', 'a', 'document', 'creator']]

In [145]:
def dummy_fun(doc):
    # TFidfVectorizer expects to do its own tokenizing and pre-processing,
    # but you can tell it to use a custom tokenizer/pre-preocessor.  What this
    # function does is returns whatever it put in, allowing us to vectorize, 
    # token for token, exactly the tokens we input. No further processing is needed.
    return doc

# Instantiate the TfidfVectorizer
tfidf = TfidfVectorizer(
    analyzer='word',
    tokenizer=dummy_fun,
    preprocessor=dummy_fun,
    token_pattern=None) 

In [153]:
# Now, we'll build our vocabulary
tfidf.fit(simple_toks)
tfidf.vocabulary_


{'a': 0,
 'by': 1,
 'creator': 2,
 'document': 3,
 'is': 4,
 'robert': 5,
 'sample': 6,
 'this': 7,
 'written': 8}

You'll notice that this is an alphabetized list of all the unique tokens that appear in our corpus.
Next, we'll 

In [154]:
# Now we can print out the inverse document frequencies (ie - 1 over the proportion of documents did these words appear in)
print(tfidf.idf_)

[1.28768207 1.69314718 1.69314718 1.         1.         1.28768207
 1.69314718 1.28768207 1.69314718]


In [175]:
# Finally, we compute the TF-IDF weight for each feature and document.
tf = []
for simp_tok in simple_toks:
    tf.append((tfidf.transform([simp_tok]).toarray()))

In [171]:
tf

[array([[0.45014501, 0.        , 0.        , 0.34957775, 0.34957775,
         0.        , 0.59188659, 0.45014501, 0.        ]]),
 array([[0.        , 0.50935267, 0.        , 0.30083189, 0.30083189,
         0.38737583, 0.        , 0.38737583, 0.50935267]]),
 array([[0.45014501, 0.        , 0.59188659, 0.34957775, 0.34957775,
         0.45014501, 0.        , 0.        , 0.        ]])]