## Code Samples for NLP Beginners

### Written by: Robert Thombley, UCSF (July 19th, 2018)

Step 1: Extraction

The pipeline begins with some means of extracting relevant data, and storing it in a data structure that is easy to work with.  In this example, we will pretend that our data lives within a fictious database.

In [None]:
import PyODBC # For step 1
import re # For step 2
import nltk # Python Natural Language Toolkit
import spacy # Alternative NLP library that is fast, modern and easier to use
import string # Includes some functions and constants useful for working with text data
from nltk.corpus import stopwords # May require downloading the nltk corpora to your local machine.

In [None]:
# Conenct to the database we are interested in
connection = pyodbc.connect(conn = pyodbc.connect(r'DRIVER={ODBC Driver 11 for SQL Server};'
                                                    r'SERVER=testServer.ucsf.edu;'
                                                    r'DATABASE=clinical_notes;'
                                                    r'UID=user123;'
                                                    r'PWD=p@ssword@bC'
                                                    ))

# Instantiate a cursor object on our database of interest.
cursor = connection.cursor()

In [None]:
# Define a simple SQL query for extracting a single note from the table CLINICAL_NOTE_TEXT from the
# fictitious database clinical_notes

sql_string = 'SELECT NOTE_ID, AUTHOR_ID, EDIT_TIMESTAMP_UTC, NOTE_TEXT FROM CLINICAL_NOTE_TEXT where NOTE_ID = 12345'

In [None]:
# Execute our SQL query
query_results_obj = cursor.execute(sql_string)

# For each row in the query results object, store the information we care about: note_id, author_id, and note text
notes = []
for row in query_results_obj:
    notes.append(row['note_id'])
    notes.append(row['author_id'])
    notes.appedn(row['note_text'])

Example output for note_id = 12345

Step 2: Formatting and Cleaning

Here we will introduce some basic data cleaning concepts like regexes

In [None]:
# Remove stray unicode mark at the start of the note text
notes[2] = re.sub(r’⌋\s*‘, ‘’, notes[2])

# Replace [**Name **] with #name#
notes[2] = re.sub(r’\[\*\*Name \(NI\) \*\*\]’),‘#name#’, Notes[2])

# Decide to remove all headings (e.g.: 'History of present illness: ') from text
lines = notes[2].split('\n')
newnote = []
heading_re = re.compile(r'.*\:\s&*(.*)') # Compile a regex for finding text that precedes a colon and a space

for line in lines:    
    match_group = heading_re.match(row)
    if match_group: 
        # if the current line matches the heading_re regex pattern, then only add the text after the colon
        # to our output
        newnote.append(match_group.group(1))    
    else:
        # otherwise, add the whole row
        newnote.append(row)

# Join note lines using the newline character and overwrite the old note data
notes[2] = ’\n’.join(newnote)


Step 3: Tokenization Steps
After extraction and cleaning, the other steps can vary in their ordering. Here we present some tokenization techniques 

Sentence Tokenization/Segmentation is deciptively simple to accomplish using NLTK/SPACY. Unfortunately, there are many assumptions that go into splitting a document into sentences. You may have to iterate between data cleaning and sentence tokenization/segmentation steps until things look like you want.

In [None]:
# Using NLTK
sentences = nltk.tokenize.sent_tokenize(notes[2])
for sent_num, sentence in enumerate(sentences):
    print('{} => {}'.format(sent_num, sentence)

Note that NLTK's sentence tokenizer chose not to treat a newline as a sentence break.  This makes sense, because often sentences can span multiple lines. In our case, we don't want this.  Our options are: we can try a different sentence tokenizer or we can try to correct the formatting using regexes (ie - iterate back to data cleaning.) Let's just try it using SPACY's sentence tokenizer:

In [None]:
# Using SPACY
nlp = spacy.load('en_core_web_en') # load the default english language model

doc = nlp(notes[2]) # Build the document model

#list the segmented sentences
for sent_num, sentence in enumerate(doc.sents):
    print('{} => {}'.format(sent_num, sentence.text.strip()))

Looks like this worked out how we wanted. For a lot of this, you will have to play around with options to see what works best for your particular use case.

Tokenizing our Sentences
Most NLP tools assume your document is a collection of sentences, which are themselves collections of tokens. These tokens can be unigrams, bi-grams, tri-grams or many other token types.

In [None]:
# Unigram tokenization using NLTK
unigrams = nltk.tokenize.word_tokenize(notes[2])

Additional Steps, as needed:

Complexity Reduction is used to reduce repetition and redundancy in your text data. We want to consider 'pt' and 'patient' to be the same token.


In [None]:
# Build a set of stop words (commonly used words that carry little information)
stop_words = set(stopwords.words('english'))

# Build a set of punctuation characters (for our purposes, we've decided punctuation isn't very important)
punct = set([s for s in string.punctuation])

# Build a set of common clinical abbreviations
med_abbrev = {'y/o': 'year old', 'w/': 'with', 'h/o': 'history of', 'EtOH': 'alcohol', 'pt': 'patient'}

tok_reduced = []
for tok in unigrams:
    if tok not in stop_words and tok not in punct:
        if tok in med_abbrev.keys():
            tok = med_abbrev[tok] # expand the abbreviation
        tok_reduced.append[tok]

POS Tagging is used to tag the part of speech of a particular word, usually based on a statistcal model of word use (ie- what's the most likely part of speech for this word, given what we've seen in a huge amount of training data). Getting POS tagging to work well, especially for jargon-heavy data like clinical text is a challenge, because most of the statistical models that are freely available haven't encountered domain specific semantics. The basic NTLK POS tagging works ok, but domain specific taggers exist (like the MedPost SKR Tagger used by MetaMap: https://metamap.nlm.nih.gov/MedPostSKRTagger.shtml)


In [None]:
pos_tok = nltk.pos_tag(tok_reduced)