# Lab 4.1: Solutions

## TA: Suraj Yerramilli

## Date: February 18th, 2019


In [1]:
import re
text = "Each year, millions of Americans walk out of a doctor's office with a misdiagnosis. Physicians try to be systematic when identifying illness and disease, but bias creeps in. Alternatives are overlooked."
print(text)

## Exercise 1
Implement `sent_tokenize` and `word_tokenize` using regex. Demonstrate them on the `text` variable.

### sent_tokenize 

The simplest attempt is to split by punctuation and not capture them. Note that you need to also include a whitespace following these punctuations.

Also, you must have noticed by now that `re.split` returns an empty character at the end of the list. You can get rid of this by using the `filter` function.

In [2]:
# solution with regex split will not include punctuation
# need to include white space in splitting criteria
# potential failures - last abbreviation eg: U.S.A. - sentence split after A
def sent_tokenize_split(text):
    sentences =  re.split(r'[\.\?!]\s*',text)
    # solution to remove the empty character at the end of the list
    return list(filter(None,sentences))

sent_tokenize_split(text)

If you want to capture the punctuation along with the sentence (which would be ideal), include only the punctuation groups in a a group. `re.split` splits the string by the given regex and also returns the captured patterns. You just need to combine them. 

In [3]:
def sent_tokenize_split2(text):
    l =  re.split(r'(\.|\?|!)\s*',text)
    # solution to remove the empty character at the end of the list
    l = list(filter(None,l))
    
    sentences = []
    for i in range(0,len(l),2):
        sentences.append(l[i]+l[i+1])
        
    return sentences

sent_tokenize_split2(text)

A problem with the above two solutions is sentences with abbreviations. This, however, is not a deal-breaker because the abbreviations are usually replaced with continuous letters of text. 

There is a regex solution to deal with abbreviations and it involves using "look behind" and "look-ahead" groups. 

- (?<!...) "look-behind": groups before the actual pattern
- (?<=...) "look-ahead": groups after the actual pattern


In [4]:
sentence = "Mr. Smith is planning to visit New York on his U.S.A. trip. He holds a B.Tech from Oxford."
re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s+',sentence)

### word tokenize

In [9]:
def word_tokenize_regex(text):
    # need to deal with abbreviations ahead of calling this function
    return re.findall(r'\w+|[;\.,!\?\:]|\'\w+',text)

print(word_tokenize_regex(text))

['Each', 'year', ',', 'millions', 'of', 'Americans', 'walk', 'out', 'of', 'a', 'doctor', "'s", 'office', 'with', 'a', 'misdiagnosis', '.', 'Physicians', 'try', 'to', 'be', 'systematic', 'when', 'identifying', 'illness', 'and', 'disease', ',', 'but', 'bias', 'creeps', 'in', '.', 'Alternatives', 'are', 'overlooked', '.']


In [6]:
# output from word_tokenize
from nltk.tokenize import word_tokenize
print(word_tokenize(text))

['Each', 'year', ',', 'millions', 'of', 'Americans', 'walk', 'out', 'of', 'a', 'doctor', "'s", 'office', 'with', 'a', 'misdiagnosis', '.', 'Physicians', 'try', 'to', 'be', 'systematic', 'when', 'identifying', 'illness', 'and', 'disease', ',', 'but', 'bias', 'creeps', 'in', '.', 'Alternatives', 'are', 'overlooked', '.']


## Exercise 2

Given a sentence, how would you correctly lemmatize words according to their parts of speech?

The idea is to apply POS tagging first and then lemmatize accordingly. The first letter of the tag will identify the POS.

In [7]:
### YOUR CODE GOES HERE
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.tokenize import word_tokenize

def lemmatize_pos(lemmatizer,tokens):
    # need to lemmatize based on parts of speech:
    # the first character of the tag usually 
    # indicates the parts of speech
    out = []
    for token,tag in pos_tag(tokens):
        if tag[0] == 'N':
            # noun
            tmp = lemmatizer.lemmatize(token,"n")
        elif tag[0] == "V":
            # verb
            tmp = lemmatizer.lemmatize(token,"v")
        elif tag[0] == "J":
            # adjective
            tmp = lemmatizer.lemmatize(token,"a")
        elif tag[0] == "R":
            # adverb
            tmp = lemmatizer.lemmatize(token,"r")
        else:
            tmp = token
            
        out.append(tmp)
        
    return out

lemmatizer = WordNetLemmatizer()
sent = "The striped bats are hanging on their feet for best."
tokens = word_tokenize(sent)

print("POS tags:")
print(pos_tag(tokens))
print("\nLemmatized sentence:")
print(lemmatize_pos(lemmatizer,tokens))

POS tags:
[('The', 'DT'), ('striped', 'JJ'), ('bats', 'NNS'), ('are', 'VBP'), ('hanging', 'VBG'), ('on', 'IN'), ('their', 'PRP$'), ('feet', 'NNS'), ('for', 'IN'), ('best', 'JJS'), ('.', '.')]

Lemmatized sentence:
['The', 'striped', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'for', 'best', '.']


In [8]:
# another example
sent = "The bats with stripes are better off hanging on their feet."
tokens = word_tokenize(sent)

print("POS tags:")
print(pos_tag(tokens))
print("\nLemmatized sentence:")
print(lemmatize_pos(lemmatizer,tokens))

POS tags:
[('The', 'DT'), ('bats', 'NNS'), ('with', 'IN'), ('stripes', 'NNS'), ('are', 'VBP'), ('better', 'JJR'), ('off', 'RP'), ('hanging', 'VBG'), ('on', 'IN'), ('their', 'PRP$'), ('feet', 'NNS'), ('.', '.')]

Lemmatized sentence:
['The', 'bat', 'with', 'stripe', 'be', 'good', 'off', 'hang', 'on', 'their', 'foot', '.']
