# Reading Text from Files, Stemming and Lemmatization

In [1]:
import nltk
import re
from nltk.collocations import *
from nltk.tokenize import TweetTokenizer

## Reading Text from a File

In [3]:
f = open("data/crimeandpunishment.txt")
rawtext = f.read()
f.close()

In [4]:
rawtext[:100]

'Produced by John Bickers; and Dagny\n\nCRIME AND PUNISHMENT\n\nBy Fyodor Dostoevsky\n\n\nTranslated By Cons'

## Tokenize and Get Concordance of the word "pass"

In [5]:
crimetokens = nltk.word_tokenize(rawtext)
text = nltk.Text(crimetokens)
text.concordance("pass")

Displaying 25 of 42 matches:
y time he went out he was obliged to pass her kitchen , the door of which inva
at would it be if it somehow came to pass that I were really going to do it ? 
oom , she said , letting her visitor pass in front of her : '' Step in , my go
d Katerina Ivanovna would not let it pass , she stood up for her ... and so th
ng into the next room , as he had to pass through hers to get there . Taking n
this scandal , and it came to such a pass that Dounia and I dared not even go 
for you . Oh , if only this comes to pass ! This would be such a benefit that 
irst place , because it will come to pass of itself , later on , and he will n
hen the hour struck , it all came to pass quite differently , as it were accid
g in the doorway not allowing him to pass , he advanced straight upon her . Sh
im -- all was lost ; if they let him pass -- all was lost too ; they would rem
t once that it would be loathsome to pass that seat on which after the girl wa
h , but it 's nothing m

## Stemming and Lemmatization

In [8]:
# inspect the tokens
print(crimetokens[:100])
print("\nThere are {:d} tokens".format(len(crimetokens)))

['Produced', 'by', 'John', 'Bickers', ';', 'and', 'Dagny', 'CRIME', 'AND', 'PUNISHMENT', 'By', 'Fyodor', 'Dostoevsky', 'Translated', 'By', 'Constance', 'Garnett', 'TRANSLATOR', "'S", 'PREFACE', 'A', 'few', 'words', 'about', 'Dostoevsky', 'himself', 'may', 'help', 'the', 'English', 'reader', 'to', 'understand', 'his', 'work', '.', 'Dostoevsky', 'was', 'the', 'son', 'of', 'a', 'doctor', '.', 'His', 'parents', 'were', 'very', 'hard-working', 'and', 'deeply', 'religious', 'people', ',', 'but', 'so', 'poor', 'that', 'they', 'lived', 'with', 'their', 'five', 'children', 'in', 'only', 'two', 'rooms', '.', 'The', 'father', 'and', 'mother', 'spent', 'their', 'evenings', 'in', 'reading', 'aloud', 'to', 'their', 'children', ',', 'generally', 'from', 'books', 'of', 'a', 'serious', 'character', '.', 'Though', 'always', 'sickly', 'and', 'delicate', 'Dostoevsky', 'came', 'out', 'third']

There are 250325 tokens


## Stemming

NLTK has two stemmers, Porter and Lancaster, described in section 3.6 of the NLTK book.  To use these stemmers, you first create them.

In [9]:
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()

In [10]:
# stem all tokens in crime tokens
crimePorterStem = [porter.stem(t) for t in crimetokens]
print("Porter stem words in text:\n")
print(crimePorterStem[:100])
crimeLancStem = [lancaster.stem(t) for t in crimetokens]
print("\nLancaster stem words in text:\n")
print(crimeLancStem[:100])

Porter stem words in text:

['produc', 'by', 'john', 'bicker', ';', 'and', 'dagni', 'crime', 'and', 'punish', 'by', 'fyodor', 'dostoevski', 'translat', 'by', 'constanc', 'garnett', 'translat', "'s", 'prefac', 'a', 'few', 'word', 'about', 'dostoevski', 'himself', 'may', 'help', 'the', 'english', 'reader', 'to', 'understand', 'hi', 'work', '.', 'dostoevski', 'wa', 'the', 'son', 'of', 'a', 'doctor', '.', 'hi', 'parent', 'were', 'veri', 'hard-work', 'and', 'deepli', 'religi', 'peopl', ',', 'but', 'so', 'poor', 'that', 'they', 'live', 'with', 'their', 'five', 'children', 'in', 'onli', 'two', 'room', '.', 'the', 'father', 'and', 'mother', 'spent', 'their', 'even', 'in', 'read', 'aloud', 'to', 'their', 'children', ',', 'gener', 'from', 'book', 'of', 'a', 'seriou', 'charact', '.', 'though', 'alway', 'sickli', 'and', 'delic', 'dostoevski', 'came', 'out', 'third']

Lancaster stem words in text:

['produc', 'by', 'john', 'bick', ';', 'and', 'dagny', 'crim', 'and', 'pun', 'by', 'fyod', 'dostoevsky

Note that the Lancaster stemmer has lower-cased all the words, and in some cases, it appears to be a little more severe in removing word endings, but in others not.

## Lemmatization

The NLTK has a lemmatizer that uses the WordNet on-line thesaurus as a dictionary to look up roots and find the word.

In [17]:
wnl = nltk.WordNetLemmatizer()
crimeLemma = [wnl.lemmatize(t) for t in crimetokens]
print("Original words in text:\n")
print(crimetokens[:100])
print("Lemma words in text:\n")
print(crimeLemma[:100])

Original words in text:

['Produced', 'by', 'John', 'Bickers', ';', 'and', 'Dagny', 'CRIME', 'AND', 'PUNISHMENT', 'By', 'Fyodor', 'Dostoevsky', 'Translated', 'By', 'Constance', 'Garnett', 'TRANSLATOR', "'S", 'PREFACE', 'A', 'few', 'words', 'about', 'Dostoevsky', 'himself', 'may', 'help', 'the', 'English', 'reader', 'to', 'understand', 'his', 'work', '.', 'Dostoevsky', 'was', 'the', 'son', 'of', 'a', 'doctor', '.', 'His', 'parents', 'were', 'very', 'hard-working', 'and', 'deeply', 'religious', 'people', ',', 'but', 'so', 'poor', 'that', 'they', 'lived', 'with', 'their', 'five', 'children', 'in', 'only', 'two', 'rooms', '.', 'The', 'father', 'and', 'mother', 'spent', 'their', 'evenings', 'in', 'reading', 'aloud', 'to', 'their', 'children', ',', 'generally', 'from', 'books', 'of', 'a', 'serious', 'character', '.', 'Though', 'always', 'sickly', 'and', 'delicate', 'Dostoevsky', 'came', 'out', 'third']
Lemma words in text:

['Produced', 'by', 'John', 'Bickers', ';', 'and', 'Dagny', 'CRIME', 

Note that the WordNetLemmatizer does not stem verbs and in general, doesn’t stem very severely at all.

## Another example: desert.txt
* Tokenize the text
* Apply and compare the Porter and Lancaster stemming method

In [29]:
f = open("data/desert.txt")
desertRawText = f.read()
f.close()
desertTokens = nltk.word_tokenize(desertRawText)
print("There are {:d} tokens".format(len(desertTokens)))
desertStemPorter = [porter.stem(t) for t in desertTokens]
desertStemLancaster = [lancaster.stem(t) for t in desertTokens]
desertLemma = [wnl.lemmatize(t) for t in desertTokens]

There are 1364 tokens


In [30]:
# compare porter and lancaster stem word at a particular index
from random import sample

i_list = sample(range(len(desertTokens)), 10)
for i in i_list:
    a = desertTokens[i]
    b = desertStemLancaster[i]
    c = desertStemPorter[i]
    d = desertLemma[i]
    print('Original: {:s} | StemLancaster: {:s} | StemPorter: {:s} | Lemma: {:s}'.format(a,b,c,d))

Original: understood | StemLancaster: understood | StemPorter: understood | Lemma: understood
Original: to | StemLancaster: to | StemPorter: to | Lemma: to
Original: the | StemLancaster: the | StemPorter: the | Lemma: the
Original: set | StemLancaster: set | StemPorter: set | Lemma: set
Original: the | StemLancaster: the | StemPorter: the | Lemma: the
Original: stubborn | StemLancaster: stubborn | StemPorter: stubborn | Lemma: stubborn
Original: finish | StemLancaster: fin | StemPorter: finish | Lemma: finish
Original: 1,200 | StemLancaster: 1,200 | StemPorter: 1,200 | Lemma: 1,200
Original: using | StemLancaster: us | StemPorter: use | Lemma: using
Original: the | StemLancaster: the | StemPorter: the | Lemma: the


## Processing a Text File End-to-End

In [31]:
# put the full path to the file here (or can use relative path from the directory of the program)
#filepath = '/Users/njmccrac/NLPfall2016/labs/LabExamplesWeek4/CrimeAndPunishment.txt'
#filepath = 'H:\NLPclass\LabExamplesWeek4\CrimeAndPunishment.txt'
filepath = 'data/CrimeAndPunishment.txt'

def alpha_filter(w):
  # pattern to match word of non-alphabetical characters
  pattern = re.compile('^[^a-z]+$')
  if (pattern.match(w)):
    return True
  else:
    return False

# open the file, read the text and close it
f = open(filepath, 'r')
filetext = f.read()
f.close()

# tokenize by the regular word tokenizer
filetokens = nltk.word_tokenize(filetext)

# choose to treat upper and lower case the same
#    by putting all tokens in lower case
filewords = [w.lower() for w in filetokens]

# display the first words
print ("Display first 50 words from file:")
print (filewords[:50])

Display first 50 words from file:
['produced', 'by', 'john', 'bickers', ';', 'and', 'dagny', 'crime', 'and', 'punishment', 'by', 'fyodor', 'dostoevsky', 'translated', 'by', 'constance', 'garnett', 'translator', "'s", 'preface', 'a', 'few', 'words', 'about', 'dostoevsky', 'himself', 'may', 'help', 'the', 'english', 'reader', 'to', 'understand', 'his', 'work', '.', 'dostoevsky', 'was', 'the', 'son', 'of', 'a', 'doctor', '.', 'his', 'parents', 'were', 'very', 'hard-working', 'and']


In [33]:
# read a stop word file
fstop = open('data/Smart.English.stop', 'r')
stoptext = fstop.read()
fstop.close()

stopwords = nltk.word_tokenize(stoptext)
print ("Display first 50 of {:d} Stopwords:".format(len(stopwords)))
print (stopwords[:50])

Display first 50 of 573 Stopwords:
['â€™s', 'a', "a's", 'able', 'about', 'above', 'according', 'accordingly', 'across', 'actually', 'after', 'afterwards', 'again', 'against', "ain't", 'all', 'allow', 'allows', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'an', 'and', 'another', 'any', 'anybody', 'anyhow', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'apart', 'appear', 'appreciate', 'appropriate', 'are', "aren't", 'around', 'as', 'aside', 'ask', 'asking']


In [34]:
# setup to process bigrams
bigram_measures = nltk.collocations.BigramAssocMeasures()
       
finder = BigramCollocationFinder.from_words(filewords)
# choose to use both the non-alpha word filter and a stopwords filter
finder.apply_word_filter(alpha_filter)
finder.apply_word_filter(lambda w: w in stopwords)

# score by frequency and display the top 50 bigrams
scored = finder.score_ngrams(bigram_measures.raw_freq)
print ("Bigrams from file with top 50 frequencies")
for item in scored[:20]:
        print (item)

Bigrams from file with top 50 frequencies
(('katerina', 'ivanovna'), 0.0008468990312593629)
(('pyotr', 'petrovitch'), 0.000683111954459203)
(('wo', "n't"), 0.0004913612304004793)
(('ca', "n't"), 0.00048736642364925596)
(('pulcheria', 'alexandrovna'), 0.00048337161689803257)
(('avdotya', 'romanovna'), 0.0004594027763906921)
(('rodion', 'romanovitch'), 0.0003435533806052132)
(('porfiry', 'petrovitch'), 0.00032357934684909616)
(('marfa', 'petrovna'), 0.00030760011984420254)
(('sofya', 'semyonovna'), 0.0002836312793368621)
(('raskolnikov', "'s"), 0.00021971437131728752)
(('amalia', 'ivanovna'), 0.0002157195645660641)
(('young', 'man'), 0.0002077299510636173)
(('great', 'deal'), 0.00018775591730750026)
(("n't", 'understand'), 0.00013981823629281934)
(('ilya', 'petrovitch'), 0.0001318286227903725)
(('ivanovna', "'s"), 0.0001238390092879257)
(('sonia', "'s"), 0.00011584939578547888)
(('make', 'haste'), 0.00010785978228303205)
(('good', 'heavens'), 0.00010386497553180865)


In [35]:
# score by PMI and display the top 50 bigrams
# only use frequently occurring words in mutual information
finder.apply_freq_filter(5)
scored = finder.score_ngrams(bigram_measures.pmi)

print ("\nBigrams from file with top 50 mutual information scores")
for item in scored[:20]:
        print (item)


Bigrams from file with top 50 mutual information scores
(('praskovya', 'pavlovna'), 14.763517853413212)
(('palais', 'de'), 14.34848035413437)
(('de', 'cristal'), 14.348480354134367)
(('explosive', 'lieutenant'), 14.248944680583456)
(('semyon', 'zaharovitch'), 14.248944680583456)
(('assistant', 'superintendent'), 13.91107504182707)
(('arkady', 'ivanovitch'), 13.763517853413212)
(('madame', 'resslich'), 13.567120640609708)
(('afanasy', 'ivanovitch'), 13.34848035413437)
(('nikodim', 'fomitch'), 13.34848035413437)
(('andrey', 'semyonovitch'), 13.348480354134368)
(('madame', 'lippevechsel'), 13.348480354134367)
(('examining', 'lawyer'), 13.026552259247005)
(('flushed', 'crimson'), 12.915520946858262)
(('hay', 'market'), 12.91107504182707)
(('chapter', 'iii'), 12.389122338631715)
(('chapter', 'iv'), 12.389122338631715)
(('dmitri', 'prokofitch'), 12.34848035413437)
(('chapter', 'vi'), 12.348480354134367)
(('canal', 'bank'), 12.322008142773177)


## Regular Expressions and Tokenization

In [36]:
# This text is a Python string
file0 = nltk.corpus.gutenberg.fileids()[0]
emmatext = nltk.corpus.gutenberg.raw(file0)

In [37]:
print("The type of this is: \n", type(emmatext))
print("The number of characters in the book is {:d}\n".format(len(emmatext)))
print("The first 150 characters of the string:\n")
print(emmatext[:150])

The type of this is: 
 <class 'str'>
The number of characters in the book is 887071

The first 150 characters of the string:

[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to


There are several newline characters. Replace them with a space.

In [38]:
newemmatext = emmatext.replace('\n', ' ')
print(newemmatext[:150])

[Emma by Jane Austen 1816]  VOLUME I  CHAPTER I   Emma Woodhouse, handsome, clever, and rich, with a comfortable home and happy disposition, seemed to


## Exploration of Regular Expressions for Tokenization

Simple text with no punctuation:

In [39]:
shorttext = "That book is interesting"
pattern = re.compile("\w+") # find alphabetic characters
print(re.findall(pattern, shorttext))

['That', 'book', 'is', 'interesting']


Text with some special characters:

In [40]:
specialtext = "That U.S.A. poster-print costs $12.40, but with 10% off."
print(re.findall(pattern, specialtext))

['That', 'U', 'S', 'A', 'poster', 'print', 'costs', '12', '40', 'but', 'with', '10', 'off']


Getting only alphabetic text leaves lots of the string unmatched.  Let’s start making a more general regular expression to match tokens by matching words that can have an internal hyphen.  In this case, we need to put parentheses around the part of the pattern that can be repeated 0 or more times.  Unfortunately, findall will then only report the part that matched inside those parentheses, so we’ll put an extra pair of parentheses around the whole match.

In [41]:
ptoken = re.compile('(\w+(-\w+)*)')
print(re.findall(ptoken, specialtext))

[('That', ''), ('U', ''), ('S', ''), ('A', ''), ('poster-print', '-print'), ('costs', ''), ('12', ''), ('40', ''), ('but', ''), ('with', ''), ('10', ''), ('off', '')]


re.findall has reported both the whole matched text and the internal matched text, i.e. it reports the last match of any part of the regular expression in parentheses.  We could fix this by looking at the parts of the re.groups function to access only the outer match.  But let’s assume that we only want to look at outer matches and not at any of the internal matches.  We can instead make the internal parentheses into non-capturing subgroups.  This regular expression matches the same strings, but the findall function doesn’t report the subgroups.

In [42]:
# Now let’s check our pattern on a word with two internal hyphens.
ptoken = re.compile('(\w+(?:-\w+)*)')
re.findall(ptoken, 'end-of-line character')

['end-of-line', 'character']

Now we try to make a pattern to match abbreviations that might have a “.” inside, like U.S.A.  We only allow capitalized letters, and we make a simple pattern that matches alternating capital letters and dots.

In [43]:
pabbrev = re.compile('(([A-Z]\.)+)')
re.findall(pabbrev, specialtext)

[('U.S.A.', 'A.')]

This worked well, so let’s combine it with the words pattern to match either words or abbreviations.

In [44]:
ptoken = re.compile('(\w+(-\w+)*|([A-Z]\.)+)')
print(re.findall(ptoken, specialtext))

[('That', '', ''), ('U', '', ''), ('S', '', ''), ('A', '', ''), ('poster-print', '-print', ''), ('costs', '', ''), ('12', '', ''), ('40', '', ''), ('but', '', ''), ('with', '', ''), ('10', '', ''), ('off', '', '')]


Well, that didn’t work because it first found the alphabetic words which found ‘U’, “S’ and “A’ as separate words before it could match the abbreviations.  So the order of the matching patterns really matters if an earlier pattern matches part of what you want to match.  We can switch the order of the token patterns to match abbreviations first and then alphabetics.

In [45]:
ptoken = re.compile('(([A-Z]\.)+|\w+(-\w+)*)')
print(re.findall(ptoken, specialtext))


[('That', '', ''), ('U.S.A.', 'A.', ''), ('poster-print', '', '-print'), ('costs', '', ''), ('12', '', ''), ('40', '', ''), ('but', '', ''), ('with', '', ''), ('10', '', ''), ('off', '', '')]


That worked much better.  Now we’ll add an expression to match the currency, with an optional $ so that we can also match numbers with optional decimal parts.

In [46]:
ptoken = re.compile(' (([A-Z]\.)+|\w+(-\w+)*|\$?\d+(\.\d+)?)')
print(re.findall(ptoken, specialtext))

[('U.S.A.', 'A.', '', ''), ('poster-print', '', '-print', ''), ('costs', '', '', ''), ('$12.40', '', '', '.40'), ('but', '', '', ''), ('with', '', '', ''), ('10', '', '', ''), ('off', '', '', '')]


We can keep on adding expressions, but the notation is getting awkward.  We can make a prettier regular expression that is equivalent to his one by using Python’s triple quotes (works for either “”” or ‘’’) that allows a string to go across multiple lines without adding a newline character.  We can use Python’s “r” before the string to get a “raw” string.  And we also use the regular expression verbose flag to allow us to put comments at the end of every line, which the re compiler will ignore.  But we seem to have to put extra parentheses around each of our disjunctions for the multi-line re to format correctly with findall.

In [47]:
p1 = '((?:[A-Z]\.)+)' # abbreviations, e.g. U.S.A.
p2 = '(\w+(?:-\w+)*)' # words with internal hyphens
p3 = '(\$?\d+(?:\.\d+)?)' # currency, like $12.40
p4 = '|'.join([p1,p2,p3]) # combine
ptoken = re.compile(p4)
print(re.findall(ptoken, specialtext))

[('', 'That', ''), ('U.S.A.', '', ''), ('', 'poster-print', ''), ('', 'costs', ''), ('', '', '$12.40'), ('', 'but', ''), ('', 'with', ''), ('', '10', ''), ('', 'off', '')]


More about the function findall()

Before we go on to finding tokens, we will look more at how to use the regular expression function findall().  Suppose that we have text with email addresses.

In [48]:
email_text = "For more information, send a request to info@ischool.syr.edu. Or you can directly contact our information staff at HelpfulHenry@syr.edu and SageSue@syr.edu."

And suppose that we want to find the email addresses in two parts:  first the user name and then the domain.  We can define a regular expression with just the two inner parentheses to match these two parts.

In [49]:
pemail = re.compile('([A-z]+)@([a-z.]+)')
matches = re.findall(pemail, email_text)
for m in matches:
    # format function puts each argument into the output string where the {} is
    email = 'User: {}, Domain:{}'.format(m[0],m[1])
    print(email)

User: info, Domain:ischool.syr.edu.
User: HelpfulHenry, Domain:syr.edu
User: SageSue, Domain:syr.edu.


## Regular Expression Tokenizer using NLTK Tokenizer

NLTK has built a tokenizing function that helps you write tokenizers by giving it the compiled pattern.  Regular expressions can also be written down in the “verbose” version, using the (?x) flag that allows the alternatives to be on different lines with comments, and it also alleviates the need to put extra parentheses.

In [51]:
pattern = r''' (?x) 		# set flag to allow verbose regexps
        (?:[A-Z]\.)+    	# abbreviations, e.g. U.S.A.
        | \$?\d+(?:\.\d+)?%?  # currency and percentages, $12.40, 50%
        | \w+(?:-\w+)*  	# words with internal hyphens
        | \.\.\.        	# ellipsis
        | [][.,;"'?():-_%#']  # separate tokens
        '''
print("Tokenize text1:\n")
print(nltk.regexp_tokenize(shorttext, pattern))
print("\nTokenize text2:\n")
print(nltk.regexp_tokenize(specialtext, pattern))


Tokenize text1:

['That', 'book', 'is', 'interesting']

Tokenize text2:

['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', ',', 'but', 'with', '10%', 'off', '.']


We might compare regular expression tokenizer with the built-in word tokenizer of NLTK:

In [52]:
print(nltk.word_tokenize(specialtext))

['That', 'U.S.A.', 'poster-print', 'costs', '$', '12.40', ',', 'but', 'with', '10', '%', 'off', '.']


## Regular expression tokenizer appropriate for tweet text or other social media text

In [53]:
tweetPattern = r''' (?x)	# set flag to allow verbose regexps
      (?:https?://|www)\S+    # simple URLs
      | (?::-\)|;-\))		# small list of emoticons
      | &(?:amp|lt|gt|quot);  # XML or HTML entity
      | \#\w+                 # hashtags
      | @\w+                  # mentions   
      | \d+:\d+               # timelike pattern
      | \d+\.\d+              # number with a decimal
      | (?:\d+,)+?\d{3}(?=(?:[^,]|$))   # number with a comma
      | (?:[A-Z]\.)+                    # simple abbreviations
      | (?:--+)               # multiple dashes
      | \w+(?:-\w+)*          # words with internal hyphens or apostrophes
      | ['\".?!,:;/]+         # special characters
      '''

In [54]:
tweet1 = "@natalieohayre I agree #hc09 needs reform- but not by crooked politicians who r clueless about healthcare! #tcot #fishy NO GOV'T TAKEOVER!"

tweet2 = "To Sen. Roland Burris: Affordable, quality health insurance can't wait http://bit.ly/j63je #hc09 #IL #60660"

tweet3 = "RT @karoli: RT @Seriou: .@whitehouse I will stand w/ Obama on #healthcare,  I trust him. #p2 #tlot"

In [55]:
print("Tokenization of tweet1 using custom regex pattern:\n")
print(nltk.regexp_tokenize(tweet1,tweetPattern))

Tokenization of tweet1 using custom regex pattern:

['@natalieohayre', 'I', 'agree', '#hc09', 'needs', 'reform', 'but', 'not', 'by', 'crooked', 'politicians', 'who', 'r', 'clueless', 'about', 'healthcare', '!', '#tcot', '#fishy', 'NO', 'GOV', "'", 'T', 'TAKEOVER', '!']


In [56]:
# using built-in tweet tokenizer in nltk
tknzr = TweetTokenizer()
print("Tokenization of tweet1 using built-in Tweet Tokenizer:\n")
print(tknzr.tokenize(tweet1))

Tokenization of tweet1 using built-in Tweet Tokenizer:

['@natalieohayre', 'I', 'agree', '#hc09', 'needs', 'reform', '-', 'but', 'not', 'by', 'crooked', 'politicians', 'who', 'r', 'clueless', 'about', 'healthcare', '!', '#tcot', '#fishy', 'NO', "GOV'T", 'TAKEOVER', '!']


## Custom Regex Pattern Added to Built-in regex tokenizer
Run the regexp tokenizer with the regular pattern on the sentence “Mr. Black and Mrs. Brown attended the lecture by Dr. Gray, but Gov. White wasn’t there.”  
* Design and add a line to the pattern of this tokenizer so that titles like “Mr.” are tokenized as having the dot inside the token.  Test and add some other titles to your list of titles.
* Design and add the pattern of this tokenizer so that words with a single apostrophe, such as “wasn’t” are taken as a single token.


In [57]:
sentence = r'''Mr. Black and Mrs. Brown attended the lecture by Dr. Gray, but Gov. White wasn't there.'''
print("My text is:\n", sentence)

My text is:
 Mr. Black and Mrs. Brown attended the lecture by Dr. Gray, but Gov. White wasn't there.


In [58]:
pattern = re.compile('|'.join([
    "(?:[A-Z][a-z]+\.)+", # abbreviations and titles, e.g. U.S.A., Gov., Dr., Mr., Mrs., etc...    
    "'?\w[\w']*(?:-\w+)*'?", # words with hyphens or '
]))
print(nltk.regexp_tokenize(sentence, pattern))

['Mr.', 'Black', 'and', 'Mrs.', 'Brown', 'attended', 'the', 'lecture', 'by', 'Dr.', 'Gray', 'but', 'Gov.', 'White', "wasn't", 'there']
