# Sentence and Word Segmentation

The first step in NLP is cutting text into its constituents. Namely, sentences and words. Let's see how well we can perform this task in base python.

**DO NOT worry about writing efficient code.** We're just practicing NLP principles.

It will useful to know the String methods! These are one of the most useful features of Python for text processing!

https://docs.python.org/2/library/stdtypes.html#string-formatting-operations

## Sentence segmentation

Let's start with sentence segmentation. English typically end with a period, exclamation, or question mark. Let's start easy.

In [1]:
# Run this cell for a HINT:
import base64
base64.decodestring('S2VlcCBhIGxpc3Qgb2Ygc2VudGVuY2VzLCBhbmQgYSB0ZW1wIHN0cmluZyB3aXRoIHRoZSBjdXJy\nZW50IHNlbnRlbmNlLiBBcHBlbmQgd2hlbiB5b3UgaGl0IHRoZSByaWdodCBjaGFyYWN0ZXJz\n')

'Keep a list of sentences, and a temp string with the current sentence. Append when you hit the right characters'

In [26]:
# defining the input text and what the output should be.

easy_text = "I went to the zoo today. What do you think of that? I bet you hate it! Or maybe you don't"
easy_split_text = ["I went to the zoo today.",
                   "What do you think of that?",
                   "I bet you hate it!",
                   "Or maybe you don't"]

In [11]:
# define a function to split a string into sentences.
# If you're familiar with regexes, feel free to use the re module

def sentencer(text):
    '''take a string called `text` and return a list of strings, each containing a sentence'''
    sentences = []
    substring = ' '
    for c in text:
            substring += c
            if c in ['.','?','!']:
                sentences.append(str.strip(substring + " \t"))
                substring = ''
    sentences.append(substring)
    return sentences

In [12]:
# test your function by running this cell

if map(str.strip, sentencer(easy_text)) == easy_split_text:
    print 'Congratulations!'
else:
    print 'Sorry, try again!'
    print
    print 'Your version:'
    print sentencer(easy_text)
    print
    print 'Desired output:'
    print easy_split_text

Congratulations!


### Sentence segmentation continued

What about cases where periods denote abbreviations? This time, try to do the same splits, but accommodate 'Dr.', 'Mrs.', 'Mr.', and 'Ms.'.

In [27]:
# defining the input text and what the output should be.

med_text = "My name is Dr. Lee. There is also a Mrs. Lee. Actually, there are tons! They're other people's wives."
med_split_text = ["My name is Dr. Lee.",
                  "There is also a Mrs. Lee.",
                  "Actually, there are tons!",
                  "They're other people's wives."]

In [19]:
# modify your last sentencer to account for these new patterns.


def sentencer2(text):
    '''take a string called `text` and return a list of strings, each containing a sentence'''
    sentences = []
    substring = ' '
    for c in text:
            substring += c
            if c in ['?','!']:
                sentences.append(str.strip(substring + " \t"))
                substring = ''
                     
            if c in ['.']:
                if substring[-4:] != 'Mrs.' & substring[-3:]!='Dr.':
                    sentences.append(str.strip(substring + " \t"))
                    substring = '' 
    sentences.append(substring)

    return sentences

In [20]:
# test your function by running this cell

if map(str.strip, sentencer2(med_text)) == med_split_text:
    print 'Congratulations!'
else:
    print 'Sorry, try again!'
    print
    print 'Your version:'
    print sentencer2(med_text)
    print
    print 'Desired output:'
    print med_split_text

TypeError: unsupported operand type(s) for &: 'str' and 'str'

### Take Home Exercise: sentence segmentation continued

Abbreviations like 'a.k.a.' are harder to accommodate. This one is quite challenging, so you can skip it if you want to move on.

In [None]:
# Run this cell for a HINT:
import base64
base64.decodestring('VHJ5IGFsbG93aW5nIHRoZSBzcGxpdHMgb24gdGhlIHBlcmlvZHMsIGJ1dCB0aGVuIHJlYXR0YWNo\naW5nIGlmIHRoZSBuZXh0IHNlbnRlbmNlIGlzIG9ubHkgb25lIGNoYXJhY3RlciBsb25n\n')

In [28]:
# defining the input text and what the output should be.

hard_text = "I know an M.D., i.e. a doctor. Like Dr. Smith, a.k.a. Docsmith."
hard_split_text = ["I know an M.D., i.e. a doctor.",
                   "Like Dr. Smith, a.k.a. Docsmith."]

In [None]:
# take home exercise:
# modify your last sentencer to account for these new patterns.

def sentencer3(text):
    '''take a string called `text` and return a list of strings, each containing a sentence'''
    # FILL IN CODE

    return sentences

In [None]:
# test your function by running this cell

if map(str.strip, sentencer3(hard_text)) == hard_split_text:
    print 'Congratulations!'
else:
    print 'Sorry, try again!'
    print
    print 'Your version:'
    print sentencer3(hard_text)
    print
    print 'Desired output:'
    print hard_split_text

### Sentence segmentation continued

Sentence segmentation is harder than it seems! Let's take a look at how a modern system does it. [NLTK](http://www.nltk.org) is the most widely-used NLP library in Python. It [relies on a statistical language model](http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.punkt) to determine when to split sentences. You'll notice that even this model can't handle our hard sentences.

To get started, you have to download the right dataset. **DO NOT** download everything. It will take forever. When the download window pops up (probably behind your other windows, annoyingly) click on the 'Models' tab, choose the 'punkt' dataset, and just download that.

In [23]:
# download the Punkt Tokenizer Models.
# DON'T DOWNLOAD EVERYTHING!
# The download window will probably pop up behind your other windows.
# uncomment the download command and comment out the print statement when you've understood these instructions.

import nltk
nltk.download()
print "did you read the instructions?"

showing info http://www.nltk.org/nltk_data/


KeyboardInterrupt: 

In [24]:
import nltk.data
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')

In [29]:
print sent_detector.sentences_from_text(easy_text)
print sent_detector.sentences_from_text(med_text)
print sent_detector.sentences_from_text(hard_text)

['I went to the zoo today.', 'What do you think of that?', 'I bet you hate it!', "Or maybe you don't"]
['My name is Dr. Lee.', 'There is also a Mrs. Lee.', 'Actually, there are tons!', "They're other people's wives."]
['I know an M.D., i.e.', 'a doctor.', 'Like Dr. Smith, a.k.a.', 'Docsmith.']


# Word tokenization

A more common task is to ignore sentences and just split text into words. We call this tokenization. Try your hand at this. This task is much easier now that you're familiar with all the string methods. right?? You should be able to write a fairly simple function that can tokenize all of our texts from before.

In [30]:
# Define our objective tokenizations. Note that we've removed some punctuation.
easy_words = ['I', 'went', 'to', 'the', 'zoo', 'today',
              'What', 'do', 'you', 'think', 'of', 'that',
              'I', 'bet', 'you', 'hate', 'it',
              'Or', 'maybe', 'you', "don't"]
med_words = ['My', 'name', 'is', 'Dr', 'Lee',
             'There', 'is', 'also', 'a', 'Mrs', 'Lee',
             'Actually,', 'there', 'are', 'tons',
             "They're", 'other', "people's", 'wives']
hard_words = ['I', 'know', 'an', 'MD,', 'ie', 'a', 'doctor',
              'Like', 'Dr', 'Smith,', 'aka', 'Docsmith']

In [38]:
# define a function to split a string into sentences.
# If you're familiar with regexes, feel free to use the re module

def tokenizer(text):
    '''take a string called `text` and return a list of strings, each containing a WORD'''
    cleaned_text = text.replace('.',' ').replace('?', ' ').replace('!', ' ')
    words = cleaned_text.split()

    return words

In [39]:
# test your function by running this cell

if tokenizer(easy_text) == easy_words:
    print 'Congratulations!'
else:
    print 'Sorry, try again!'
    print
    print 'Your version:'
    print tokenizer(easy_text)
    print
    print 'Desired output:'
    print easy_words

Congratulations!


In [40]:
# test your function by running this cell

if tokenizer(med_text) == med_words:
    print 'Congratulations!'
else:
    print 'Sorry, try again!'
    print
    print 'Your version:'
    print tokenizer(med_text)
    print
    print 'Desired output:'
    print med_words

Congratulations!


In [41]:
# test your function by running this cell

if tokenizer(hard_text) == hard_words:
    print 'Congratulations!'
else:
    print 'Sorry, try again!'
    print
    print 'Your version:'

    print tokenizer(hard_text)
    print
    print 'Desired output:'
    print hard_words

Sorry, try again!

Your version:
['I', 'know', 'an', 'M', 'D', ',', 'i', 'e', 'a', 'doctor', 'Like', 'Dr', 'Smith,', 'a', 'k', 'a', 'Docsmith']

Desired output:
['I', 'know', 'an', 'MD,', 'ie', 'a', 'doctor', 'Like', 'Dr', 'Smith,', 'aka', 'Docsmith']


## Tokenization continued

Let's see how NLTK [tokenizes text into words](http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize).

In [42]:
from nltk.tokenize import word_tokenize

In [43]:
print word_tokenize(easy_text)
print word_tokenize(med_text)
print word_tokenize(hard_text)

['I', 'went', 'to', 'the', 'zoo', 'today', '.', 'What', 'do', 'you', 'think', 'of', 'that', '?', 'I', 'bet', 'you', 'hate', 'it', '!', 'Or', 'maybe', 'you', 'do', "n't"]
['My', 'name', 'is', 'Dr.', 'Lee', '.', 'There', 'is', 'also', 'a', 'Mrs.', 'Lee', '.', 'Actually', ',', 'there', 'are', 'tons', '!', 'They', "'re", 'other', 'people', "'s", 'wives', '.']
['I', 'know', 'an', 'M.D.', ',', 'i.e', '.', 'a', 'doctor', '.', 'Like', 'Dr.', 'Smith', ',', 'a.k.a', '.', 'Docsmith', '.']


It behaves a little differently, and sometimes erratically. Let's try a version based on pattern-matching.

In [44]:
from nltk.tokenize import wordpunct_tokenize

In [45]:
print wordpunct_tokenize(easy_text)
print wordpunct_tokenize(med_text)
print wordpunct_tokenize(hard_text)

['I', 'went', 'to', 'the', 'zoo', 'today', '.', 'What', 'do', 'you', 'think', 'of', 'that', '?', 'I', 'bet', 'you', 'hate', 'it', '!', 'Or', 'maybe', 'you', 'don', "'", 't']
['My', 'name', 'is', 'Dr', '.', 'Lee', '.', 'There', 'is', 'also', 'a', 'Mrs', '.', 'Lee', '.', 'Actually', ',', 'there', 'are', 'tons', '!', 'They', "'", 're', 'other', 'people', "'", 's', 'wives', '.']
['I', 'know', 'an', 'M', '.', 'D', '.,', 'i', '.', 'e', '.', 'a', 'doctor', '.', 'Like', 'Dr', '.', 'Smith', ',', 'a', '.', 'k', '.', 'a', '.', 'Docsmith', '.']


In the end, there isn't really a right or wrong way to tokenize words. Sometimes punctuation provides valuable semantic content. Sometimes, you want to strip it all away.

As a final thought, what do you suppose the following functions do? Go ahead and play with them.

In [46]:
from nltk import bigrams, trigrams, ngrams

In [47]:
for bigram in bigrams(word_tokenize(easy_text)):
    print bigram

('I', 'went')
('went', 'to')
('to', 'the')
('the', 'zoo')
('zoo', 'today')
('today', '.')
('.', 'What')
('What', 'do')
('do', 'you')
('you', 'think')
('think', 'of')
('of', 'that')
('that', '?')
('?', 'I')
('I', 'bet')
('bet', 'you')
('you', 'hate')
('hate', 'it')
('it', '!')
('!', 'Or')
('Or', 'maybe')
('maybe', 'you')
('you', 'do')
('do', "n't")
