## What is tokenization?
* turning a string or document ***into*** (smaller chunks)
* one step in prepraring a text for NLP
* many different theories  and rules
* you can create your own rules using regular expressions
* some examples:
    * breaking out words or sentences
    * separating punctuation
    * separating all hashtags in a tweet

### NLTK library
* nltk: natural language toolkit

In [1]:
from nltk.tokenize import word_tokenize
word_tokenize("Hi there!")

['Hi', 'there', '!']

### Why tokenize?
* easier to map part of speech
* mathcing common words
* removing unwanted tokens
* "I don't like Sam's shoes."
* "I", "do", "n't", "like", "Sam", "'s", "shoes", "."

### Other nltk tokenizers
* `sent_tokenize`: tokenize a document into sentences
* `regexp_tokenize`: tokenize a string or document based on a regular expression pattern
* `TweetTokenizer`: special class just for tweet tokenization, allowing you to separate hashtags, mentions and lots of exclamation points!!!

### More regex practice
* diffenrence between re.search() and re.match()

In [3]:
import re
re.match('abc', 'abcde')

<re.Match object; span=(0, 3), match='abc'>

In [4]:
re.search('abc','abcde')

<re.Match object; span=(0, 3), match='abc'>

In [5]:
re.match('cd','abcde') # this tries to find the pattern at the start of the string

In [6]:
re.search('cd','abcde')

<re.Match object; span=(2, 4), match='cd'>

In [8]:
scene_one = "SCENE 1: [wind] [clop clop clop] \nKING ARTHUR: Whoa there!  [clop clop clop] \nSOLDIER #1: Halt!  Who goes there?\nARTHUR: It is I, Arthur, son of Uther Pendragon, from the castle of Camelot.  King of the Britons, defeator of the Saxons, sovereign of all England!\nSOLDIER #1: Pull the other one!\nARTHUR: I am, ...  and this is my trusty servant Patsy.  We have ridden the length and breadth of the land in search of knights who will join me in my court at Camelot.  I must speak with your lord and master.\nSOLDIER #1: What?  Ridden on a horse?\nARTHUR: Yes!\nSOLDIER #1: You're using coconuts!\nARTHUR: What?\nSOLDIER #1: You've got two empty halves of coconut and you're bangin' 'em together.\nARTHUR: So?  We have ridden since the snows of winter covered this land, through the kingdom of Mercea, through--\nSOLDIER #1: Where'd you get the coconuts?\nARTHUR: We found them.\nSOLDIER #1: Found them?  In Mercea?  The coconut's tropical!\nARTHUR: What do you mean?\nSOLDIER #1: Well, this is a temperate zone.\nARTHUR: The swallow may fly south with the sun or the house martin or the plover may seek warmer climes in winter, yet these are not strangers to our land?\nSOLDIER #1: Are you suggesting coconuts migrate?\nARTHUR: Not at all.  They could be carried.\nSOLDIER #1: What?  A swallow carrying a coconut?\nARTHUR: It could grip it by the husk!\nSOLDIER #1: It's not a question of where he grips it!  It's a simple question of weight ratios!  A five ounce bird could not carry a one pound coconut.\nARTHUR: Well, it doesn't matter.  Will you go and tell your master that Arthur from the Court of Camelot is here.\nSOLDIER #1: Listen.  In order to maintain air-speed velocity, a swallow needs to beat its wings forty-three times every second, right?\nARTHUR: Please!\nSOLDIER #1: Am I right?\nARTHUR: I'm not interested!\nSOLDIER #2: It could be carried by an African swallow!\nSOLDIER #1: Oh, yeah, an African swallow maybe, but not a European swallow.  That's my point.\nSOLDIER #2: Oh, yeah, I agree with that.\nARTHUR: Will you ask your master if he wants to join my court at Camelot?!\nSOLDIER #1: But then of course a-- African swallows are non-migratory.\nSOLDIER #2: Oh, yeah...\nSOLDIER #1: So they couldn't bring a coconut back anyway...  [clop clop clop] \nSOLDIER #2: Wait a minute!  Supposing two swallows carried it together?\nSOLDIER #1: No, they'd have to have it on a line.\nSOLDIER #2: Well, simple!  They'd just use a strand of creeper!\nSOLDIER #1: What, held under the dorsal guiding feathers?\nSOLDIER #2: Well, why not?\n"

In [9]:
# Import necessary modules
from nltk.tokenize import word_tokenize,sent_tokenize
import re

# Split scene_one into sentences: sentences
sentences = sent_tokenize(scene_one)

# Use word_tokenize to tokenize the fourth sentence: tokenized_sent
tokenized_sent = word_tokenize(sentences[3])

# Make a set of unique tokens in the entire scene: unique_tokens
unique_tokens = set(word_tokenize(scene_one))

# Print the unique tokens result
print(unique_tokens)


{'I', 'court', 'who', 'with', 'maintain', "'re", 'beat', 'halves', 'get', 'or', 'husk', 'these', 'question', 'speak', 'wind', 'son', 'What', 'horse', 'Please', 'other', 'back', 'snows', 'line', 'carrying', "'d", 'Wait', 'breadth', 'We', 'Pendragon', 'lord', 'why', 'The', '!', 'Uther', 'be', 'defeator', 'my', 'sovereign', 'to', 'do', 'Am', 'African', 'bangin', 'all', '--', 'second', 'Oh', 'could', 'Camelot', 'under', 'me', 'land', 'wings', 'Found', 'But', 'King', 'seek', 'empty', 'martin', 'bring', 'on', 'Ridden', 'two', ':', 'the', 'interested', 'must', 'bird', 'minute', 'trusty', 'coconut', 'anyway', 'Arthur', 'your', 'house', 'It', '.', 'south', 'fly', 'swallows', 'he', 'since', 'That', 'grips', 'course', 'kingdom', 'In', 'will', 'Supposing', 'Yes', 'right', 'but', '...', 'our', 'times', '2', 'clop', 'knights', 'every', 'and', 'they', 'Where', 'coconuts', 'England', 'Court', 'search', 'found', 'them', 'use', '1', 'Saxons', 'go', "'m", 'tell', 'castle', 'matter', 'does', 'wants', 'in'

In [10]:
# Search for the first occurrence of "coconuts" in scene_one: match
match = re.search("coconuts", scene_one)

# Print the start and end indexes of match
print(match.start(), match.end())

580 588


In [11]:
# Write a regular expression to search for anything in square brackets: pattern1
pattern1 = r"\[.*]"

# Use re.search to find the first text in square brackets
print(re.search(pattern1, scene_one))

<re.Match object; span=(9, 32), match='[wind] [clop clop clop]'>


In [12]:
sentences = ['SCENE 1: [wind] [clop clop clop] \nKING ARTHUR: Whoa there!',
 '[clop clop clop] \nSOLDIER #1: Halt!',
 'Who goes there?',
 'ARTHUR: It is I, Arthur, son of Uther Pendragon, from the castle of Camelot.',
 'King of the Britons, defeator of the Saxons, sovereign of all England!',
 'SOLDIER #1: Pull the other one!',
 'ARTHUR: I am, ...  and this is my trusty servant Patsy.',
 'We have ridden the length and breadth of the land in search of knights who will join me in my court at Camelot.',
 'I must speak with your lord and master.',
 'SOLDIER #1: What?',
 'Ridden on a horse?',
 'ARTHUR: Yes!',
 "SOLDIER #1: You're using coconuts!",
 'ARTHUR: What?',
 "SOLDIER #1: You've got two empty halves of coconut and you're bangin' 'em together.",
 'ARTHUR: So?',
 "We have ridden since the snows of winter covered this land, through the kingdom of Mercea, through--\nSOLDIER #1: Where'd you get the coconuts?",
 'ARTHUR: We found them.',
 'SOLDIER #1: Found them?',
 'In Mercea?',
 "The coconut's tropical!",
 'ARTHUR: What do you mean?',
 'SOLDIER #1: Well, this is a temperate zone.',
 'ARTHUR: The swallow may fly south with the sun or the house martin or the plover may seek warmer climes in winter, yet these are not strangers to our land?',
 'SOLDIER #1: Are you suggesting coconuts migrate?',
 'ARTHUR: Not at all.',
 'They could be carried.',
 'SOLDIER #1: What?',
 'A swallow carrying a coconut?',
 'ARTHUR: It could grip it by the husk!',
 "SOLDIER #1: It's not a question of where he grips it!",
 "It's a simple question of weight ratios!",
 'A five ounce bird could not carry a one pound coconut.',
 "ARTHUR: Well, it doesn't matter.",
 'Will you go and tell your master that Arthur from the Court of Camelot is here.',
 'SOLDIER #1: Listen.',
 'In order to maintain air-speed velocity, a swallow needs to beat its wings forty-three times every second, right?',
 'ARTHUR: Please!',
 'SOLDIER #1: Am I right?',
 "ARTHUR: I'm not interested!",
 'SOLDIER #2: It could be carried by an African swallow!',
 'SOLDIER #1: Oh, yeah, an African swallow maybe, but not a European swallow.',
 "That's my point.",
 'SOLDIER #2: Oh, yeah, I agree with that.',
 'ARTHUR: Will you ask your master if he wants to join my court at Camelot?!',
 'SOLDIER #1: But then of course a-- African swallows are non-migratory.',
 'SOLDIER #2: Oh, yeah...',
 "SOLDIER #1: So they couldn't bring a coconut back anyway...  [clop clop clop] \nSOLDIER #2: Wait a minute!",
 'Supposing two swallows carried it together?',
 "SOLDIER #1: No, they'd have to have it on a line.",
 'SOLDIER #2: Well, simple!',
 "They'd just use a strand of creeper!",
 'SOLDIER #1: What, held under the dorsal guiding feathers?',
 'SOLDIER #2: Well, why not?']

In [13]:
# Find the script notation at the beginning of the fourth sentence and print it
pattern2 = r"[\w+]+:"
print(re.match(pattern2, sentences[3]))

<re.Match object; span=(0, 7), match='ARTHUR:'>
