### Tokenization

When working with text provided by nltk, documents are already tokenized. When working with our own text, we often need to create tokens.

Tokenization - the act of breaking a text document into individual words or sentences

Sentence tokens - Tokens divided on punctuation

Word tokens - Tokens divided on spaces or puncuation. (Every item is a word or punctuation symbol)

Working with tokens makes analyzing text much easier

In [1]:
import nltk

In [2]:
my_string = "I am learning Natural Language Processing."

tokens = nltk.word_tokenize(my_string)
print("Tokens: ", tokens)

Tokens:  ['I', 'am', 'learning', 'Natural', 'Language', 'Processing', '.']


In [3]:
#Original string is now a list of tokens
num_tokens = len(tokens)
print("Number of tokens:", num_tokens)

Number of tokens: 7


In [4]:
phrase = "I am learning Natural Language Processing. I am learning how to tokenize!"
tokens_sent = nltk.sent_tokenize(phrase)

print("Sentence tokens: ", tokens_sent)

Sentence tokens:  ['I am learning Natural Language Processing.', 'I am learning how to tokenize!']


In [5]:
#Tokenize sentence tokens

for item in tokens_sent:
    print("\nSentence: ", item)
    tokens_from_sent_tokens = nltk.word_tokenize(item)
    print("Sentence tokenized: ", tokens_from_sent_tokens)


Sentence:  I am learning Natural Language Processing.
Sentence tokenized:  ['I', 'am', 'learning', 'Natural', 'Language', 'Processing', '.']

Sentence:  I am learning how to tokenize!
Sentence tokenized:  ['I', 'am', 'learning', 'how', 'to', 'tokenize', '!']


#### Normalizing
- the act of cleaning our text data to make it more uniform

In [6]:
md = nltk.corpus.gutenberg.words("melville-moby_dick.txt")

md_22 = md[:22]
print("First 22 words: ", md_22)

First 22 words:  ['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', ']', 'ETYMOLOGY', '.', '(', 'Supplied', 'by', 'a', 'Late', 'Consumptive', 'Usher', 'to', 'a', 'Grammar', 'School', ')']


In [7]:
for word in md_22:
    if word.isalpha(): #Is alpha is a boolean function that returns True if the string is alphabetic and False if it is not.
        print(word)

Moby
Dick
by
Herman
Melville
ETYMOLOGY
Supplied
by
a
Late
Consumptive
Usher
to
a
Grammar
School


In [8]:
#Make all our text the same case
md_22_lower = [word.lower() for word in md_22]
print("First 22 words in lower case: ", md_22_lower)

First 22 words in lower case:  ['[', 'moby', 'dick', 'by', 'herman', 'melville', '1851', ']', 'etymology', '.', '(', 'supplied', 'by', 'a', 'late', 'consumptive', 'usher', 'to', 'a', 'grammar', 'school', ')']


In [9]:
md_22_norm = [word.lower() for word in md_22 if word.isalpha()]
print("First 22 words in lower case and only alpha: ", md_22_norm)

First 22 words in lower case and only alpha:  ['moby', 'dick', 'by', 'herman', 'melville', 'etymology', 'supplied', 'by', 'a', 'late', 'consumptive', 'usher', 'to', 'a', 'grammar', 'school']


There are several ways to normalize even further

Suppose we want to remove affixes from our words, we can accomplish this with something called stemers.
Some of these help when we want to know if cats is referenced as well as cat 

In [10]:
porter = nltk.PorterStemmer()

plurals_list = ["cat", "cats", "lie", "lying", "run", "running", "city", "cities", "month", "monthly", "woman", "women"]

for word in plurals_list:
    print(porter.stem(word))

cat
cat
lie
lie
run
run
citi
citi
month
monthli
woman
women


In [11]:
lancaster = nltk.LancasterStemmer()

for word in plurals_list:
    print(lancaster.stem(word))

cat
cat
lie
lying
run
run
city
city
mon
month
wom
wom


One more method to solve normalization problem called lemmatization.

We can use the word net resource provided by nltk to remove the affixes.
**Note**: More computationally expensive 

In [12]:
wordnet_lem = nltk.WordNetLemmatizer()

for word in plurals_list:
    print (wordnet_lem.lemmatize(word))

cat
cat
lie
lying
run
running
city
city
month
monthly
woman
woman


### Part of Speech Tagging

May discover that knowing the part of speech is required to solve a given NLP problem

In [13]:
tag_phrase = "I walked to the cafe to buy coffee before work."

tag_tokens = nltk.word_tokenize(tag_phrase)

tagged = nltk.pos_tag(tag_tokens)

print("Tagged tokens: ", tagged)

Tagged tokens:  [('I', 'PRP'), ('walked', 'VBD'), ('to', 'TO'), ('the', 'DT'), ('cafe', 'NN'), ('to', 'TO'), ('buy', 'VB'), ('coffee', 'NN'), ('before', 'IN'), ('work', 'NN'), ('.', '.')]


Now we have an assigned part of speech to every token in our phrase.

Run code block below for descriptions on what each tag means.

In [14]:
print(nltk.help.upenn_tagset())

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

In [15]:
example_phrase = "I will have dessert."
tagged_example_phrase = nltk.pos_tag(nltk.word_tokenize(example_phrase))

print("Tagged example phrase: ", tagged_example_phrase)

Tagged example phrase:  [('I', 'PRP'), ('will', 'MD'), ('have', 'VB'), ('dessert', 'NN'), ('.', '.')]


In [16]:
#Find most common nouns in moby_dick
md = nltk.corpus.gutenberg.words("melville-moby_dick.txt")
md_norm = [word.lower() for word in md if word.isalpha()]

#We are not concerned with types of nouns so we will tell tagger to be less descriptive.

md_tags = nltk.pos_tag(md_norm, tagset="universal")

print("Number of words tagged: ", len(md_norm))
print("Tagged words: ", md_tags[:20])

Number of words tagged:  218361
Tagged words:  [('moby', 'NOUN'), ('dick', 'NOUN'), ('by', 'ADP'), ('herman', 'NOUN'), ('melville', 'NOUN'), ('etymology', 'NOUN'), ('supplied', 'VERB'), ('by', 'ADP'), ('a', 'DET'), ('late', 'ADJ'), ('consumptive', 'NOUN'), ('usher', 'NOUN'), ('to', 'PRT'), ('a', 'DET'), ('grammar', 'NOUN'), ('school', 'NOUN'), ('the', 'DET'), ('pale', 'NOUN'), ('usher', 'NOUN'), ('threadbare', 'NOUN')]


In [17]:
md_nouns = [word for word, tag in md_tags if tag == "NOUN"]

print("MD nouns: ", md_nouns[:20])

MD nouns:  ['moby', 'dick', 'herman', 'melville', 'etymology', 'consumptive', 'usher', 'grammar', 'school', 'pale', 'usher', 'threadbare', 'heart', 'body', 'brain', 'i', 'lexicons', 'grammars', 'queer', 'handkerchief']


In [18]:
#Frequency distribution to find most common nouns
md_nouns_fd = nltk.FreqDist(md_nouns)

print("Most common nouns: ", md_nouns_fd.most_common(20))

Most common nouns:  [('i', 1182), ('whale', 909), ('s', 774), ('man', 527), ('ship', 498), ('sea', 435), ('head', 337), ('time', 334), ('boat', 332), ('ahab', 278), ('captain', 275), ('way', 271), ('whales', 256), ('men', 244), ('ye', 220), ('hand', 214), ('side', 197), ('water', 190), ('deck', 189), ('thing', 188)]


#### Example: Multiple Parts of Speech

In this example, we will be quantifying words that have different parts of speech based on usage.

For the Alice in Wonderland text, find all parts of speech used for the words 'over', 'spoke', and 'answer'.

In [19]:
alice = nltk.corpus.gutenberg.words("carroll-alice.txt")

alice_norm = [word.lower() for word in alice if word.isalpha()]

alice_tags = nltk.pos_tag(alice_norm, tagset="universal")

alice_word_tags = [(word, tag) for word, tag in alice_tags if word in ["over", "spoke", "answer"]]

alice_tags_cd = nltk.ConditionalFreqDist(alice_word_tags)

print("Alice word tags conditional frequency distribution: ", alice_tags_cd.items())

Alice word tags conditional frequency distribution:  dict_items([('over', FreqDist({'ADP': 31, 'PRT': 5, 'ADV': 4})), ('spoke', FreqDist({'VERB': 16, 'NOUN': 1})), ('answer', FreqDist({'NOUN': 5, 'VERB': 3, 'ADP': 1}))])


#### Example: Choices
Lets say we wanted to find all cases in a given text where there was a choice between two options.
This would be of the form \<noun\> or \<noun\>.

Task: Load up the bryant-stories.txt and find every time there is a choice presented.

In [20]:
bryant_stories = nltk.corpus.gutenberg.words("bryant-stories.txt")

tags = nltk.pos_tag(bryant_stories, tagset="universal")

print("tags", tags[:10])

tags [('[', 'NOUN'), ('Stories', 'NOUN'), ('to', 'PRT'), ('Tell', 'VERB'), ('to', 'PRT'), ('Children', 'NOUN'), ('by', 'ADP'), ('Sara', 'NOUN'), ('Cone', 'NOUN'), ('Bryant', 'NOUN')]


In [21]:
choices = []
for ((word1, tag1), (word2, tag2), (word3, tag3)) in nltk.trigrams(tags):
    if tag1 == "NOUN" and word2 == "or" and tag3 == "NOUN":
        choices.append((word1, word3))
        print(word1 + " " + word2 + " " + word3)

ship or part
food or water
queens or princesses
rank or wealth


### Chunking

How to solve when words get split up in ways we don't want them to. e.g. 'New' 'York'

In [22]:
chunking_phrase = "I will go to the coffee shop in New York after I get off the jet plane."

chunking_tags = nltk.pos_tag(nltk.word_tokenize(chunking_phrase))

print("chunking_tags", chunking_tags[:10])

chunking_tags [('I', 'PRP'), ('will', 'MD'), ('go', 'VB'), ('to', 'TO'), ('the', 'DT'), ('coffee', 'NN'), ('shop', 'NN'), ('in', 'IN'), ('New', 'NNP'), ('York', 'NNP')]


In [23]:
sequence = '''
    Chunk:
    {<NNPS>+}
    {<NNP>+}
    {<NN>+}'''

In [24]:
NPChunker = nltk.RegexpParser(sequence)
result = NPChunker.parse(chunking_tags)
print(result)

(S
  I/PRP
  will/MD
  go/VB
  to/TO
  the/DT
  (Chunk coffee/NN shop/NN)
  in/IN
  (Chunk New/NNP York/NNP)
  after/IN
  I/PRP
  get/VBP
  off/IN
  the/DT
  (Chunk jet/NN plane/NN)
  ./.)


#### Named Entity Recognition

How to use chunking to find named entities (People, locations, etc.)

In [25]:
text = open("example.txt").read()
print("Text: ", text)

Text:  World War II (WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945, though related conflicts began earlier. It involved the vast majority of the world's nationsâ€”including all of the great powersâ€”eventually forming two opposing military alliances: the Allies and the Axis. It was the most widespread war in history, and directly involved more than 100 million people from over 30 countries. In a state of "total war", the major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, erasing the distinction between civilian and military resources. Marked by mass deaths of civilians, including the Holocaust (in which approximately 11 million people were killed) and the strategic bombing of industrial and population centres (in which approximately one million were killed, and which included the atomic bombings of Hiroshima and Nagasaki), it resulted in an estimated 50 million to 85 million f

In [26]:
text_tag = nltk.pos_tag(nltk.word_tokenize(text))

text_chunks = nltk.ne_chunk(text_tag)

for chunk in text_chunks:
    if hasattr(chunk, 'label'):
        print(chunk.label(), ' '.join(c[0] for c in chunk.leaves()))

ORGANIZATION WWII
ORGANIZATION WW2
ORGANIZATION Second
ORGANIZATION Axis
ORGANIZATION Hiroshima
GPE Nagasaki
ORGANIZATION Empire of Japan
GPE Asia
ORGANIZATION Pacific
ORGANIZATION Republic
GPE China
GPE Poland
GPE Germany
GPE Germany
GPE France
ORGANIZATION United Kingdom
GPE Germany
GPE Europe
ORGANIZATION Axis
GPE Italy
GPE Japan
GPE Germany
GPE Soviet Union
GPE European
GPE Poland
GPE Finland
GPE Romania
GPE Baltic
ORGANIZATION United Kingdom
GPE British
ORGANIZATION European Axis
GPE North Africa
ORGANIZATION Horn
GPE Africa
GPE Britain
GPE Blitz
ORGANIZATION Atlantic
ORGANIZATION European Axis
GPE Soviet Union
ORGANIZATION Axis
GPE Japan
GPE United States
GPE European
ORGANIZATION Pacific Ocean
LOCATION Western Pacific
ORGANIZATION Axis
PERSON Japan
GPE Midway
GPE Hawaii
GPE Germany
GPE North Africa
FACILITY Stalingrad
GPE Soviet Union
GPE German
LOCATION Eastern Front
GPE Italy
GPE Italian
GPE Allied
ORGANIZATION Pacific
ORGANIZATION Axis
LOCATION Western
GPE France
GPE Soviet U