# Tokenisation and Tagging of Text
In order to classify and analyze a body of text in a more granular fashion, it is necessary to consider how to break it into individual sentences and words or "tokens". Broadly then there are two tasks:
* Sentence Tokenization
* Word Tokenization

To go beyond counting the frequency or occurence of actual words we need to classify words in general categories that signify their part in the construct of the sentence - for instance Noun, Verb Adjective etc. This is generally known as
* Part of Speech or POS Tagging

# Token
Each "entity" that is a part of whatever was split up based on rules. For examples, each word is a token when a sentence is "tokenized" into words. Each sentence can also be a token, if you tokenized the sentences out of a paragraph.

# Sentence Tokenisation
The default Sentence Tokenizer is the `PunktSentenceTokenizer` from the `nltk.tokenize.punkt` module.


In the example below (taken from James Joyce's Ulysses), we load the nltk library and process a block of text.

In [1]:
import nltk

ulysses = "Mrkgnao! the cat said loudly. She blinked up out of her avid shameclosing eyes, mewing \
plaintively and long, showing him her milkwhite teeth. He watched the dark eyeslits narrowing \
with greed till her eyes were green stones. Then he went to the dresser, took the jug Hanlon's\
milkman had just filled for him, poured warmbubbled milk on a saucer and set it slowly on the floor.\
— Gurrhr! she cried, running to lap."

doc = nltk.sent_tokenize(ulysses)
for s in doc:
    print(">",s)

> Mrkgnao!
> the cat said loudly.
> She blinked up out of her avid shameclosing eyes, mewing plaintively and long, showing him her milkwhite teeth.
> He watched the dark eyeslits narrowing with greed till her eyes were green stones.
> Then he went to the dresser, took the jug Hanlon'smilkman had just filled for him, poured warmbubbled milk on a saucer and set it slowly on the floor.— Gurrhr!
> she cried, running to lap.



# Word Tokenisation
There are many methods for tokenising text into words. The default Penn Treebank Tokeniser is the tokeniser based on the Penn TreeBank Corpus. A few examples of different tokenisers giving different results are listed below:
* TreebankWordTokenizer
* WordPunctTokenizer
* WhitespaceTokenize

We can see a simple illustration of the impact of chosing a different tokenisation method by looking at the different results we get for a simple sentence:

In [2]:
from nltk import word_tokenize
sentence = "Mary had a little lamb It's fleece was white as snow."

# Default Tokenization 
tree_tokens = word_tokenize(sentence)

#other Tokenizers
punct_tokenizer = nltk.tokenize.WordPunctTokenizer()
punct_tokens = punct_tokenizer.tokenize(sentence)

space_tokenizer = nltk.tokenize.SpaceTokenizer()
space_tokens = space_tokenizer.tokenize(sentence)

print("DEFAULT: ", tree_tokens)
print()
print("PUNCT  : ", punct_tokens)
print()
print("SPACE  : ", space_tokens)

DEFAULT:  ['Mary', 'had', 'a', 'little', 'lamb', 'It', "'s", 'fleece', 'was', 'white', 'as', 'snow', '.']

PUNCT  :  ['Mary', 'had', 'a', 'little', 'lamb', 'It', "'", 's', 'fleece', 'was', 'white', 'as', 'snow', '.']

SPACE  :  ['Mary', 'had', 'a', 'little', 'lamb', "It's", 'fleece', 'was', 'white', 'as', 'snow.']


# Lexicon and corporas
- Corpora : body of text of same thing, ex: medical journals, presidental speeches
- Lexicon : Words and their meanings
    ex: investor-speak vs regular english-speak
    - investor speak 'bull' = someone who is positive about the market
    - english speak 'bull' scary animal you donot want runing @ you

In [3]:
from nltk.tokenize import sent_tokenize , word_tokenize
example_text = "Hello Mr. Smith, how are you doing today? The weather is grate and Python is awesome.\
The sky is pinkish-blue. you shouldn't eat cardboard" 

print(sent_tokenize(example_text))
print(word_tokenize(example_text))

['Hello Mr. Smith, how are you doing today?', 'The weather is grate and Python is awesome.The sky is pinkish-blue.', "you shouldn't eat cardboard"]
['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'grate', 'and', 'Python', 'is', 'awesome.The', 'sky', 'is', 'pinkish-blue', '.', 'you', 'should', "n't", 'eat', 'cardboard']


There are a few things to note here. First, notice that punctuation is treated as a separate token. Also, notice the separation of the word "shouldn't" into "should" and "n't." Finally, notice that "pinkish-blue" is indeed treated like the "one word" it was meant to be turned into.

We start to ponder about how might we derive meaning by looking at these words. We can clearly think of ways to put value to many words, but we also see a few words that are basically worthless. These are a form of "stop words," which we can also handle for. 

# Part of Speech Tagging


One of the more powerful aspects of the NLTK module is the Part of Speech tagging that it can do for you. This means labeling words in a sentence as nouns, adjectives, verbs...etc. Even more impressive, it also labels by tense, and more.

For each word-token the nltk pos_tag method can be used to classify its Part of Speech (POS), automating the classification of words into their parts of speech and labeling them accordingly.


The outcome depends on how the sentence has been split up into individual tokens and which Tokensizer and Corpus the POS-tagger has been trained against:

http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [3]:
'''
POS tag list:

CC	coordinating conjunction
CD	cardinal digit
DT	determiner
EX	existential there (like: "there is" ... think of it like "there exists")
FW	foreign word
IN	preposition/subordinating conjunction
JJ	adjective	'big'
JJR	adjective, comparative	'bigger'
JJS	adjective, superlative	'biggest'
LS	list marker	1)
MD	modal	could, will
NN	noun, singular 'desk'
NNS	noun plural	'desks'
NNP	proper noun, singular	'Harrison'
NNPS	proper noun, plural	'Americans'
PDT	predeterminer	'all the kids'
POS	possessive ending	parent's
PRP	personal pronoun	I, he, she
PRP$	possessive pronoun	my, his, hers
RB	adverb	very, silently,
RBR	adverb, comparative	better
RBS	adverb, superlative	best
RP	particle	give up
TO	to	go 'to' the store.
UH	interjection	errrrrrrrm
VB	verb, base form	take
VBD	verb, past tense	took
VBG	verb, gerund/present participle	taking
VBN	verb, past participle	taken
VBP	verb, sing. present, non-3d	take
VBZ	verb, 3rd person sing. present	takes
WDT	wh-determiner	which
WP	wh-pronoun	who, what
WP$	possessive wh-pronoun	whose
WRB	wh-abverb	where, when
'''

'\nPOS tag list:\n\nCC\tcoordinating conjunction\nCD\tcardinal digit\nDT\tdeterminer\nEX\texistential there (like: "there is" ... think of it like "there exists")\nFW\tforeign word\nIN\tpreposition/subordinating conjunction\nJJ\tadjective\t\'big\'\nJJR\tadjective, comparative\t\'bigger\'\nJJS\tadjective, superlative\t\'biggest\'\nLS\tlist marker\t1)\nMD\tmodal\tcould, will\nNN\tnoun, singular \'desk\'\nNNS\tnoun plural\t\'desks\'\nNNP\tproper noun, singular\t\'Harrison\'\nNNPS\tproper noun, plural\t\'Americans\'\nPDT\tpredeterminer\t\'all the kids\'\nPOS\tpossessive ending\tparent\'s\nPRP\tpersonal pronoun\tI, he, she\nPRP$\tpossessive pronoun\tmy, his, hers\nRB\tadverb\tvery, silently,\nRBR\tadverb, comparative\tbetter\nRBS\tadverb, superlative\tbest\nRP\tparticle\tgive up\nTO\tto\tgo \'to\' the store.\nUH\tinterjection\terrrrrrrrm\nVB\tverb, base form\ttake\nVBD\tverb, past tense\ttook\nVBG\tverb, gerund/present participle\ttaking\nVBN\tverb, past participle\ttaken\nVBP\tverb, sing. 

In [4]:
pos = nltk.pos_tag(tree_tokens)
print(pos)
pos_space = nltk.pos_tag(space_tokens)
print(pos_space)

[('Mary', 'NNP'), ('had', 'VBD'), ('a', 'DT'), ('little', 'JJ'), ('lamb', 'NN'), ('It', 'PRP'), ("'s", 'VBZ'), ('fleece', 'NN'), ('was', 'VBD'), ('white', 'JJ'), ('as', 'IN'), ('snow', 'NN'), ('.', '.')]
[('Mary', 'NNP'), ('had', 'VBD'), ('a', 'DT'), ('little', 'JJ'), ('lamb', 'JJ'), ("It's", 'NNP'), ('fleece', 'NN'), ('was', 'VBD'), ('white', 'JJ'), ('as', 'IN'), ('snow.', 'NN')]


The naming convention of the PoS tags makes it easy to use regular expressions to extract classes of word-type (i.e. all the Nouns or Verbs):


In [5]:
import re
regex = re.compile("^N.*")
nouns = []
for l in pos:
    if regex.match(l[1]):
        nouns.append(l[0])
print("Nouns:", nouns)

Nouns: ['Mary', 'lamb', 'fleece', 'snow']


In [6]:
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2005-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)


def process_content():
    try:
        for i in tokenized[:5]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            print(tagged)

    except Exception as e:
        print(str(e))


process_content()

[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S", 'POS'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('February', 'NNP'), ('2', 'CD'), (',', ','), ('2005', 'CD'), ('9:10', 'CD'), ('P.M.', 'NNP'), ('EST', 'NNP'), ('THE', 'NNP'), ('PRESIDENT', 'NNP'), (':', ':'), ('Mr.', 'NNP'), ('Speaker', 'NNP'), (',', ','), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), (',', ','), ('fellow', 'JJ'), ('citizens', 'NNS'), (':', ':'), ('As', 'IN'), ('a', 'DT'), ('new', 'JJ'), ('Congress', 'NNP'), ('gathers', 'NNS'), (',', ','), ('all', 'DT'), ('of', 'IN'), ('us', 'PRP'), ('in', 'IN'), ('the', 'DT'), ('elected', 'JJ'), ('branches', 'NNS'), ('of', 'IN'), ('government', 'NN'), ('share', 'NN'), ('a', 'DT'), ('

# Stemming and Lemmatizing

Striping off the suffixes from words is known as stemming.
Mapping a word to a known dictionary word is know as lemmatization

There are multiple Stemming methods available and the the NLTK book references a few methods in particular:
* The Porter Stemmer - see https://tartarus.org/martin/PorterStemmer/
* Lancaster Stemmer - (Chris Paice, University of Lancaster) additionally the
* Snowball Stemmer - "Porter 2" developed by Martin Porter is generally considered the de-facto optimal Stemmer

A list of other stemming methods can be found here: http://www.nltk.org/api/nltk.stem.html. Current Stemming and "Lemming" techniques are an inexact process as things currently stand.

### Stemming Example

The idea of stemming is a sort of normalizing method. Many variations of words carry the same meaning, other than when tense is involved.

The reason why we stem is to shorten the lookup, and normalize sentences.

Consider:

- I was taking a ride in the car.
- I was riding in the car.

This sentence means the same thing

One of the most popular stemming algorithms is the Porter stemmer, which has been around since 1979.

In [7]:
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
snowball = nltk.stem.snowball.SnowballStemmer('english')

print([porter.stem(t) for t in tree_tokens])
print()
print([lancaster.stem(t) for t in tree_tokens])
print()
print([snowball.stem(t) for t in tree_tokens])

sentence2 = "When I was going into the woods I saw a bear lying asleep on the forest floor"
tokens2 = word_tokenize(sentence2)

print("\n",sentence2)
for stemmer in [porter, lancaster, snowball]:
    print([stemmer.stem(t) for t in tokens2])

['Mari', 'had', 'a', 'littl', 'lamb', 'It', "'s", 'fleec', 'wa', 'white', 'as', 'snow', '.']

['mary', 'had', 'a', 'littl', 'lamb', 'it', "'s", 'fleec', 'was', 'whit', 'as', 'snow', '.']

['mari', 'had', 'a', 'littl', 'lamb', 'it', "'s", 'fleec', 'was', 'white', 'as', 'snow', '.']

 When I was going into the woods I saw a bear lying asleep on the forest floor
['When', 'I', 'wa', 'go', 'into', 'the', 'wood', 'I', 'saw', 'a', 'bear', 'lie', 'asleep', 'on', 'the', 'forest', 'floor']
['when', 'i', 'was', 'going', 'into', 'the', 'wood', 'i', 'saw', 'a', 'bear', 'lying', 'asleep', 'on', 'the', 'forest', 'flo']
['when', 'i', 'was', 'go', 'into', 'the', 'wood', 'i', 'saw', 'a', 'bear', 'lie', 'asleep', 'on', 'the', 'forest', 'floor']


### Lemmatizing Example

Lemmatization aims to achieve a similar base "stem" for a word, but aims to derive the genuine dictionary root word, not just a trunctated version of the word.

A very similar operation to stemming is called lemmatizing. The major difference between these is, as you saw earlier, stemming can often create non-existent words, whereas lemmas are actual words.

So, your root stem, meaning the word you end up with, is not something you can just look up in a dictionary, but you can look up a lemma.

Some times you will wind up with a very similar word, but sometimes, you will wind up with a completely different word. Let's see some examples.

The default lemmatization method with the Python NLTK is the WordNet lemmatizer.

nltk.download("averaged_perceptron_tagger")

nltk.download("wordnet")

In [8]:
wnl = nltk.WordNetLemmatizer()
tokens2_pos = nltk.pos_tag(tokens2)  #nltk.download("averaged_perceptron_tagger")

print([wnl.lemmatize(t) for t in tree_tokens])

print([wnl.lemmatize(t) for t in tokens2])

['Mary', 'had', 'a', 'little', 'lamb', 'It', "'s", 'fleece', 'wa', 'white', 'a', 'snow', '.']
['When', 'I', 'wa', 'going', 'into', 'the', 'wood', 'I', 'saw', 'a', 'bear', 'lying', 'asleep', 'on', 'the', 'forest', 'floor']


In [9]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("run"))
print(lemmatizer.lemmatize("run",'v'))

cat
cactus
goose
rock
python
good
best
run
run


Here, we've got a bunch of examples of the lemma for the words that we use. The only major thing to note is that lemmatize takes a part of speech parameter, "pos." If not supplied, the default is "noun." This means that an attempt will be made to find the closest noun, which can create trouble for you. Keep this in mind if you use lemmatizing!


In [10]:
print(lemmatizer.lemmatize("better"))


better


# Summary
By Tokenising text into sentences and words we can go beyond counting the frequency or occurence of actual words and instead classify words by a classification type (i.e. we can identify common features in the text).

## Further Investigation
Optional further work and experimentation:

* **Regular Expressions and POS patterns**

Consider how to extract "phrase-chunks" based on regular expressions.

See this Stack Overflow thread for one idea: http://stackoverflow.com/questions/34090734/how-to-use-nltk-regex-pattern-to-extract-a-specific-phrase-chunk

Consider the pitfalls and complexities for different sentence constructs with essentially the same meaning.
* **Experiment with the POS Tags**

Try tokenising the sentance:
"When I was going into the woods I saw a bear lying asleep on the forest floor"
and note any inaccuriacies in the PoS classifications.