## Tokenizer

 It's not very difficult to understand how to do it, but it is worth knowing why we
do it. The smallest unit to process in language processing task is a token. It is very much like
a divide-and-conquer strategy, where we try to make sense of the smallest units at a
granular level and add them up to understand the semantics of the sentence, paragraph,
document, and the corpus (if any) by moving up the level of detail.

In [19]:
from nltk.tokenize import LineTokenizer, SpaceTokenizer, TweetTokenizer
from nltk import word_tokenize
import nltk
# download the word pakage
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\Raj
[nltk_data]     Patel\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

# Sentence Tokenizer

In [20]:
lTokenizer = LineTokenizer()
text = """My name is Maximus Decimus Meridius, commander of the Armies of the North,General of the Felix Legions and loyal servant to the true emperor,Marcus Aurelius. 
         \nFather to a murdered son, husband to a murdered wife. 
         \nAnd I will have my vengeance, in this life or the next."""
print("Line tokenizer output : \n")
for sentence in lTokenizer.tokenize(text):
    print(sentence)

Line tokenizer output : 

My name is Maximus Decimus Meridius, commander of the Armies of the North,General of the Felix Legions and loyal servant to the true emperor,Marcus Aurelius. 
Father to a murdered son, husband to a murdered wife. 
And I will have my vengeance, in this life or the next.


As you can see, it has returned a list of three strings, meaning the given input has
been divided in to three lines on the basis of where the newlines are.
LineTokenizer simply divides the given input string into new lines.

## Space Tokenizer

As the name implies, it is supposed to divide on split on space characters

In [23]:
raw_text = "By 11 o'clock on Sunday, the doctor shall open the dispensary."
sTokenizer = SpaceTokenizer()
print("Space Tokenizer output: ")
print(sTokenizer.tokenize(raw_text))

Space Tokenizer output: 
['By', '11', "o'clock", 'on', 'Sunday,', 'the', 'doctor', 'shall', 'open', 'the', 'dispensary.']


## Word Tokenizer

In [22]:
raw_text = "By 11 o'clock on Sunday, the doctor shall open the dispensary."
print(word_tokenize(raw_text))


['By', '11', "o'clock", 'on', 'Sunday', ',', 'the', 'doctor', 'shall', 'open', 'the', 'dispensary', '.']


As you can see, the difference between SpaceTokenizer and word_tokenize()
is clearly visible.

## Tweet Tokenizer
Now, on to the last one. There's a special TweetTokernizer that we can use
when dealing with special case strings:

In [24]:
tTokenizer = TweetTokenizer()
tweet = "This is a cooool #dummysmiley: :-) :-P <3"
print(tTokenizer.tokenize(tweet))

['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3']


Tweets contain special words, special characters, hashtags, and smileys that we want to keep intact. 

As we see, the Tokenizer kept the hashtag word intact and didn't break it; the
smileys are also kept intact and are not lost. This is one special little class that can
be used when the application demands it.

# Stemming

In [25]:
from nltk import PorterStemmer, LancasterStemmer, word_tokenize

In [28]:
raw = """My name is Maximus Decimus Meridius, commander of the Armies
of the North, General of the Felix Legions and loyal servant to the
true emperor, Marcus Aurelius. Father to a murdered son, husband to
a murdered wife. And I will have my vengeance, in this life or the
next."""
tokens = word_tokenize(raw)
print(tokens)

['My', 'name', 'is', 'Maximus', 'Decimus', 'Meridius', ',', 'commander', 'of', 'the', 'Armies', 'of', 'the', 'North', ',', 'General', 'of', 'the', 'Felix', 'Legions', 'and', 'loyal', 'servant', 'to', 'the', 'true', 'emperor', ',', 'Marcus', 'Aurelius', '.', 'Father', 'to', 'a', 'murdered', 'son', ',', 'husband', 'to', 'a', 'murdered', 'wife', '.', 'And', 'I', 'will', 'have', 'my', 'vengeance', ',', 'in', 'this', 'life', 'or', 'the', 'next', '.']


In [29]:
porter = PorterStemmer()
pStems = [porter.stem(t) for t in tokens]
print(pStems)

['my', 'name', 'is', 'maximu', 'decimu', 'meridiu', ',', 'command', 'of', 'the', 'armi', 'of', 'the', 'north', ',', 'gener', 'of', 'the', 'felix', 'legion', 'and', 'loyal', 'servant', 'to', 'the', 'true', 'emperor', ',', 'marcu', 'aureliu', '.', 'father', 'to', 'a', 'murder', 'son', ',', 'husband', 'to', 'a', 'murder', 'wife', '.', 'and', 'i', 'will', 'have', 'my', 'vengeanc', ',', 'in', 'thi', 'life', 'or', 'the', 'next', '.']


As you can see in the output, all the words have been rid of the trailing 's', 'es', 'e', 'ed', 'al', and so on.

In [30]:
lancaster = LancasterStemmer()
lStems = [lancaster.stem(t) for t in tokens]
print(lStems)

['my', 'nam', 'is', 'maxim', 'decim', 'meridi', ',', 'command', 'of', 'the', 'army', 'of', 'the', 'nor', ',', 'gen', 'of', 'the', 'felix', 'leg', 'and', 'loy', 'serv', 'to', 'the', 'tru', 'emp', ',', 'marc', 'aureli', '.', 'fath', 'to', 'a', 'murd', 'son', ',', 'husband', 'to', 'a', 'murd', 'wif', '.', 'and', 'i', 'wil', 'hav', 'my', 'veng', ',', 'in', 'thi', 'lif', 'or', 'the', 'next', '.']


'us', 'e', 'th', 'eral', "ered", and many more! ending trailing is droped


As we compare the output of both the stemmers, we see that lancaster is clearly the
greedier one when dropping suffixes. It tries to remove as many characters from the end as
possible, whereas porter is non-greedy and removes as little as possible.

# Lemmatization

WordNetLemmatizer removes affixes only if it can find the resulting word in the
dictionary. This makes the process of lemmatization slower than Stemming. Also, it
understands and treats capitalized words as special words; it doesn’t do any processing for
them and returns them as is. To work around this, you may want to convert your input
string to lowercase and then run lemmatization on it.

In [31]:
from nltk import word_tokenize, PorterStemmer, WordNetLemmatizer

In [32]:
raw = """My name is Maximus Decimus Meridius, commander of the Armies
of the North, General of the Felix Legions and loyal servant to the
true emperor, Marcus Aurelius. Father to a murdered son, husband to
a murdered wife. And I will have my vengeance, in this life or the
next."""
tokens = word_tokenize(raw)
print(tokens)

['My', 'name', 'is', 'Maximus', 'Decimus', 'Meridius', ',', 'commander', 'of', 'the', 'Armies', 'of', 'the', 'North', ',', 'General', 'of', 'the', 'Felix', 'Legions', 'and', 'loyal', 'servant', 'to', 'the', 'true', 'emperor', ',', 'Marcus', 'Aurelius', '.', 'Father', 'to', 'a', 'murdered', 'son', ',', 'husband', 'to', 'a', 'murdered', 'wife', '.', 'And', 'I', 'will', 'have', 'my', 'vengeance', ',', 'in', 'this', 'life', 'or', 'the', 'next', '.']


In [33]:
porter = PorterStemmer()
stems = [porter.stem(t) for t in tokens]
print(stems)

['my', 'name', 'is', 'maximu', 'decimu', 'meridiu', ',', 'command', 'of', 'the', 'armi', 'of', 'the', 'north', ',', 'gener', 'of', 'the', 'felix', 'legion', 'and', 'loyal', 'servant', 'to', 'the', 'true', 'emperor', ',', 'marcu', 'aureliu', '.', 'father', 'to', 'a', 'murder', 'son', ',', 'husband', 'to', 'a', 'murder', 'wife', '.', 'and', 'i', 'will', 'have', 'my', 'vengeanc', ',', 'in', 'thi', 'life', 'or', 'the', 'next', '.']


In [34]:
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(t) for t in tokens]
print(lemmas)

['My', 'name', 'is', 'Maximus', 'Decimus', 'Meridius', ',', 'commander', 'of', 'the', 'Armies', 'of', 'the', 'North', ',', 'General', 'of', 'the', 'Felix', 'Legions', 'and', 'loyal', 'servant', 'to', 'the', 'true', 'emperor', ',', 'Marcus', 'Aurelius', '.', 'Father', 'to', 'a', 'murdered', 'son', ',', 'husband', 'to', 'a', 'murdered', 'wife', '.', 'And', 'I', 'will', 'have', 'my', 'vengeance', ',', 'in', 'this', 'life', 'or', 'the', 'next', '.']


As you see, it understands that for nouns it doesn't have to remove the trailing
's'. But for non-nouns, for example, legions and armies, it removes suffixes and
also replaces them. However, what it’s essentially doing is a dictionary match. We
shall discuss the difference in the output section.


As we compare the output of the stemmer and the lemmatizer, we see that the stemmer
makes a lot of mistakes and the lemmatizer makes very few mistakes. However, it doesn't
do anything with the word 'murdered', and that is an error. Yet, as an end product,
lemmatizer does a far better job of getting us the base form than the stemmer.

# Stop words

In [37]:
import nltk
from nltk.corpus import gutenberg , stopwords 
nltk.download('gutenberg')
print(gutenberg.fileids())

[nltk_data] Downloading package gutenberg to C:\Users\Raj
[nltk_data]     Patel\AppData\Roaming\nltk_data...


['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


[nltk_data]   Unzipping corpora\gutenberg.zip.


In [40]:
gb_words = gutenberg.words('bible-kjv.txt')
print(gb_words)
print(len(gb_words))

['[', 'The', 'King', 'James', 'Bible', ']', 'The', ...]
1010654


In [41]:
words_filtered = [word for word in gb_words if len(word) >=3]

In this step is we are iterating over the entire list of words from Gutenberg, discarding all
the words/tokens whose length is two characters or less.

In [44]:
len(words_filtered)

642004

In [47]:
# removing the stop words 
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')
words = [w for w in words_filtered if w.lower() not in stopwords]

[nltk_data] Downloading package stopwords to C:\Users\Raj
[nltk_data]     Patel\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


The first line simply loads words from the stopwords corpus into the stopwords
variable for the english language. The second line is where we are filtering out
all stopwords from the filtered word list we had developed in the previous
example.

In [49]:
len(words)

368614

### Freq count

In [50]:
fdistPlain = nltk.FreqDist(words)
fdist = nltk.FreqDist(gb_words)

In [54]:
print('Following are the most common 10 words in the bag')
print(fdistPlain.most_common(10))
print('Following are the most common 10 words in the bag minus the stopwords')
print(fdist.most_common(10))

Following are the most common 10 words in the bag
[('shall', 9760), ('unto', 8940), ('LORD', 6651), ('thou', 4890), ('thy', 4450), ('God', 4115), ('said', 3995), ('thee', 3827), ('upon', 2730), ('man', 2721)]
Following are the most common 10 words in the bag minus the stopwords
[(',', 70509), ('the', 62103), (':', 43766), ('and', 38847), ('of', 34480), ('.', 26160), ('to', 13396), ('And', 12846), ('that', 12576), ('in', 12331)]


If you look carefully at the output, the most common 10 words in the unprocessed or plain
list of words won't make much sense. Whereas from the preprocessed bag of words, the
most common 10 words such as god, lord, and man give us a quick understanding that we
are dealing with a text related to faith or religion.