### What is Tokenization

Tokenization is a process of breaking the input sequence into smaller parts so that we can some useful smaller units of sequence for semantic processing. Tokens can be a word, sentence, paragraph etc. But in general, mostly tokens are considered words in real life applications. 

There are so many tokenizers available in `nltk` library, we will see below three tokenizers and how they work - 
<ol>
    <li><b>WhitespaceTokenizer:</b>It tokenizes the string on whitespace</li>
    <li><b>WordPunctTokenizer:</b>It tokenizes a text into a sequence of alphabetic and non-alphabetic characters, using the regexp ``\w+|[^\w\s]+``.</li>
    <li><b>TreebankWordTokenizer:</b>The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank</li>
</ol>

In [1]:
sentence = "Today is Rahul's first class, isn't it?"

In [2]:
from nltk.tokenize import WhitespaceTokenizer
wst = WhitespaceTokenizer()
[wst.tokenize(token) for token in sentence.split(" ")]

[['Today'], ['is'], ["Rahul's"], ['first'], ['class,'], ["isn't"], ['it?']]

<font color='red'><b>Disadvantage:</b> it can't distinguish between `it` and `it?` which are different tokens with same meaning</font>

In [3]:
from nltk.tokenize import WordPunctTokenizer
wpt = WordPunctTokenizer()
[wpt.tokenize(token) for token in sentence.split(" ")]

[['Today'],
 ['is'],
 ['Rahul', "'", 's'],
 ['first'],
 ['class', ','],
 ['isn', "'", 't'],
 ['it', '?']]

<font color='red'><b>Disadvantage:</b> it gives some tokes which are not meaningful such as - `s`, `isn`, `t`</font>

In [4]:
from nltk.tokenize import TreebankWordTokenizer
tbt = TreebankWordTokenizer()
[tbt.tokenize(token) for token in sentence.split(" ")]

[['Today'],
 ['is'],
 ['Rahul', "'s"],
 ['first'],
 ['class', ','],
 ['is', "n't"],
 ['it', '?']]

<font color='blue'><b>Advantage:</b> it is better than the above two tokenizers, here we can see `'s` and `n't` are more meaningful for processing</font>

### What is text normalization

while doing text analysis we may want to convert different varieties of words having similar meaning into a single entity. For example - 

<font color='green'>This `computer` is taking too much `computation` time while `computing` on this data set<font>

In the above example - are the words `computer`, `computation` and `computing` have different meaning? But computer will treat them differently

<ol>
    <li>computer, computation, computing -> compute</li>
    <li>run, runs, running -> run</li>
</ol>    

### Types of text normalization

<font color='blue'><b>Stemming:</b> It is a process of removing and replacing suffixes to get to the root form of the word which is known as `stem`. There are three popular English stemmer - 
<ol>
    <li>Porter</li>
    <li>Lancester</li>
</ol>    
Stemming is generally a set of heuristics that chopps off suffixes. Porter stemmer has `5 heuristic phases` of word reductions that are applied sequentially. It is often noticed that stemming algorithm does not produce a real word after removing the stem, which is fine. But the purpose of stemming is to bring variant forms of a word together, not to map a word onto its ‘paradigm’ form.

Step 1a

    SSES -> SS                         caresses  ->  caress
    IES  -> I                          ponies    ->  poni
                                       ties      ->  ti
    SS   -> SS                         caress    ->  caress
    S    ->                            cats      ->  cat

Step 1b

    (m>0) EED -> EE                    feed      ->  feed
                                       agreed    ->  agree
    (*v*) ED  ->                       plastered ->  plaster
                                       bled      ->  bled
    (*v*) ING ->                       motoring  ->  motor
                                       sing      ->  sing

If the second or third of the rules in Step 1b is successful, the following
is done:

    AT -> ATE                       conflat(ed)  ->  conflate
    BL -> BLE                       troubl(ed)   ->  trouble
    IZ -> IZE                       siz(ed)      ->  size
    (*d and not (*L or *S or *Z))
       -> single letter
                                    hopp(ing)    ->  hop
                                    tann(ed)     ->  tan
                                    fall(ing)    ->  fall
                                    hiss(ing)    ->  hiss
                                    fizz(ed)     ->  fizz
    (m=1 and *o) -> E               fail(ing)    ->  fail
                                    fil(ing)     ->  file

The rule to map to a single letter causes the removal of one of the double
letter pair. The -E is put back on -AT, -BL and -IZ, so that the suffixes
-ATE, -BLE and -IZE can be recognised later. This E may be removed in step
4.

Step 1c

    (*v*) Y -> I                    happy        ->  happi
                                    sky          ->  sky

Step 1 deals with plurals and past participles. The subsequent steps are
much more straightforward.

<font color='blue'>You can go through the remaining rules of Porter stemming algorithm using this [link](https://tartarus.org/martin/PorterStemmer/def.txt)</font>

In [5]:
#from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [6]:
#stemmer.stem('Germany')

In [7]:
sentence = "This computer is taking too much computation time while computing on this data set"
tokens = tokenizer.tokenize(sentence)
" ".join([stemmer.stem(token) for token in tokens])

'thi comput is take too much comput time while comput on thi data set'

<font color='red'><b>Disadvantage:</b> It fails on irregular forms and thereby produces non-words. In the above you can see that it has produced non-words like `thi` and `comput`</font>

<font color='blue'><b>Lemmatization:</b> It is a process of converting different versions of a word with the use of vocabulary and morphological analysis. It returns the base or a dictionary form of a word which is known as `lemma` - 
<ol>
    <li>WordNet Lemmatizer</li>
</ol>    
WordNet lemmatization uses `WordNet` database to look up the lemmas. You can check the database using this [link](http://wordnetweb.princeton.edu/perl/webwn), and this database is also downloaded when we use `nltk` library

In [None]:
# !pip install nltk

In [8]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [9]:
sentence = "The wolves are at the door"
tokens = tokenizer.tokenize(sentence)
" ".join([lemmatizer.lemmatize(token) for token in tokens])

'The wolf are at the door'

In [None]:
#lemmatizer.lemmatize

In [10]:
sentence = "the presentation could have been better"
tokens = tokenizer.tokenize(sentence)
" ".join([lemmatizer.lemmatize(token) for token in tokens])

'the presentation could have been better'

In [11]:
sentence = "he talked to the person"
tokens = tokenizer.tokenize(sentence)
" ".join([lemmatizer.lemmatize(token) for token in tokens])

'he talked to the person'

<font color='red'><b>Disadvantage:</b> Not all forms are reduced. As you can see above `talked` remains `talked`, no changes there. But let's see what happens if we stemmer in place of lemmatizer in this sentence</font>

In [12]:
sentence = "he talked to the person"
tokens = tokenizer.tokenize(sentence)
" ".join([stemmer.stem(token) for token in tokens])

'he talk to the person'

<font color='red'>Did you noticed the difference?</font>

<font color='red'>Q1. So when to use stemming and when to use lemmatization?</font>

### More on Normalization

<font color='blue'>Let's see the below examples - 
<ol>
    <li>Us, us -> Are they same?</li>
    <li>us, US -> Are they same?</li>
</ol></font>   

<font color='blue'>To solve these kinds of problems - 
We can 
<ol>
    <li>lowercase beginning of the sentence</li>
    <li>lowercasing words in titles</li>
    <li>leave mid-sentence words as-is</li>
<ol></font>    

### Stopwords

<font color='blue'>Stopwords are English words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For example, the words like the, he, have etc.</font>

In [13]:
# List stop words
from nltk.corpus import stopwords
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [14]:
text = "The first time - you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2.\
It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?"

# normalizing the text
import re
text = re.sub(r"[^a-zA-Z]", " ", text.lower())

# tokenizing it
words = text.split()
print(words)

['the', 'first', 'time', 'you', 'see', 'the', 'second', 'renaissance', 'it', 'may', 'look', 'boring', 'look', 'at', 'it', 'at', 'least', 'twice', 'and', 'definitely', 'watch', 'part', 'it', 'will', 'change', 'your', 'view', 'of', 'the', 'matrix', 'are', 'the', 'human', 'people', 'the', 'ones', 'who', 'started', 'the', 'war', 'is', 'ai', 'a', 'bad', 'thing']


In [15]:
# removing stop words
words = [w for w in words if w not in stopwords.words("english")]
print(words)

['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'boring', 'look', 'least', 'twice', 'definitely', 'watch', 'part', 'change', 'view', 'matrix', 'human', 'people', 'ones', 'started', 'war', 'ai', 'bad', 'thing']


In [16]:
# complete sentence after removing stop words
sentence = " ".join([w for w in words if w not in stopwords.words("english")])
print(sentence)

first time see second renaissance may look boring look least twice definitely watch part change view matrix human people ones started war ai bad thing


In [17]:
stemmer.stem('thank')

'thank'

In [18]:
stemmer.stem('thanks')

'thank'