<a href="https://colab.research.google.com/github/vishnu-itachi/Jupyter_Notebooks/blob/main/NLP_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 # NLTK Library

The [NLTK](https://www.nltk.org/) Python framework is generally used as an education and research tool. Tokenization, Stemming, Lemmatization, Punctuation, Character count, word count are some of these packages which will be discussed.

**Installing Nltk** <br>
Nltk can be installed using PIP or Conda package managers.
The steps to install NLTK is available on the link: 
```bash
sudo pip3 install nltk 
python3 
nltk.download()
```

# Tokenizing words and Sentences using Nltk

**Tokenization** is the process by which big quantity of text is divided into smaller parts called tokens. <br>It is crucial to understand the pattern in the text in order to perform various NLP tasks.These tokens are very useful for finding such patterns.<br>

Natural Language toolkit has very important module tokenize which further comprises of sub-modules

1. word tokenize
2. sentence tokenize

In [63]:
# Importing modules
import nltk
nltk.download('punkt') # For tokenizers
nltk.download('inaugural') # For dataset
nltk.download('stopwords')
from nltk.tokenize import word_tokenize,sent_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package inaugural to /root/nltk_data...
[nltk_data]   Unzipping corpora/inaugural.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [64]:
word_tokenize("shouldn't they'll")

['should', "n't", 'they', "'ll"]

In [65]:
# Sample corpus.
from nltk.corpus import inaugural
corpus = inaugural.raw('1789-Washington.txt')
print(corpus)

Fellow-Citizens of the Senate and of the House of Representatives:

Among the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the 14th day of the present month. On the one hand, I was summoned by my Country, whose voice I can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in my flattering hopes, with an immutable decision, as the asylum of my declining years -- a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination, and of frequent interruptions in my health to the gradual waste committed on it by time. On the other hand, the magnitude and difficulty of the trust to which the voice of my country called me, being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications, could not bu

In [69]:

from nltk.corpus import stopwords

num_sentences = len(sent_tokenize(corpus))
tokens = sum([word_tokenize(sentence) for sentence in sent_tokenize(corpus)],[])
# print(tokens)
num_tokens = len(tokens)
stopwords = set(stopwords.words('english'))
tokens_after_rem = [i for i in tokens if i not in stopwords]

print("Number of sentences : ", num_sentences)
print("Number of tokens : ", num_tokens)
print("Average number of tokens per sentence : ", num_tokens/num_sentences)
print("Number of unique tokens : ", len(set(tokens)))
print("Number of tokens after stopword removal : ", len(tokens_after_rem))
print("Number of unique tokens after stopword removal : ", len(set(tokens_after_rem)))

Number of sentences :  23
Number of tokens :  1537
Average number of tokens per sentence :  66.82608695652173
Number of unique tokens :  626
Number of tokens after stopword removal :  800
Number of unique tokens after stopword removal :  543


# Stemming and Lemmatization with NLTK

**What is Stemming?** <br>
Stemming is a kind of normalization for words. Normalization is a technique where a set of words in a sentence are converted into a sequence to shorten its lookup. The words which have the same meaning but have some variation according to the context or sentence are normalized.<br>
Hence Stemming is a way to find the root word from any variations of respective word

There are many stemmers provided by Nltk like **PorterStemmer**, **SnowballStemmer**, **LancasterStemmer**.<br>


In [85]:
from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer # Note that SnowballStemmer has language as parameter.

words = ["grows","leaves","fairly","cats","trouble","troubling","misunderstanding","friendships","easily", "rational", "relational"]

porter = PorterStemmer()
snowball = SnowballStemmer('english')

print(f"PorterStemmer\t\t\tSnowballStemmer")
for word in words:
    print(f"{porter.stem(word)}{snowball.stem(word):>32}")



PorterStemmer			SnowballStemmer
grow                            grow
leav                            leav
fairli                            fair
cat                             cat
troubl                          troubl
troubl                          troubl
misunderstand                   misunderstand
friendship                      friendship
easili                          easili
ration                          ration
relat                           relat


**What is Lemmatization?** <br>
Lemmatization is the algorithmic process of finding the lemma of a word depending on their meaning. Lemmatization usually refers to the morphological analysis of words, which aims to remove inflectional endings. It helps in returning the base or dictionary form of a word, which is known as the lemma.<br>

*The NLTK Lemmatization method is based on WorldNet's built-in morph function.*

In [107]:
#imports
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet') # Since Lemmatization method is based on WorldNet's built-in morph function.

words = ["grows","leaves","fairly","cats","trouble","troubling","running","friendships","easily", "was", "relational","has"]

lw = WordNetLemmatizer()
print(f"Word\t\t\tLemmatizedWord")
for word in words:
    print(f"{word}{lw.lemmatize(word,pos='v'):>25}")


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Word			LemmatizedWord
grows                     grow
leaves                    leave
fairly                   fairly
cats                      cat
trouble                  trouble
troubling                  trouble
running                      run
friendships              friendships
easily                   easily
was                       be
relational               relational
has                     have


In [96]:
import pandas as pd
df = pd.DataFrame({
    "Words" : words,
    "PorterStemmer" : [porter.stem(word) for word in words],
    "SnowballStemmer" : [snowball.stem(word) for word in words],
    "WordNetLemmatizer" : [lw.lemmatize(word,pos='v') for word in words]
})

df

Unnamed: 0,Words,PorterStemmer,SnowballStemmer,WordNetLemmatizer
0,grows,grow,grow,grow
1,leaves,leav,leav,leave
2,fairly,fairli,fair,fairly
3,cats,cat,cat,cat
4,trouble,troubl,troubl,trouble
5,troubling,troubl,troubl,trouble
6,running,run,run,run
7,friendships,friendship,friendship,friendships
8,easily,easili,easili,easily
9,was,wa,was,be


# NGram Model

In [108]:
# give dataset ALice in the wonderland
with open('corpus.txt', 'r') as f:
    corpus = f.read()

In [109]:
corpus



## Preprocess the corpus

In [110]:
len(corpus)

142477

In [111]:
import string, nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [112]:
# turn a doc into clean tokens
def clean_doc(doc):
    # replace '--' with a space ' '
    doc = doc.replace('--', ' ')
    # split into tokens by white space
    tokens = doc.split()
    # remove punctuation from each token
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # make lower case
    tokens = [word.lower() for word in tokens]
    return tokens

In [113]:
doc = clean_doc(corpus)

In [114]:
len(doc)

23482

In [115]:
doc

['chapter',
 'i',
 'down',
 'the',
 'rabbithole',
 'alice',
 'was',
 'beginning',
 'to',
 'get',
 'very',
 'tired',
 'of',
 'sitting',
 'by',
 'her',
 'sister',
 'on',
 'the',
 'bank',
 'and',
 'of',
 'having',
 'nothing',
 'to',
 'do',
 'once',
 'or',
 'twice',
 'she',
 'had',
 'peeped',
 'into',
 'the',
 'book',
 'her',
 'sister',
 'was',
 'reading',
 'but',
 'it',
 'had',
 'no',
 'pictures',
 'or',
 'conversations',
 'in',
 'it',
 'what',
 'is',
 'the',
 'use',
 'of',
 'a',
 'thought',
 'alice',
 'pictures',
 'or',
 'so',
 'she',
 'was',
 'considering',
 'in',
 'her',
 'own',
 'mind',
 'as',
 'well',
 'as',
 'she',
 'could',
 'for',
 'the',
 'hot',
 'day',
 'made',
 'her',
 'feel',
 'very',
 'sleepy',
 'and',
 'stupid',
 'whether',
 'the',
 'pleasure',
 'of',
 'making',
 'a',
 'daisychain',
 'would',
 'be',
 'worth',
 'the',
 'trouble',
 'of',
 'getting',
 'up',
 'and',
 'picking',
 'the',
 'daisies',
 'when',
 'suddenly',
 'a',
 'white',
 'rabbit',
 'with',
 'pink',
 'eyes',
 'ran'

### Creating Language Models:

1. **Create the following language models** on the training corpus: <br>
    i.   Unigram <br>
    ii.  Bigram <br>
    iii. Trigram <br>
    iv.  Fourgram <br>

2. **List the top 5 bigrams, trigrams, four-grams (with and without Add-1 smoothing).**
(Note: We remove those which contain only articles, prepositions, determiners. For Example: “of the”, “in a”, etc).

In [116]:
from nltk.util import ngrams
unigrams=[]
bigrams=[]
trigrams=[]
fourgrams=[]

docs = [doc]

for content in docs:
    unigrams.extend(content)
    bigrams.extend(ngrams(content,2))
    trigrams.extend(ngrams(content, 3))
    fourgrams.extend(ngrams(content, 4))

In [118]:
len(unigrams),len(bigrams),len(trigrams),len(fourgrams)

(23482, 23481, 23480, 23479)

In [120]:
# stopwords = downloading stop words through nltk
stopwords = (nltk.corpus.stopwords.words('english'))

# printing top 10 unigrams, bigrams after removing stopwords
uni_processed = [p for p in unigrams if p not in stopwords]

uni_fdist = nltk.FreqDist(uni_processed)
print(list(uni_fdist.most_common())[:10])

# printing top 10 bigrams, trigrams, fourgrams after removing stopwords
bi_processed = [p for p in bigrams if p[0] not in stopwords or p[1] not in stopwords]
tri_processed = [p for p in trigrams if p[0] not in stopwords or p[1] not in stopwords or p[2] not in stopwords]
four_processed = [p for p in fourgrams if p[0] not in stopwords or p[1] not in stopwords  or p[2] not in stopwords  or p[3] not in stopwords]

[('said', 457), ('alice', 383), ('little', 125), ('one', 90), ('went', 83), ('like', 78), ('could', 77), ('would', 77), ('thought', 74), ('queen', 65)]


In [121]:
bi_fdist = nltk.FreqDist(bi_processed)
print(list(bi_fdist.most_common())[:10])

[(('said', 'the'), 208), (('said', 'alice'), 114), (('the', 'king'), 60), (('the', 'queen'), 59), (('a', 'little'), 58), (('mock', 'turtle'), 54), (('the', 'mock'), 53), (('the', 'gryphon'), 53), (('the', 'hatter'), 51), (('went', 'on'), 48)]


In [122]:
tri_fdist = nltk.FreqDist(tri_processed)
print(list(tri_fdist.most_common())[:10])

[(('the', 'mock', 'turtle'), 51), (('the', 'march', 'hare'), 29), (('said', 'the', 'king'), 29), (('the', 'white', 'rabbit'), 21), (('said', 'the', 'hatter'), 20), (('said', 'to', 'herself'), 19), (('said', 'the', 'mock'), 19), (('said', 'the', 'caterpillar'), 18), (('she', 'went', 'on'), 17), (('she', 'said', 'to'), 17)]


In [123]:
four_fdist = nltk.FreqDist(four_processed)
print(list(four_fdist.most_common())[:10])

[(('said', 'the', 'mock', 'turtle'), 19), (('she', 'said', 'to', 'herself'), 16), (('a', 'minute', 'or', 'two'), 11), (('said', 'the', 'march', 'hare'), 8), (('said', 'alice', 'in', 'a'), 7), (('as', 'well', 'as', 'she'), 6), (('well', 'as', 'she', 'could'), 6), (('in', 'a', 'great', 'hurry'), 6), (('in', 'a', 'tone', 'of'), 6), (('the', 'moral', 'of', 'that'), 6)]


# Applying Smoothing

Assume additional training data in which each possible N-gram occurs exactly once and adjust estimates.

### $ Probability(ngram) = \frac{Count(ngram)+1}{ N\, +\, V} $

N: Total number of N-grams <br>
V: Number of unique N-grams


In [124]:
len(unigrams), len(uni_fdist)

(23482, 2290)

In [125]:
uni_prob_ws = {} 
print(uni_fdist)
for key,value in dict(uni_fdist).items():
    uni_prob_ws[key] = (value+1)/(len(unigrams)+len(uni_fdist))
prob_distr_unigram = sorted(uni_prob_ws.items(), key=lambda x:x[1], reverse=True)    

bigram_count = len(bigrams)
unique_bigram_count = len(set(bigrams))
bi_prob_ws = {}
print(bi_fdist)
for key, value in dict(bi_fdist).items():
    bi_prob_ws[key] = (value+1)/(bigram_count+unique_bigram_count)
prob_distr_bigram = sorted(bi_prob_ws.items(), key=lambda x:x[1], reverse=True)


trigram_count = len(trigrams)
unique_trigram_count = len(set(trigrams))
tri_prob_ws={}
print(tri_fdist)
for key, value in dict(tri_fdist).items():
    tri_prob_ws[key]=(value+1)/(trigram_count+unique_trigram_count)
prob_distr_trigram = sorted(tri_prob_ws.items(), key=lambda x:x[1], reverse=True)


fourgram_count = len(fourgrams)
unique_fourgram_count = len(set(fourgrams))
print(four_fdist)
four_prob_ws = {}
for key, value in dict(four_fdist).items():
    four_prob_ws[key]=(value+1)/(fourgram_count+unique_fourgram_count)
prob_distr_fourgram = sorted(four_prob_ws.items(), key=lambda x:x[1], reverse=True)

#Print top 10 unigram, bigram, trigram, fourgram after smoothing
print('Unigrams')
for i in range(10):
    print(prob_distr_unigram[i])
print('Bigrams')
for i in range(10):
    print(prob_distr_bigram[i])
print('Trigrams')
for i in range(10):
    print(prob_distr_trigram[i])
print('Fourgrams')
for i in range(10):
    print(prob_distr_fourgram[i])
    

<FreqDist with 2290 samples and 10837 outcomes>
<FreqDist with 11382 samples and 17789 outcomes>
<FreqDist with 18660 samples and 21283 outcomes>
<FreqDist with 21830 samples and 22618 outcomes>
Unigrams
('said', 0.017771224584820736)
('alice', 0.01489989135495887)
('little', 0.004889026850845879)
('one', 0.003530963836722024)
('went', 0.003259351233897253)
('like', 0.0030653422318795594)
('could', 0.0030265404314760206)
('would', 0.0030265404314760206)
('thought', 0.0029101350302654042)
('queen', 0.002560918826633556)
Bigrams
(('said', 'the'), 0.005721481562594104)
(('said', 'alice'), 0.0031481836349202003)
(('the', 'king'), 0.0016699061020011498)
(('the', 'queen'), 0.0016425305921322784)
(('a', 'little'), 0.001615155082263407)
(('mock', 'turtle'), 0.0015056530427879219)
(('the', 'mock'), 0.0014782775329190505)
(('the', 'gryphon'), 0.0014782775329190505)
(('the', 'hatter'), 0.001423526513181308)
(('went', 'on'), 0.001341399983574694)
Trigrams
(('the', 'mock', 'turtle'), 0.001184672164

### Predict the next word using statistical language modelling

We use the above bigram, trigram, and fourgram models and **predict the next word(top 5 probable) given the previous n(=2, 3, 4)-grams** for the sentences below.


In [126]:
str1 = 'after that alice said the'
str2 = 'alice felt so desperate that she was'

In [127]:
def predict(tokens):
    print('\nBigram model Predictions')
    for k, v in dict(prob_distr_bigram).items():
        if(k[0]==tokens[-1]):
            print(k, v)
    print('\nTrigram model Predictions')
    for k, v in dict(prob_distr_trigram).items():
        if(k[0]==tokens[-2] and k[1]==tokens[-1]):
            print(k,v) 
    print('\nFourgram model Predictions')
    for k, v in dict(prob_distr_fourgram).items():
        if(k[0]==tokens[-3] and k[1]==tokens[-2] and k[2]==tokens[-1]):
            print(k, v) 

In [128]:
tokens = nltk.word_tokenize(str1)
predict(tokens)


Bigram model Predictions
('the', 'king') 0.0016699061020011498
('the', 'queen') 0.0016425305921322784
('the', 'mock') 0.0014782775329190505
('the', 'gryphon') 0.0014782775329190505
('the', 'hatter') 0.001423526513181308
('the', 'duchess') 0.0010128938651482384
('the', 'dormouse') 0.0009033918256727532
('the', 'march') 0.0008212652960661392
('the', 'mouse') 0.0007665142763283966
('the', 'caterpillar') 0.000711763256590654
('the', 'white') 0.0006296367269840401
('the', 'cat') 0.0006022612171151688
('the', 'little') 0.0005475101973774261
('the', 'door') 0.0005475101973774261
('the', 'rabbit') 0.0005201346875085549
('the', 'said') 0.0004927591776396835
('the', 'first') 0.0004380081579019409
('the', 'jury') 0.0004106326480330696
('the', 'next') 0.0003832571381641983
('the', 'whole') 0.0003832571381641983
('the', 'cook') 0.0003832571381641983
('the', 'time') 0.000355881628295327
('the', 'dodo') 0.000355881628295327
('the', 'court') 0.000355881628295327
('the', 'right') 0.0003285061184264557

In [43]:
tokens = nltk.word_tokenize(str2)
predict(tokens)


Bigram model Predictions
('was', 'going') 0.0003285061184264557
('was', 'quite') 0.00024637958881984176
('was', 'looking') 0.00019162856908209916
('was', 'sitting') 0.00019162856908209916
('was', 'beginning') 0.00016425305921322785
('was', 'nothing') 0.00013687754934435653
('was', 'gone') 0.00013687754934435653
('was', 'coming') 0.00010950203947548523
('was', 'talking') 0.00010950203947548523
('was', 'getting') 0.00010950203947548523
('was', 'exactly') 0.00010950203947548523
('was', 'certainly') 0.00010950203947548523
('was', 'reading') 8.212652960661393e-05
('was', 'good') 8.212652960661393e-05
('was', 'rather') 8.212652960661393e-05
('was', 'walking') 8.212652960661393e-05
('was', 'another') 8.212652960661393e-05
('was', 'close') 8.212652960661393e-05
('was', 'lying') 8.212652960661393e-05
('was', 'surprised') 8.212652960661393e-05
('was', 'soon') 8.212652960661393e-05
('was', 'speaking') 8.212652960661393e-05
('was', 'pressed') 8.212652960661393e-05
('was', 'bill') 8.21265296066139

In [129]:
def predict_next_token(input_string, nmodel = 2):
    tokens = nltk.word_tokenize(input_string)
    if nmodel == 2 :
        for k, v in dict(prob_distr_bigram).items():
            if(k[0]==tokens[-1]):
                return k[-1]
    elif nmodel == 3:
        for k, v in dict(prob_distr_trigram).items():
            if(k[0]==tokens[-2] and k[1]==tokens[-1]):
                return k[-1]
    elif nmodel == 4:
        for k, v in dict(prob_distr_fourgram).items():
            if(k[0]==tokens[-3] and k[1]==tokens[-2] and k[2]==tokens[-1]):
                return k[-1] 

In [130]:
from time import sleep
def predict_sequence(input_string, nmodel = 2):
    print(f'\n{nmodel} gram model Predictions')
    while(1):
        print(f"\r{input_string}", end = "")
        word = predict_next_token(input_string)
        input_string += f" {word}"
        sleep(2)

In [141]:
str1 = 'after that alice said the'
str2 = 'alice felt so desperate that she was'
str3 = "I said that"
predict_sequence(str3, 4)


4 gram model Predictions
I said that   alice said the

KeyboardInterrupt: ignored