# **Natural Language Processing**
# **Ruthu S Sanketh**


 # NLTK Library

# Tokenizing words and Sentences using Nltk

**Tokenization** is the process by which big quantity of text is divided into smaller parts called tokens. <br>It is crucial to understand the pattern in the text in order to perform various NLP tasks.These tokens are very useful for finding such patterns.<br>

Natural Language toolkit has very important module tokenize which further comprises of sub-modules

1. word tokenize
2. sentence tokenize

In [None]:
# Importing modules
import nltk
nltk.download('punkt') # For tokenizers
nltk.download('inaugural') # For dataset
from nltk.tokenize import word_tokenize,sent_tokenize

In [1]:
# Sample corpus.
from nltk.corpus import inaugural
corpus = inaugural.raw('1789-Washington.txt')
print(corpus)

Fellow-Citizens of the Senate and of the House of Representatives:

Among the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the 14th day of the present month. On the one hand, I was summoned by my Country, whose voice I can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in my flattering hopes, with an immutable decision, as the asylum of my declining years -- a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination, and of frequent interruptions in my health to the gradual waste committed on it by time. On the other hand, the magnitude and difficulty of the trust to which the voice of my country called me, being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications, could not bu

For the given corpus, 
1. Print the number of sentences and tokens. 
2. Print the average number of tokens per sentence.
3. Print the number of unique tokens
4. Print the number of tokens after stopword removal using the stopwords from nltk.


In [3]:
sents = nltk.sent_tokenize(corpus)
print("The number of sentences is", len(sents)) #prints the number of sentences

words = nltk.word_tokenize(corpus)
print("The number of tokens is", len(words)) #prints the number of tokens

average_tokens = round(len(words)/len(sents))
print("The average number of tokens per sentence is", average_tokens) #prints the average number of tokens per sentence

unique_tokens = set(words)
print("The number of unique tokens are", len(unique_tokens)) #prints the number of unique tokens

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
final_tokens = []

for each in words:
 if each not in stop_words:
    final_tokens.append(each)

print("The number of total tokens after removing stopwords are", len((final_tokens)))  #prints number of tokens after removing stopwords    


The number of sentences is 23
The number of tokens is 1537
The average number of tokens per sentence is 67
The number of unique tokens are 626
The number of total tokens after removing stopwords are 800


# Stemming and Lemmatization with NLTK

**What is Stemming?** <br>
Stemming is a kind of normalization for words. Normalization is a technique where a set of words in a sentence are converted into a sequence to shorten its lookup. The words which have the same meaning but have some variation according to the context or sentence are normalized.<br>
Hence Stemming is a way to find the root word from any variations of respective word

There are many stemmers provided by Nltk like **PorterStemmer**, **SnowballStemmer**, **LancasterStemmer**.<br>

We will see differences between Porterstemmer and Snowballstemmer

In [4]:
from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer # Note that SnowballStemmer has language as parameter.

words = ["grows","leaves","fairly","cats","trouble","misunderstanding","friendships","easily", "rational", "relational"]


#Create instances of both stemmers, and stem the words using them.

stemmer_ps = PorterStemmer()   #an instance of porter stemmer
stemmed_words_ps = [stemmer_ps.stem(word) for word in words]
print("Porter stemmed words: ", stemmed_words_ps)

stemmer_ss = SnowballStemmer("english")   #an isntance of snowball stemmer
stemmed_words_ss = [stemmer_ss.stem(word) for word in words]
print("Snowball stemmed words: ", stemmed_words_ss)


# A function which takes a sentence/corpus and gets its stemmed version.
def stemSentence(sentence):
    token_words=word_tokenize(sentence) #we need to tokenize the sentence or else stemming will return the entire sentence as is.
    stem_sentence=[]
    for word in token_words:
        stem_sentence.append(stemmer_ps.stem(word))
        stem_sentence.append(" ") #adding a space so that we can join all the words at the end to form the sentence again.
    return "".join(stem_sentence)

stemmed_sentence = stemSentence("The circumstances under which I now meet you will acquit me from entering into that subject further than to refer to the great constitutional charter under which you are assembled, and which, in defining your powers, designates the objects to which your attention is to be given.")
print("The Porter stemmed sentence is: ", stemmed_sentence)


Porter stemmed words:  ['grow', 'leav', 'fairli', 'cat', 'troubl', 'misunderstand', 'friendship', 'easili', 'ration', 'relat']
Snowball stemmed words:  ['grow', 'leav', 'fair', 'cat', 'troubl', 'misunderstand', 'friendship', 'easili', 'ration', 'relat']
The Porter stemmed sentence is:  the circumst under which I now meet you will acquit me from enter into that subject further than to refer to the great constitut charter under which you are assembl , and which , in defin your power , design the object to which your attent is to be given . 


**What is Lemmatization?** <br>
Lemmatization is the algorithmic process of finding the lemma of a word depending on their meaning. Lemmatization usually refers to the morphological analysis of words, which aims to remove inflectional endings. It helps in returning the base or dictionary form of a word, which is known as the lemma.<br>

*The NLTK Lemmatization method is based on WorldNet's built-in morph function.*

In [5]:
#imports
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet') # Since Lemmatization method is based on WorldNet's built-in morph function.

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ruthu\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [6]:
words = ["grows","leaves","fairly","cats","trouble","running","friendships","easily", "was", "relational","has"]

# Create an instance of the Lemmatizer and perform Lemmatization on above words

lemmatizer = WordNetLemmatizer()   #an instance of Word Net Lemmatizer
lemmatized_words = [lemmatizer.lemmatize(word) for word in words] 
print("The lemmatized words: ", lemmatized_words) #prints the lemmatized words

lemmatized_words_pos = [lemmatizer.lemmatize(word, pos = "v") for word in words]
print("The lemmatized words using a POS tag: ", lemmatized_words_pos) #prints POS tagged lemmatized words

# A function which takes a sentence/corpus and gets its lemmatized version.
def lemmatizeSentence(sentence):
    token_words=word_tokenize(sentence) #we need to tokenize the sentence or else lemmatizing will return the entire sentence as is.
    lemma_sentence=[]
    for word in token_words:
        lemma_sentence.append(lemmatizer.lemmatize(word))
        lemma_sentence.append(" ")
    return "".join(lemma_sentence)

lemma_sentence = lemmatizeSentence("The circumstances under which I now meet you will acquit me from entering into that subject further than to refer to the great constitutional charter under which you are assembled, and which, in defining your powers, designates the objects to which your attention is to be given.")
print("The lemmatized sentence is: ", lemma_sentence)

The lemmatized words:  ['grows', 'leaf', 'fairly', 'cat', 'trouble', 'running', 'friendship', 'easily', 'wa', 'relational', 'ha']
The lemmatized words using a POS tag:  ['grow', 'leave', 'fairly', 'cat', 'trouble', 'run', 'friendships', 'easily', 'be', 'relational', 'have']
The lemmatized sentence is:  The circumstance under which I now meet you will acquit me from entering into that subject further than to refer to the great constitutional charter under which you are assembled , and which , in defining your power , designates the object to which your attention is to be given . 
