# **Assignment 1 on Natural Language Processing**

### Date : 4th Sept, 2020

#### Instructor : Prof. Sudeshna Sarkar

#### Teaching Assistants : Alapan Kuila, Aniruddha Roy, Anusha Potnuru, Uppada Vishnu

#### Submitted by : Shrey Shrivastava (17CS30034)

 # NLTK Library

The [NLTK](https://www.nltk.org/) Python framework is generally used as an education and research tool. Tokenization, Stemming, Lemmatization, Punctuation, Character count, word count are some of these packages which will be discussed in this tutorial.

**Installing Nltk** <br>
Nltk can be installed using PIP or Conda package managers.For detailed installation instructions follow this [link](https://www.nltk.org/install.html).

To ensure we are all on the same page, the coding environment will be in **python3**. We suggest downloading Anaconda3 and creating a separate environment to do this assignment. 
The link to anaconda3 for Windows and Linux is available here https://docs.anaconda.com/anaconda/install/. 
The steps to install NLTK is available on the link: 
```bash
sudo pip3 install nltk 
python3 
nltk.download()
```

**Note for Question and answers:**

Write your answers to the point in the text box below labelled as **Answer here**.

# Tokenizing words and Sentences using Nltk

**Tokenization** is the process by which big quantity of text is divided into smaller parts called tokens. <br>It is crucial to understand the pattern in the text in order to perform various NLP tasks.These tokens are very useful for finding such patterns.<br>

Natural Language toolkit has very important module tokenize which further comprises of sub-modules

1. word tokenize
2. sentence tokenize

In [2]:
# Importing modules
import nltk
nltk.download('punkt') # For tokenizers
nltk.download('inaugural') # For dataset
nltk.download('stopwords')
from nltk.tokenize import word_tokenize,sent_tokenize
from nltk.corpus import stopwords

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package inaugural to /root/nltk_data...
[nltk_data]   Unzipping corpora/inaugural.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [3]:
# Sample corpus.
from nltk.corpus import inaugural
corpus = inaugural.raw('1789-Washington.txt')
print(corpus)

Fellow-Citizens of the Senate and of the House of Representatives:

Among the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the 14th day of the present month. On the one hand, I was summoned by my Country, whose voice I can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in my flattering hopes, with an immutable decision, as the asylum of my declining years -- a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination, and of frequent interruptions in my health to the gradual waste committed on it by time. On the other hand, the magnitude and difficulty of the trust to which the voice of my country called me, being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications, could not bu

### **TASK**:

For the given corpus, 
1. Print the number of sentences and tokens. 
2. Print the average number of tokens per sentence.
3. Print the number of unique tokens
4. Print the number of tokens after stopword removal using the stopwords from nltk.


In [4]:
# Print the number of sentences 
sent_list = sent_tokenize(corpus, language='english')
print(f'The number of sentences in corpus is {len(sent_list)}.')

# Print the average number of tokens per sentence
# Calculate number of words/ number of sentences
word_list = word_tokenize(corpus, language='english')
print(f'The average number of tokens {len(word_list)/len(sent_list)}.')

# Print unique tokens 
print(f'The number of unique tokens in corpus are {len(set(word_list))}.')

# Print the number of tokens after removing stopwords
stopword = stopwords.words('english')
nostop_word_list = [ word for word in word_list if word not in stopword]
print(f'The number of tokens and unique tokens after removing stopwords are {len(nostop_word_list)} and {len(set(nostop_word_list))}.') 

The number of sentences in corpus is 23.
The average number of tokens 66.82608695652173.
The number of unique tokens in corpus are 626.
The number of tokens and unique tokens after removing stopwords are 800 and 543.


# Stemming and Lemmatization with NLTK

**What is Stemming?** <br>
Stemming is a kind of normalization for words. Normalization is a technique where a set of words in a sentence are converted into a sequence to shorten its lookup. The words which have the same meaning but have some variation according to the context or sentence are normalized.<br>
Hence Stemming is a way to find the root word from any variations of respective word

There are many stemmers provided by Nltk like **PorterStemmer**, **SnowballStemmer**, **LancasterStemmer**.<br>

We will try and see differences between Porterstemmer and Snowballstemmer

In [6]:
from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer # Note that SnowballStemmer has language as parameter.

words = ["grows","leaves","fairly","cats","trouble","misunderstanding","friendships","easily", "rational", "relational"]

# TODO
# create an instance of both the stemmers and perform stemming on above words
porter_stemmer = PorterStemmer()
snowball_stemmer = SnowballStemmer('english')

ps_wordlist = []
ss_wordlist = []
for word in words:
  ps_wordlist.append(porter_stemmer.stem(word))
  ss_wordlist.append(snowball_stemmer.stem(word))

print(ps_wordlist)
print(ss_wordlist)

# TODO
# Complete the function which takes a sentence/corpus and gets its stemmed version.
def stemSentence(sentence=None):
    ps_wordlist = []
    ss_wordlist = []
    for sent in sent_tokenize(sentence):
      for word in word_tokenize(sent):
        ps_wordlist.append(porter_stemmer.stem(word))
        ss_wordlist.append(snowball_stemmer.stem(word))

    return ps_wordlist, ss_wordlist

print(f'{stemSentence("Hello my name is Shrey.")}')
print(f'{stemSentence(corpus)}')


['grow', 'leav', 'fairli', 'cat', 'troubl', 'misunderstand', 'friendship', 'easili', 'ration', 'relat']
['grow', 'leav', 'fair', 'cat', 'troubl', 'misunderstand', 'friendship', 'easili', 'ration', 'relat']
(['hello', 'my', 'name', 'is', 'shrey', '.'], ['hello', 'my', 'name', 'is', 'shrey', '.'])
(['fellow-citizen', 'of', 'the', 'senat', 'and', 'of', 'the', 'hous', 'of', 'repres', ':', 'among', 'the', 'vicissitud', 'incid', 'to', 'life', 'no', 'event', 'could', 'have', 'fill', 'me', 'with', 'greater', 'anxieti', 'than', 'that', 'of', 'which', 'the', 'notif', 'wa', 'transmit', 'by', 'your', 'order', ',', 'and', 'receiv', 'on', 'the', '14th', 'day', 'of', 'the', 'present', 'month', '.', 'On', 'the', 'one', 'hand', ',', 'I', 'wa', 'summon', 'by', 'my', 'countri', ',', 'whose', 'voic', 'I', 'can', 'never', 'hear', 'but', 'with', 'vener', 'and', 'love', ',', 'from', 'a', 'retreat', 'which', 'I', 'had', 'chosen', 'with', 'the', 'fondest', 'predilect', ',', 'and', ',', 'in', 'my', 'flatter', '

**What is Lemmatization?** <br>
Lemmatization is the algorithmic process of finding the lemma of a word depending on their meaning. Lemmatization usually refers to the morphological analysis of words, which aims to remove inflectional endings. It helps in returning the base or dictionary form of a word, which is known as the lemma.<br>

*The NLTK Lemmatization method is based on WorldNet's built-in morph function.*

In [8]:
#imports
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet') # Since Lemmatization method is based on WorldNet's built-in morph function.

words = ["grows","leaves","fairly","cats","trouble","running","friendships","easily", "was", "relational","has"]

#TODO
# Create an instance of the Lemmatizer and perform Lemmatization on above words
# You can also give Parts-of-speech(pos) to the Lemmatizer for example "v" (verb). Check the differences in the outputs.
lemmatizer = WordNetLemmatizer()
lemm_wordlist = [lemmatizer.lemmatize(word, pos='v') for word in words]
print(f'{lemm_wordlist}')

#TODO
# Complete the function which takes a sentence/corpus and gets its lemmatized version.
def lemmatizeSentence(sentence=None):
    lemmatizer = WordNetLemmatizer()
    lemm_wordlist = []
    for sent in sent_tokenize(sentence):
      for word in word_tokenize(sent):
        lemm_wordlist.append(lemmatizer.lemmatize(word))

    return lemm_wordlist

print(f'{lemmatizeSentence("What I talk about when I talk about running")}')
print(f'{lemmatizeSentence(corpus)}')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
['grow', 'leave', 'fairly', 'cat', 'trouble', 'run', 'friendships', 'easily', 'be', 'relational', 'have']
['What', 'I', 'talk', 'about', 'when', 'I', 'talk', 'about', 'running']
['Fellow-Citizens', 'of', 'the', 'Senate', 'and', 'of', 'the', 'House', 'of', 'Representatives', ':', 'Among', 'the', 'vicissitude', 'incident', 'to', 'life', 'no', 'event', 'could', 'have', 'filled', 'me', 'with', 'greater', 'anxiety', 'than', 'that', 'of', 'which', 'the', 'notification', 'wa', 'transmitted', 'by', 'your', 'order', ',', 'and', 'received', 'on', 'the', '14th', 'day', 'of', 'the', 'present', 'month', '.', 'On', 'the', 'one', 'hand', ',', 'I', 'wa', 'summoned', 'by', 'my', 'Country', ',', 'whose', 'voice', 'I', 'can', 'never', 'hear', 'but', 'with', 'veneration', 'and', 'love', ',', 'from', 'a', 'retreat', 'which', 'I', 'had', 'chosen', 'with', 'the', 'fondest', 'predilection', ',', 

**Question:** Give example of two words which have same stem but different lemma? Show the stem and lemma of both words in the code below 



**Answer here:**

In [9]:
#TODO
# Write code to print the stem and lemma of both your words
word_list = ['betters', 'better']

# Stemmed ouputs
stem_wordlist = []
for word in word_list:
  stem_wordlist.append(porter_stemmer.stem(word))
print(f'{stem_wordlist}')

# Lemmatized outputs
lemm_wordlist = []
for word in word_list:
  lemm_wordlist.append(lemmatizer.lemmatize(word, pos='a'))
print(f'{lemm_wordlist}')


['better', 'better']
['betters', 'good']


**Question:** Write a comparison between stemming and lemmatization?

Stemming is used to convert different forms of the same word with an affix into its base word by removing the affix (like cutting the branches of a tree to its stem) whereas Lemmatization converts a group of words unified by meaning into its base form based on its current Part of Speech. Lemmatization always leads to a dictionary word with meaning similar to the original word and takes more compuation and memory because of its search and store operations whereas Stemming is a heuristic method of rule-based removal of certain characters and is a lot more efficient but leads to character sequence which may not be an exact dictionary word. NLTK uses WordNet association of words for Lemmatization and Porter Stemmer and SnowBall Stemmer for stemming modules.