# NLTK Introduction

# About NLTK library

- NLTK is a leading platform for building Python programs to work with human language data
- https://www.nltk.org/index.html

# NLTK library installation

In [1]:
# !pip install nltk

In [2]:
import nltk as nlp
from nltk.tokenize import word_tokenize,sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer,SnowballStemmer,RegexpStemmer,WordNetLemmatizer
import matplotlib.pyplot as plt
import re

# NLTK version

In [3]:
nlp.__version__

'3.6.5'

# How to download nltk corpus data

In [4]:
# nlp.download()

# Sample Text

In [5]:
txt = """Albert Einstein (/ˈaɪnstaɪn/ EYEN-styne;[6] German: [ˈalbɛʁt ˈʔaɪnʃtaɪn] (audio speaker iconlisten); 14 March 1879 – 18 April 1955) was a German-born theoretical physicist,[7] widely acknowledged to be one of the greatest physicists of all time. Einstein is best known for developing the theory of relativity, but he also made important contributions to the development of the theory of quantum mechanics. Relativity and quantum mechanics are together the two pillars of modern physics.[3][8] His mass–energy equivalence formula E = mc2, which arises from relativity theory, has been dubbed "the world's most famous equation".[9] His work is also known for its influence on the philosophy of science.[10][11] He received the 1921 Nobel Prize in Physics "for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect",[12] a pivotal step in the development of quantum theory. His intellectual achievements and originality resulted in "Einstein" becoming synonymous with "genius".[13]

In 1905, a year sometimes described as his annus mirabilis ('miracle year'), Einstein published four groundbreaking papers.[14] These outlined the theory of the photoelectric effect, explained Brownian motion, introduced special relativity, and demonstrated mass-energy equivalence. Einstein thought that the laws of classical mechanics could no longer be reconciled with those of the electromagnetic field, which led him to develop his special theory of relativity. He then extended the theory to gravitational fields; he published a paper on general relativity in 1916, introducing his theory of gravitation. In 1917, he applied the general theory of relativity to model the structure of the universe.[15][16] He continued to deal with problems of statistical mechanics and quantum theory, which led to his explanations of particle theory and the motion of molecules. He also investigated the thermal properties of light and the quantum theory of radiation, which laid the foundation of the photon theory of light.
"""

In [6]:
print(txt)

Albert Einstein (/ˈaɪnstaɪn/ EYEN-styne;[6] German: [ˈalbɛʁt ˈʔaɪnʃtaɪn] (audio speaker iconlisten); 14 March 1879 – 18 April 1955) was a German-born theoretical physicist,[7] widely acknowledged to be one of the greatest physicists of all time. Einstein is best known for developing the theory of relativity, but he also made important contributions to the development of the theory of quantum mechanics. Relativity and quantum mechanics are together the two pillars of modern physics.[3][8] His mass–energy equivalence formula E = mc2, which arises from relativity theory, has been dubbed "the world's most famous equation".[9] His work is also known for its influence on the philosophy of science.[10][11] He received the 1921 Nobel Prize in Physics "for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect",[12] a pivotal step in the development of quantum theory. His intellectual achievements and originality resulted in "Einstein" becom

# Part-3

# What is Sentence Tokenizer

In [7]:
st = sent_tokenize(txt)
print(len(st))
print(st)

14
['Albert Einstein (/ˈaɪnstaɪn/ EYEN-styne;[6] German: [ˈalbɛʁt ˈʔaɪnʃtaɪn] (audio speaker iconlisten); 14 March 1879 – 18 April 1955) was a German-born theoretical physicist,[7] widely acknowledged to be one of the greatest physicists of all time.', 'Einstein is best known for developing the theory of relativity, but he also made important contributions to the development of the theory of quantum mechanics.', 'Relativity and quantum mechanics are together the two pillars of modern physics.', '[3][8] His mass–energy equivalence formula E = mc2, which arises from relativity theory, has been dubbed "the world\'s most famous equation".', '[9] His work is also known for its influence on the philosophy of science.', '[10][11] He received the 1921 Nobel Prize in Physics "for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect",[12] a pivotal step in the development of quantum theory.', 'His intellectual achievements and originality r

## Ways to do sentence tokenization

In [8]:
# usig split()

In [9]:
st = txt.split(".")
print(len(st))
print(st)

15
['Albert Einstein (/ˈaɪnstaɪn/ EYEN-styne;[6] German: [ˈalbɛʁt ˈʔaɪnʃtaɪn] (audio speaker iconlisten); 14 March 1879 – 18 April 1955) was a German-born theoretical physicist,[7] widely acknowledged to be one of the greatest physicists of all time', ' Einstein is best known for developing the theory of relativity, but he also made important contributions to the development of the theory of quantum mechanics', ' Relativity and quantum mechanics are together the two pillars of modern physics', '[3][8] His mass–energy equivalence formula E = mc2, which arises from relativity theory, has been dubbed "the world\'s most famous equation"', '[9] His work is also known for its influence on the philosophy of science', '[10][11] He received the 1921 Nobel Prize in Physics "for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect",[12] a pivotal step in the development of quantum theory', ' His intellectual achievements and originality resu

### How to add user defined words if some words are removed because of stop words

In [10]:
a = ["cleaned data after removing stop words"]
print(a)

['cleaned data after removing stop words']


In [11]:
a.append("ON")
a.append("OFF")
a.append("IT")
print(a)

['cleaned data after removing stop words', 'ON', 'OFF', 'IT']


# Stemming

In [12]:
words = ["give","given","giving","gives","gave"]
print(words)

['give', 'given', 'giving', 'gives', 'gave']


## using Porterstemmer

In [13]:
ps = PorterStemmer()

In [14]:
ps.stem("caught")

'caught'

In [15]:
ps.stem("gave")

'gave'

In [16]:
ps.stem("gives")

'give'

In [17]:
for w in words:
    print(w,ps.stem(w))

give give
given given
giving give
gives give
gave gave


## using SnowBallstemmer

In [18]:
sb = SnowballStemmer(language="english")

In [19]:
for w in words:
    print(w,sb.stem(w))

give give
given given
giving give
gives give
gave gave


## using RegExstemmer

In [20]:
rs = RegexpStemmer("ing$|ed$|es$|s$|",min=4)

In [21]:
for w in words:
    print(w,rs.stem(w))

give give
given given
giving giv
gives giv
gave gave


# Lemmatization
Used for returning the actual word by which Original word was derived from it. e.g
Dogs derived from Dog

In [22]:
lemmatizer = WordNetLemmatizer()

In [23]:
print(words)

['give', 'given', 'giving', 'gives', 'gave']


In [24]:
for w in words:
    print(w,lemmatizer.lemmatize(w))

give give
given given
giving giving
gives give
gave gave


In [25]:
for w in words:
    print(w,lemmatizer.lemmatize(w,pos="v"))

give give
given give
giving give
gives give
gave give


In [26]:
word_2 = ["Playing", "Play","plays","played"]

In [27]:
for w in word_2:
    print(w,lemmatizer.lemmatize(w.lower(),pos="v"))

Playing play
Play play
plays play
played play
