<a href="https://colab.research.google.com/github/sidharth178/Natural-Language-Processing-Tutorial/blob/master/2_stemming_lemmatization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Stemming**
**Stemming** is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a **lemma**. For example: words such as “Likes”, ”liked”, ”likely” and ”liking” will be reduced to “like” after stemming.

1. PorterStemmer
2. SnowballStemmer

## **Lemmatization**
**Lemmatization** usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the **lemma** .
- The difference between stemming and lemmatization is that stemming is faster as it cuts words without knowing the context, while lemmatization is slower as it knows the context of words before processing.
- For example, in stemming "historical" word'll convert to "histori", while in Lemmatization this word'll convert to "history"
- In stemming the converted base word may have or haven't a proper meaning while in lemmatization converted base word has proper meaning.



In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc2 = nlp(u"We're here to help! Send snail-mail, email fahad@gmail.com or visit us at http://www.fahadhussaincs.blogspot.com!")
for t in doc2:
    print(t)

We
're
here
to
help
!
Send
snail
-
mail
,
email
fahad@gmail.com
or
visit
us
at
http://www.fahadhussaincs.blogspot.com
!


In [None]:
doc3 = nlp(u'A 5km NYC cab ride costs $10.30')
for t in doc3:
    print(t)

A
5
km
NYC
cab
ride
costs
$
10.30


In [None]:
doc4 = nlp(u"Let's visit St. Louis in the U.S. next year.")
for t in doc4:
    print(t)

Let
's
visit
St.
Louis
in
the
U.S.
next
year
.


In [None]:
 # ----   Porter Stemmer   ------

## **Stemming**
**Stemming** is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. For example: words such as “Likes”, ”liked”, ”likely” and ”liking” will be reduced to “like” after stemming.

1. PorterStemmer
2. SnowballStemmer

In [None]:
# Import the toolkit and the full Porter Stemmer library
import nltk
from nltk.stem.porter import *

In [None]:
p_stemmer = PorterStemmer()
words = ['run','runner','running','ran','runs','easily','fairly']

In [None]:
for word in words:
    print(word+' --> '+p_stemmer.stem(word))

run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fairli


In [None]:
#SnowballStemmer
from nltk.stem.snowball import SnowballStemmer
# The Snowball Stemmer requires that you pass a language parameter
s_stemmer = SnowballStemmer(language='english')

In [None]:
words = ['run','runner','running','ran','runs','easily','fairly']
# words = ['generous','generation','generously','generate']

In [None]:
for word in words:
    print(word+' --> '+s_stemmer.stem(word))

run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fair


In [None]:
# ----Do Some more practice -----

In [None]:
words = ['consolingly']

In [None]:
print('Porter Stemmer:')
for word in words:
    print(word+' --> '+p_stemmer.stem(word))

Porter Stemmer:
consolingly --> consolingli


In [None]:
print('Porter2 Stemmer:')
for word in words:
    print(word+' --> '+s_stemmer.stem(word))

Porter2 Stemmer:
consolingly --> consol


In [None]:
# Stemming has its drawbacks. If given the token saw, stemming might always return saw, whereas lemmatization would likely return either
# see or saw depending on whether the use of the token was as a verb or a noun. As an example, consider the following:

phrase = 'I am meeting him tomorrow at the meeting'
for word in phrase.split():
    print(word+' --> '+p_stemmer.stem(word))

I --> I
am --> am
meeting --> meet
him --> him
tomorrow --> tomorrow
at --> at
the --> the
meeting --> meet


In [None]:
# Perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')

In [None]:
var1 = nlp(u"John Adam is one the researcher who invent the direction of way towards success!")

for token in var1:
    print(token.text, '\t', token.pos_, '\t', token.lemma, '\t', token.lemma_)

John 	 PROPN 	 11174346320140919546 	 John
Adam 	 PROPN 	 14264057329400597350 	 Adam
is 	 AUX 	 10382539506755952630 	 be
one 	 NOUN 	 17454115351911680600 	 one
the 	 DET 	 7425985699627899538 	 the
researcher 	 NOUN 	 1317581537614213870 	 researcher
who 	 PRON 	 3876862883474502309 	 who
invent 	 VERB 	 5373681334090504585 	 invent
the 	 DET 	 7425985699627899538 	 the
direction 	 NOUN 	 895834437038626927 	 direction
of 	 ADP 	 886050111519832510 	 of
way 	 NOUN 	 6878210874361030284 	 way
towards 	 ADP 	 9315050841437086371 	 towards
success 	 NOUN 	 16089821935113899987 	 success
! 	 PUNCT 	 17494803046312582752 	 !


In [None]:
def show_lemmas(text):
    for token in text:
        print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}')

In [None]:
var2 = nlp(u"John Adam is one the researcher who invent the direction of way towards success!")
show_lemmas(var2)

John         PROPN  11174346320140919546   John
Adam         PROPN  14264057329400597350   Adam
is           AUX    10382539506755952630   be
one          NOUN   17454115351911680600   one
the          DET    7425985699627899538    the
researcher   NOUN   1317581537614213870    researcher
who          PRON   3876862883474502309    who
invent       VERB   5373681334090504585    invent
the          DET    7425985699627899538    the
direction    NOUN   895834437038626927     direction
of           ADP    886050111519832510     of
way          NOUN   6878210874361030284    way
towards      ADP    9315050841437086371    towards
success      NOUN   16089821935113899987   success
!            PUNCT  17494803046312582752   !


In [None]:
var3 = nlp(u"I am meeting him tomorrow at the meeting.")
show_lemmas(var3)

I            PRON   561228191312463089     -PRON-
am           AUX    10382539506755952630   be
meeting      VERB   6880656908171229526    meet
him          PRON   561228191312463089     -PRON-
tomorrow     NOUN   3573583789758258062    tomorrow
at           ADP    11667289587015813222   at
the          DET    7425985699627899538    the
meeting      NOUN   14798207169164081740   meeting
.            PUNCT  12646065887601541794   .


In [None]:
var4 = nlp(u"That's of the greate person in the world")
show_lemmas(var4)

That         DET    4380130941430378203    that
's           AUX    10382539506755952630   be
of           ADP    886050111519832510     of
the          DET    7425985699627899538    the
greate       ADJ    4429768169814447593    greate
person       NOUN   14800503047316267216   person
in           ADP    3002984154512732771    in
the          DET    7425985699627899538    the
world        NOUN   1703489418272052182    world


# **Stemming - KN**

In [None]:
import nltk
nltk.download('popular')

from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

In [None]:
paragraph = """I have three visions for India. In 3000 years of our history, people from all over 
               the world have come and invaded us, captured our lands, conquered our minds. 
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours. 
               Yet we have not done this to any other nation. We have not conquered anyone. 
               We have not grabbed their land, their culture, 
               their history and tried to enforce our way of life on them. 
               My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of 
               space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
               I was lucky to have worked with all three of them closely and consider this the great opportunity of my life. 
               I see four milestones in my career"""
               
               


In [None]:
sentences = nltk.sent_tokenize(paragraph)
stemmer = PorterStemmer()

# Stemming
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words)   
    

In [None]:
sentences

['I three vision india .',
 'In 3000 year histori , peopl world come invad us , captur land , conquer mind .',
 'from alexand onward , greek , turk , mogul , portugues , british , french , dutch , came loot us , took .',
 'yet done nation .',
 'We conquer anyon .',
 'We grab land , cultur , histori tri enforc way life .',
 'My good fortun work three great mind .',
 'dr. vikram sarabhai dept .',
 'space , professor satish dhawan , succeed dr. brahm prakash , father nuclear materi .',
 'I lucki work three close consid great opportun life .',
 'I see four mileston career']

**Problem :** Produced intermediate representation of the word may not have any meaning.

# **Lemmatization - KN**

In [None]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

In [None]:
paragraph = """Thank you all so very much. Thank you to the Academy. 
               Thank you to all of you in this room. I have to congratulate 
               the other incredible nominees this year. The Revenant was 
               the product of the tireless efforts of an unbelievable cast
               support leaders around the world who do not speak for the 
               big polluters, but who speak for all of humanity, for the
               indigenous people of the world, for the billions and 
               billions of underprivileged people out there who would be
               most affected by this. For our children’s children, and 
               for those people out there whose voices have been drowned
               out by the politics of greed. I thank you all for this 
               amazing award tonight. Let us not take this planet for 
               granted. I do not take tonight for granted. Thank you so very much."""
               

In [None]:
sentences = nltk.sent_tokenize(paragraph)
lemmatizer = WordNetLemmatizer()

# Lemmatization
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words)   