<b>Here we will talk about Stemming and Lemmatization</b><br>
Have you noticed that so far, if in a sample of text, we have two words, say apple and apples, then they will be treated as different words? Obviously this isn't wanted, to make the words apple and apples such that they are treated the same, we will convert apples to apple, in all its occurances in the text. This will be done for words like proper and properly, and so on. This is basically converting every word into its "base" word.

The first method to do this is called Stemming. It is basically a set of rules that are used to chop off the suffix of a word to convert it to it's root form. This cannot deal with words like feet and foot, where there is no suffix that can be cut/ modified to convert one to the other.<br>
This is a quicker method (compared to lemmatization) and is used in many things.

In [17]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer


In [2]:
text = "Very orderly and methodical he looked, with a hand on each knee, and a loud watch ticking a sonorous sermon under his flapped newly bought waist-coat, as though it pitted its gravity and longevity against the levity and evanescence of the brisk fire."
print(text)

Very orderly and methodical he looked, with a hand on each knee, and a loud watch ticking a sonorous sermon under his flapped newly bought waist-coat, as though it pitted its gravity and longevity against the levity and evanescence of the brisk fire.


In [4]:
words = word_tokenize(text)
print(words)

['Very', 'orderly', 'and', 'methodical', 'he', 'looked', ',', 'with', 'a', 'hand', 'on', 'each', 'knee', ',', 'and', 'a', 'loud', 'watch', 'ticking', 'a', 'sonorous', 'sermon', 'under', 'his', 'flapped', 'newly', 'bought', 'waist-coat', ',', 'as', 'though', 'it', 'pitted', 'its', 'gravity', 'and', 'longevity', 'against', 'the', 'levity', 'and', 'evanescence', 'of', 'the', 'brisk', 'fire', '.']


There are two functions that can be used to perform stemming, PorterStemmer, and SnowballStemmer.<br>
SnowballStemmer can be used in other languages, while PorterStemmer can only be used on english.

In [6]:
porter = PorterStemmer()
porter_stemmed = [porter.stem(word) for word in words]
print(porter_stemmed)

['veri', 'orderli', 'and', 'method', 'he', 'look', ',', 'with', 'a', 'hand', 'on', 'each', 'knee', ',', 'and', 'a', 'loud', 'watch', 'tick', 'a', 'sonor', 'sermon', 'under', 'hi', 'flap', 'newli', 'bought', 'waist-coat', ',', 'as', 'though', 'it', 'pit', 'it', 'graviti', 'and', 'longev', 'against', 'the', 'leviti', 'and', 'evanesc', 'of', 'the', 'brisk', 'fire', '.']


In [9]:
snowball = SnowballStemmer("english")
snowball_stemmed = [snowball.stem(word) for word in words]
print(snowball_stemmed)

['veri', 'order', 'and', 'method', 'he', 'look', ',', 'with', 'a', 'hand', 'on', 'each', 'knee', ',', 'and', 'a', 'loud', 'watch', 'tick', 'a', 'sonor', 'sermon', 'under', 'his', 'flap', 'newli', 'bought', 'waist-coat', ',', 'as', 'though', 'it', 'pit', 'it', 'graviti', 'and', 'longev', 'against', 'the', 'leviti', 'and', 'evanesc', 'of', 'the', 'brisk', 'fire', '.']


In [13]:
# this is just to compare them
compare = pd.DataFrame({"original" : words , 'snowball' : snowball_stemmed, 'porter' : porter_stemmed} )
compare

Unnamed: 0,original,snowball,porter
0,Very,veri,veri
1,orderly,order,orderli
2,and,and,and
3,methodical,method,method
4,he,he,he
5,looked,look,look
6,",",",",","
7,with,with,with
8,a,a,a
9,hand,hand,hand


<b>Now we look at lemmatization</b>

In this, instead of simply cutting off a suffix, lemmatization checks the word in a clever dictionary, that maps to it's root. This can map words like feet to their root.

In [19]:
lemma = WordNetLemmatizer()
lemmatized = [lemma.lemmatize(word) for word in words]
print(lemmatized)

['Very', 'orderly', 'and', 'methodical', 'he', 'looked', ',', 'with', 'a', 'hand', 'on', 'each', 'knee', ',', 'and', 'a', 'loud', 'watch', 'ticking', 'a', 'sonorous', 'sermon', 'under', 'his', 'flapped', 'newly', 'bought', 'waist-coat', ',', 'a', 'though', 'it', 'pitted', 'it', 'gravity', 'and', 'longevity', 'against', 'the', 'levity', 'and', 'evanescence', 'of', 'the', 'brisk', 'fire', '.']


In [20]:
compare['lemmatized'] = lemmatized
compare

Unnamed: 0,original,snowball,porter,lemmatized
0,Very,veri,veri,Very
1,orderly,order,orderli,orderly
2,and,and,and,and
3,methodical,method,method,methodical
4,he,he,he,he
5,looked,look,look,looked
6,",",",",",",","
7,with,with,with,with
8,a,a,a,a
9,hand,hand,hand,hand


For now, we will not talk about the part-of-speech argument that lemmatization takes, but we will talk about it later (hopefully). For now, leave it. This makes the difference in how efficient the lemmatization process is, and since we are using it's default value for now, lemmatization will be less efficient than stemming