In [1]:
import pandas as pd

In [2]:
import nltk

In [3]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\RAHUL\AppData\Roaming\nltk_data...


True

PorterStemmer: One of the most commonly used stemmers. It is based on Porter Stemming Algorithm.  


LancasterStemmer: It is based on Lancaster Stemming Algorithm and can sometimes result in more aggressive stemming than PorterStemmer.


WordNetLemmatiser: It lemmatises using WordNet lexical database. Returns the input word unchanged if it cannot be found in WordNet.

In [4]:
from nltk.stem import PorterStemmer, LancasterStemmer, WordNetLemmatizer
# Instantiate stemmers and lemmatiser
porter = PorterStemmer()
lancaster = LancasterStemmer()
lemmatiser = WordNetLemmatizer()
# Create function that normalises text using all three techniques
def normalise_text(words, pos='v'):
    """Stem and lemmatise each word in a list. Return output in a dataframe."""
    normalised_text = pd.DataFrame(index=words, columns=['Porter', 'Lancaster', 'Lemmatiser'])
    for word in words:
        normalised_text.loc[word,'Porter'] = porter.stem(word)
        normalised_text.loc[word,'Lancaster'] = lancaster.stem(word)
        normalised_text.loc[word,'Lemmatiser'] = lemmatiser.lemmatize(word, pos=pos)
    return normalised_text

In [5]:
normalise_text(['pie', 'globe', 'house', 'knee', 'angle', 'acetone', 'time', 'brownie', 'climate', 'independence'], pos='n')

Unnamed: 0,Porter,Lancaster,Lemmatiser
pie,pie,pie,pie
globe,globe,glob,globe
house,hous,hous,house
knee,knee,kne,knee
angle,angl,angl,angle
acetone,aceton,aceton,acetone
time,time,tim,time
brownie,browni,browny,brownie
climate,climat,clim,climate
independence,independ,independ,independence


In [6]:
normalise_text(['wrote', 'thinking', 'remembered', 'relies', 'ate', 'gone', 'won', 'ran', 'swimming', 'mistreated'], pos='v')


Unnamed: 0,Porter,Lancaster,Lemmatiser
wrote,wrote,wrot,write
thinking,think,think,think
remembered,rememb,rememb,remember
relies,reli,rely,rely
ate,ate,at,eat
gone,gone,gon,go
won,won,won,win
ran,ran,ran,run
swimming,swim,swim,swim
mistreated,mistreat,mist,mistreat


# Speed Comparison
When researching about lemmatisation and stemming, I have came across many resources stating that stemming is faster than lemmatisation. However, when I test three three normalisers on a sample data on my computer, I have observed quite the opposite:

In [8]:
from nltk.corpus import movie_reviews
from nltk.tokenize import RegexpTokenizer

In [16]:
# Import data
reviews = []
for fileid in movie_reviews.fileids():
    tag, filename = fileid.split('/')
    reviews.append((tag, movie_reviews.raw(fileid)))
sample = pd.DataFrame(reviews, columns=['target', 'document'])

In [15]:
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\RAHUL\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\movie_reviews.zip.


True

In [18]:
# Prepare one giant string 
sample_string = " ".join(sample['document'].values)
# Tokenise data
tokeniser = RegexpTokenizer(r'\w+')
tokens = tokeniser.tokenize(sample_string)


In [19]:
%%timeit 
lemmatiser = WordNetLemmatizer()
[lemmatiser.lemmatize(token, 'v') for token in tokens]

3.55 s Â± 30.9 ms per loop (mean Â± std. dev. of 7 runs, 1 loop each)


In [20]:
%%timeit 
porter = PorterStemmer()
[porter.stem(token) for token in tokens]

16.8 s Â± 302 ms per loop (mean Â± std. dev. of 7 runs, 1 loop each)


In [21]:
%%timeit 
lancaster = LancasterStemmer()
[lancaster.stem(token) for token in tokens]

13.5 s Â± 272 ms per loop (mean Â± std. dev. of 7 runs, 1 loop each)


As you can see, from this quick assessment, lemmatiser was in fact faster even when we compare a range with mean +/- 3 standard deviations. Lemmatiser therefore looks more favourable since it normalises sensibly and is faster to run. I will share 2 tips for effective lemmatisation as a bonus in the next section.

5. Two tips for effective lemmatisation ðŸ’¡

In [22]:
lemmatiser = WordNetLemmatizer()
print(f"Lemmatising 'remembered' with pos='v' results in: {lemmatiser.lemmatize('remembered', 'v')}")
print(f"Lemmatising 'remembered' with pos='n' results in: {lemmatiser.lemmatize('remembered', 'n')}\n")
print(f"Lemmatising 'universities' with pos='v' results in: {lemmatiser.lemmatize('universities', 'v')}")
print(f"Lemmatising 'universities' with pos='n' results in: {lemmatiser.lemmatize('universities', 'n')}")

Lemmatising 'remembered' with pos='v' results in: remember
Lemmatising 'remembered' with pos='n' results in: remembered

Lemmatising 'universities' with pos='v' results in: universities
Lemmatising 'universities' with pos='n' results in: university


As you can see, to effectively normalise words with WordNetLemmatizer, itâ€™s important to provide correct pos argument for each word.

In [23]:
print(f"Lemmatising 'Remembered' with pos='v' results in: {lemmatiser.lemmatize('Remembered', 'v')}")
print(f"Lemmatising 'Remembered' with pos='n' results in: {lemmatiser.lemmatize('Remembered', 'n')}\n")
print(f"Lemmatising 'Universities' with pos='v' results in: {lemmatiser.lemmatize('Universities', 'v')}")
print(f"Lemmatising 'Universities' with pos='n' results in: {lemmatiser.lemmatize('Universities', 'n')}")

Lemmatising 'Remembered' with pos='v' results in: Remembered
Lemmatising 'Remembered' with pos='n' results in: Remembered

Lemmatising 'Universities' with pos='v' results in: Universities
Lemmatising 'Universities' with pos='n' results in: Universities


When capitalised, words remain unchanged even with the correct pos because they are viewed as proper nouns.