Stemming and Lemmatization tutorial

Stemming - Use fixed rules such as remove "able", "ing" etc. to derive a base word
Lemmatization - Use knowledge of a language(a.k.a linguistic knowledge) to drive a base word

In [27]:
import nltk
import spacy
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

Stemming with NLTK

In [28]:
words = ["eating", "eats", "eat", "ate", "adjustable", "rafting", "ability", "meeting"]
for word in words:
    print(f"{word} | {stemmer.stem(word)}")

eating | eat
eats | eat
eat | eat
ate | ate
adjustable | adjust
rafting | raft
ability | abil
meeting | meet


Lemmatization with SpaCy

In [29]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("eating eats eat ate adjustable rafting ability meeting")
for token in doc:
    print(f"{token} | {token.lemma_}")

eating | eating
eats | eat
eat | eat
ate | eat
adjustable | adjustable
rafting | raft
ability | ability
meeting | meeting


In [30]:
nlp = spacy.load("nl_core_news_sm")
doc = nlp("eten eet eten aten verstelbare rafting vermogen vergadering")
for token in doc:
    print(f"{token} | {token.lemma_}")

eten | eten
eet | eten
eten | eten
aten | aten
verstelbare | verstelbaar
rafting | rafting
vermogen | vermogen
vergadering | vergadering


In [44]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Mando talked for 3 hours although talking isn't his thing he became talkative")
for token in doc:
    print(f"{token} | {token.lemma_}  | {token.lemma}")

Mando | mando  | 10991835832878170099
talked | talk  | 13939146775466599234
for | for  | 16037325823156266367
3 | 3  | 602994839685422785
hours | hour  | 9748623380567160636
although | although  | 343236316598008647
talking | talking  | 3577425109143670181
is | be  | 10382539506755952630
n't | not  | 447765159362469301
his | his  | 2661093235354845946
thing | thing  | 2473243759842082748
he | he  | 1655312771067108281
became | become  | 12558846041070486771
talkative | talkative  | 13364764166055324990


In [45]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [46]:
#add custom rule to identify slang words
ar = nlp.get_pipe('attribute_ruler')
ar.add([[{"TEXT":"Bro"}],[{"TEXT":"Brah"}]],{"LEMMA":"Brother"})
doc = nlp("Bro, you wanna go? Brah, don't say no! I am exhausted")
for token in doc:
    print(f"{token} | {token.lemma_}")

Bro | Brother
, | ,
you | you
wanna | wanna
go | go
? | ?
Brah | Brother
, | ,
do | do
n't | not
say | say
no | no
! | !
I | I
am | be
exhausted | exhaust
