### Stemming

In [None]:
import spacy
import nltk

In [2]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [6]:
words = ['eating','eats','eat','ate','adjustable','rafting','ability','meeting']

for word in words:
    print(word,' | ',stemmer.stem(word))

eating  |  eat
eats  |  eat
eat  |  eat
ate  |  ate
adjustable  |  adjust
rafting  |  raft
ability  |  abil
meeting  |  meet


Here we see that the stemmer object does not gives the correct base word for 'ate'. Also it gave base word for ability as 'abil'. This is because stemmer object does not have nay knowledge of language.It just applies a fixed rule and tries to get the base word.

### Lemmatization

In [8]:
nlp = spacy.load('en_core_web_sm')

In [11]:
doc = nlp('eating eats eat ate adjustable rafting ability meeting better')

for token in doc:
    print(token,' | ',token.lemma_, ' | ',token.lemma)

eating  |  eat  |  9837207709914848172
eats  |  eat  |  9837207709914848172
eat  |  eat  |  9837207709914848172
ate  |  eat  |  9837207709914848172
adjustable  |  adjustable  |  6033511944150694480
rafting  |  raft  |  7154368781129989833
ability  |  ability  |  11565809527369121409
meeting  |  meeting  |  14798207169164081740
better  |  well  |  4525988469032889948


token.lemma - this gives the hash code for each word present in the language model

In [12]:
# Creating a custom lemma attributes

In [15]:
doc = nlp("Bro, you wanna go ? Brah, don't say no , i am already exhausted")

for token in doc:
    print(token, " | ",token.lemma_)

Bro  |  bro
,  |  ,
you  |  you
wanna  |  wanna
go  |  go
?  |  ?
Brah  |  Brah
,  |  ,
do  |  do
n't  |  not
say  |  say
no  |  no
,  |  ,
i  |  I
am  |  be
already  |  already
exhausted  |  exhaust


Now suppose we want the base word for bro, brah to be shown as 'Brother'. We will have to add this customization into the pipeline

In [19]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [21]:
ar = nlp.get_pipe('attribute_ruler')
ar.add([[{'TEXT':'Bro'}],[{'TEXT':'Brah'}]],{'LEMMA':'Brother'})

doc = nlp("Bro, you wanna go ? Brah, don't say no , i am already exhausted")

for token in doc:
    print(token, ' | ',token.lemma_)

Bro  |  Brother
,  |  ,
you  |  you
wanna  |  wanna
go  |  go
?  |  ?
Brah  |  Brother
,  |  ,
do  |  do
n't  |  not
say  |  say
no  |  no
,  |  ,
i  |  I
am  |  be
already  |  already
exhausted  |  exhaust


Stemming and lemmatization are two popular techniques to reduce a given word to its base word. Stemming uses a fixed set of rules to remove suffixes, and prefixes whereas lemmatization use language knowledge to come up with a correct base word. Stemming will be demonstrated in ntlk (spacy doesn't support stemming) whereas code for lemmatization is written in spacy