## Stemming in NLTK

In [20]:
import nltk
import spacy

In [21]:
#stemmer type 1
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [18]:
#stemmer type 2
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer(language='english')  # Specify the language for the stemmer, for example 'english'

In [22]:
words = ["eating", "eats", "eat", "ate", "adjustable", "rafting", "ability", "meeting"]

for word in words:
    print(word, "|", stemmer.stem(word))

eating | eat
eats | eat
eat | eat
ate | ate
adjustable | adjust
rafting | raft
ability | abil
meeting | meet


as you can see stemmers don't use any luienguistic knowledge in stemming process. there are some issues of getting the base word like in ' ate, ability ' likewise.

## Lemmatization in Spacy

spacy don't support stemming

In [23]:
import spacy

In [35]:
nlp = spacy.load("en_core_web_sm")

doc1 = nlp("Mando talked for 3 hours although talking isn't his thing he became talktive")
doc2 = nlp("eating eats eat ate adjustable rafting ability meeting better")
for token in doc1:
    print(token, " | ", token.lemma_)

Mando  |  Mando
talked  |  talk
for  |  for
3  |  3
hours  |  hour
although  |  although
talking  |  talk
is  |  be
n't  |  not
his  |  his
thing  |  thing
he  |  he
became  |  become
talktive  |  talktive


In [36]:
for token in doc1:
    print(token, " | ", token.lemma_, "|" ,token.lemma)

#genrate the unique identifier of each and every word.
#lemma is printing A hash that unique to each base word that in the english vocabulary on trained model.

Mando  |  Mando | 7837215228004622142
talked  |  talk | 13939146775466599234
for  |  for | 16037325823156266367
3  |  3 | 602994839685422785
hours  |  hour | 9748623380567160636
although  |  although | 343236316598008647
talking  |  talk | 13939146775466599234
is  |  be | 10382539506755952630
n't  |  not | 447765159362469301
his  |  his | 2661093235354845946
thing  |  thing | 2473243759842082748
he  |  he | 1655312771067108281
became  |  become | 12558846041070486771
talktive  |  talktive | 11965990922604149741


In [37]:
for token in doc2:
    print(token, " | ", token.lemma_)

eating  |  eat
eats  |  eat
eat  |  eat
ate  |  eat
adjustable  |  adjustable
rafting  |  raft
ability  |  ability
meeting  |  meeting
better  |  well


In [38]:
for token in doc2:
    print(token, " | ", token.lemma_, "|" ,token.lemma)

#genrate the unique identifier of each and every word. A hash

eating  |  eat | 9837207709914848172
eats  |  eat | 9837207709914848172
eat  |  eat | 9837207709914848172
ate  |  eat | 9837207709914848172
adjustable  |  adjustable | 6033511944150694480
rafting  |  raft | 7154368781129989833
ability  |  ability | 11565809527369121409
meeting  |  meeting | 14798207169164081740
better  |  well | 4525988469032889948


## Customizing lemmatizer

In [25]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [39]:
doc = nlp("Bro, you wanna go? Brah, don't say no! I am exhausted")

for token in doc:
    print(token.text, "|", token.lemma_)

Bro | bro
, | ,
you | you
wanna | wanna
go | go
? | ?
Brah | Brah
, | ,
do | do
n't | not
say | say
no | no
! | !
I | I
am | be
exhausted | exhaust


the default language model don't understand slangs. but we can customize it.

the 'attribute_ruler' element on this nlp pipe can assign attributes to a particular token.

In [40]:
doc[0]

Bro

In [48]:
doc[0].lemma_

'bro'

In [55]:
doc[6]

Brah

In [47]:
doc[6].lemma_

'Brah'

In [49]:
ar = nlp.get_pipe('attribute_ruler')
#this will give that particular 'attribute_ruler' component from the pipeline
#then we can customize it by adding your custom rule

ar.add([[{"TEXT":"Bro"}],[{"TEXT":"Brah"}]],{"LEMMA":"Brother"})

doc = nlp("Bro, you wanna go? Brah, don't say no! I am exhausted")

for token in doc:
    print(token.text, "|", token.lemma_)

Bro | Brother
, | ,
you | you
wanna | wanna
go | go
? | ?
Brah | Brother
, | ,
do | do
n't | not
say | say
no | no
! | !
I | I
am | be
exhausted | exhaust


In [53]:
doc[0].lemma_ , doc[6].lemma_

('Brother', 'Brother')