## Stemming
- when we search in google with modifying word such as talking, some of the results will consist of the word 'talk'.
- **def** : removing suffix to map to the base word
- Basically, we can use simple rules such as removing -ing, -able to derive to base word.
- However, simple rules do not work for all cases such as ability should not convert to abil.

In [1]:
import nltk
import spacy

### NLTK

In [2]:
from nltk.stem import PorterStemmer

In [3]:
stemmer = PorterStemmer()

In [4]:
words = ["eating", "eats", "eat", "ate", "adjustable", "rafting", "ability", "meeting"]

for word in words:
    print(word, "|", stemmer.stem(word))

eating | eat
eats | eat
eat | eat
ate | ate
adjustable | adjust
rafting | raft
ability | abil
meeting | meet


Notice that ability was mapped to abil.

## Lemmatization
- Use linguistic knowledge to derive to base word such as ate to eat, drove to drive.

In [5]:
nlp = spacy.load("en_core_web_sm")

doc = nlp("eating eats eat ate adjustable rafting ability meeting better")
for token in doc:
    print(token, " | ", token.lemma_)

eating  |  eat
eats  |  eat
eat  |  eat
ate  |  eat
adjustable  |  adjustable
rafting  |  raft
ability  |  ability
meeting  |  meet
better  |  well


Notice that better was mapped to well and adjustable was mapped to adjustable.

For summary, the rules of stemming and lemmatization are based on the model that we used. If we change the model, the result may change.

### Customizing lemmatizer
- For uncommon words such as slang

In [8]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [6]:
doc = nlp("Bro, you wanna go? Brah, don't say no! I am exhausted")
for token in doc:
    print(token.text, "|", token.lemma_)

Bro | Bro
, | ,
you | you
wanna | wanna
go | go
? | ?
Brah | Brah
, | ,
do | do
n't | not
say | say
no | no
! | !
I | I
am | be
exhausted | exhaust


We want Bro and Brah to be brother

In [7]:
ar = nlp.get_pipe("attribute_ruler")
ar.add([[{"TEXT":"Bro"}],[{"TEXT":"Brah"}]],{"LEMMA":"Brother"})

doc = nlp("Bro, you wanna go? Brah, don't say no! I am exhausted")
for token in doc:
    print(token.text, "|", token.lemma_)

Bro | Brother
, | ,
you | you
wanna | wanna
go | go
? | ?
Brah | Brother
, | ,
do | do
n't | not
say | say
no | no
! | !
I | I
am | be
exhausted | exhaust
