## Stemming & Lemmatization

Stemming and lemmatization are natural language processing techniques used to reduce words to their base or root form.

- **Stemming**: It involves removing suffixes from words to obtain their root form. For example, "running" becomes "run". Stemming may produce non-existent words as it uses heuristic rules.

- **Lemmatization**: It reduces words to their dictionary form (lemma) by considering the context and meaning. For example, "running" becomes "run", and "better" becomes "good". Lemmatization is more accurate but computationally intensive compared to stemming.


Stemming is dumber than Lemma. 
example - ability becomes abil in Stemming while able in lemmatization.

Spacy does not support stemming. 
NLTK supports both.

In [1]:
import nltk
import spacy

In [2]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

In [3]:
words = ['eating', 'eats', 'eat', 'ate', 'adjustable', 'rafting', 'ability', 'meeting']

for word in words:
    print(word, " | ", stemmer.stem(word))

eating  |  eat
eats  |  eat
eat  |  eat
ate  |  ate
adjustable  |  adjust
rafting  |  raft
ability  |  abil
meeting  |  meet


This is fast and gets the work done in most cases. But its not accurate, 
see:
 ate - ate
 ability - abil

 These are not correct transformations.


In [5]:
nlp = spacy.load('en_core_web_sm') # load a trained model for en

doc = nlp("eating eats eat ate adjustable rafting, ability meeting better")

for token in doc:
    print(token, " | ", token.lemma_, " | ", token.lemma)

eating  |  eat  |  9837207709914848172
eats  |  eat  |  9837207709914848172
eat  |  eat  |  9837207709914848172
ate  |  eat  |  9837207709914848172
adjustable  |  adjustable  |  6033511944150694480
rafting  |  rafting  |  1196139325854331
,  |  ,  |  2593208677638477497
ability  |  ability  |  11565809527369121409
meeting  |  meet  |  6880656908171229526
better  |  well  |  4525988469032889948


Here ate becomes eat, ability is ability. 

token.lemma shows the hash for the trained words.

In [None]:
nlp.pipe_names # lemmatizer is there in the pipeline

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

### Customize the lemma

In [10]:
ar = nlp.get_pipe('attribute_ruler')

ar.add([[{'TEXT': 'bro'}], [{'TEXT': "Whats'up"}]], {'LEMMA': 'Brother', 'LEMMA': 'Whats Up'})

doc = nlp("Hi bro! Whats'up?")

for token in doc:
    print(token, " | ", token.lemma_)

Hi  |  hi
bro  |  Whats Up
!  |  !
Whats'up  |  Whats Up
?  |  ?
