# Stemming
**Stemming** is a fundamental technique in natural language processing (NLP) that aims to reduce words to their root or base form. This process involves removing affixes from words to normalize them and improve text analysis and retrieval tasks. Stemming algorithms play a crucial role in various NLP applications:
* information retrieval
* sentiment analysis
* text mining.

There are a quite few many types of stemming, the ones we'll be checking out are:
1. Porter Stemmer
2. Snowball Stemmer
3. Lancaster Stemmer
4. RegxpStemmer

# Porter Stemmer

In [1]:
from nltk.stem import PorterStemmer

In [4]:
words = ['run', 'runner', 'running', 'swim', 'swimmer', 'swimming', 'eat', 'eaten', 'eating', 'bouncer', 'bounce', 'bouncing']

In [5]:
stemmed_words = [PorterStemmer().stem(word) for word in words]
for original, stemmed in zip(words, stemmed_words):
  print(f'{original} -----> {stemmed}')

run -----> run
runner -----> runner
running -----> run
swim -----> swim
swimmer -----> swimmer
swimming -----> swim
eat -----> eat
eaten -----> eaten
eating -----> eat
bouncer -----> bouncer
bounce -----> bounc
bouncing -----> bounc


Well, we can see that many words are successfully being stemmed down to their root words. However many aren't. This is oone problem with porter stemmer as it does not give the guarantee that all the words would get stemmed down

# Snowball stemmer

In [6]:
from nltk.stem import SnowballStemmer

In [7]:
stemmed_words = [SnowballStemmer('english').stem(word) for word in words]
for original, stemmed in zip(words, stemmed_words):
  print(f'{original} -----> {stemmed}')

run -----> run
runner -----> runner
running -----> run
swim -----> swim
swimmer -----> swimmer
swimming -----> swim
eat -----> eat
eaten -----> eaten
eating -----> eat
bouncer -----> bouncer
bounce -----> bounc
bouncing -----> bounc


We can see the same problem with Snowball stemmer however in lower quantity

# Lancaster Stemmer

In [8]:
from nltk.stem import LancasterStemmer

In [9]:
stemmed_words = [LancasterStemmer().stem(word) for word in words]
for original, stemmed in zip(words, stemmed_words):
  print(f'{original} -----> {stemmed}')

run -----> run
runner -----> run
running -----> run
swim -----> swim
swimmer -----> swim
swimming -----> swim
eat -----> eat
eaten -----> eat
eating -----> eat
bouncer -----> bount
bounce -----> bount
bouncing -----> bount


Damn! Lancaster perfomed better than the above two, however it still struggles with 'bounce' family

# Regxpstemmer
A stemmer that uses regular expressions to identify morphological affixes. Any substrings that match the regular expressions will be removed.

In [10]:
from nltk.stem import RegexpStemmer

In [12]:
stemmed_words = [RegexpStemmer('ing$|s$|e$|able$').stem(word) for word in words]
for original, stemmed in zip(words, stemmed_words):
  print(f'{original} -----> {stemmed}')

run -----> run
runner -----> runner
running -----> runn
swim -----> swim
swimmer -----> swimmer
swimming -----> swimm
eat -----> eat
eaten -----> eaten
eating -----> eat
bouncer -----> bouncer
bounce -----> bounc
bouncing -----> bounc


As you can see, the Regxpstemmer performs the worst, that's why its not recommended to use by many experts

# Lemmatization
**Lemmatization** is a more advanced and accurate form of text normalization compared to stemming. Instead of simply chopping off word endings, lemmatization considers the word’s meaning and part of speech to reduce it to its base or root form `(lemma)`, ensuring the word retains its meaning.
For example it converts:
* Running --> Run
* better -> good

unlike stemming, which incorrectly converts `Running --> Runn` as we already saw above

NLTK provides a lemmatization module called as `WordNetLemmatizer`, which uses the WordNet lexical database to find the correct base forms of words. You can specify the part of speech (POS) to get more accurate results.

the parts of speech (POS) are:
1. `n` - nouns
2. `v` - verbs
3. `a` - adjectives
4. `r` - adverbs

In [16]:
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

Let us see an example for it  

In [17]:
WordNetLemmatizer().lemmatize('going')

'going'

🤔 well.. according to everything i had already said, 'going' should have been converted to 'go' right?

Wrong, the defauls POS in WordNet is `n` but as we all may have studied in our schools, going..is an action word and therfore it is a verb. So lets try keeping the POS to `v` and see the results

In [18]:
WordNetLemmatizer().lemmatize('going', pos = 'v')

'go'

Yep, as expected!

In [19]:
lemmatized_words = [WordNetLemmatizer().lemmatize(word, 'v') for word in words]
for original, lemmatized in zip(words, lemmatized_words):
  print(f'{original} -----> {lemmatized}')

run -----> run
runner -----> runner
running -----> run
swim -----> swim
swimmer -----> swimmer
swimming -----> swim
eat -----> eat
eaten -----> eat
eating -----> eat
bouncer -----> bouncer
bounce -----> bounce
bouncing -----> bounce


woah, it correctly lemmatized all the words present here..except a few that's because the words `Runner, Swimmer and Bouncer` are nouns and not a verb

In [28]:
lemmatized_words = [WordNetLemmatizer().lemmatize(word, 'v') for word in ['runner', 'swimmer', 'bouncer']]
for original, lemmatized in zip(['runner', 'swimmer', 'bouncer'], lemmatized_words):
  print(f'{original} -----> {lemmatized}')

runner -----> runner
swimmer -----> swimmer
bouncer -----> bouncer


😯They should have been converted to their root forms but still havent!

reason? The WordNetLemmatizer didn't convert `runner`, `swimmer`, and `bouncer` to their root verbs because it treats these words as distinct nouns rather than verb derivatives. WordNet relies on its predefined lexicon and doesn't automatically strip suffixes like `-er`

