15/04/2025

🌱 What is Stemming?
Stemming is the process of cutting off prefixes or suffixes from words to get their root form (called a stem).

📌 The goal is to reduce different forms of a word to the same base word eg:

Original Word | Stemmed Word
playing | play
played | play
plays | play
happily | happi
running | run

* Notice how "happily" becomes "happi" — it's not a real word, but it's a stem.
* Stemming doesn’t care about correct grammar — it just cuts things off quickly.

1) PorterStemmer

In [3]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [None]:
from nltk.stem import PorterStemmer
words=["eating","eaten","eats","written","writes","writing","programming","programs","history","finally","finalized"]
stemming = PorterStemmer()
for word in words:
    print(word+"---->" +stemming.stem(word))

eating---->eat
eaten---->eaten
eats---->eat
written---->written
writes---->write
writing---->write
programming---->program
programs---->program
history---->histori
finally---->final
finalized---->final


stemming doesnt ensure meaningful stems eg history to histori

2) RegexpStemmer:
* RegexpStemmer is a stemmer that removes patterns (like suffixes) from words using regular expressions (regex).
* Unlike other stemmers (like Porter), you define your own rules using regex.

In [12]:
from nltk.stem import RegexpStemmer
reg_Stemmer=RegexpStemmer("ing$| s$| e$|able$",min=4)

In [13]:
reg_Stemmer.stem("eating")

'eat'

3) snowball stemmer
* The Snowball Stemmer (also known as the Porter2 Stemmer) is an improved version of the original Porter Stemmer.
* It is more accurate, efficient, and supports multiple languages.

* It’s built into NLTK and is commonly used for English text.

In [14]:
from nltk.stem import SnowballStemmer
Snowballstemmer=SnowballStemmer("english")
for word in words:
    print(word+"---->"+Snowballstemmer.stem(word))

eating---->eat
eaten---->eaten
eats---->eat
written---->written
writes---->write
writing---->write
programming---->program
programs---->program
history---->histori
finally---->final
finalized---->final


In [15]:
stemming.stem("fairly"),stemming.stem("sportingly")

('fairli', 'sportingli')

In [17]:
Snowballstemmer.stem("fairly"),Snowballstemmer.stem("sportingly")

('fair', 'sport')

* 🌿 What is Lemmatization?
* Lemmatization is the process of reducing a word to its base or dictionary form called the lemma — and the result is * * always a real word.

* Unlike stemming, lemmatization uses grammar rules and a vocabulary to find the correct form of the word.