## 🌿 Stemming vs Lemmatization

Both **stemming** and **lemmatization** are text normalization techniques used to reduce words to their root/base form.

---

### 🔹 Stemming

- **Reduces words to their stem/root** by chopping off suffixes.
- Often produces **non-dictionary words**.
- Uses **heuristic rules** (e.g., remove "ing", "ed", "s").
- Faster but less accurate.

**Example:**  
`playing`, `played`, `plays` → `play`

But:  
`studies` → `studi` ❌ (not a real word)

**Popular Algorithms:**

- PorterStemmer
- SnowballStemmer
- LancasterStemmer

---

### 🔹 Lemmatization

- Reduces words to their **base/dictionary form (lemma)**.
- **Considers part of speech (POS)** and word meaning.
- **Accurate**, but slower than stemming.

**Example:**  
`studies`, `studying` → `study`  
`better` → `good` ✅

**Requires:** a proper dictionary (e.g., WordNet).

---

## ✅ Comparison Table

| Feature         | Stemming                  | Lemmatization           |
| --------------- | ------------------------- | ----------------------- |
| Output          | Word stem (may not exist) | Dictionary word (lemma) |
| Speed           | Faster                    | Slower                  |
| Accuracy        | Lower                     | Higher                  |
| Uses POS info?  | ❌ No                     | ✅ Yes                  |
| Library Support | Porter, Snowball, etc.    | WordNetLemmatizer       |

> ✅ **Tip:** Use **lemmatization** for tasks needing linguistic accuracy (e.g., sentiment, NER). Use **stemming** when speed matters and slight errors are acceptable.


In [1]:
words = [
    "eating",
    "eats",
    "eaten",
    "writing",
    "writes",
    "programming",
    "programs",
    "history",
    "finally",
    "finalized",
]

### PorterStemmer


In [2]:
from nltk.stem import PorterStemmer

In [3]:
stemming = PorterStemmer()

In [4]:
for word in words:
    print(word + " -> " + stemming.stem(word))

eating -> eat
eats -> eat
eaten -> eaten
writing -> write
writes -> write
programming -> program
programs -> program
history -> histori
finally -> final
finalized -> final


In [5]:
stemming.stem("congratulations")

'congratul'

In [6]:
stemming.stem("sitting")

'sit'

### RegexpStemmer class

NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix that matches the expression.


In [7]:
from nltk.stem import RegexpStemmer

In [8]:
reg_stemmer = RegexpStemmer("ing$|s$|e$|able$", min=4)

In [9]:
reg_stemmer.stem("eating")

'eat'

In [10]:
reg_stemmer.stem("ingeating")

'ingeat'

### Snowball Stemmer

It is a stemming algorithm which is also known as the Porter2 stemming algorithm as it is a better version of the Porter Stemmer since some issues of it were fixed in this stemmer.


In [11]:
from nltk.stem import SnowballStemmer

In [12]:
snowballsstemmer = SnowballStemmer("english")

In [13]:
for word in words:
    print(word + " -> " + snowballsstemmer.stem(word))

eating -> eat
eats -> eat
eaten -> eaten
writing -> write
writes -> write
programming -> program
programs -> program
history -> histori
finally -> final
finalized -> final


In [14]:
stemming.stem("fairly"), stemming.stem("sportingly")

('fairli', 'sportingli')

In [15]:
snowballsstemmer.stem("fairly"), snowballsstemmer.stem("sportingly")

('fair', 'sport')