<a href="https://colab.research.google.com/github/vipinbhagat123/NLP_CDAC/blob/main/udemy_stemmer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🔠 Most Common Stemmers Used in Industry

Stemming is a text normalization technique used to reduce words to their base or root form. In industry, stemming is still used in search engines and simple NLP pipelines, although **lemmatization** is more common today in production systems.

---

## 🔝 Top Stemmers in Use

### 1. ✅ Porter Stemmer
- **Most famous**, created by Martin Porter (1980).
- Example: `"running"` → `"run"`, `"flies"` → `"fli"`
- **Pros**: Fast, simple, widely supported.
- **Cons**: Often too aggressive, not context-aware.
- **Used in**: Lucene, classic search engines.

---

### 2. ✅ Snowball Stemmer (Porter2)
- An improved version of the Porter stemmer.
- Supports **multiple languages** (English, Spanish, etc.).
- Example: `"organization"` → `"organ"`
- **Pros**: Cleaner rules, more consistent.
- **Used in**: Elasticsearch, Solr, NLTK.

---

### 3. ⚠️ Lancaster Stemmer
- Developed at Lancaster University.
- Very aggressive stemming.
- Example: `"maximum"` → `"maxim"`, `"running"` → `"run"`
- **Pros**: Strong normalization.
- **Cons**: Over-stemming and distortion.
- **Rare in production** due to harsh truncation.

---

### 4. 🧩 Custom/Regex-Based Stemmers
- Designed for domain-specific language (e.g., finance, science).
- Example: `"geolocation"`, `"geolocate"` → `"geo"`
- **Pros**: Tuned to application needs.
- **Cons**: Requires maintenance, not generalizable.

---

### 5. 🕰️ Lovins Stemmer
- One of the earliest stemmers (1968).
- Removes longest matching suffix.
- Example: `"connections"` → `"connect"`
- **Rarely used today**, but historically notable.

---

## ⚖️ Stemming vs Lemmatization

| Feature            | Stemming                | Lemmatization           |
|--------------------|--------------------------|---------------------------|
| Approach           | Rule-based suffix stripping | Dictionary + POS-based |
| Output             | `"running"` → `"run"`<br>`"flies"` → `"fli"` | `"ran"` → `"run"` |
| Speed              | Fast                    | Slower but accurate     |
| Context-awareness  | ❌                      | ✅                      |
| Use Case           | IR/search, basic NLP    | ML/NLP pipelines, chatbots |

---

## 💼 Industrial Use Cases

- **Search Engines (e.g., Elasticsearch, Solr)**: Porter or Snowball stemmers.
- **Machine Learning Pipelines (e.g., scikit-learn)**: Sometimes stemmers, more often lemmatization.
- **Deep Learning Models (e.g., BERT)**: Typically use **raw text**, no stemming.

---

> 🔎 Tip: Use **lemmatization** if you care about word meaning and grammar. Use **stemming** if you need speed and simplicity.



In [19]:

# Stemming

"""Stemming is the process of reducing a word to its word stem that
affixes to suffixes and prefixes or to the roots of words known as a
lemma. Stemming is important in natural language understanding (NLU) and natural
language processing (NLP)."""

'Stemming is the process of reducing a word to its word stem that\naffixes to suffixes and prefixes or to the roots of words known as a \nlemma. Stemming is important in natural language understanding (NLU) and natural \nlanguage processing (NLP).'

In [20]:
# Classification problems
## comments of produt is a positvie review or negative review
## Reviews --> eating, eat, eaten [going, gone, goes] --->go
words=["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized"]


## PorterStemmer

In [21]:
from nltk.stem import PorterStemmer
stemming = PorterStemmer()


In [22]:
for word in words:
  print(word+"--->"+stemming.stem(word))

eating--->eat
eats--->eat
eaten--->eaten
writing--->write
writes--->write
programming--->program
programs--->program
history--->histori
finally--->final
finalized--->final


In [23]:
stemming.stem('congratulations')

'congratul'

In [24]:
stemming.stem('sitting')

'sit'

RegexpStemmer class


## RegexpStemmer class

NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix that matches the expression. Let us see an example

In [25]:
from nltk.stem import RegexpStemmer
reg_stemmer=RegexpStemmer('ing$|s$|e$|able$', min=4)

In [26]:
reg_stemmer.stem('eating')

'eat'

In [27]:
reg_stemmer.stem('ingeating')

'ingeat'

## Snowball Stemmer

It is a stemming algorithm which is also known as the Porter2 stemming algorithm as it is a better version of the Porter Stemmer since some issues of it were fixed in this stemmer.

In [28]:
from nltk.stem import SnowballStemmer

In [29]:
snowballsstemmer = SnowballStemmer('english')


In [32]:
for word in words:
  print(word+"--->" + snowballsstemmer.stem(word))

eating--->eat
eats--->eat
eaten--->eaten
writing--->write
writes--->write
programming--->program
programs--->program
history--->histori
finally--->final
finalized--->final


In [33]:
stemming.stem("fairly"),stemming.stem("sportingly")


('fairli', 'sportingli')

In [34]:
snowballsstemmer.stem("fairly"),snowballsstemmer.stem("sportingly")


('fair', 'sport')

In [35]:
snowballsstemmer.stem('goes')


'goe'

In [36]:
stemming.stem('goes')


'goe'