# Stemming

- Stemming is the process of reducing a word to its word stem or root form that affixes to suffixes and prefixes or to the roots of words known as a lemma. 
- This is particularly useful in text processing tasks like search engines, where you want different forms of the same word (e.g., "running," "runner," and "ran") to be treated as equivalent, thus improving search results or data analysis.
- Stemming is important in natural language understanding (NLU) and natural language processing (NLP).

## How Stemming Works:
- It removes affixes (such as "ing," "ly," "es," "s," etc.) from words to reduce them to their base form.
- Example: The words "running," "runs," and "ran" might all be reduced to "run."

However, stemming doesn't always produce real words, and the root form may not be a proper dictionary word. The main goal is consistency across word variations, not necessarily grammatical accuracy.

## PorterStemmer

In [1]:
from nltk.stem import PorterStemmer
# Create a PorterStemmer object
stemmer = PorterStemmer()

In [7]:
# Sample words
words=["eating","eats","eaten","writing","writes","history","finally","finalized", "running", "runs", "ran", "easily", "fairly", "studies", "studying", "programming","programs"]
words

['eating',
 'eats',
 'eaten',
 'writing',
 'writes',
 'history',
 'finally',
 'finalized',
 'running',
 'runs',
 'ran',
 'easily',
 'fairly',
 'studies',
 'studying',
 'programming',
 'programs']

In [8]:
# Apply stemming to each word
stemmed_words = [f'{word} :: {stemmer.stem(word)}' for word in words]
stemmed_words

['eating :: eat',
 'eats :: eat',
 'eaten :: eaten',
 'writing :: write',
 'writes :: write',
 'history :: histori',
 'finally :: final',
 'finalized :: final',
 'running :: run',
 'runs :: run',
 'ran :: ran',
 'easily :: easili',
 'fairly :: fairli',
 'studies :: studi',
 'studying :: studi',
 'programming :: program',
 'programs :: program']

- "running," "runs" → "run": The stemmer reduces both words to the base form "run."
- "easily" → "easili": The stem is not always a real word but is a consistent root form.
- "studies" and "studying" → "studi": Both words are reduced to the same base form.

In [9]:
stemmer.stem('sitting')

'sit'

In [10]:
stemmer.stem('congratulations')

'congratul'

## RegexpStemmer 
The RegexpStemmer in NLTK is a customizable stemming algorithm that allows you to define a regular expression (regex) pattern to remove or replace specific word endings or prefixes. Unlike standard stemming algorithms like Porter, which follow predefined rules, RegexpStemmer gives you control over which patterns are removed.

In [15]:
from nltk.stem import RegexpStemmer
# Define a regular expression to remove "ing" at the end of words
regexp_stemmer = RegexpStemmer('ing$|s$|e$|able$|ed|ly$', min=4)

In [16]:
regexp_stemmer.stem('eating')

'eat'

In [17]:
regexp_stemmer.stem('ingeating')

'ingeat'

In [18]:
regexp_stemmer.stem('walked')

'walk'

In [19]:
regexp_stemmer.stem('happily')

'happi'

## Snowball Stemmer (improved version of Porter Stemmer)
It is a stemming algorithm which is also known as the Porter2 stemming algorithm as it is a better version of the Porter Stemmer since some issues of it were fixed in this stemmer.

In [20]:
from nltk.stem import SnowballStemmer
snowball = SnowballStemmer("english")

In [22]:
[f'{word} :: {snowball.stem(word)}' for word in words]

['eating :: eat',
 'eats :: eat',
 'eaten :: eaten',
 'writing :: write',
 'writes :: write',
 'history :: histori',
 'finally :: final',
 'finalized :: final',
 'running :: run',
 'runs :: run',
 'ran :: ran',
 'easily :: easili',
 'fairly :: fair',
 'studies :: studi',
 'studying :: studi',
 'programming :: program',
 'programs :: program']

In [23]:
stemmer.stem("fairly"),stemmer.stem("sportingly")

('fairli', 'sportingli')

In [24]:
snowball.stem("fairly"),snowball.stem("sportingly")

('fair', 'sport')

### Key Points:
- Porter Stemmer: Produces shorter stems and is slightly less aggressive than Lancaster.
- RegexpStemmer : It is useful when you need a specific rule for stemming, especially for domain-specific text processing tasks.
- Snowball Stemmer: A more sophisticated, flexible algorithm with support for multiple languages.

### When to Use Stemming:
- Stemming is useful in *search engines, text mining,* and *information retrieval* where different word forms need to be treated as equivalent.
- It’s not always appropriate for tasks where *grammatical accuracy* is important (e.g., generating human-readable content).

### Stemming vs Lemmatization:
- **Stemming:** Reduces words to their base form (even if the result is not a valid word).
- **Lemmatization:** Converts words to their dictionary form, ensuring grammatical accuracy (e.g., "studies" becomes "study," but "ran" becomes "run").