## Stemming

Stemming is a fundamental technique in Natural Language Processing (NLP) used to simplify words to their base or root form, called a “stem.” The goal is to treat different forms of a word as the same item, which helps in tasks like text search, information retrieval, and text analysis.

### How Stemming Works

Stemming algorithms remove suffixes (and sometimes prefixes) from words. For example:
- “playing,” “played,” and “plays” are all reduced to “play.”
- “running” and “runner” become “run.”

The resulting stem may not always be a valid word in the language, but it serves as a common representation for related words.

### Porter Stemmer

Porter Stemmer is one of the most widely used stemming algorithms. It applies a series of rules to iteratively strip suffixes from words.

In [1]:
words = ["playing", "played", "plays", "running", "runner", "easily", "fairly", "cats", "catlike", "catty", "happiness", "happy", "happily", "unhappiness", "quickly", "quickness", "quick", "quicker", "quickest", "studies", "studying", "studied", "study", "studious"]

In [2]:
from nltk.stem import PorterStemmer
porter_stemmer = PorterStemmer()

In [3]:
porter_stems = [porter_stemmer.stem(word) for word in words]

In [4]:
porter_stems

['play',
 'play',
 'play',
 'run',
 'runner',
 'easili',
 'fairli',
 'cat',
 'catlik',
 'catti',
 'happi',
 'happi',
 'happili',
 'unhappi',
 'quickli',
 'quick',
 'quick',
 'quicker',
 'quickest',
 'studi',
 'studi',
 'studi',
 'studi',
 'studiou']

In [5]:
# Compare the original words with their stems
for original, stem in zip(words, porter_stems):
    print(f"Original: {original} -> Stem: {stem}")

Original: playing -> Stem: play
Original: played -> Stem: play
Original: plays -> Stem: play
Original: running -> Stem: run
Original: runner -> Stem: runner
Original: easily -> Stem: easili
Original: fairly -> Stem: fairli
Original: cats -> Stem: cat
Original: catlike -> Stem: catlik
Original: catty -> Stem: catti
Original: happiness -> Stem: happi
Original: happy -> Stem: happi
Original: happily -> Stem: happili
Original: unhappiness -> Stem: unhappi
Original: quickly -> Stem: quickli
Original: quickness -> Stem: quick
Original: quick -> Stem: quick
Original: quicker -> Stem: quicker
Original: quickest -> Stem: quickest
Original: studies -> Stem: studi
Original: studying -> Stem: studi
Original: studied -> Stem: studi
Original: study -> Stem: studi
Original: studious -> Stem: studiou


### RegExpStemmer

RegExpStemmer is another stemming algorithm that uses regular expressions to define patterns for removing suffixes.
It uses regular expressions to indetify morphological affixes and remove them based on predefined patterns.

In [7]:
from nltk.stem import RegexpStemmer
regexp_stemmer = RegexpStemmer('ing$|s$|ed$|ly$|ness$|er$|est$|y$|ious$|ies$', min=4)

In [8]:
regexp_stems = [regexp_stemmer.stem(word) for word in words]

In [9]:
regexp_stems

['play',
 'play',
 'play',
 'runn',
 'runn',
 'easi',
 'fair',
 'cat',
 'catlike',
 'catt',
 'happi',
 'happ',
 'happi',
 'unhappi',
 'quick',
 'quick',
 'quick',
 'quick',
 'quick',
 'stud',
 'study',
 'studi',
 'stud',
 'stud']

In [10]:
for original, stem in zip(words, regexp_stems):
    print(f"Original: {original} -> Stem: {stem}")


Original: playing -> Stem: play
Original: played -> Stem: play
Original: plays -> Stem: play
Original: running -> Stem: runn
Original: runner -> Stem: runn
Original: easily -> Stem: easi
Original: fairly -> Stem: fair
Original: cats -> Stem: cat
Original: catlike -> Stem: catlike
Original: catty -> Stem: catt
Original: happiness -> Stem: happi
Original: happy -> Stem: happ
Original: happily -> Stem: happi
Original: unhappiness -> Stem: unhappi
Original: quickly -> Stem: quick
Original: quickness -> Stem: quick
Original: quick -> Stem: quick
Original: quicker -> Stem: quick
Original: quickest -> Stem: quick
Original: studies -> Stem: stud
Original: studying -> Stem: study
Original: studied -> Stem: studi
Original: study -> Stem: stud
Original: studious -> Stem: stud


### Snowball Stemmer

Snowball Stemmer is an improvement over the Porter Stemmer and is designed to be more efficient and effective. It supports multiple languages and uses a more sophisticated set of rules for stemming.

In [11]:
from nltk.stem import SnowballStemmer
snowball_stemmer = SnowballStemmer("english")

In [12]:
snowball_stems = [snowball_stemmer.stem(word) for word in words]

In [13]:
snowball_stems

['play',
 'play',
 'play',
 'run',
 'runner',
 'easili',
 'fair',
 'cat',
 'catlik',
 'catti',
 'happi',
 'happi',
 'happili',
 'unhappi',
 'quick',
 'quick',
 'quick',
 'quicker',
 'quickest',
 'studi',
 'studi',
 'studi',
 'studi',
 'studious']

In [14]:
for original, stem in zip(words, snowball_stems):
    print(f"Original: {original} -> Stem: {stem}")


Original: playing -> Stem: play
Original: played -> Stem: play
Original: plays -> Stem: play
Original: running -> Stem: run
Original: runner -> Stem: runner
Original: easily -> Stem: easili
Original: fairly -> Stem: fair
Original: cats -> Stem: cat
Original: catlike -> Stem: catlik
Original: catty -> Stem: catti
Original: happiness -> Stem: happi
Original: happy -> Stem: happi
Original: happily -> Stem: happili
Original: unhappiness -> Stem: unhappi
Original: quickly -> Stem: quick
Original: quickness -> Stem: quick
Original: quick -> Stem: quick
Original: quicker -> Stem: quicker
Original: quickest -> Stem: quickest
Original: studies -> Stem: studi
Original: studying -> Stem: studi
Original: studied -> Stem: studi
Original: study -> Stem: studi
Original: studious -> Stem: studious
