# Stemming Techniques using NLTK

This notebook demonstrates three types of stemming techniques:
- Porter Stemmer
- Snowball Stemmer
- Regex Stemmer

Each technique is applied to a list of words with examples.

In [1]:
# Install NLTK if not already installed
!pip install nltk

Defaulting to user installation because normal site-packages is not writeable


In [2]:
import nltk
from nltk.stem import PorterStemmer, SnowballStemmer, RegexpStemmer
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sujal\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
words = [
    "playing", "played", "plays",
    "running", "runner", "ran",
    "easily", "easier", "easiest",
    "flying", "flies", "flown",
    "studies", "studied", "studying",
    "happily", "happiness", "happy",
    "singing", "singer", "sang",
    "jumps", "jumping", "jumped",
    "argues", "argued", "arguing",
    "writing", "writes", "wrote", "written"
]

## Porter Stemmer

In [4]:
porter = PorterStemmer()
for word in words:
    print(word + " --> " + porter.stem(word))

playing --> play
played --> play
plays --> play
running --> run
runner --> runner
ran --> ran
easily --> easili
easier --> easier
easiest --> easiest
flying --> fli
flies --> fli
flown --> flown
studies --> studi
studied --> studi
studying --> studi
happily --> happili
happiness --> happi
happy --> happi
singing --> sing
singer --> singer
sang --> sang
jumps --> jump
jumping --> jump
jumped --> jump
argues --> argu
argued --> argu
arguing --> argu
writing --> write
writes --> write
wrote --> wrote
written --> written


## Snowball Stemmer

In [5]:
snowball = SnowballStemmer("english")
for word in words:
    print(word + " --> " + snowball.stem(word))

playing --> play
played --> play
plays --> play
running --> run
runner --> runner
ran --> ran
easily --> easili
easier --> easier
easiest --> easiest
flying --> fli
flies --> fli
flown --> flown
studies --> studi
studied --> studi
studying --> studi
happily --> happili
happiness --> happi
happy --> happi
singing --> sing
singer --> singer
sang --> sang
jumps --> jump
jumping --> jump
jumped --> jump
argues --> argu
argued --> argu
arguing --> argu
writing --> write
writes --> write
wrote --> wrote
written --> written


## Regex Stemmer
We use simple regex rules to chop common suffixes like -ing, -ed, -s, etc.

In [6]:
regex_stemmer = RegexpStemmer('ing$|s$|ed$', min=4)
examples = ["playing", "jumps", "argued", "studying"]
for word in examples:
    print(word + " --> " + regex_stemmer.stem(word))

playing --> play
jumps --> jump
argued --> argu
studying --> study
