### Introduction to Text Preprocessing and Stemming

In NLP, **text preprocessing** is a crucial step that prepares raw text for machine learning models. We've already covered **tokenization**, which breaks text into words. Now, let's explore **stemming**, a process that reduces words to their root form.

### What is Stemming?

**Stemming** is the process of reducing a word to its **word stem**, or **root form**, by removing affixes (prefixes and suffixes). This is an essential technique for text analysis because it helps consolidate different forms of the same word into a single representation. For instance, in a product review dataset, words like "eating," "eats," and "eaten" all convey the same core meaning. By stemming, we can reduce all these words to a common root like "eat." This reduces the total number of unique words (the vocabulary) in the dataset, simplifying the input for our model and making it more efficient.

In [1]:
words = ["eating", "eats", "eaten", "writing", "writes", "programming", "programs", "history", "finally", "finalized"]

In [8]:
import nltk

### Stemming with NLTK

The NLTK library provides several stemming algorithms.

#### **Porter Stemmer**

The **Porter Stemmer** is one of the most widely used and oldest stemming algorithms. It applies a series of rules to remove common English suffixes.

In [2]:
from nltk.stem import PorterStemmer

# Initialize the Porter Stemmer
porter = PorterStemmer()


print("--- Porter Stemmer Output ---")
for word in words:
    stemmed_word = porter.stem(word)
    print(f"{word:<15} -> {stemmed_word}")

print("\n--- Additional Examples ---")
print(f"{'congratulations':<15} -> {porter.stem('congratulations')}")
print(f"{'sitting':<15} -> {porter.stem('sitting')}")
print(f"{'fairly':<15} -> {porter.stem('fairly')}")
print(f"{'sportingly':<15} -> {porter.stem('sportingly')}")

--- Porter Stemmer Output ---
eating          -> eat
eats            -> eat
eaten           -> eaten
writing         -> write
writes          -> write
programming     -> program
programs        -> program
history         -> histori
finally         -> final
finalized       -> final

--- Additional Examples ---
congratulations -> congratul
sitting         -> sit
fairly          -> fairli
sportingly      -> sportingli



**Observation:** While the Porter Stemmer works well for words like "eating" and "eating" (stemming them to "eat"), it has limitations. It often produces a stem that is not a valid word, such as "congratul" from "congratulations." It also fails to reduce "eaten" to "eat."

#### **Snowball Stemmer**

The **Snowball Stemmer**, also known as the **Porter2 Stemmer**, is an improved version of the Porter Stemmer. It supports more languages and often provides more accurate stems.


In [4]:
from nltk.stem import SnowballStemmer

# Initialize the Snowball Stemmer for English
snowball = SnowballStemmer("english")

print("\n--- Snowball Stemmer Output ---")
for word in words:
    stemmed_word = snowball.stem(word)
    print(f"{word:<15} -> {stemmed_word}")

print("\n--- Comparing Stemmers ---")
print(f"{'fairly':<15} -> Porter: {porter.stem('fairly')}, Snowball: {snowball.stem('fairly')}")
print(f"{'sportingly':<15} -> Porter: {porter.stem('sportingly')}, Snowball: {snowball.stem('sportingly')}")


--- Snowball Stemmer Output ---
eating          -> eat
eats            -> eat
eaten           -> eaten
writing         -> write
writes          -> write
programming     -> program
programs        -> program
history         -> histori
finally         -> final
finalized       -> final

--- Comparing Stemmers ---
fairly          -> Porter: fairli, Snowball: fair
sportingly      -> Porter: sportingli, Snowball: sport


**Observation:** The Snowball Stemmer generally produces better results. For example, it correctly stems "fairly" to "fair" and "sportingly" to "sport."

#### **Regexp Stemmer**

The **Regexp Stemmer** allows you to define your own stemming rules using **regular expressions**. This offers more control but requires a good understanding of regex.

In [5]:
from nltk.stem import RegexpStemmer

# Initialize RegexpStemmer with a regex pattern to remove common suffixes
# The pattern '(ing|s|e|able)$' matches 'ing', 's', 'e', or 'able' at the end of a word.
regexp_stemmer = RegexpStemmer('(ing|s|e|able)$')
custom_words = ["eating", "writes", "disable", "finalize"]

print("\n--- Regexp Stemmer Output ---")
for word in custom_words:
    stemmed_word = regexp_stemmer.stem(word)
    print(f"{word:<15} -> {stemmed_word}")


--- Regexp Stemmer Output ---
eating          -> eat
writes          -> write
disable         -> dis
finalize        -> finaliz


### Limitations of Stemming

Despite its usefulness, **stemming** has a major drawback: it is a **rule-based process** that often creates stems that are not actual words. This is because it doesn't consider the grammatical context of a word. For example, the Snowball Stemmer reduces "goes" to "goe," which is grammatically incorrect.

In [7]:
print(f"{'goes':<1} -> Porter: {porter.stem('goes')}, Snowball: {snowball.stem('goes')}")

goes -> Porter: goe, Snowball: goe


This is where **lemmatization** comes in. Unlike stemming, **lemmatization** uses a dictionary and morphological analysis to return a grammatically correct base form (**lemma**) of a word. It is more computationally intensive but provides a more accurate result. For applications like chatbots or machine translation, where grammatical correctness is crucial, lemmatization is often the preferred choice.