#Demonstration: Evaluating Stemming and Lemmatization Methods for Text Preprocessing

##Scenario:

- David, a data analyst, aimed to preprocess mobile customer reviews for sentiment analysis and topic modeling. To ensure accurate text normalization, she compared various stemming (Porter, Snowball, Lancaster, Krovetz) and lemmatization methods (SpaCy, TextBlob, Stanza, WordNet). This comparison helped her identify the most effective technique for preserving context before downstream processing.

##Import Libraries

In [1]:
!pip install stanza
!pip install lovinspy
!pip install krovetzstemmer
!pip3 install textblob

[31mERROR: Could not find a version that satisfies the requirement lovinspy (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for lovinspy[0m[31m


In [2]:
import pandas as pd
import nltk
import spacy
import stanza
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer, WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import wordnet
from textblob import TextBlob
from krovetzstemmer import Stemmer
import warnings
warnings.filterwarnings("ignore")

# Downloads
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')
nltk.download('average_perceptron_tagger_eng')
nltk.download('wordnet')
spacy.cli.download("en_core_web_sm")
stanza.download('en')

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/varuniexpress/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/varuniexpress/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/varuniexpress/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Error loading average_perceptron_tagger_eng: Package
[nltk_data]     'average_perceptron_tagger_eng' not found in index
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/varuniexpress/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m47.9 MB/s[0m  [33m0:00:00[0meta [36m0:00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.11.0.json: 436kB [00:00, 28.4MB/s]                    
2026-01-19 22:31:06 INFO: Downloaded file to /Users/varuniexpress/stanza_resources/resources.json
2026-01-19 22:31:06 INFO: Downloading default packages for language: en (English) ...
2026-01-19 22:31:07 INFO: File exists: /Users/varuniexpress/stanza_resources/en/default.zip
2026-01-19 22:31:08 INFO: Finished downloading models and saved to /Users/varuniexpress/stanza_resources


##Load Dataset

In [3]:
df = pd.read_csv("mobile reviews.csv")
df = df[['Review_ID', 'Review_Text']]
df.head(3)

Unnamed: 0,Review_ID,Review_Text
0,1,The new device is sleek and fast. I love the c...
1,2,"Amazing display and battery life, but the pric..."
2,3,"I had a few issues with the initial setup, but..."


##Porter Stemmer

In [4]:
porter = PorterStemmer()
df['Porter_Stem'] = df['Review_Text'].apply(
    lambda text: " ".join([porter.stem(word) for word in word_tokenize(text.lower())])
)
df[['Review_Text', 'Porter_Stem']].head(2)

Unnamed: 0,Review_Text,Porter_Stem
0,The new device is sleek and fast. I love the c...,the new devic is sleek and fast . i love the c...
1,"Amazing display and battery life, but the pric...","amaz display and batteri life , but the price ..."


##Snowball Stemmer

In [5]:
snowball = SnowballStemmer("english")
df['Snowball_Stem'] = df['Review_Text'].apply(
    lambda text: " ".join([snowball.stem(word) for word in word_tokenize(text.lower())])
)
df[['Review_Text', 'Snowball_Stem']].head(2)

Unnamed: 0,Review_Text,Snowball_Stem
0,The new device is sleek and fast. I love the c...,the new devic is sleek and fast . i love the c...
1,"Amazing display and battery life, but the pric...","amaz display and batteri life , but the price ..."


##Lancaster Stemmer

In [6]:
lancaster = LancasterStemmer()
df['Lancaster_Stem'] = df['Review_Text'].apply(
    lambda text: " ".join([lancaster.stem(word) for word in word_tokenize(text.lower())])
)
df[['Review_Text', 'Lancaster_Stem']].head(2)

Unnamed: 0,Review_Text,Lancaster_Stem
0,The new device is sleek and fast. I love the c...,the new dev is sleek and fast . i lov the came...
1,"Amazing display and battery life, but the pric...","amaz display and battery lif , but the pric is..."


##Krovetz Stemmer

In [7]:
krovetz = Stemmer()
df['Krovetz_Stem'] = df['Review_Text'].apply(
    lambda text: " ".join([krovetz.stem(word) for word in word_tokenize(text.lower())])
)
df[['Review_Text', 'Krovetz_Stem']].head(2)

Unnamed: 0,Review_Text,Krovetz_Stem
0,The new device is sleek and fast. I love the c...,the new device is sleek and fast . i love the ...
1,"Amazing display and battery life, but the pric...","amazing display and battery life , but the pri..."


##Spacy Lemmatization

In [8]:
nlp_spacy = spacy.load("en_core_web_sm")
df['Spacy_Lemma'] = df['Review_Text'].apply(
    lambda text: " ".join([token.lemma_ for token in nlp_spacy(text)])
)
df[['Review_Text', 'Spacy_Lemma']].head(2)

Unnamed: 0,Review_Text,Spacy_Lemma
0,The new device is sleek and fast. I love the c...,the new device be sleek and fast . I love the ...
1,"Amazing display and battery life, but the pric...","amazing display and battery life , but the pri..."


##TextBlob Lemmatization

In [9]:
df['TextBlob_Lemma'] = df['Review_Text'].apply(
    lambda text: " ".join([word.lemmatize() for word in TextBlob(text).words])
)
df[['Review_Text', 'TextBlob_Lemma']].head(2)

Unnamed: 0,Review_Text,TextBlob_Lemma
0,The new device is sleek and fast. I love the c...,The new device is sleek and fast I love the ca...
1,"Amazing display and battery life, but the pric...",Amazing display and battery life but the price...


##Stanza Lemmatization

In [10]:
nlp_stanza = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos,lemma')
df['Stanza_Lemma'] = df['Review_Text'].apply(
    lambda text: " ".join([word.lemma for sent in nlp_stanza(text).sentences for word in sent.words])
)
df[['Review_Text', 'Stanza_Lemma']].head(2)

2026-01-19 22:31:09 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.11.0.json: 436kB [00:00, 17.2MB/s]                    
2026-01-19 22:31:09 INFO: Downloaded file to /Users/varuniexpress/stanza_resources/resources.json
2026-01-19 22:31:10 INFO: Loading these models for language: en (English):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| mwt       | combined          |
| pos       | combined_charlm   |
| lemma     | combined_nocharlm |

2026-01-19 22:31:10 INFO: Using device: cpu
2026-01-19 22:31:10 INFO: Loading: tokenize
2026-01-19 22:31:10 INFO: Loading: mwt
2026-01-19 22:31:10 INFO: Loading: pos
2026-01-19 22:31:11 INFO: Loading: lemma
2026-01-19 22:31:11 INFO: Done loading processors!


Unnamed: 0,Review_Text,Stanza_Lemma
0,The new device is sleek and fast. I love the c...,the new device be sleek and fast . I love the ...
1,"Amazing display and battery life, but the pric...","amazing display and battery life , but the pri..."


##WordNet Lemmatization (with POS Tagging)

In [11]:
lemmatizer = WordNetLemmatizer()

def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

df['WordNet_Lemma'] = df['Review_Text'].apply(
    lambda text: " ".join([
        lemmatizer.lemmatize(word, get_wordnet_pos(pos))
        for word, pos in pos_tag(word_tokenize(text.lower()))
    ])
)
df[['Review_Text', 'WordNet_Lemma']].head(2)

Unnamed: 0,Review_Text,WordNet_Lemma
0,The new device is sleek and fast. I love the c...,the new device be sleek and fast . i love the ...
1,"Amazing display and battery life, but the pric...","amazing display and battery life , but the pri..."


##Combine and Save to CSV

In [12]:
df['Stemming'] = (
    "Porter: " + df['Porter_Stem'] + "\n" +
    "Snowball: " + df['Snowball_Stem'] + "\n" +
    "Lancaster: " + df['Lancaster_Stem'] + "\n" +
    "Krovetz: " + df['Krovetz_Stem']
)

df['Lemmatization'] = (
    "SpaCy: " + df['Spacy_Lemma'] + "\n" +
    "TextBlob: " + df['TextBlob_Lemma'] + "\n" +
    "Stanza: " + df['Stanza_Lemma'] + "\n" +
    "WordNet: " + df['WordNet_Lemma']
)

final_df = df[['Review_ID', 'Stemming', 'Lemmatization']]
final_df.to_csv("stemming_vs_lemmatization.csv", index=False)