# Stemming and lemmatization (using NLTK vs. spaCy)

This notebook shows how different lemmatizers and stemmer algorithms work. It compares NLTK and spaCy methods for turning words into their base forms. You'll see how stemming chops words down quickly (but sometimes messily), while lemmatization uses dictionaries to get more accurate results. Run the code and see for yourself how each technique handles the same text.



In [20]:
# Install required packages if not already installed
# !pip install nltk spacy
# !python -m nltk.downloader wordnet punkt
# !python -m spacy download en_core_web_sm

In [21]:
import nltk
from nltk.stem import PorterStemmer, SnowballStemmer, WordNetLemmatizer
import spacy

In [22]:

text = (
    "The striped bats were hanging on their feet for the best outcomes. "
    "They had been running, thinking, and eating quickly. "
)


## NLTK Stemming using Porter & Snowball stemmers
-   
    *What it does*: Applies simple rule-based suffix stripping to reduce words to their base form.  
    *Pros*: Fast and lightweight.  
    *Cons*: May result in non-words or inconsistent roots (e.g., "studies" → "studi", but “better” → “better”).


In [14]:
print("=== NLTK Stemming (using two types of stemmers: Porter & Snowball) ===")
porter = PorterStemmer()
snowball = SnowballStemmer("english")

tokens = nltk.word_tokenize(text)

print(f"{'Token':<12} {'Porter':<12} {'Snowball':<12}")
print("-" * 40)
for token in tokens:
    print(f"{token:<12} {porter.stem(token):<12} {snowball.stem(token):<12}")

=== NLTK Stemming (using two types of stemmers: Porter & Snowball) ===
Token        Porter       Snowball    
----------------------------------------
The          the          the         
striped      stripe       stripe      
bats         bat          bat         
were         were         were        
hanging      hang         hang        
on           on           on          
their        their        their       
feet         feet         feet        
for          for          for         
the          the          the         
best         best         best        
outcomes     outcom       outcom      
.            .            .           
They         they         they        
had          had          had         
been         been         been        
running      run          run         
,            ,            ,           
thinking     think        think       
,            ,            ,           
and          and          and         
eating       eat          eat 

##  NLTK Lemmatization using WordNet Lemmatizer

- *What it does*: Uses the WordNet dictionary to convert words to their base (lemma) form.
- *Pros*: More accurate than stemming; returns valid words.
- *Cons*: Requires part-of-speech (POS) tagging for full accuracy, which must be added manually.


In [15]:
print("=== NLTK Lemmatization (WordNet) ===")
lemmatizer = WordNetLemmatizer()

print(f"{'Token':<12} {'Lemma':<12}")
print("-" * 25)
for token in tokens:
    print(f"{token:<12} {lemmatizer.lemmatize(token):<12}")

=== NLTK Lemmatization (WordNet) ===
Token        Lemma       
-------------------------
The          The         
striped      striped     
bats         bat         
were         were        
hanging      hanging     
on           on          
their        their       
feet         foot        
for          for         
the          the         
best         best        
outcomes     outcome     
.            .           
They         They        
had          had         
been         been        
running      running     
,            ,           
thinking     thinking    
,            ,           
and          and         
eating       eating      
quickly      quickly     
.            .           


## spaCy lemmatization

-   *What it does*: Uses a context-aware NLP model with built-in POS tagging to return accurate lemmas.
-   *Pros*: Very accurate, handles irregular forms and context well.
-   *Cons*: Slightly slower than NLTK due to additional processing.


In [18]:
print("=== spaCy Lemmatization ===")
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

print(f"{'Token':<12} {'Lemma':<12} ")
print("-" * 35)
for token in doc:
    print(f"{token.text:<12} {token.lemma_:<12} ")

=== spaCy Lemmatization ===
Token        Lemma        
-----------------------------------
The          the          
striped      striped      
bats         bat          
were         be           
hanging      hang         
on           on           
their        their        
feet         foot         
for          for          
the          the          
best         good         
outcomes     outcome      
.            .            
They         they         
had          have         
been         be           
running      run          
,            ,            
thinking     think        
,            ,            
and          and          
eating       eat          
quickly      quickly      
.            .            


## Summary table

| Technique        | tool       | what it does                                                                 |
|------------------|------------|--------------------------------------------------------------------------------------------------|
| **Stemming**     | **NLTK**       | Chops off word endings using simple rules. It’s fast, but the results can be rough or weird.     |
| **Lemmatization**| **NLTK**       | Looks up words in a dictionary (WordNet) to get their base form, but doesn’t understand context. |
| **Lemmatization**| **spaCy**  | Smart and accurate. Understands the meaning and grammar of the sentence to find the right base word. |
