<a href="https://colab.research.google.com/github/raz0208/Techniques-For-Text-Analysis/blob/main/Lemmatization_Stemming.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Lemmatization
Lemmatization is a natural language processing (NLP) technique that reduces words to their base or dictionary form, known as a lemma. Unlike simple stemming—which might just chop off word endings—lemmatization uses linguistic rules and context (like a word’s part of speech) to ensure that the transformed word is meaningful.

For example, "running," "ran," and "runs" are all reduced to the lemma "run." This process is essential for tasks like text analysis and search because it helps consolidate different forms of a word into one common representation.

### Step 1: Import libraries and read data

In [1]:
# Import required libraries
import nltk
import string
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [15]:
#!pip install nltk

Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: nltk
Successfully installed nltk-3.9.1


In [2]:
# Download necessary NLTK data files (only needed once)
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [3]:
# Load Text Data (Replace with your text source if needed)
document = (
    "This is an example sentence, demonstrating the process of lemmatization. "
    "Running, ran, and runs are different forms of the verb run."
)

document

'This is an example sentence, demonstrating the process of lemmatization. Running, ran, and runs are different forms of the verb run.'

### Step 2: Tokenization

In [4]:
# Tokenization
tokens = word_tokenize(document)

print(tokens)

['This', 'is', 'an', 'example', 'sentence', ',', 'demonstrating', 'the', 'process', 'of', 'lemmatization', '.', 'Running', ',', 'ran', ',', 'and', 'runs', 'are', 'different', 'forms', 'of', 'the', 'verb', 'run', '.']


In [5]:
tokens = [token.lower() for token in tokens]

print(tokens)

['this', 'is', 'an', 'example', 'sentence', ',', 'demonstrating', 'the', 'process', 'of', 'lemmatization', '.', 'running', ',', 'ran', ',', 'and', 'runs', 'are', 'different', 'forms', 'of', 'the', 'verb', 'run', '.']


### Step 3: Remove Stopwords & Punctuation

In [6]:
# Remove Stopwords & Punctuation
stop_words = set(stopwords.words('english'))
tokens = [token for token in tokens if token.lower() not in stop_words and token not in string.punctuation]

print(tokens)

['example', 'sentence', 'demonstrating', 'process', 'lemmatization', 'running', 'ran', 'runs', 'different', 'forms', 'verb', 'run']


### Step 4: Lemmatization using nltk with correct POS tags

In [7]:
# Helper function to map NLTK POS tags to WordNet POS tags
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

# Lemmatization using nltk with correct POS tags
lemmatizer = WordNetLemmatizer()
# First, get POS tags for the tokens
pos_tags = nltk.pos_tag(tokens)
# Then lemmatize each token with its corresponding POS tag
lemmatized_tokens = [
    lemmatizer.lemmatize(token, get_wordnet_pos(pos))
    for token, pos in pos_tags
]

print(lemmatized_tokens)

['example', 'sentence', 'demonstrate', 'process', 'lemmatization', 'run', 'ran', 'run', 'different', 'form', 'verb', 'run']


### Step 5: Reconstruct Processed Text

In [8]:
# Reconstruct Processed Text
processed_text_nltk = " ".join(lemmatized_tokens)
print("Processed Text (nltk):")
print(processed_text_nltk)

Processed Text (nltk):
example sentence demonstrate process lemmatization run ran run different form verb run


### Step 6: Save the Processed Data to a File

In [9]:
# Save the Processed Data to a File
with open("processed_text_nltk.txt", "w") as f:
    f.write(processed_text_nltk)

In [12]:
# ------------------------------
# Using spaCy for Text Processing
# ------------------------------

import spacy

# Load spaCy's English language model
# Make sure to run: python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")

# Process the text with spaCy
doc = nlp(document)

# 3 & 4. Remove stopwords & punctuation, then lemmatize using spaCy's attributes
lemmatized_tokens_spacy = [
    token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct
]

# Step 7: Dependency parsing
def dependency_parsing(text):
    doc = nlp(text)
    return [(token.text, token.dep_, token.head.text) for token in doc]

dep_test = dependency_parsing(document)
print(dep_test)

# 5. Reconstruct Processed Text
processed_text_spacy = " ".join(lemmatized_tokens_spacy)
print("\nProcessed Text (spaCy):")
print(processed_text_spacy)

# 6. Save the Processed Data to a File
with open("processed_text_spacy.txt", "w") as f:
    f.write(processed_text_spacy)

[('This', 'nsubj', 'is'), ('is', 'ROOT', 'is'), ('an', 'det', 'sentence'), ('example', 'compound', 'sentence'), ('sentence', 'attr', 'is'), (',', 'punct', 'is'), ('demonstrating', 'advcl', 'is'), ('the', 'det', 'process'), ('process', 'dobj', 'demonstrating'), ('of', 'prep', 'process'), ('lemmatization', 'pobj', 'of'), ('.', 'punct', 'is'), ('Running', 'nsubj', 'ran'), (',', 'punct', 'Running'), ('ran', 'ROOT', 'ran'), (',', 'punct', 'ran'), ('and', 'cc', 'ran'), ('runs', 'nsubj', 'are'), ('are', 'conj', 'ran'), ('different', 'amod', 'forms'), ('forms', 'attr', 'are'), ('of', 'prep', 'forms'), ('the', 'det', 'run'), ('verb', 'compound', 'run'), ('run', 'pobj', 'of'), ('.', 'punct', 'are')]

Processed Text (spaCy):
example sentence demonstrate process lemmatization running run run different form verb run


## Stemming
Stemming is a natural language processing (NLP) technique where words are reduced to their root or base form by cutting off prefixes or suffixes. The goal is to treat different variations of a word as the same for text analysis. For example:

    "running" → "run"
    "happily" → "happi"
    "better" → "better" (some stemmers may not handle irregular forms well)

Stemming often uses simple, rule-based approaches and can sometimes produce non-meaningful roots because it just chops off parts of the word without understanding the language.

A more advanced technique, Lemmatization, reduces words to their dictionary (or lemma) form with context — like converting "better" to "good".

### Step 1: Import libraries and read data

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer
import string

In [None]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
# Read data
text = "This is a simple example to demonstrate stemming. Stemming reduces words to their root form, which helps in text processing tasks."

text

'This is a simple example to demonstrate stemming. Stemming reduces words to their root form, which helps in text processing tasks.'

### Step 2: Tokenization

In [None]:
# Tokenization
tokens = word_tokenize(text.lower())

print(tokens)

['this', 'is', 'a', 'simple', 'example', 'to', 'demonstrate', 'stemming', '.', 'stemming', 'reduces', 'words', 'to', 'their', 'root', 'form', ',', 'which', 'helps', 'in', 'text', 'processing', 'tasks', '.']


### Step 3: Remove stopwords and punctuation

In [None]:
# Remove stopwords and punctuation
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words and word not in string.punctuation]

print(filtered_tokens)

['simple', 'example', 'demonstrate', 'stemming', 'stemming', 'reduces', 'words', 'root', 'form', 'helps', 'text', 'processing', 'tasks']


### Step 4: Apply Stemming
- Use a stemming algorithm to reduce words to their base form:
  - Porter Stemmer (nltk.PorterStemmer())
  - Lancaster Stemmer (nltk.LancasterStemmer())
  - Snowball Stemmer (nltk.SnowballStemmer())
- Process tokens by applying the stemmer.

In [None]:
# Apply Stemming
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer('english')

# Process tokens with each stemmer
porter_stems = [porter.stem(word) for word in filtered_tokens]
lancaster_stems = [lancaster.stem(word) for word in filtered_tokens]
snowball_stems = [snowball.stem(word) for word in filtered_tokens]

print(porter_stems, "\n")
print(lancaster_stems, "\n")
print(snowball_stems)

['simpl', 'exampl', 'demonstr', 'stem', 'stem', 'reduc', 'word', 'root', 'form', 'help', 'text', 'process', 'task'] 

['simpl', 'exampl', 'demonst', 'stem', 'stem', 'reduc', 'word', 'root', 'form', 'help', 'text', 'process', 'task'] 

['simpl', 'exampl', 'demonstr', 'stem', 'stem', 'reduc', 'word', 'root', 'form', 'help', 'text', 'process', 'task']


### Step 5: Reconstruct processed text
- Rebuild text from stemmed tokens.

In [None]:
# Reconstruct processed text
porter_text = ' '.join(porter_stems)
lancaster_text = ' '.join(lancaster_stems)
snowball_text = ' '.join(snowball_stems)

print("Porter Stemmer Text:", porter_text, "\n")
print("Lancaster Stemmer Text:", lancaster_text, "\n")
print("Snowball Stemmer Text:", snowball_text)

Porter Stemmer Text: simpl exampl demonstr stem stem reduc word root form help text process task 

Lancaster Stemmer Text: simpl exampl demonst stem stem reduc word root form help text process task 

Snowball Stemmer Text: simpl exampl demonstr stem stem reduc word root form help text process task


### Step 6: Save or use the processed data

In [None]:
# Save or use the processed data
with open('processed_text_porter.txt', 'w') as f:
    f.write(porter_text)

with open('processed_text_lancaster.txt', 'w') as f:
    f.write(lancaster_text)

with open('processed_text_snowball.txt', 'w') as f:
    f.write(snowball_text)

print("Processed text saved successfully!")

Processed text saved successfully!
