<a href="https://colab.research.google.com/github/raz0208/Techniques-For-Text-Analysis/blob/main/Lemmatization_nltk_spaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Lemmatization
Lemmatization is a natural language processing (NLP) technique that reduces words to their base or dictionary form, known as a lemma. Unlike simple stemming—which might just chop off word endings—lemmatization uses linguistic rules and context (like a word’s part of speech) to ensure that the transformed word is meaningful.

For example, "running," "ran," and "runs" are all reduced to the lemma "run." This process is essential for tasks like text analysis and search because it helps consolidate different forms of a word into one common representation.

### Step 1: Import libraries and read data

In [1]:
# Import required libraries
import nltk
import string
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [16]:
# Download necessary NLTK data files (only needed once)
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [9]:
# Load Text Data (Replace with your text source if needed)
document = (
    "This is an example sentence, demonstrating the process of lemmatization. "
    "Running, ran, and runs are different forms of the verb run."
)

document

'This is an example sentence, demonstrating the process of lemmatization. Running, ran, and runs are different forms of the verb run.'

### Step 2: Tokenization

In [11]:
# Tokenization
tokens = word_tokenize(document)

print(tokens)

['This', 'is', 'an', 'example', 'sentence', ',', 'demonstrating', 'the', 'process', 'of', 'lemmatization', '.', 'Running', ',', 'ran', ',', 'and', 'runs', 'are', 'different', 'forms', 'of', 'the', 'verb', 'run', '.']


In [23]:
tokens = [token.lower() for token in tokens]

print(tokens)

['example', 'sentence', 'demonstrating', 'process', 'lemmatization', 'running', 'ran', 'runs', 'different', 'forms', 'verb', 'run']


### Step 3: Remove Stopwords & Punctuation

In [24]:
# Remove Stopwords & Punctuation
stop_words = set(stopwords.words('english'))
tokens = [token for token in tokens if token.lower() not in stop_words and token not in string.punctuation]

print(tokens)

['example', 'sentence', 'demonstrating', 'process', 'lemmatization', 'running', 'ran', 'runs', 'different', 'forms', 'verb', 'run']


### Step 4: Lemmatization using nltk with correct POS tags

In [25]:
# Helper function to map NLTK POS tags to WordNet POS tags
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

# Lemmatization using nltk with correct POS tags
lemmatizer = WordNetLemmatizer()
# First, get POS tags for the tokens
pos_tags = nltk.pos_tag(tokens)
# Then lemmatize each token with its corresponding POS tag
lemmatized_tokens = [
    lemmatizer.lemmatize(token, get_wordnet_pos(pos))
    for token, pos in pos_tags
]

print(lemmatized_tokens)

['example', 'sentence', 'demonstrate', 'process', 'lemmatization', 'run', 'ran', 'run', 'different', 'form', 'verb', 'run']


### Step 5: Reconstruct Processed Text

In [27]:
# Reconstruct Processed Text
processed_text_nltk = " ".join(lemmatized_tokens)
print("Processed Text (nltk):")
print(processed_text_nltk)

Processed Text (nltk):
example sentence demonstrate process lemmatization run ran run different form verb run


### Step 6: Save the Processed Data to a File

In [28]:
# Save the Processed Data to a File
with open("processed_text_nltk.txt", "w") as f:
    f.write(processed_text_nltk)

In [29]:
# ------------------------------
# Using spaCy for Text Processing
# ------------------------------

import spacy

# Load spaCy's English language model
# Make sure to run: python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")

# Process the text with spaCy
doc = nlp(document)

# 3 & 4. Remove stopwords & punctuation, then lemmatize using spaCy's attributes
lemmatized_tokens_spacy = [
    token.lemma_ for token in doc if not token.is_stop and not token.is_punct
]

# 5. Reconstruct Processed Text
processed_text_spacy = " ".join(lemmatized_tokens_spacy)
print("\nProcessed Text (spaCy):")
print(processed_text_spacy)

# 6. Save the Processed Data to a File
with open("processed_text_spacy.txt", "w") as f:
    f.write(processed_text_spacy)


Processed Text (spaCy):
example sentence demonstrate process lemmatization running run run different form verb run
