
# Lemmatization

* **Definition**: Lemmatization is the process of reducing a word to its **base form (lemma)**, but unlike stemming, it uses **vocabulary + morphological analysis + POS (Part of Speech) tags**.
* It produces **real words** (not chopped forms).
* Example:

  * **Stemming**: *studies → studi*
  * **Lemmatization**: *studies → study*

---

## 🔹 Why is Lemmatization Better than Stemming?

1. **Valid Words**: Lemmas are dictionary words.
2. **POS-Aware**: Lemmatizer needs to know if the word is a *noun, verb, adjective*, etc. Example:

   * "better" →

     * As an adjective: *good*
     * As a verb: *better*
3. **Context-Sensitive**: Uses linguistic rules to avoid incorrect chopping.

---

## 🔹 Process of Lemmatization

1. **Tokenization** → Split text into words/sentences.
2. **POS Tagging** → Assign part of speech to each word.
3. **Lemmatization** → Use dictionary (WordNet in NLTK or spaCy lexicon) to get lemma.

---

## 🔹 Example

Sentence:
👉 *“The cats are sitting outside, and the children were playing happily.”*

* *cats → cat*
* *sitting → sit*
* *children → child*
* *playing → play*
* *happily → happily* (adverbs often remain unchanged)

---

## 🔹 Tools for Lemmatization

* **NLTK WordNet Lemmatizer** (basic, needs POS tags for accuracy).
* **spaCy Lemmatizer** (more powerful, uses large linguistic models).

---

## 🔹 When to Use Lemmatization

✅ When meaning and grammar matter (chatbots, translation, search engines).
✅ For **semantic NLP tasks**: text classification, QA, summarization.
❌ Stemming may be enough for **speed-focused tasks** like keyword extraction.

---

In short:

* **Stemming** → Fast but crude chopping (*connection → connect*).
* **Lemmatization** → Slower but linguistically accurate (*better → good*).

In [9]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     --------------------------------------- 0.0/12.8 MB 660.6 kB/s eta 0:00:20
     ---------------------------------------- 0.1/12.8 MB 1.3 MB/s eta 0:00:10
      --------------------------------------- 0.3/12.8 MB 2.2 MB/s eta 0:00:06
     - -------------------------------------- 0.5/12.8 MB 2.6 MB/s eta 0:00:05
     -- ------------------------------------- 0.6/12.8 MB 2.9 MB/s eta 0:00:05
     -- ------------------------------------- 0.8/12.8 MB 3.1 MB/s eta 0:00:04
     --- ------------------------------------ 1.0/12.8 MB 3.2 MB/s eta 0:00:04
     --- ------------------------------------ 1.1/12.8 MB 3.2 MB/s eta 0:00:04
     --- ------------------------------------ 1.1/12.8 MB 3.2 MB/s eta 0:00:04
     --- -------------------------------


[notice] A new release of pip is available: 23.2.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
# Demonstration of Lemmatization using NLTK and spaCy

# Import necessary libraries
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import spacy

# Download required resources for NLTK
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger_eng')


# Function to convert POS tags to WordNet format for lemmatization
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

# Sample sentence
sentence = "The cats are sitting outside, and the children were playing happily."

# Tokenize and POS tagging
tokens = nltk.word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)

# Initialize WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Apply lemmatization with POS tags
lemmatized_words_nltk = [lemmatizer.lemmatize(word, get_wordnet_pos(pos)) for word, pos in pos_tags]

# Now using spaCy for comparison
nlp = spacy.load("en_core_web_sm")
doc = nlp(sentence)
lemmatized_words_spacy = [token.lemma_ for token in doc]

tokens, lemmatized_words_nltk, lemmatized_words_spacy


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sangouda\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\sangouda\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sangouda\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\sangouda\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\sangouda\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


(['The',
  'cats',
  'are',
  'sitting',
  'outside',
  ',',
  'and',
  'the',
  'children',
  'were',
  'playing',
  'happily',
  '.'],
 ['The',
  'cat',
  'be',
  'sit',
  'outside',
  ',',
  'and',
  'the',
  'child',
  'be',
  'play',
  'happily',
  '.'],
 ['the',
  'cat',
  'be',
  'sit',
  'outside',
  ',',
  'and',
  'the',
  'child',
  'be',
  'play',
  'happily',
  '.'])