# Lemmatization

Lemmatization is the process of reducing words to their base or root form, known as a lemma. Unlike stemming, which may produce non-words, lemmatization ensures that the root word is a valid word in the language. This is particularly useful in natural language processing tasks where understanding the context and meaning of words is important. It uses WordNet, a large lexical database of English.

In [8]:
nouns = ["cats", "geese", "rocks", "corpora"]
verbs = ["running", "ate", "swimming", "danced", "studying", "studies"]
adjectives = ["better", "best", "worse", "worst", "faster", "fastest"]
adverbs = ["faster", "fastest", "better", "best", "worse", "worst"]

In [9]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/dhruvsmac/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [10]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [11]:
lemmas = {
    "nouns": [lemmatizer.lemmatize(word, pos='n') for word in nouns],
    "verbs": [lemmatizer.lemmatize(word, pos='v') for word in verbs],
    "adjectives": [lemmatizer.lemmatize(word, pos='a') for word in adjectives],
    "adverbs": [lemmatizer.lemmatize(word, pos='r') for word in adverbs]
}

In [13]:
lemmas["nouns"], lemmas["verbs"], lemmas["adjectives"], lemmas["adverbs"]

(['cat', 'goose', 'rock', 'corpus'],
 ['run', 'eat', 'swim', 'dance', 'study', 'study'],
 ['good', 'best', 'bad', 'bad', 'fast', 'fast'],
 ['faster', 'fastest', 'well', 'best', 'worse', 'worst'])

In [17]:
# Comparing with Original Words
for pos in ["nouns", "verbs", "adjectives", "adverbs"]:
    print(f"{pos.upper()}:")
    original_words = eval(pos)
    for orig, lemma in zip(original_words, lemmas[pos]):
        print(f"  {orig} → {lemma}")
    print()

NOUNS:
  cats → cat
  geese → goose
  rocks → rock
  corpora → corpus

VERBS:
  running → run
  ate → eat
  swimming → swim
  danced → dance
  studying → study
  studies → study

ADJECTIVES:
  better → good
  best → best
  worse → bad
  worst → bad
  faster → fast
  fastest → fast

ADVERBS:
  faster → faster
  fastest → fastest
  better → well
  best → best
  worse → worse
  worst → worst

