# Lemmatization

Lemmatization technique is like stemming that converts a word to its base or dictionary form. The output we will get after lemmatization is called ‘lemma’, which is a root word rather than root stem, the output of stemming. After lemmatization, we will be getting a valid word that means the same thing.

## Key Differences Between Lemmatization and Stemming:
### Lemmatization:
- Converts words into their dictionary form (lemma).
- Considers the part of speech (POS) to determine the correct base form.
- Results in real words (e.g., "ran" → "run", "better" → "good").

### Stemming:
- Simply cuts off prefixes or suffixes to generate a stem.
- Does not consider the grammatical correctness, which may result in non-real words.
- For example, "running," "runner," "ran" might all stem to "run" or some shortened form like "runn."


NLTK provides WordNetLemmatizer class which is a thin wrapper around the wordnet corpus. This class uses morphy() function to the WordNet CorpusReader class to find a lemma. Let us understand it with an example −

In [1]:
from nltk.stem import WordNetLemmatizer

# Create a lemmatizer object
lemmatizer = WordNetLemmatizer()

In [2]:
lemmatizer.lemmatize("going")

LookupError: 
**********************************************************************
  Resource [93mwordnet[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('wordnet')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/wordnet[0m

  Searched in:
    - 'C:\\Users\\satee/nltk_data'
    - 'C:\\Users\\satee\\anaconda3\\envs\\gen_ai\\nltk_data'
    - 'C:\\Users\\satee\\anaconda3\\envs\\gen_ai\\share\\nltk_data'
    - 'C:\\Users\\satee\\anaconda3\\envs\\gen_ai\\lib\\nltk_data'
    - 'C:\\Users\\satee\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


In [3]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\satee\AppData\Roaming\nltk_data...


True

In [4]:
lemmatizer.lemmatize("going")

'going'

In [5]:
'''
POS- Noun-n
verb-v
adjective-a
adverb-r
'''
lemmatizer.lemmatize("going", pos='v')

'go'

In [6]:
lemmatizer.lemmatize("going", pos='n')

'going'

In [7]:
lemmatizer.lemmatize("going", pos='a')

'going'

In [8]:
lemmatizer.lemmatize("going", pos='r')

'going'

In [9]:
# Sample words
words=["eating","eats","eaten","writing","writes","history","finally","finalized", "running", "runs", "ran", "easily", "fairly", "studies", "studying", "programming","programs"]
words

['eating',
 'eats',
 'eaten',
 'writing',
 'writes',
 'history',
 'finally',
 'finalized',
 'running',
 'runs',
 'ran',
 'easily',
 'fairly',
 'studies',
 'studying',
 'programming',
 'programs']

In [12]:
# Lemmatize words without specifying POS (default is noun)
[f'{word} :: {lemmatizer.lemmatize(word)}' for word in words]

['eating :: eating',
 'eats :: eats',
 'eaten :: eaten',
 'writing :: writing',
 'writes :: writes',
 'history :: history',
 'finally :: finally',
 'finalized :: finalized',
 'running :: running',
 'runs :: run',
 'ran :: ran',
 'easily :: easily',
 'fairly :: fairly',
 'studies :: study',
 'studying :: studying',
 'programming :: programming',
 'programs :: program']

In [13]:
lemmatizer.lemmatize("goes",pos='v')

'go'

In [14]:
lemmatizer.lemmatize("fairly",pos='v'),lemmatizer.lemmatize("sportingly")

('fairly', 'sportingly')

## Lemmatizing with POS Tagging

In [21]:
from nltk.corpus import wordnet

# Function to convert NLTK POS tag to WordNet POS tag
def get_wordnet_pos(word):
    from nltk.corpus import wordnet
    from nltk import pos_tag
    print(f'{word} :: {pos_tag([word])}')
    tag = pos_tag([word])[0][1][0].upper()  # Get first letter of POS tag
    tag_dict = {"J": wordnet.ADJ, "N": wordnet.NOUN, "V": wordnet.VERB, "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)  # Default to noun if not found


In [22]:
# Sample words
words = ["running", "ran", "better", "studies", "studying", "easily", "fairly", "leaves"]

In [18]:
 nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\satee\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger_eng.zip.


True

In [23]:
# Lemmatize with POS tagging
[lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in words]

running :: [('running', 'VBG')]
ran :: [('ran', 'NN')]
better :: [('better', 'RBR')]
studies :: [('studies', 'NNS')]
studying :: [('studying', 'VBG')]
easily :: [('easily', 'RB')]
fairly :: [('fairly', 'RB')]
leaves :: [('leaves', 'NNS')]


['run', 'ran', 'well', 'study', 'study', 'easily', 'fairly', 'leaf']

### When to Use Lemmatization:
- Use lemmatization when grammatical correctness is important (e.g., text mining, natural language understanding, or information retrieval tasks).
- In cases where you want to retain the actual dictionary form of words rather than arbitrary stems.

### Practical Use Cases:
- **Search Engines:** To map different forms of a word to a common base for better search results.
- **Text Summarization:** To reduce words to their dictionary form before analysis.
- **Chatbots and NLP:** To standardize inputs for better language understanding.