# Lemmatization

Based on **Stats Wire** video: https://www.youtube.com/watch?v=gBXXuL4HfCo

## Concept

Lemmatization is a linguistic and NLP technique used to determine the base or dictionary form of a word, known as the **lemma**. Unlike steamming, which simply removes prefixes and suffixes to obtain a rot form, lemmatization takes into acount the context and part of speech (POS) of the word to produce valid word that exists in the language.

The goal of lemmatization is to reduce words to their canonical or base form, which can improve the accuracy and interpretability of NLP tasks such as information retrieval, text classification, and language modeling. By converting words to their lemmas, variations of the same word are grouped together, enhancing the analysis by treating them as the same entity.

Lemmanization algorithms use dictionaries, lexical databases, or morphological analysis to map words to their lemmas. They consider factors such as word morphology, POS tags, and contextual information to ensure accurate transformation. For example, the lemma of the word "was" would be "be", and the lemma of "better" would be "good".

Compared to stemming, lemmatization typically produces more linguistically accurate results, as it considers the word's part of speech and semantic meaning. However, lemmatization can be computationally more xpensive than stemming due to the need for morphological analysis and access to dictionary resources.

In summary, lemmatizaton is a technique used in NLP to transforms words int their base forms (lemmas) by considering their part of speech and language context. It helps improve the accuracy and interpretability of NLP tasks by grouping related word variations together.

In [1]:
import nltk

In [2]:
# Universal Declaration of Human Rights
udhr = nltk.corpus.udhr.words('English-Latin1')

In [3]:
print(udhr)

['Universal', 'Declaration', 'of', 'Human', 'Rights', ...]


In [4]:
udhr[:30]

['Universal',
 'Declaration',
 'of',
 'Human',
 'Rights',
 'Preamble',
 'Whereas',
 'recognition',
 'of',
 'the',
 'inherent',
 'dignity',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalienable',
 'rights',
 'of',
 'all',
 'members',
 'of',
 'the',
 'human',
 'family',
 'is',
 'the',
 'foundation',
 'of']

In [5]:
porter = nltk.PorterStemmer()

In [11]:
# Stemming reduces words to their stems, but for more meaningful results considering word semantics and context, we apply lemmatization
[porter.stem(w) for w in udhr]

['univers',
 'declar',
 'of',
 'human',
 'right',
 'preambl',
 'wherea',
 'recognit',
 'of',
 'the',
 'inher',
 'digniti',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalien',
 'right',
 'of',
 'all',
 'member',
 'of',
 'the',
 'human',
 'famili',
 'is',
 'the',
 'foundat',
 'of',
 'freedom',
 ',',
 'justic',
 'and',
 'peac',
 'in',
 'the',
 'world',
 ',',
 'wherea',
 'disregard',
 'and',
 'contempt',
 'for',
 'human',
 'right',
 'have',
 'result',
 'in',
 'barbar',
 'act',
 'which',
 'have',
 'outrag',
 'the',
 'conscienc',
 'of',
 'mankind',
 ',',
 'and',
 'the',
 'advent',
 'of',
 'a',
 'world',
 'in',
 'which',
 'human',
 'be',
 'shall',
 'enjoy',
 'freedom',
 'of',
 'speech',
 'and',
 'belief',
 'and',
 'freedom',
 'from',
 'fear',
 'and',
 'want',
 'ha',
 'been',
 'proclaim',
 'as',
 'the',
 'highest',
 'aspir',
 'of',
 'the',
 'common',
 'peopl',
 ',',
 'wherea',
 'it',
 'is',
 'essenti',
 ',',
 'if',
 'man',
 'is',
 'not',
 'to',
 'be',
 'compel',
 'to',
 'have',
 'recours',
 '

In [12]:
wnl = nltk.WordNetLemmatizer()

In [13]:
# we apply lemmatization to keep words meaningful
[wnl.lemmatize(w) for w in udhr]

['Universal',
 'Declaration',
 'of',
 'Human',
 'Rights',
 'Preamble',
 'Whereas',
 'recognition',
 'of',
 'the',
 'inherent',
 'dignity',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalienable',
 'right',
 'of',
 'all',
 'member',
 'of',
 'the',
 'human',
 'family',
 'is',
 'the',
 'foundation',
 'of',
 'freedom',
 ',',
 'justice',
 'and',
 'peace',
 'in',
 'the',
 'world',
 ',',
 'Whereas',
 'disregard',
 'and',
 'contempt',
 'for',
 'human',
 'right',
 'have',
 'resulted',
 'in',
 'barbarous',
 'act',
 'which',
 'have',
 'outraged',
 'the',
 'conscience',
 'of',
 'mankind',
 ',',
 'and',
 'the',
 'advent',
 'of',
 'a',
 'world',
 'in',
 'which',
 'human',
 'being',
 'shall',
 'enjoy',
 'freedom',
 'of',
 'speech',
 'and',
 'belief',
 'and',
 'freedom',
 'from',
 'fear',
 'and',
 'want',
 'ha',
 'been',
 'proclaimed',
 'a',
 'the',
 'highest',
 'aspiration',
 'of',
 'the',
 'common',
 'people',
 ',',
 'Whereas',
 'it',
 'is',
 'essential',
 ',',
 'if',
 'man',
 'is',
 'not',
 'to',
 