# Text Normalization

Text normalization is the process of transforming text into a consistent, standardized format, which facilitates more effective text analysis and natural language processing (NLP). This involves various steps to clean and homogenize the text data, making it easier for algorithms to process and understand. The main goals of text normalization are to reduce noise, improve the quality of the text, and ensure that similar entities are represented in the same way.

## Lemmatization

Lemmatization is the process of reducing a word to its base or dictionary form, known as a lemma. Lemmatization considers the context and the word's part of speech to return a valid word in the language. It uses vocabulary and morphological analysis to achieve this.

We are going to take a look at NLTK's WordNetLemmatizer, which uses WordNet’s built-in morphy function.

In [1]:
# Initializing words

words = ["Eat", "eating", "run", "ran", "running", "history", "fairly", "go", "goes", "gone"]

# WordNet Lemmatizer

In [2]:
from nltk.stem import WordNetLemmatizer

In [3]:
wn = WordNetLemmatizer()

In [6]:
import nltk
nltk.download('wordnet') # Downloading the wordnet database

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sharo\AppData\Roaming\nltk_data...


True

In [8]:
for word in words:
    print(word + "---->" + wn.lemmatize(word=word, pos='v')) # Using Part of Speech as Verb

# parameter pos: The Part Of Speech tag. Valid options are `"n"` for nouns,
#               `"v"` for verbs, `"a"` for adjectives, `"r"` for adverbs and `"s"`
#                for satellite adjectives

Eat---->Eat
eating---->eat
run---->run
ran---->run
running---->run
history---->history
fairly---->fairly
go---->go
goes---->go
gone---->go


Lemmatization solves all of the mistakes made by Porter Stemmer and Snowball Stemmer in the previous notebook, as it uses context and the word's part of speech to return a valid word.