# Text Normalization

Text normalization is the process of transforming text into a consistent, standardized format, which facilitates more effective text analysis and natural language processing (NLP). This involves various steps to clean and homogenize the text data, making it easier for algorithms to process and understand. The main goals of text normalization are to reduce noise, improve the quality of the text, and ensure that similar entities are represented in the same way.

## Stemming

Stemming is a process of reducing a word to its root form by stripping away its suffixes and prefixes. The resulting stemmed word may not always be a valid word in the language but serves as a common base form. Stemming is typically rule-based and does not consider the context of the word.

Common Algorithms

* Porter Stemmer: One of the most widely used stemming algorithms.
* Snowball Stemmer: An improved version of the Porter Stemmer.
* Lancaster Stemmer: An aggressive stemming algorithm.

We are going to take a look at Porter Stemmer, RegexpStemmer and Snowball Stemmer to see how they differ.

In [1]:
# Initializing words

words = ["Eat", "eating", "run", "ran", "running", "history", "fairly", "go", "goes", "gone"]

# Porter Stemmer

In [2]:
from nltk.stem import PorterStemmer

In [3]:
porter_stemmer = PorterStemmer()

In [4]:
for word in words:
    print(word + "---->" + porter_stemmer.stem(word=word, to_lowercase=True))

eat---->eat
eating---->eat
run---->run
ran---->ran
running---->run
history---->histori
fairly---->fairli
go---->go
goes---->goe
gone---->gone


Works well except for "history", "fairly" and "goes".

# RegexpStemmer

In [6]:
from nltk import RegexpStemmer

In [7]:
res = RegexpStemmer('ing$|s$|e$|able$', min=4)

In [19]:
res.stem("bee") # min=4, therefore words less than 4 characters won't be stemmed

'bee'

In [21]:
for word in words:
    print(word + "---->" + res.stem(word=word))

eat---->eat
eating---->eat
run---->run
ran---->ran
running---->runn
history---->history
fairly---->fairly
go---->go
goes---->goe
gone---->gon


Works in a crude, pre-defined way.

# Snowball Stemmer

Snowball Stemmer is a stemming algorithm which is also known as the Porter2 stemming algorithm as it is a better version of the Porter Stemmer since some issues of it were fixed in this stemmer.

In [25]:
from nltk.stem import SnowballStemmer

In [26]:
snowball_stemmer = SnowballStemmer(language='english', ignore_stopwords=False)

In [27]:
for word in words:
    print(word + "---->" + snowball_stemmer.stem(word=word))

eat---->eat
eating---->eat
run---->run
ran---->ran
running---->run
history---->histori
fairly---->fair
go---->go
goes---->goe
gone---->gone


Makes the same mistakes as Porter Stemmer, except for "fairly"