<h1 align = "center">Text Normalization</h1>

---
In the world of Natural Language Processing (NLP), we work with human language. However, human language is inherently messy, varied, and full of nuances that can be confusing for computers. Text normalization is the foundational process of cleaning and standardizing raw text into a consistent, predictable format. Think of it as tidying up a chaotic room before you can find anything; we are tidying up language so a machine learning model can understand it.

The primary goal is to reduce the randomness in text by grouping different variations of a word or phrase into a single, canonical form. For example, to a computer, the words "run," "Run," and "running" are three distinct items. Text normalization ensures these are all recognized as the same core concept, simplifying the data for NLP models. This preprocessing step is crucial for the success of almost all major NLP tasks, including search engines, sentiment analysis, and machine translation.

**Why is it so Important?**

  * **Improved Model Performance:** Clean, standardized data helps models learn more effectively, leading to higher accuracy.
  * **Reduced Complexity:** It significantly shrinks the vocabulary the model needs to learn, which reduces computational costs and memory usage.
  * **Enhanced Feature Extraction:** When different forms of a word are treated as a single feature, the statistical power of that feature increases, leading to better insights.

In [1]:
import os  # miscellaneous os interfaces
import sys # configuring python runtime environment

## NLP Libraries

Python offers a rich ecosystem of libraries for Natural Language Processing (NLP), catering to various needs from foundational tasks to advanced deep learning models. Here are some of the most prominent ones:

  1. [NLTK](https://www.nltk.org/) Natural Language Toolkit - a comprehensive library for foundational NLP tasks like tokenization, stemming, lemmatization, etc.
  2. [spaCy](https://spacy.io/) Industrial-Strength NLP - designed for production-level applications, emphasizing speed and efficiency.

In [2]:
# import nltk

### NLPurify

A text cleaning and extraction engine was developed using a combination of traditional techniques like Unicode translations, cleaning using regular expressions, and modern tools like "natural language processing"
and "large language models" to detect and clean long texts and create word vectors. The library is developed as an one-stop solution that modifies and collates the utility functions to provide common things at one place.

In [3]:
import nlpurify as nlpu

# general convention is to assign the short form ``nlpu`` to the library
# print the current version of the library - for debugging and documentation
print(f"Current Version: {nlpu.__version__}")

Current Version: v2.1.0.dev0


In [8]:
text = '''
    This is a   uncLeaneD text    with lots of
   extra WHITE 
space.
'''

In [20]:
model = nlpu.preprocessing.normalization.WhiteSpace()
print(f"Normalized White Space: `{model.apply(text)}`")

Normalized White Space: `This is a uncLeaneD text with lots of extra WHITE space.`


In [26]:
model = nlpu.preprocessing.normalization.CaseFolding()
print(f"Uniform Case Folding: `{model.apply(text)}`")

Uniform Case Folding: `
    this is a   uncleaned text    with lots of
   extra white 
space.
`


In [30]:
model = nlpu.preprocessing.normalization.StopWords()
print(f"Uniform Case Folding: `{model.apply(text)}`")

Uniform Case Folding: `This uncLeaneD text lots extra WHITE space .`


In [31]:
model = nlpu.preprocessing.utils.WordTokenize(vanilla = True, tokenizer = False, vanilla_getalnum = True)
print(f"Uniform Case Folding: `{model.apply(text)}`")

Uniform Case Folding: `['This', 'is', 'a', 'uncLeaneD', 'text', 'with', 'lots', 'extra', 'WHITE']`


In [32]:
print(nlpu.preprocessing.normalization.normalize(text, upper = True, stopwords_in_uppercase = True))

UNCLEANED TEXT LOTS EXTRA WHITE SPACE .
