```{contents}
```

# Text Normalization

Text normalization is the process of transforming text into a **standard form**, so that variations in text (like case differences, punctuation, or spelling) do not affect analysis. Essentially, it reduces the complexity and variability of textual data.

**Goal:** Make text uniform so NLP models can focus on the semantics, not on irrelevant variations.

---

## **Key Steps in Text Normalization**

1. **Lowercasing**

   * Convert all text to lowercase to avoid treating "Apple" and "apple" differently.
   * Example: `"NLP is Amazing!" → "nlp is amazing!"`

2. **Removing Punctuation**

   * Punctuation usually doesn’t carry meaning for many NLP tasks.
   * Example: `"Hello, world!" → "Hello world"`

3. **Removing Numbers (optional)**

   * Numbers may or may not be relevant depending on the task.
   * Example: `"I have 2 apples"` → `"I have apples"`

4. **Removing Stopwords**

   * Words like “is,” “the,” “a” are common but carry little meaning.
   * Example: `"This is a sample text"` → `"sample text"`

5. **Spelling Correction**

   * Correct typos to standardize words.
   * Example: `"loove"` → `"love"`

6. **Tokenization**

   * Split text into smaller units like sentences or words.
   * Example: `"I love NLP." → ["I", "love", "NLP"]`

7. **Stemming**

   * Reduce words to their root form.
   * Example: `"running" → "run"`

8. **Lemmatization**

   * Convert words to their dictionary form using context.
   * Example: `"better" → "good"`

9. **Handling Emojis and Emoticons**

   * Convert emojis to text or remove them.
   * Example: `"I am happy 😊"` → `"I am happy"`

10. **Handling Contractions**

    * Expand contractions to standard form.
    * Example: `"I’m happy"` → `"I am happy"`

11. **Removing Extra Whitespaces**

    * Clean up unnecessary spaces.
    * Example: `"Hello   world"` → `"Hello world"`

---

## **Python Demonstration: Text Normalization Pipeline**



In [2]:
!pip install textblob

Collecting textblob
  Obtaining dependency information for textblob from https://files.pythonhosted.org/packages/1e/d6/40aa5aead775582ea0cf35870e5a3f16fab4b967f1ad2debe675f673f923/textblob-0.19.0-py3-none-any.whl.metadata
  Downloading textblob-0.19.0-py3-none-any.whl.metadata (4.4 kB)
Downloading textblob-0.19.0-py3-none-any.whl (624 kB)
   ---------------------------------------- 0.0/624.3 kB ? eta -:--:--
   - ------------------------------------- 30.7/624.3 kB 660.6 kB/s eta 0:00:01
   -- ------------------------------------ 41.0/624.3 kB 393.8 kB/s eta 0:00:02
   ---- ---------------------------------- 71.7/624.3 kB 653.6 kB/s eta 0:00:01
   --------- ------------------------------ 153.6/624.3 kB 1.0 MB/s eta 0:00:01
   ---------- --------------------------- 174.1/624.3 kB 876.1 kB/s eta 0:00:01
   --------------------- ------------------ 337.9/624.3 kB 1.4 MB/s eta 0:00:01
   --------------------------- ------------ 430.1/624.3 kB 1.4 MB/s eta 0:00:01
   -------------------------


[notice] A new release of pip is available: 23.2.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
# Install required libraries
# !pip install nltk textblob

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from textblob import TextBlob

nltk.download('stopwords')
nltk.download('wordnet')

text = "I loove NLP! It's amazing 😊 and I have 2 apples."

# 1️⃣ Lowercase
text = text.lower()

# 2️⃣ Remove punctuation
text = re.sub(r'[^\w\s]', '', text)

# 3️⃣ Remove numbers
text = re.sub(r'\d+', '', text)

# 4️⃣ Tokenization
words = text.split()

# 5️⃣ Remove stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in words if w not in stop_words]

# 6️⃣ Spelling Correction
words = [str(TextBlob(w).correct()) for w in words]

# 7️⃣ Stemming
stemmer = PorterStemmer()
words = [stemmer.stem(w) for w in words]

# 8️⃣ Lemmatization
lemmatizer = WordNetLemmatizer()
words = [lemmatizer.lemmatize(w) for w in words]

normalized_text = " ".join(words)
print("Normalized Text:")
print(normalized_text)


[nltk_data] Downloading package stopwords to c:\Users\sangouda\AppData
[nltk_data]     \Local\Programs\Python\Python312\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to c:\Users\sangouda\AppData\L
[nltk_data]     ocal\Programs\Python\Python312\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Normalized Text:
love nap amaz appl




---

### **✅ Output Example**

**Original Text:**
`"I loove NLP! It's amazing 😊 and I have 2 apples."`

**Normalized Text:**
`"nlp amaz appl"`

> After normalization, text is clean, uniform, and ready for NLP models.

---

**Key Points**

* Normalization reduces **noise** in text.
* Helps NLP models like **text classification, sentiment analysis, and language modeling** perform better.
* Steps may vary depending on **task**. For example, numbers or emojis may be important in sentiment analysis.
