# Text Normalization in Natural Language Processing

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vuhung16au/nlp-learning-journey/blob/main/examples/normalization.ipynb)

## Overview

Text normalization is the process of transforming text into a canonical (standard) form. This preprocessing step helps reduce the complexity of text data and improves the performance of NLP models by ensuring consistent representation of similar words and phrases.

## What You'll Learn

- Case normalization
- Punctuation handling
- Stemming and lemmatization
- Handling special characters and unicode
- Stop word removal
- Text cleaning techniques
- Domain-specific normalization

## Prerequisites

Basic understanding of Python, regular expressions, and NLP concepts.

## Setup and Installation

Let's install the required libraries for this notebook.

In [None]:
# Install required libraries
!pip install nltk spacy textblob unidecode contractions
!python -m spacy download en_core_web_sm

In [None]:
# Import required libraries
import nltk
import spacy
import re
import string
import unicodedata
from textblob import TextBlob
from unidecode import unidecode
import contractions

# Download NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('porter_test')

# Import NLTK modules
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

## Sample Text

Let's use a sample text with various normalization challenges.

In [None]:
sample_text = """
Hey There!!! This is a SAMPLE text for NORMALIZATION. It contains UPPERCASE and lowercase words, 
contractions like don't, won't, can't, and it's. There are also numbers like 123, 4.56, and $100.00.
Special characters: @#$%^&*()!!! And some accented characters: café, naïve, résumé.
URLs: https://www.example.com and emails: user@domain.com.
Multiple    spaces   and\ttabs\nand newlines need cleaning too.
Words like running, runs, ran should be normalized. Also books, book's, and booking.
Stop words: the, is, at, which, on should often be removed.
"""

print("Original Text:")
print(repr(sample_text))
print("\nDisplayed Text:")
print(sample_text)

## 1. Case Normalization

Converting text to a consistent case (usually lowercase) to treat words like "The" and "the" as the same.

In [None]:
# Case normalization examples
print("Original:", "This IS a MIXED case TEXT")
print("Lowercase:", "This IS a MIXED case TEXT".lower())
print("Uppercase:", "This IS a MIXED case TEXT".upper())
print("Title case:", "This IS a MIXED case TEXT".title())
print("Capitalize:", "This IS a MIXED case TEXT".capitalize())

# Apply to sample text
text_lower = sample_text.lower()
print("\nSample text in lowercase:")
print(text_lower[:200] + "...")

## 2. Whitespace Normalization

Handling multiple spaces, tabs, and newlines.

In [None]:
def normalize_whitespace(text):
    """
    Normalize whitespace by replacing multiple spaces, tabs, and newlines with single spaces
    """
    # Replace multiple whitespace characters with single space
    text = re.sub(r'\s+', ' ', text)
    # Strip leading and trailing whitespace
    text = text.strip()
    return text

# Test whitespace normalization
test_whitespace = "This   has    multiple\t\tspaces\n\nand\nnewlines"
print("Original:", repr(test_whitespace))
print("Normalized:", repr(normalize_whitespace(test_whitespace)))

# Apply to sample text
text_whitespace_norm = normalize_whitespace(text_lower)
print("\nSample text after whitespace normalization:")
print(text_whitespace_norm[:200] + "...")

## 3. Punctuation Handling

Different strategies for dealing with punctuation marks.

In [None]:
def remove_punctuation(text):
    """
    Remove all punctuation from text
    """
    return text.translate(str.maketrans('', '', string.punctuation))

def replace_punctuation_with_space(text):
    """
    Replace punctuation with spaces
    """
    return text.translate(str.maketrans(string.punctuation, ' ' * len(string.punctuation)))

def keep_sentence_punctuation(text):
    """
    Keep only sentence-ending punctuation
    """
    sentence_punct = '.!?'
    other_punct = ''.join(c for c in string.punctuation if c not in sentence_punct)
    return text.translate(str.maketrans('', '', other_punct))

# Test different punctuation strategies
punct_test = "Hello, world! How are you? I'm fine... Thanks!"
print("Original:", punct_test)
print("Remove all:", remove_punctuation(punct_test))
print("Replace with space:", normalize_whitespace(replace_punctuation_with_space(punct_test)))
print("Keep sentence punct:", keep_sentence_punctuation(punct_test))

# Apply to sample text
text_no_punct = remove_punctuation(text_whitespace_norm)
text_no_punct = normalize_whitespace(text_no_punct)
print("\nSample text after punctuation removal:")
print(text_no_punct[:200] + "...")

## 4. Contraction Expansion

Expanding contractions like "don't" → "do not", "won't" → "will not".

In [None]:
# Using the contractions library
def expand_contractions(text):
    """
    Expand contractions in text
    """
    return contractions.fix(text)

# Manual contraction dictionary (alternative approach)
contraction_dict = {
    "ain't": "is not",
    "aren't": "are not",
    "can't": "cannot",
    "couldn't": "could not",
    "didn't": "did not",
    "doesn't": "does not",
    "don't": "do not",
    "hadn't": "had not",
    "hasn't": "has not",
    "haven't": "have not",
    "he'd": "he would",
    "he'll": "he will",
    "he's": "he is",
    "i'd": "i would",
    "i'll": "i will",
    "i'm": "i am",
    "i've": "i have",
    "isn't": "is not",
    "it'd": "it would",
    "it'll": "it will",
    "it's": "it is",
    "let's": "let us",
    "shouldn't": "should not",
    "that's": "that is",
    "there's": "there is",
    "they'd": "they would",
    "they'll": "they will",
    "they're": "they are",
    "they've": "they have",
    "we'd": "we would",
    "we're": "we are",
    "we've": "we have",
    "weren't": "were not",
    "what's": "what is",
    "where's": "where is",
    "who's": "who is",
    "won't": "will not",
    "wouldn't": "would not",
    "you'd": "you would",
    "you'll": "you will",
    "you're": "you are",
    "you've": "you have"
}

def expand_contractions_manual(text, contraction_mapping=contraction_dict):
    """
    Expand contractions using manual dictionary
    """
    for contraction, expansion in contraction_mapping.items():
        text = re.sub(re.escape(contraction), expansion, text, flags=re.IGNORECASE)
    return text

# Test contraction expansion
contraction_test = "I don't think it's working. Won't you help me? I can't do it."
print("Original:", contraction_test)
print("Expanded (library):", expand_contractions(contraction_test))
print("Expanded (manual):", expand_contractions_manual(contraction_test.lower()))

# Apply to our working text
text_expanded = expand_contractions(text_no_punct)
print("\nSample text after contraction expansion:")
print(text_expanded[:200] + "...")

## 5. Unicode and Accent Normalization

Handling accented characters and special unicode characters.

In [None]:
# Different approaches to unicode normalization
def remove_accents_unidecode(text):
    """
    Remove accents using unidecode (converts to ASCII)
    """
    return unidecode(text)

def remove_accents_unicode(text):
    """
    Remove accents using unicode normalization
    """
    # Normalize to NFD (decomposed form)
    text = unicodedata.normalize('NFD', text)
    # Remove combining characters (accents)
    text = ''.join(char for char in text if unicodedata.category(char) != 'Mn')
    return text

def normalize_unicode(text):
    """
    Normalize unicode to NFC form (composed)
    """
    return unicodedata.normalize('NFC', text)

# Test unicode normalization
unicode_test = "café naïve résumé piñata Zürich"
print("Original:", unicode_test)
print("Unidecode:", remove_accents_unidecode(unicode_test))
print("Unicode decompose:", remove_accents_unicode(unicode_test))
print("Unicode normalize:", normalize_unicode(unicode_test))

# Apply to our text
text_unicode_norm = remove_accents_unidecode(text_expanded)
print("\nSample text after unicode normalization:")
print(text_unicode_norm[:200] + "...")

## 6. Number Normalization

Different strategies for handling numbers in text.

In [None]:
def remove_numbers(text):
    """
    Remove all numbers from text
    """
    return re.sub(r'\d+', '', text)

def replace_numbers_with_token(text, token='<NUM>'):
    """
    Replace numbers with a special token
    """
    return re.sub(r'\d+', token, text)

def normalize_numbers(text):
    """
    More sophisticated number normalization
    """
    # Replace currency
    text = re.sub(r'\$\d+(?:\.\d+)?', '<CURRENCY>', text)
    # Replace percentages
    text = re.sub(r'\d+(?:\.\d+)?%', '<PERCENT>', text)
    # Replace decimal numbers
    text = re.sub(r'\d+\.\d+', '<DECIMAL>', text)
    # Replace integers
    text = re.sub(r'\d+', '<INTEGER>', text)
    return text

# Test number normalization
number_test = "I have $100.50 and scored 95.5% on the test. There are 42 students."
print("Original:", number_test)
print("Remove numbers:", normalize_whitespace(remove_numbers(number_test)))
print("Replace with token:", replace_numbers_with_token(number_test))
print("Sophisticated norm:", normalize_numbers(number_test))

# Apply simple number removal
text_no_numbers = normalize_whitespace(remove_numbers(text_unicode_norm))
print("\nSample text after number removal:")
print(text_no_numbers[:200] + "...")

## 7. Stemming

Reducing words to their root form by removing suffixes.

In [None]:
# Initialize stemmers
porter_stemmer = PorterStemmer()
snowball_stemmer = SnowballStemmer('english')
lancaster_stemmer = LancasterStemmer()

def compare_stemmers(words):
    """
    Compare different stemming algorithms
    """
    print(f"{'Word':<15} {'Porter':<15} {'Snowball':<15} {'Lancaster':<15}")
    print("-" * 60)
    
    for word in words:
        porter = porter_stemmer.stem(word)
        snowball = snowball_stemmer.stem(word)
        lancaster = lancaster_stemmer.stem(word)
        print(f"{word:<15} {porter:<15} {snowball:<15} {lancaster:<15}")

# Test stemming with example words
test_words = ['running', 'runs', 'ran', 'runner', 'easily', 'fairly', 'books', 'booking', 'booked']
print("Stemming Comparison:")
compare_stemmers(test_words)

# Apply stemming to sample text
def stem_text(text, stemmer):
    """
    Apply stemming to all words in text
    """
    words = word_tokenize(text)
    stemmed_words = [stemmer.stem(word) for word in words]
    return ' '.join(stemmed_words)

text_stemmed = stem_text(text_no_numbers, porter_stemmer)
print("\nSample text after Porter stemming:")
print(text_stemmed[:200] + "...")

## 8. Lemmatization

Reducing words to their base or dictionary form (lemma) using linguistic knowledge.

In [None]:
# Initialize lemmatizers
nltk_lemmatizer = WordNetLemmatizer()

def compare_stemming_lemmatization(words):
    """
    Compare stemming vs lemmatization
    """
    print(f"{'Word':<15} {'Stemmed':<15} {'Lemmatized':<15}")
    print("-" * 45)
    
    for word in words:
        stemmed = porter_stemmer.stem(word)
        lemmatized = nltk_lemmatizer.lemmatize(word)
        print(f"{word:<15} {stemmed:<15} {lemmatized:<15}")

# Test comparison
test_words = ['running', 'runs', 'ran', 'better', 'good', 'feet', 'foot', 'geese', 'goose', 'mice', 'mouse']
print("Stemming vs Lemmatization:")
compare_stemming_lemmatization(test_words)

# NLTK lemmatization
def lemmatize_text_nltk(text):
    """
    Apply NLTK lemmatization to text
    """
    words = word_tokenize(text)
    lemmatized_words = [nltk_lemmatizer.lemmatize(word) for word in words]
    return ' '.join(lemmatized_words)

# spaCy lemmatization (more advanced)
def lemmatize_text_spacy(text):
    """
    Apply spaCy lemmatization to text
    """
    doc = nlp(text)
    lemmatized_words = [token.lemma_ for token in doc]
    return ' '.join(lemmatized_words)

# Apply lemmatization
text_lemmatized_nltk = lemmatize_text_nltk(text_no_numbers)
text_lemmatized_spacy = lemmatize_text_spacy(text_no_numbers)

print("\nSample text after NLTK lemmatization:")
print(text_lemmatized_nltk[:200] + "...")

print("\nSample text after spaCy lemmatization:")
print(text_lemmatized_spacy[:200] + "...")

## 9. Stop Word Removal

Removing common words that don't carry much semantic meaning.

In [None]:
# Get English stop words
stop_words_nltk = set(stopwords.words('english'))
stop_words_spacy = nlp.Defaults.stop_words

# Custom stop words
custom_stop_words = {'also', 'would', 'could', 'should'}

print("NLTK stop words (first 20):", list(stop_words_nltk)[:20])
print(f"\nTotal NLTK stop words: {len(stop_words_nltk)}")
print(f"Total spaCy stop words: {len(stop_words_spacy)}")

def remove_stopwords(text, stop_words=stop_words_nltk):
    """
    Remove stop words from text
    """
    words = word_tokenize(text.lower())
    filtered_words = [word for word in words if word not in stop_words]
    return ' '.join(filtered_words)

def remove_stopwords_spacy(text):
    """
    Remove stop words using spaCy
    """
    doc = nlp(text)
    filtered_words = [token.text for token in doc if not token.is_stop]
    return ' '.join(filtered_words)

# Test stop word removal
stopword_test = "This is a sample sentence with many common stop words that should be removed."
print("\nOriginal:", stopword_test)
print("Without stop words (NLTK):", remove_stopwords(stopword_test))
print("Without stop words (spaCy):", remove_stopwords_spacy(stopword_test))

# Apply to our text
text_no_stopwords = remove_stopwords(text_lemmatized_spacy)
print("\nSample text after stop word removal:")
print(text_no_stopwords[:200] + "...")

## 10. Comprehensive Normalization Pipeline

Combining all normalization techniques into a single pipeline.

In [None]:
class TextNormalizer:
    """
    A comprehensive text normalization pipeline
    """
    
    def __init__(self):
        self.lemmatizer = WordNetLemmatizer()
        self.stemmer = PorterStemmer()
        self.stop_words = set(stopwords.words('english'))
        self.nlp = spacy.load('en_core_web_sm')
    
    def normalize(self, text, 
                  lowercase=True,
                  remove_punctuation=True,
                  remove_numbers=True,
                  expand_contractions=True,
                  remove_accents=True,
                  lemmatize=True,
                  stem=False,
                  remove_stopwords=True,
                  remove_extra_whitespace=True):
        """
        Apply normalization pipeline to text
        """
        result = text
        
        # Step 1: Expand contractions (before lowercasing)
        if expand_contractions:
            result = contractions.fix(result)
        
        # Step 2: Convert to lowercase
        if lowercase:
            result = result.lower()
        
        # Step 3: Remove accents
        if remove_accents:
            result = unidecode(result)
        
        # Step 4: Remove numbers
        if remove_numbers:
            result = re.sub(r'\d+', '', result)
        
        # Step 5: Remove punctuation
        if remove_punctuation:
            result = result.translate(str.maketrans('', '', string.punctuation))
        
        # Step 6: Normalize whitespace
        if remove_extra_whitespace:
            result = re.sub(r'\s+', ' ', result).strip()
        
        # Step 7: Lemmatization or stemming
        if lemmatize or stem or remove_stopwords:
            words = word_tokenize(result)
            
            if remove_stopwords:
                words = [word for word in words if word not in self.stop_words]
            
            if lemmatize:
                words = [self.lemmatizer.lemmatize(word) for word in words]
            elif stem:
                words = [self.stemmer.stem(word) for word in words]
            
            result = ' '.join(words)
        
        return result

# Test the comprehensive normalizer
normalizer = TextNormalizer()

print("Original text:")
print(sample_text)
print("\n" + "="*80)

# Apply different normalization levels
basic_norm = normalizer.normalize(sample_text, 
                                  lemmatize=False, 
                                  remove_stopwords=False)
print("\nBasic normalization:")
print(basic_norm)

full_norm = normalizer.normalize(sample_text)
print("\nFull normalization:")
print(full_norm)

## 11. Domain-Specific Normalization

Different domains may require special normalization techniques.

In [None]:
def normalize_social_media_text(text):
    """
    Normalization specific to social media text
    """
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '<URL>', text, flags=re.MULTILINE)
    
    # Replace user mentions
    text = re.sub(r'@\w+', '<USER>', text)
    
    # Replace hashtags but keep the content
    text = re.sub(r'#(\w+)', r'\1', text)
    
    # Replace multiple exclamation marks
    text = re.sub(r'!{2,}', '!', text)
    
    # Replace multiple question marks
    text = re.sub(r'\?{2,}', '?', text)
    
    # Replace elongated words (e.g., "sooooo" -> "so")
    text = re.sub(r'(\w)\1{2,}', r'\1\1', text)
    
    return text

def normalize_email_text(text):
    """
    Normalization specific to email text
    """
    # Remove email headers patterns
    text = re.sub(r'^From:.*$', '', text, flags=re.MULTILINE)
    text = re.sub(r'^To:.*$', '', text, flags=re.MULTILINE)
    text = re.sub(r'^Subject:.*$', '', text, flags=re.MULTILINE)
    text = re.sub(r'^Date:.*$', '', text, flags=re.MULTILINE)
    
    # Remove email signatures (lines starting with --)
    text = re.sub(r'^--.*', '', text, flags=re.MULTILINE | re.DOTALL)
    
    # Remove quoted text (lines starting with >)
    text = re.sub(r'^>.*$', '', text, flags=re.MULTILINE)
    
    return text

# Test domain-specific normalization
social_media_text = """
OMG!!! This is sooooo amazing!!! Check out @username and visit https://example.com 
#NLP #MachineLearning #AI Loooove this!!!
"""

email_text = """
From: sender@example.com
To: recipient@example.com
Subject: Meeting Tomorrow

Hi there,

Can we meet tomorrow at 3 PM?

> Previous message:
> Yes, that works for me.

Thanks!

--
Best regards,
John Doe
john@company.com
"""

print("Social Media Text (Original):")
print(social_media_text)
print("\nSocial Media Text (Normalized):")
print(normalize_social_media_text(social_media_text))

print("\n" + "="*50)
print("\nEmail Text (Original):")
print(email_text)
print("\nEmail Text (Normalized):")
print(normalize_email_text(email_text))

## 12. Evaluation and Comparison

How to evaluate the effectiveness of different normalization strategies.

In [None]:
def analyze_normalization_impact(original_text, normalized_text):
    """
    Analyze the impact of normalization on text
    """
    original_tokens = word_tokenize(original_text.lower())
    normalized_tokens = word_tokenize(normalized_text.lower())
    
    original_vocab = set(original_tokens)
    normalized_vocab = set(normalized_tokens)
    
    print("Normalization Impact Analysis:")
    print(f"Original tokens: {len(original_tokens)}")
    print(f"Normalized tokens: {len(normalized_tokens)}")
    print(f"Token reduction: {len(original_tokens) - len(normalized_tokens)} ({((len(original_tokens) - len(normalized_tokens)) / len(original_tokens) * 100):.1f}%)")
    
    print(f"\nOriginal vocabulary size: {len(original_vocab)}")
    print(f"Normalized vocabulary size: {len(normalized_vocab)}")
    print(f"Vocabulary reduction: {len(original_vocab) - len(normalized_vocab)} ({((len(original_vocab) - len(normalized_vocab)) / len(original_vocab) * 100):.1f}%)")
    
    # Show examples of removed tokens
    removed_tokens = original_vocab - normalized_vocab
    if removed_tokens:
        print(f"\nExamples of removed tokens: {list(removed_tokens)[:10]}")

# Compare different normalization strategies
print("Comparing normalization strategies on sample text:")
print("\n" + "="*60)

# Strategy 1: Minimal normalization
minimal_norm = normalizer.normalize(sample_text, 
                                    remove_punctuation=False,
                                    remove_numbers=False,
                                    lemmatize=False,
                                    remove_stopwords=False)
print("\n1. Minimal Normalization (lowercase + whitespace only):")
analyze_normalization_impact(sample_text, minimal_norm)

# Strategy 2: Medium normalization
medium_norm = normalizer.normalize(sample_text,
                                   lemmatize=False,
                                   remove_stopwords=False)
print("\n2. Medium Normalization (+ punctuation + numbers):")
analyze_normalization_impact(sample_text, medium_norm)

# Strategy 3: Full normalization
full_norm = normalizer.normalize(sample_text)
print("\n3. Full Normalization (+ lemmatization + stop words):")
analyze_normalization_impact(sample_text, full_norm)

## 13. Exercises

Practice text normalization with these exercises:

### Exercise 1: Custom Normalization Function

Create a normalization function for a specific domain (e.g., product reviews, scientific papers).

In [None]:
def normalize_product_reviews(text):
    """
    TODO: Implement normalization specific to product reviews
    Consider:
    - Handling star ratings (e.g., "5/5 stars", "★★★★★")
    - Price mentions (e.g., "$19.99", "twenty dollars")
    - Product model numbers
    - Common review phrases
    """
    # Your implementation here
    pass

# Test with sample product review
product_review = """
This product is AMAZING!!! I bought it for $29.99 and it's worth every penny.
Model XYZ-123 works perfectly. I'd give it 5/5 stars ★★★★★.
Definitely recommend to anyone. Fast shipping too!!!
"""

# Your test code here


### Exercise 2: Normalization Comparison

Compare the effect of different normalization strategies on a classification task.

In [None]:
# Sample dataset of movie reviews (simplified)
movie_reviews = [
    ("This movie is FANTASTIC!!! I loved it so much. Best film I've seen this year!", "positive"),
    ("Terrible movie. Don't waste your time. Acting was awful.", "negative"),
    ("It's an okay film. Not great, not terrible. Just average.", "neutral"),
    ("AMAZING storyline and great acting! Highly recommended!!!", "positive"),
    ("Boring and predictable. Couldn't wait for it to end.", "negative")
]

# TODO: Apply different normalization strategies and analyze:
# 1. Vocabulary size for each strategy
# 2. Most common words for each strategy
# 3. Which strategy might work better for sentiment classification?

# Your analysis code here:


## Key Takeaways

1. **Normalization is context-dependent**:
   - Social media text needs different handling than formal documents
   - Consider your downstream task when choosing normalization steps

2. **Order matters**:
   - Expand contractions before lowercasing
   - Handle special characters before removing punctuation
   - Apply lemmatization/stemming after other preprocessing

3. **Trade-offs to consider**:
   - **Information loss**: Aggressive normalization removes information
   - **Computational cost**: Some methods are more expensive
   - **Domain specificity**: Generic approaches may not work for specialized text

4. **Stemming vs Lemmatization**:
   - Stemming is faster but can be inaccurate
   - Lemmatization is more accurate but slower
   - spaCy lemmatization is generally better than NLTK

5. **Stop word removal**:
   - Useful for some tasks (topic modeling, search)
   - May hurt performance for others (sentiment analysis, machine translation)

## Best Practices

1. **Start simple**: Begin with basic normalization and add complexity as needed
2. **Evaluate impact**: Measure how normalization affects your specific task
3. **Keep originals**: Always preserve original text for reference
4. **Document choices**: Record what normalization steps you applied and why
5. **Consider alternatives**: Modern transformer models often work well with minimal preprocessing

## Next Steps

- Learn about advanced preprocessing techniques
- Explore language-specific normalization challenges
- Study the impact of normalization on different NLP tasks
- Practice with real-world datasets from your domain of interest

## Resources

- [NLTK Book - Text Preprocessing](https://www.nltk.org/book/ch03.html)
- [spaCy Linguistic Features](https://spacy.io/usage/linguistic-features)
- [Text Preprocessing in Python](https://realpython.com/python-string-formatting/)
- [Unicode in Python](https://docs.python.org/3/howto/unicode.html)
