# Data Normalization

## Overview

This notebook demonstrates how to normalize and clean data using Semantica's normalization modules. You'll learn to normalize text, entities, dates, numbers, and handle encoding issues.

### Learning Objectives

- Use `TextNormalizer` for text cleaning and normalization
- Use `EntityNormalizer` for entity name standardization
- Use `DateNormalizer` for date format normalization
- Use `NumberNormalizer` for number and quantity normalization
- Use `DataCleaner` for general data cleaning
- Use `LanguageDetector` and `EncodingHandler` for data quality

---

## Step 1: Text Normalization

Normalize text content for consistency.


In [None]:
from semantica.normalize import TextNormalizer

text_normalizer = TextNormalizer()

sample_text = "Hello   World!!!  This is a test."

normalized = text_normalizer.normalize_text(sample_text, case="lower")
cleaned = text_normalizer.clean_text(sample_text, remove_special_chars=False)

print(f"Original: {sample_text}")
print(f"Normalized: {normalized}")
print(f"Cleaned: {cleaned}")


## Step 2: Entity Normalization

Normalize entity names to canonical forms.


In [None]:
from semantica.normalize import EntityNormalizer

entity_normalizer = EntityNormalizer()

entity_variants = ["Apple Inc.", "Apple Inc", "Apple", "Apple Incorporated"]

normalized_entities = []
for entity in entity_variants:
    normalized = entity_normalizer.normalize_entity(entity, entity_type="Organization")
    normalized_entities.append(normalized)
    print(f"{entity} -> {normalized}")


## Step 3: Date Normalization

Normalize dates to standard formats.


In [None]:
from semantica.normalize import DateNormalizer

date_normalizer = DateNormalizer()

date_formats = ["2023-12-25", "12/25/2023", "December 25, 2023", "25 Dec 2023"]

for date_str in date_formats:
    try:
        normalized = date_normalizer.normalize_date(date_str)
        print(f"{date_str} -> {normalized}")
    except Exception as e:
        print(f"{date_str} -> Error: {e}")


## Step 4: Number Normalization

Normalize numbers and quantities.


In [None]:
from semantica.normalize import NumberNormalizer

number_normalizer = NumberNormalizer()

numbers = ["1,000", "1.5M", "$100", "50%", "3.14e2"]

for num_str in numbers:
    try:
        normalized = number_normalizer.normalize_number(num_str)
        print(f"{num_str} -> {normalized}")
    except Exception as e:
        print(f"{num_str} -> Error: {e}")


## Step 5: Data Cleaning

Clean data using DataCleaner.


In [None]:
from semantica.normalize import DataCleaner

data_cleaner = DataCleaner()

data = [
    {"name": "Apple Inc.", "value": 100},
    {"name": "Apple Inc", "value": 100},
    {"name": "Microsoft", "value": 200}
]

cleaned_data = data_cleaner.clean_data(data, remove_duplicates=True)

print(f"Original records: {len(data)}")
print(f"Cleaned records: {len(cleaned_data)}")


## Step 6: Language Detection and Encoding

Detect language and handle encoding.


In [None]:
from semantica.normalize import LanguageDetector, EncodingHandler

language_detector = LanguageDetector()
encoding_handler = EncodingHandler()

text_samples = [
    "Hello, this is English text.",
    "Bonjour, ceci est du texte français.",
    "Hola, este es texto en español."
]

for text in text_samples:
    detected_lang = language_detector.detect_language(text)
    print(f"Text: {text[:30]}... -> Language: {detected_lang}")

sample_bytes = "Hello World".encode('utf-8')
detected_encoding = encoding_handler.detect_encoding(sample_bytes)
print(f"\nDetected encoding: {detected_encoding}")


## Summary

You've learned how to normalize and clean data:

- **TextNormalizer**: Text cleaning and normalization
- **EntityNormalizer**: Entity name standardization
- **DateNormalizer**: Date format normalization
- **NumberNormalizer**: Number and quantity normalization
- **DataCleaner**: General data cleaning
- **LanguageDetector**: Language detection
- **EncodingHandler**: Encoding detection and conversion

Next: Learn how to extract entities in the Entity_Extraction notebook.
