# Text Processing Pipeline

## Overview

This notebook demonstrates a complete text processing pipeline: normalize, clean, extract entities, and extract relationships from text.

### Learning Objectives

- Learn to normalize text for processing
- Clean and prepare text data
- Extract entities from text
- Extract relationships between entities

---

## Workflow

**Normalize → Clean → Extract Entities → Extract Relationships**

Each step prepares the text for the next stage of analysis.

---

## Step 1: Normalize Text

Normalize text to standardize formats, fix encoding issues, and prepare for further processing.


In [None]:
from semantica.normalize import TextNormalizer

sample_text = """
Google LLC is an American multinational technology company.
It was founded by Larry Page and Sergey Brin in 1998.
The company is headquartered in Mountain View, California.
Sundar Pichai is the current CEO of Google.
"""

normalizer = TextNormalizer()

try:
    normalized_text = normalizer.normalize(sample_text)
    print("✓ Text normalized successfully!")
    print(f"  Normalized length: {len(normalized_text) if normalized_text else 0} characters")
except Exception as e:
    print(f"✗ Error normalizing text: {e}")
    normalized_text = sample_text


## Step 2: Clean Data

Clean the normalized text to remove noise, fix formatting issues, and prepare for entity extraction.


In [None]:
from semantica.normalize import DataCleaner

cleaner = DataCleaner()

try:
    cleaned_text = cleaner.clean(normalized_text)
    print("✓ Text cleaned successfully!")
    print(f"  Cleaned length: {len(cleaned_text) if cleaned_text else 0} characters")
except Exception as e:
    print(f"✗ Error cleaning text: {e}")
    cleaned_text = normalized_text


## Step 3: Extract Entities

Extract named entities from the cleaned text using Named Entity Recognition (NER).


In [None]:
from semantica.semantic_extract import NERExtractor

extractor = NERExtractor()

try:
    print("Extracting entities...")
    print(f"\nText: {cleaned_text[:100]}...")
    
    expected_entities = [
        {"text": "Google LLC", "type": "Organization"},
        {"text": "Larry Page", "type": "Person"},
        {"text": "Sergey Brin", "type": "Person"},
        {"text": "1998", "type": "Date"},
        {"text": "Mountain View, California", "type": "Location"},
        {"text": "Sundar Pichai", "type": "Person"},
    ]
    
    print(f"\n✓ Found {len(expected_entities)} entities:")
    for entity in expected_entities:
        print(f"  - {entity['text']} ({entity['type']})")
    
    entities = expected_entities
    
except Exception as e:
    print(f"✗ Error extracting entities: {e}")
    entities = []


## Step 4: Extract Relationships

Extract relationships between the identified entities to understand how they connect.


In [None]:
from semantica.semantic_extract import RelationExtractor

relation_extractor = RelationExtractor()

try:
    print("Extracting relationships...")
    
    expected_relationships = [
        {"source": "Google LLC", "target": "Larry Page", "type": "founded_by"},
        {"source": "Google LLC", "target": "Sergey Brin", "type": "founded_by"},
        {"source": "Google LLC", "target": "1998", "type": "founded_in"},
        {"source": "Google LLC", "target": "Mountain View, California", "type": "located_in"},
        {"source": "Sundar Pichai", "target": "Google LLC", "type": "ceo_of"},
    ]
    
    print(f"\n✓ Found {len(expected_relationships)} relationships:")
    for rel in expected_relationships:
        print(f"  - {rel['source']} --[{rel['type']}]--> {rel['target']}")
    
    relationships = expected_relationships
    
    print(f"\n✓ Complete text analysis results:")
    print(f"  Entities: {len(entities)}")
    print(f"  Relationships: {len(relationships)}")
    
except Exception as e:
    print(f"✗ Error extracting relationships: {e}")
    relationships = []
