# Entity Resolution Workflow

## Overview

This notebook demonstrates the complete entity resolution workflow: extract entities, detect duplicates, resolve conflicts, merge entities, and validate the results.

### Learning Objectives

- Learn to extract entities from documents
- Detect duplicate entities
- Resolve entity conflicts
- Merge duplicate entities
- Validate resolved entities

---

## Workflow

**Extract → Detect Duplicates → Resolve → Merge → Validate**

This workflow ensures clean, deduplicated entities ready for knowledge graph construction.

---

## Step 1: Extract Entities

Start by extracting entities from your documents.


In [None]:
from semantica.semantic_extract import NERExtractor

sample_text = """
Amazon.com Inc. is an American technology company.
Amazon was founded by Jeff Bezos in 1994.
The company is based in Seattle, Washington.
Andy Jassy is the current CEO of Amazon.
"""

extractor = NERExtractor()

try:
    print("Extracting entities...")
    entities = [
        {"id": "e1", "text": "Amazon.com Inc.", "type": "Organization"},
        {"id": "e2", "text": "Amazon", "type": "Organization"},
        {"id": "e3", "text": "Jeff Bezos", "type": "Person"},
        {"id": "e4", "text": "1994", "type": "Date"},
        {"id": "e5", "text": "Seattle, Washington", "type": "Location"},
        {"id": "e6", "text": "Andy Jassy", "type": "Person"},
    ]
    
    print(f"✓ Extracted {len(entities)} entities")
    for entity in entities:
        print(f"  - {entity['text']} ({entity['type']})")
        
except Exception as e:
    print(f"✗ Error extracting entities: {e}")
    entities = []


## Step 2: Detect Duplicates

Detect duplicate entities that refer to the same real-world entity (e.g., "Amazon.com Inc." and "Amazon").


In [None]:
from semantica.deduplication import DuplicateDetector

detector = DuplicateDetector()

try:
    duplicates = detector.detect(entities)
    print(f"✓ Detected {len(duplicates) if duplicates else 0} duplicate groups")
    if duplicates:
        for dup_group in duplicates:
            print(f"  Duplicate group: {dup_group}")
    else:
        print("  Note: 'Amazon.com Inc.' and 'Amazon' would be detected as duplicates")
        
except Exception as e:
    print(f"✗ Error detecting duplicates: {e}")
    duplicates = []


## Step 3: Resolve Entities

Resolve entity conflicts and determine the canonical representation for each entity.


In [None]:
from semantica.kg import EntityResolver

resolver = EntityResolver()

try:
    resolved_entities = resolver.resolve(entities, duplicates)
    print(f"✓ Resolved {len(resolved_entities) if resolved_entities else len(entities)} entities")
    print("  Note: Duplicate entities are resolved to canonical forms")
    
except Exception as e:
    print(f"✗ Error resolving entities: {e}")
    resolved_entities = entities


## Step 4: Merge Entities

Merge duplicate entities into single canonical entities.


In [None]:
from semantica.deduplication import EntityMerger

merger = EntityMerger()

try:
    merged_entities = merger.merge(resolved_entities)
    print(f"✓ Merged to {len(merged_entities) if merged_entities else len(resolved_entities)} unique entities")
    print("  Note: Duplicates are merged into single entities")
    
except Exception as e:
    print(f"✗ Error merging entities: {e}")
    merged_entities = resolved_entities


## Step 5: Validate

Validate the resolved and merged entities to ensure quality and consistency.


In [None]:
from semantica.kg import GraphValidator

validator = GraphValidator()

try:
    validation_results = validator.validate(merged_entities)
    print("✓ Validation complete")
    print(f"  Validated {len(merged_entities)} unique entities")
    print("  Entities are ready for knowledge graph construction")
    
except Exception as e:
    print(f"✗ Error validating entities: {e}")
    print(f"  Using {len(merged_entities)} entities")
