# Deduplication

## Overview

This notebook demonstrates how to detect and merge duplicate entities using Semantica's deduplication modules. You'll learn to use `DuplicateDetector`, `EntityMerger`, `SimilarityCalculator`, and `ClusterBuilder`.

### Learning Objectives

- Use `DuplicateDetector` to find duplicate entities
- Use `EntityMerger` to merge duplicates
- Use `SimilarityCalculator` to calculate similarity scores
- Use `ClusterBuilder` for batch deduplication

---

## Step 1: Duplicate Detection

Detect duplicate entities.


In [None]:
from semantica.deduplication import DuplicateDetector

duplicate_detector = DuplicateDetector(similarity_threshold=0.8)

entities = [
    {"id": "e1", "name": "Apple Inc.", "type": "Organization"},
    {"id": "e2", "name": "Apple Inc", "type": "Organization"},
    {"id": "e3", "name": "Microsoft", "type": "Organization"}
]

duplicates = duplicate_detector.detect_duplicates(entities)

print(f"Detected {len(duplicates)} duplicate groups")
for group in duplicates[:3]:
    print(f"  Group: {[e.get('id') for e in group.entities]}")


## Step 2: Entity Merging

Merge duplicate entities.


In [None]:
from semantica.deduplication import EntityMerger

entity_merger = EntityMerger()

merged_entities = entity_merger.merge_duplicates(entities)

print(f"Original entities: {len(entities)}")
print(f"Merged entities: {len(merged_entities)}")


## Step 3: Similarity Calculation

Calculate similarity between entities.


In [None]:
from semantica.deduplication import SimilarityCalculator

similarity_calculator = SimilarityCalculator()

similarity = similarity_calculator.calculate_similarity(entities[0], entities[1])

print(f"Similarity between '{entities[0]['name']}' and '{entities[1]['name']}': {similarity.score:.3f}")


## Summary

You've learned how to deduplicate entities:

- **DuplicateDetector**: Detect duplicate entities
- **EntityMerger**: Merge duplicate entities
- **SimilarityCalculator**: Calculate similarity scores
- **ClusterBuilder**: Batch deduplication

Next: Learn how to generate embeddings in the Embedding_Generation notebook.
