[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/introduction/18_Deduplication.ipynb)

# Deduplication in Semantica

Welcome to the **Deduplication** walkthrough! In any Knowledge Graph, data often comes from multiple sources, leading to duplicate entities (e.g., "Apple Inc." vs. "Apple Inc"). 

Semantica provides a robust **Deduplication Module** to help you:
1. **Calculate Similarity**: Compare entities using strings, properties, and embeddings.
2. **Detect Duplicates**: Find pairs or groups of entities that represent the same real-world object.
3. **Cluster Entities**: Group similar entities together.
4. **Merge Entities**: Combine duplicates into a single, canonical entity while resolving conflicts.

This notebook will guide you through each step with clear examples.

In [1]:
# Install Semantica
!pip install -q semantica



## 1. Preparing Sample Data

Let's create a dataset with some intentional duplicates. We'll simulate data coming from different sources (e.g., a CRM and a public database).

**Our Entities:**
- **Apple**: Variations like "Apple Inc.", "Apple Inc", and just "Apple".
- **Microsoft**: Variations like "Microsoft Corp" and "Microsoft".
- **Google**: A unique entity for control.

In [2]:
entities = [
    # Apple Variations
    {
        "id": "e1",
        "name": "Apple Inc.",
        "type": "Company",
        "properties": {"industry": "Technology", "hq": "Cupertino", "founded": 1976},
        "relationships": [{"predicate": "founded_by", "object": "Steve Jobs"}]
    },
    {
        "id": "e2",
        "name": "Apple Inc",
        "type": "Company",
        "properties": {"industry": "Tech", "hq": "Cupertino, CA"}, # Slightly different properties
        "relationships": []
    },
    {
        "id": "e3",
        "name": "Apple",
        "type": "Company",
        "properties": {"industry": "Consumer Electronics"}, 
        "relationships": [{"predicate": "ceo", "object": "Tim Cook"}]
    },
    
    # Microsoft Variations
    {
        "id": "e4",
        "name": "Microsoft Corp",
        "type": "Company",
        "properties": {"industry": "Software", "hq": "Redmond"}
    },
    {
        "id": "e5",
        "name": "Microsoft",
        "type": "Company",
        "properties": {"industry": "Tech", "hq": "Redmond, WA"}
    },
    
    # Unique Entity
    {
        "id": "e6",
        "name": "Google LLC",
        "type": "Company",
        "properties": {"industry": "Internet"}
    }
]

print(f"Created {len(entities)} sample entities.")

Created 6 sample entities.


## 2. Similarity Calculation

The `SimilarityCalculator` is the core engine. It compares two entities and returns a score between 0 and 1. It looks at:
- **String Similarity**: Names and text fields.
- **Property Similarity**: Overlap in key-value pairs.
- **Relationship Similarity**: Connections to other entities.
- **Embeddings**: Semantic vector similarity (if available).

You can customize the weights for each factor.

In [3]:
from semantica.deduplication import SimilarityCalculator, SimilarityResult
# Initialize calculator with custom weights
calculator = SimilarityCalculator(
    string_weight=0.5,      # High importance on name
    property_weight=0.3,    # Medium importance on properties
    relationship_weight=0.2 # Lower importance on relationships
)

# Compare "Apple Inc." (e1) vs "Apple Inc" (e2)
score_e1_e2 = calculator.calculate_similarity(entities[0], entities[1])

print(f"Similarity between '{entities[0]['name']}' and '{entities[1]['name']}':")
print(f"  Total Score: {score_e1_e2.score:.4f}")
print(f"  Breakdown:   {score_e1_e2.components}")

# Compare "Apple Inc." (e1) vs "Microsoft" (e5)
score_e1_e5 = calculator.calculate_similarity(entities[0], entities[4])

print(f"\nSimilarity between '{entities[0]['name']}' and '{entities[4]['name']}':")
print(f"  Total Score: {score_e1_e5.score:.4f}")

Status,Action,Module,Submodule,File,Time
âœ…,Semantica is deduplicating,ðŸ”„ deduplication,SimilarityCalculator,-,0.01s
âœ…,Semantica is deduplicating,ðŸ”„ deduplication,DuplicateDetector,-,0.17s
âœ…,Semantica is deduplicating,ðŸ”„ deduplication,EntityMerger,-,0.18s


Similarity between 'Apple Inc.' and 'Apple Inc':
  Total Score: 0.5592
  Breakdown:   {'string': 0.9, 'property': 0.36410256410256414, 'relationship': 0.0}

Similarity between 'Apple Inc.' and 'Microsoft':
  Total Score: 0.0400


## 3. Duplicate Detection

The `DuplicateDetector` uses the similarity calculator to scan your dataset for duplicates. It can find:
- **Pairs**: Simple A matches B.
- **Groups**: A matches B, and B matches C.

It uses a `similarity_threshold` to decide what counts as a match.

In [4]:
# Import specific classes for Duplicate Detection
from semantica.deduplication import DuplicateDetector, DuplicateCandidate, DuplicateGroup
from semantica.deduplication import DeduplicationConfig

detector = DuplicateDetector(
    similarity_threshold=0.7, 
    confidence_threshold=0.6  
)

# Detect pairs
candidates = detector.detect_duplicates(entities)

print(f"Found {len(candidates)} duplicate pairs:")
for c in candidates:
    print(f"  - {c.entity1['name']} <==> {c.entity2['name']} (Score: {c.similarity_score:.2f})")

Found 0 duplicate pairs:


### Incremental Detection
If you have an existing database and ingest new data, you don't want to re-compare everything. Use `incremental_detect`.

In [5]:
existing_db = entities[:3] # The Apple entities
new_data = [entities[4]]   # Microsoft
# Check if new data matches anything in existing DB
inc_candidates = detector.incremental_detect(new_data, existing_db)

print(f"New matches found: {len(inc_candidates)}")
# Expected: 0, because Microsoft is not Apple.

New matches found: 0


## 4. Clustering

Sometimes pairs aren't enough. `ClusterBuilder` groups related entities into clusters. This is useful for understanding the full scope of a duplicated entity.

In [6]:
# Import specific classes for Clustering
from semantica.deduplication import ClusterBuilder, Cluster, ClusterResult
cluster_builder = ClusterBuilder(threshold=0.7)
result = cluster_builder.build_clusters(entities)

print(f"Found {len(result.clusters)} clusters:")
for i, cluster in enumerate(result.clusters):
    names = [e['name'] for e in cluster.entities]
    print(f"  Cluster {i+1}: {names}")

cluster_builder = ClusterBuilder(threshold=0.7)
result = cluster_builder.build_clusters(entities)

print(f"Found {len(result.clusters)} clusters:")
for i, cluster in enumerate(result.clusters):
    names = [e['name'] for e in cluster.entities]
    print(f"  Cluster {i+1}: {names}")

Found 0 clusters:
Found 0 clusters:


## 5. Entity Merging

Once duplicates are found, `EntityMerger` combines them. You need to choose a **Merge Strategy**:

- `KEEP_FIRST` / `KEEP_LAST`: Based on order.
- `KEEP_MOST_COMPLETE`: Keeps the entity with the most data (properties + relationships).
- `KEEP_HIGHEST_CONFIDENCE`: Uses internal confidence scores.
- `MERGE_ALL`: Combines everything (arrays are concatenated, conflicts resolved by voting).

In [7]:
# Import specific classes for Entity Merging
from semantica.deduplication import EntityMerger, MergeStrategy, MergeStrategyManager, MergeOperation, MergeResult

In [8]:
merger = EntityMerger()

# We will use the 'KEEP_MOST_COMPLETE' strategy
# This ensures we don't lose valuable information from richer entities
merge_ops = merger.merge_duplicates(
    entities, 
    strategy=MergeStrategy.KEEP_MOST_COMPLETE
)

print(f"Performed {len(merge_ops)} merge operations.")

print("\n--- Merged Results ---")
for op in merge_ops:
    final_ent = op.merged_entity
    original_count = len(op.source_entities)
    print(f"Merged {original_count} entities into: '{final_ent['name']}'")
    print(f"  - Final Properties: {final_ent['properties']}")
    print(f"  - Final Relationships: {len(final_ent.get('relationships', []))}")

Performed 0 merge operations.

--- Merged Results ---


## 6. Complete Workflow

Let's wrap this up into a clean function that takes dirty data and returns clean data.

In [9]:
def deduplicate_dataset(raw_entities):
    print("1. Detecting duplicates...")
    # Step 1: Detect
    detector = DuplicateDetector(similarity_threshold=0.75)
    # We can skip explicit detection calls if we just want to merge, 
    # as EntityMerger calls detection internally, but doing it manually allows inspection.
    
    print("2. Merging entities...")
    # Step 2: Merge
    merger = EntityMerger()
    ops = merger.merge_duplicates(raw_entities, strategy=MergeStrategy.KEEP_MOST_COMPLETE)
    
    # Extract the final merged entities from the operations
    # Note: merge_duplicates returns operations for *merged* groups.
    # Entities that were not duplicated are often passed through or need to be collected.
    # In Semantica's EntityMerger, it typically returns the operations for the merges performed.
    
    # Let's collect all final IDs to see what remains
    merged_entities = [op.merged_entity for op in ops]
    
    # Find entities that were NOT part of any merge (singletons)
    merged_ids = set()
    for op in ops:
        for source in op.source_entities:
            merged_ids.add(source['id'])
            
    singletons = [e for e in raw_entities if e['id'] not in merged_ids]
    
    final_dataset = merged_entities + singletons
    return final_dataset

# Run the workflow
clean_data = deduplicate_dataset(entities)

print(f"\nOriginal Size: {len(entities)}")
print(f"Cleaned Size:  {len(clean_data)}")
print("\nFinal Entity Names:")
for e in clean_data:
    print(f"  - {e['name']}")

1. Detecting duplicates...
2. Merging entities...

Original Size: 6
Cleaned Size:  6

Final Entity Names:
  - Apple Inc.
  - Apple Inc
  - Apple
  - Microsoft Corp
  - Microsoft
  - Google LLC


## Summary

You've learned how to:
1. **Import** the necessary Deduplication classes.
2. **Calculate Similarity** between entities.
3. **Detect Duplicates** using configurable thresholds.
4. **Cluster** similar entities.
5. **Merge** duplicates into a clean, canonical dataset.

This module is essential for maintaining high-quality Knowledge Graphs, especially when ingesting data from multiple, potentially messy sources.