# Tutorial: Fuzzy Data Deduplication

**Category**: String Handlers  
**Difficulty**: Intermediate  
**Time**: 15-25 minutes

## Problem Statement

When managing contact lists, customer databases, or user records, duplicate entries often appear with slight variations. "John Smith" and "Jon Smith" are likely the same person, but exact string matching won't catch this. Manual deduplication is time-consuming and error-prone, especially with thousands of records.

Traditional exact-match deduplication fails to identify near-duplicates caused by typos, abbreviations, or data entry inconsistencies. Users end up with duplicate contacts ("Robert Johnson" vs "Bob Johnson"), wasting storage and causing confusion during operations.

**Why This Matters**:
- **Data Quality**: Duplicate records inflate metrics, waste storage, and confuse analytics
- **User Experience**: Multiple entries for the same entity frustrate users and reduce trust
- **Operational Efficiency**: Automated fuzzy matching saves hours of manual review time

**What You'll Build**:
A production-ready fuzzy deduplication system using lionherd-core's `string_similarity` function that identifies and merges near-duplicate records based on configurable similarity thresholds.

## Prerequisites

**Prior Knowledge**:
- Python lists and dictionaries
- Basic understanding of string comparison
- Familiarity with data deduplication concepts

**Required Packages**:
```bash
pip install lionherd-core  # >=0.1.0
```

**Optional Reading**:
- [API Reference: string_similarity](../../../docs/api/libs/string_handlers/string_similarity.md)

In [1]:
# Standard library
from dataclasses import dataclass

# lionherd-core - string similarity for fuzzy matching
from lionherd_core.libs.string_handlers import string_similarity

## Solution Overview

We'll implement a fuzzy deduplication system that:

1. **Similarity Scoring**: Compare each record against all others using Levenshtein distance
2. **Threshold Matching**: Identify duplicates above a configurable similarity threshold (0.0-1.0)
3. **Merge Strategy**: Group similar records and keep the canonical version

**Key lionherd-core Components**:
- `string_similarity()`: Find similar strings using configurable algorithms (Jaro-Winkler, Levenshtein, etc.)
- `SimilarityAlgo`: Type-safe enum for available algorithms

**Flow**:
```
Records → Compare Pairs → Calculate Similarity → Threshold Filter → Group Duplicates → Merge
              ↓                ↓                      ↓                    ↓
        All vs All      Levenshtein Algo       Keep ≥ threshold    Deduplicated Output
```

**Expected Outcome**: A clean dataset with near-duplicates merged, preserving the most complete or canonical record from each duplicate group.

### Step 1: Define Data Model and Sample Data

First, we'll create a simple contact record structure and generate sample data with typical duplicates (typos, abbreviations, formatting variations).

**Why dataclass**: Lightweight, immutable-friendly, and easy to serialize for production use.

In [2]:
@dataclass
class Contact:
    """Contact record with name and metadata."""

    name: str
    email: str
    source: str  # Where the record came from (for tracking)


# Sample contact list with duplicates (real-world scenario)
contacts = [
    Contact("John Smith", "john.smith@example.com", "CRM"),
    Contact("Jon Smith", "jsmith@example.com", "Manual Entry"),  # Typo: John → Jon
    Contact("Jane Doe", "jane.doe@company.com", "Import"),
    Contact("Jane Do", "jane.d@company.com", "Manual Entry"),  # Typo: Doe → Do
    Contact("Robert Johnson", "robert.j@firm.com", "API"),
    Contact("Bob Johnson", "bob.johnson@firm.com", "CSV Upload"),  # Nickname: Robert → Bob
    Contact("Alice Williams", "alice.w@startup.com", "CRM"),
    Contact("Alise Williams", "alice@startup.com", "Manual Entry"),  # Typo: Alice → Alise
]

print(f"Contact List ({len(contacts)} records):")
print("=" * 60)
for i, contact in enumerate(contacts, 1):
    print(f"{i}. {contact.name:20} | {contact.email:30} | {contact.source}")
print("\nExpected duplicates:")
print("  - John Smith / Jon Smith")
print("  - Jane Doe / Jane Do")
print("  - Robert Johnson / Bob Johnson")
print("  - Alice Williams / Alise Williams")

Contact List (8 records):
1. John Smith           | john.smith@example.com         | CRM
2. Jon Smith            | jsmith@example.com             | Manual Entry
3. Jane Doe             | jane.doe@company.com           | Import
4. Jane Do              | jane.d@company.com             | Manual Entry
5. Robert Johnson       | robert.j@firm.com              | API
6. Bob Johnson          | bob.johnson@firm.com           | CSV Upload
7. Alice Williams       | alice.w@startup.com            | CRM
8. Alise Williams       | alice@startup.com              | Manual Entry

Expected duplicates:
  - John Smith / Jon Smith
  - Jane Doe / Jane Do
  - Robert Johnson / Bob Johnson
  - Alice Williams / Alise Williams


**Notes**:
- **Typo duplicates**: Single-character errors ("John" → "Jon", "Alice" → "Alise")
- **Abbreviation duplicates**: Nicknames or shortened names ("Robert" → "Bob")
- **Real-world data**: These patterns mirror actual data entry issues in production systems
- **Source tracking**: Keeps provenance for audit trails (which system created the record)

### Step 2: Explore String Similarity Basics

Before building the deduplication system, let's understand how `string_similarity` works with different algorithms and thresholds.

**Algorithm Choice**: Levenshtein measures character-level edit distance, perfect for typos and spelling variations.

In [3]:
# Compare two similar names
name1 = "John Smith"
name2 = "Jon Smith"

print("String Similarity Basics")
print("=" * 60)
print(f"Comparing: '{name1}' vs '{name2}'\n")

# Test different algorithms
algorithms = [
    ("levenshtein", "Edit distance (insertions/deletions/substitutions)"),
    ("jaro_winkler", "Prefix-weighted similarity (good for names)"),
    ("sequence_matcher", "Python's SequenceMatcher (longest common subsequence)"),
]

candidates = ["John Smith", "Jon Smith", "Jane Doe", "Robert Johnson"]

for algo, description in algorithms:
    print(f"\n{algo.upper()}")
    print(f"  Description: {description}")

    # Find matches with 0.8 threshold
    matches = string_similarity(
        word=name2,  # Search for "Jon Smith"
        correct_words=candidates,
        algorithm=algo,
        threshold=0.8,  # 80% similarity minimum
        return_most_similar=False,  # Return all matches above threshold
    )

    print(f"  Matches (≥0.8): {matches}")

print("\n" + "=" * 60)
print("Insight: Levenshtein correctly identifies 'John Smith' as similar to 'Jon Smith'")
print("         due to single-character edit (insertion of 'h')")

String Similarity Basics
Comparing: 'John Smith' vs 'Jon Smith'


LEVENSHTEIN
  Description: Edit distance (insertions/deletions/substitutions)
  Matches (≥0.8): ['Jon Smith', 'John Smith']

JARO_WINKLER
  Description: Prefix-weighted similarity (good for names)
  Matches (≥0.8): ['Jon Smith', 'John Smith']

SEQUENCE_MATCHER
  Description: Python's SequenceMatcher (longest common subsequence)
  Matches (≥0.8): ['Jon Smith', 'John Smith']

Insight: Levenshtein correctly identifies 'John Smith' as similar to 'Jon Smith'
         due to single-character edit (insertion of 'h')


**Notes**:
- **Threshold tuning**: 0.8 (80%) is a good default for name matching - strict enough to avoid false positives
- **Algorithm differences**: Jaro-Winkler favors prefix matches, Levenshtein handles typos better
- **return_most_similar**: `False` returns all matches (for dedup), `True` returns best match (for autocorrect)

### Step 3: Build Similarity Matrix

To identify all duplicates, we need to compare every record against every other record. This creates a similarity matrix showing which pairs are potential duplicates.

**Complexity**: O(n²) comparisons for n records - acceptable for <10,000 records, use clustering for larger datasets.

In [4]:
from lionherd_core.libs.string_handlers._string_similarity import levenshtein_similarity


def build_similarity_matrix(
    contacts: list[Contact], threshold: float = 0.8
) -> dict[int, list[tuple[int, float]]]:
    """Build similarity matrix for all contact pairs.

    Args:
        contacts: List of contact records
        threshold: Minimum similarity score (0.0-1.0)

    Returns:
        Dict mapping contact index to list of (similar_index, score) tuples
    """
    matrix = {}

    for i, contact_a in enumerate(contacts):
        similar = []

        for j, contact_b in enumerate(contacts):
            if i >= j:  # Skip self-comparison and duplicates (i,j) == (j,i)
                continue

            # Calculate similarity
            score = levenshtein_similarity(contact_a.name, contact_b.name)

            if score >= threshold:
                similar.append((j, score))

        if similar:  # Only store if duplicates found
            matrix[i] = similar

    return matrix


# Build similarity matrix
similarity_matrix = build_similarity_matrix(contacts, threshold=0.8)

print("Similarity Matrix (threshold=0.8)")
print("=" * 60)
for idx, similar_contacts in similarity_matrix.items():
    print(f"\n{contacts[idx].name} (index {idx}):")
    for similar_idx, score in similar_contacts:
        print(f"  → {contacts[similar_idx].name} (index {similar_idx}): {score:.3f} similarity")

print(f"\n{'=' * 60}")
print(f"Found {len(similarity_matrix)} contacts with potential duplicates")

Similarity Matrix (threshold=0.8)

John Smith (index 0):
  → Jon Smith (index 1): 0.900 similarity

Jane Doe (index 2):
  → Jane Do (index 3): 0.875 similarity

Alice Williams (index 6):
  → Alise Williams (index 7): 0.929 similarity

Found 3 contacts with potential duplicates


**Notes**:
- **Upper triangle only**: We skip `i >= j` to avoid comparing (A,B) and (B,A) separately
- **Sparse matrix**: Only stores pairs above threshold, saving memory for large datasets
- **Score interpretation**: 1.0 = identical, 0.8-0.95 = likely duplicate, <0.8 = probably different

### Step 4: Group Duplicates into Clusters

The similarity matrix gives us pairwise matches, but we need to group transitive duplicates together. If A matches B and B matches C, all three should be in the same cluster.

**Algorithm**: Union-Find (Disjoint Set Union) efficiently groups connected components.

In [5]:
def group_duplicates(similarity_matrix: dict[int, list[tuple[int, float]]]) -> list[set[int]]:
    """Group duplicate contacts into clusters using union-find.

    Args:
        similarity_matrix: Pairwise similarity mapping

    Returns:
        List of sets, each set contains indices of duplicate contacts
    """
    # Union-Find data structure
    parent = {}

    def find(x: int) -> int:
        """Find root of x with path compression."""
        if x not in parent:
            parent[x] = x
        if parent[x] != x:
            parent[x] = find(parent[x])  # Path compression
        return parent[x]

    def union(x: int, y: int):
        """Merge sets containing x and y."""
        root_x, root_y = find(x), find(y)
        if root_x != root_y:
            parent[root_x] = root_y

    # Build connected components
    for idx, similar_contacts in similarity_matrix.items():
        for similar_idx, _ in similar_contacts:
            union(idx, similar_idx)

    # Group by root parent
    clusters = {}
    for idx in parent:
        root = find(idx)
        if root not in clusters:
            clusters[root] = set()
        clusters[root].add(idx)

    return list(clusters.values())


# Group duplicates
duplicate_clusters = group_duplicates(similarity_matrix)

print("Duplicate Clusters")
print("=" * 60)
for cluster_num, cluster in enumerate(duplicate_clusters, 1):
    print(f"\nCluster {cluster_num}:")
    for idx in sorted(cluster):
        contact = contacts[idx]
        print(f"  [{idx}] {contact.name:20} | {contact.email:30} | {contact.source}")

print(f"\n{'=' * 60}")
print(f"Identified {len(duplicate_clusters)} duplicate clusters")

Duplicate Clusters

Cluster 1:
  [0] John Smith           | john.smith@example.com         | CRM
  [1] Jon Smith            | jsmith@example.com             | Manual Entry

Cluster 2:
  [2] Jane Doe             | jane.doe@company.com           | Import
  [3] Jane Do              | jane.d@company.com             | Manual Entry

Cluster 3:
  [6] Alice Williams       | alice.w@startup.com            | CRM
  [7] Alise Williams       | alice@startup.com              | Manual Entry

Identified 3 duplicate clusters


**Notes**:
- **Union-Find efficiency**: O(α(n)) amortized time per operation (nearly constant)
- **Path compression**: Flattens tree structure for faster subsequent lookups
- **Transitive closure**: A~B and B~C → all in same cluster automatically

### Step 5: Implement Merge Strategy

Once duplicates are grouped, we need a strategy to pick the "canonical" record. Common strategies: earliest timestamp, most complete data, most recent update, or manual review flag.

**Strategy**: Pick the record from the most trusted source (CRM > API > Import > Manual Entry).

In [6]:
def merge_duplicates(
    contacts: list[Contact], clusters: list[set[int]], source_priority: dict[str, int] | None = None
) -> list[Contact]:
    """Merge duplicate clusters, keeping highest-priority record.

    Args:
        contacts: Original contact list
        clusters: Duplicate clusters (from group_duplicates)
        source_priority: Dict mapping source to priority (lower = higher priority)

    Returns:
        Deduplicated contact list
    """
    if source_priority is None:
        # Default priority: CRM > API > Import > Manual Entry > CSV Upload
        source_priority = {"CRM": 1, "API": 2, "Import": 3, "Manual Entry": 4, "CSV Upload": 5}

    # Track which indices to keep
    to_remove = set()

    for cluster in clusters:
        # Find canonical record (highest priority source)
        cluster_list = list(cluster)
        canonical_idx = min(
            cluster_list, key=lambda idx: source_priority.get(contacts[idx].source, 99)
        )

        # Mark others for removal
        for idx in cluster:
            if idx != canonical_idx:
                to_remove.add(idx)

    # Build deduplicated list
    deduplicated = [contact for i, contact in enumerate(contacts) if i not in to_remove]

    return deduplicated


# Merge duplicates
deduplicated_contacts = merge_duplicates(contacts, duplicate_clusters)

print("Deduplication Results")
print("=" * 60)
print(f"Original: {len(contacts)} contacts")
print(f"Deduplicated: {len(deduplicated_contacts)} contacts")
print(f"Removed: {len(contacts) - len(deduplicated_contacts)} duplicates\n")

print("Final Contact List:")
print("=" * 60)
for i, contact in enumerate(deduplicated_contacts, 1):
    print(f"{i}. {contact.name:20} | {contact.email:30} | {contact.source}")

Deduplication Results
Original: 8 contacts
Deduplicated: 5 contacts
Removed: 3 duplicates

Final Contact List:
1. John Smith           | john.smith@example.com         | CRM
2. Jane Doe             | jane.doe@company.com           | Import
3. Robert Johnson       | robert.j@firm.com              | API
4. Bob Johnson          | bob.johnson@firm.com           | CSV Upload
5. Alice Williams       | alice.w@startup.com            | CRM


**Notes**:
- **Source priority**: Adjust based on data quality - CRM records typically most accurate
- **Alternative strategies**: Most recent timestamp, longest name (more complete), manual review flag
- **Audit trail**: Consider logging removed duplicates for manual review before permanent deletion

### Step 6: Tune Threshold for Precision/Recall Trade-off

The similarity threshold controls the trade-off between false positives (merging different people) and false negatives (missing duplicates).

**Experimentation**: Test different thresholds to find optimal balance for your data.

In [7]:
# Test different thresholds
thresholds = [0.7, 0.8, 0.85, 0.9]

print("Threshold Tuning Analysis")
print("=" * 60)

for threshold in thresholds:
    matrix = build_similarity_matrix(contacts, threshold=threshold)
    clusters = group_duplicates(matrix)
    deduped = merge_duplicates(contacts, clusters)

    removed = len(contacts) - len(deduped)

    print(f"\nThreshold: {threshold}")
    print(f"  Duplicate clusters found: {len(clusters)}")
    print(f"  Records removed: {removed}")
    print(f"  Final count: {len(deduped)}")

    # Show what was merged
    if clusters:
        print("  Merged groups:")
        for cluster in clusters:
            names = [contacts[idx].name for idx in cluster]
            print(f"    - {', '.join(names)}")

print("\n" + "=" * 60)
print("Recommendation:")
print("  - 0.7-0.75: Aggressive (more false positives)")
print("  - 0.8-0.85: Balanced (recommended for names)")
print("  - 0.9+: Conservative (fewer matches, higher confidence)")

Threshold Tuning Analysis

Threshold: 0.7
  Duplicate clusters found: 4
  Records removed: 4
  Final count: 4
  Merged groups:
    - John Smith, Jon Smith
    - Jane Doe, Jane Do
    - Robert Johnson, Bob Johnson
    - Alice Williams, Alise Williams

Threshold: 0.8
  Duplicate clusters found: 3
  Records removed: 3
  Final count: 5
  Merged groups:
    - John Smith, Jon Smith
    - Jane Doe, Jane Do
    - Alice Williams, Alise Williams

Threshold: 0.85
  Duplicate clusters found: 3
  Records removed: 3
  Final count: 5
  Merged groups:
    - John Smith, Jon Smith
    - Jane Doe, Jane Do
    - Alice Williams, Alise Williams

Threshold: 0.9
  Duplicate clusters found: 2
  Records removed: 2
  Final count: 6
  Merged groups:
    - John Smith, Jon Smith
    - Alice Williams, Alise Williams

Recommendation:
  - 0.7-0.75: Aggressive (more false positives)
  - 0.8-0.85: Balanced (recommended for names)
  - 0.9+: Conservative (fewer matches, higher confidence)


**Notes**:
- **Low threshold (<0.75)**: Catches more duplicates but risks merging different people
- **High threshold (>0.9)**: Very safe but may miss valid duplicates like "Bob" vs "Robert"
- **Domain-specific**: Names tolerate lower thresholds than product codes or addresses

## Complete Working Example

Here's a production-ready deduplication system combining all steps into a single function. Copy-paste this into your project for immediate use.

**Features**:
- ✅ Configurable similarity threshold
- ✅ Multiple algorithm support (Levenshtein, Jaro-Winkler, etc.)
- ✅ Customizable merge strategy (source priority)
- ✅ Audit trail (returns removed duplicates)
- ✅ Type-safe dataclass interface

In [8]:
"""Production-ready fuzzy deduplication system.

Copy this entire cell into your project for immediate use.
"""

from dataclasses import dataclass
from typing import Literal

from lionherd_core.libs.string_handlers._string_similarity import jaro_winkler_similarity


@dataclass
class Contact:
    """Contact record."""

    name: str
    email: str
    source: str


@dataclass
class DeduplicationResult:
    """Result of deduplication operation."""

    deduplicated: list[Contact]
    removed: list[Contact]
    clusters: list[set[int]]
    original_count: int
    final_count: int


def deduplicate_contacts(
    contacts: list[Contact],
    threshold: float = 0.8,
    algorithm: Literal["levenshtein", "jaro_winkler"] = "levenshtein",
    source_priority: dict[str, int] | None = None,
) -> DeduplicationResult:
    """Deduplicate contact list using fuzzy string matching.

    Args:
        contacts: List of contact records
        threshold: Similarity threshold (0.0-1.0), higher = stricter
        algorithm: Similarity algorithm to use
        source_priority: Dict mapping source to priority (lower = keep)

    Returns:
        DeduplicationResult with deduplicated contacts and audit info
    """
    # Select similarity function
    sim_func = {"levenshtein": levenshtein_similarity, "jaro_winkler": jaro_winkler_similarity}[
        algorithm
    ]

    # Default source priority
    if source_priority is None:
        source_priority = {"CRM": 1, "API": 2, "Import": 3, "Manual Entry": 4, "CSV Upload": 5}

    # Step 1: Build similarity matrix
    matrix = {}
    for i, contact_a in enumerate(contacts):
        similar = []
        for j, contact_b in enumerate(contacts):
            if i >= j:
                continue
            score = sim_func(contact_a.name, contact_b.name)
            if score >= threshold:
                similar.append((j, score))
        if similar:
            matrix[i] = similar

    # Step 2: Group duplicates (union-find)
    parent = {}

    def find(x):
        if x not in parent:
            parent[x] = x
        if parent[x] != x:
            parent[x] = find(parent[x])
        return parent[x]

    def union(x, y):
        root_x, root_y = find(x), find(y)
        if root_x != root_y:
            parent[root_x] = root_y

    for idx, similar_contacts in matrix.items():
        for similar_idx, _ in similar_contacts:
            union(idx, similar_idx)

    clusters_dict = {}
    for idx in parent:
        root = find(idx)
        if root not in clusters_dict:
            clusters_dict[root] = set()
        clusters_dict[root].add(idx)

    clusters = list(clusters_dict.values())

    # Step 3: Merge duplicates
    to_remove = set()
    for cluster in clusters:
        cluster_list = list(cluster)
        canonical_idx = min(
            cluster_list, key=lambda idx: source_priority.get(contacts[idx].source, 99)
        )
        for idx in cluster:
            if idx != canonical_idx:
                to_remove.add(idx)

    deduplicated = [contact for i, contact in enumerate(contacts) if i not in to_remove]

    removed = [contact for i, contact in enumerate(contacts) if i in to_remove]

    return DeduplicationResult(
        deduplicated=deduplicated,
        removed=removed,
        clusters=clusters,
        original_count=len(contacts),
        final_count=len(deduplicated),
    )


# Example usage
result = deduplicate_contacts(contacts, threshold=0.8, algorithm="levenshtein")

print("Deduplication Complete")
print("=" * 60)
print(f"Original contacts: {result.original_count}")
print(f"Deduplicated contacts: {result.final_count}")
print(f"Duplicates removed: {len(result.removed)}")
print(f"Duplicate clusters: {len(result.clusters)}")

print("\nRemoved Duplicates (for audit):")
for contact in result.removed:
    print(f"  - {contact.name} ({contact.source})")

print("\nFinal Contact List:")
for i, contact in enumerate(result.deduplicated, 1):
    print(f"{i}. {contact.name} | {contact.email} | {contact.source}")

Deduplication Complete
Original contacts: 8
Deduplicated contacts: 5
Duplicates removed: 3
Duplicate clusters: 3

Removed Duplicates (for audit):
  - Jon Smith (Manual Entry)
  - Jane Do (Manual Entry)
  - Alise Williams (Manual Entry)

Final Contact List:
1. John Smith | john.smith@example.com | CRM
2. Jane Doe | jane.doe@company.com | Import
3. Robert Johnson | robert.j@firm.com | API
4. Bob Johnson | bob.johnson@firm.com | CSV Upload
5. Alice Williams | alice.w@startup.com | CRM


## Production Considerations

### Error Handling

**What Can Go Wrong**:
1. **Empty contact list**: No records to deduplicate
2. **Invalid threshold**: Values outside 0.0-1.0 range
3. **Missing source priority**: Unknown source types in data
4. **Large datasets**: O(n²) complexity becomes slow for >10,000 records

**Handling**:
```python
def safe_deduplicate(
    contacts: list[Contact],
    threshold: float = 0.8
) -> DeduplicationResult | None:
    """Deduplicate with comprehensive error handling."""
    # Validate inputs
    if not contacts:
        print("Warning: Empty contact list")
        return None
    
    if not 0.0 <= threshold <= 1.0:
        raise ValueError(f"Threshold must be 0.0-1.0, got {threshold}")
    
    # Warn about performance
    if len(contacts) > 10000:
        print(f"Warning: {len(contacts)} records may be slow (O(n²) algorithm)")
    
    try:
        return deduplicate_contacts(contacts, threshold)
    except Exception as e:
        print(f"Deduplication failed: {e}")
        return None
```

### Performance

**Scalability**:
- **Similarity matrix**: O(n²) time, O(n²) space for n records
- **Union-find**: O(α(n)) amortized per operation (nearly constant)
- **Total**: O(n²) dominated by pairwise comparisons

**Benchmarks** (approximate, single-threaded):
- 100 records: ~50ms
- 1,000 records: ~5s
- 10,000 records: ~8-10 minutes
- 100,000 records: Use blocking/clustering pre-processing

**Optimization for Large Datasets**:
```python
# Blocking: Group by first letter before comparing
def block_by_first_letter(contacts: list[Contact]) -> dict[str, list[Contact]]:
    """Group contacts by first letter to reduce comparisons."""
    blocks = {}
    for contact in contacts:
        key = contact.name[0].upper() if contact.name else "_"
        if key not in blocks:
            blocks[key] = []
        blocks[key].append(contact)
    return blocks

# Deduplicate within each block (reduces O(n²) to O(n²/k) for k blocks)
blocks = block_by_first_letter(contacts)
all_deduplicated = []
for block_contacts in blocks.values():
    result = deduplicate_contacts(block_contacts, threshold=0.8)
    all_deduplicated.extend(result.deduplicated)
```

### Testing

**Unit Tests**:
```python
def test_deduplication():
    """Test basic deduplication functionality."""
    contacts = [
        Contact("John Smith", "j@example.com", "CRM"),
        Contact("Jon Smith", "jon@example.com", "Manual Entry"),
    ]
    
    result = deduplicate_contacts(contacts, threshold=0.8)
    
    # Should merge into 1 contact
    assert result.final_count == 1
    assert len(result.removed) == 1
    
    # Should keep CRM record (higher priority)
    assert result.deduplicated[0].name == "John Smith"
    assert result.deduplicated[0].source == "CRM"
```

**Integration Tests**:
- Test with real data samples (100-1000 records)
- Validate precision/recall against manual labels
- Test different thresholds (0.7, 0.8, 0.85, 0.9)
- Verify audit trail (removed duplicates logged correctly)

### Monitoring

**Key Metrics**:
- **Deduplication rate**: % of records removed (typical: 5-15% for CRM data)
- **Cluster size distribution**: Most clusters should be pairs, large clusters (>5) may indicate issues
- **Processing time**: Track time per 1000 records for capacity planning

**Observability**:
```python
import time

def deduplicate_with_metrics(contacts: list[Contact]) -> DeduplicationResult:
    """Deduplicate with metric emission."""
    start = time.time()
    
    result = deduplicate_contacts(contacts, threshold=0.8)
    
    duration = time.time() - start
    dedup_rate = len(result.removed) / result.original_count if result.original_count > 0 else 0
    
    # Log metrics
    print(f"Deduplication metrics:")
    print(f"  Duration: {duration:.2f}s")
    print(f"  Records: {result.original_count} → {result.final_count}")
    print(f"  Dedup rate: {dedup_rate:.1%}")
    print(f"  Clusters: {len(result.clusters)}")
    print(f"  Avg cluster size: {len(result.removed) / len(result.clusters) if result.clusters else 0:.1f}")
    
    return result
```

### Configuration Tuning

**threshold**:
- Too low (< 0.7): False positives - "Jane Doe" merges with "Jane Smith"
- Too high (> 0.9): False negatives - "Bob Johnson" not matched to "Robert Johnson"
- Recommended: 0.8-0.85 for names, 0.9+ for product codes/IDs

**algorithm**:
- `levenshtein`: Best for typos and character-level edits ("John" vs "Jon")
- `jaro_winkler`: Best for prefix variations and different lengths ("Robert" vs "Bob")
- Recommended: Start with `levenshtein`, switch to `jaro_winkler` if missing nickname matches

**source_priority**:
- Adjust based on data quality: CRM (most accurate) > API > Manual Entry (least reliable)
- Alternative: Use timestamp (most recent), completeness score (most fields filled), or manual review flag

## Variations

### 1. Multi-Field Matching

**When to Use**: Higher confidence deduplication by combining name + email + other fields

**Approach**:
```python
def multi_field_similarity(contact_a: Contact, contact_b: Contact) -> float:
    """Calculate weighted similarity across multiple fields."""
    name_sim = levenshtein_similarity(contact_a.name, contact_b.name)
    
    # Extract email username (before @)
    email_a = contact_a.email.split("@")[0]
    email_b = contact_b.email.split("@")[0]
    email_sim = levenshtein_similarity(email_a, email_b)
    
    # Weighted average: name 70%, email 30%
    return 0.7 * name_sim + 0.3 * email_sim

# Use in similarity matrix
for i, contact_a in enumerate(contacts):
    for j, contact_b in enumerate(contacts):
        if i >= j:
            continue
        score = multi_field_similarity(contact_a, contact_b)
        # ... rest of logic
```

**Trade-offs**:
- ✅ Higher precision - fewer false positives
- ✅ Catches duplicates with different names but same email
- ❌ Slower - more comparisons per pair
- ❌ May miss duplicates with different emails

### 2. Incremental Deduplication

**When to Use**: New records added frequently, avoid re-processing entire dataset

**Approach**:
```python
def incremental_deduplicate(
    existing: list[Contact],
    new_records: list[Contact],
    threshold: float = 0.8
) -> tuple[list[Contact], list[Contact]]:
    """Deduplicate new records against existing database.
    
    Returns:
        (unique_new_records, duplicate_new_records)
    """
    unique = []
    duplicates = []
    
    for new_contact in new_records:
        # Check if similar to any existing contact
        is_duplicate = False
        for existing_contact in existing:
            score = levenshtein_similarity(new_contact.name, existing_contact.name)
            if score >= threshold:
                duplicates.append(new_contact)
                is_duplicate = True
                break
        
        if not is_duplicate:
            unique.append(new_contact)
    
    return unique, duplicates

# Usage: Only compare new records against existing (O(n×m) instead of O((n+m)²))
unique, dupes = incremental_deduplicate(deduplicated_contacts, new_imports)
```

**Trade-offs**:
- ✅ Much faster for large existing datasets
- ✅ Suitable for real-time duplicate detection
- ❌ Doesn't detect duplicates within new batch
- ❌ Existing dataset must already be deduplicated

### 3. Manual Review Queue

**When to Use**: High-stakes deduplication where errors are costly (financial, legal, medical)

**Approach**:
```python
@dataclass
class ReviewItem:
    """Potential duplicate for manual review."""
    contact_a: Contact
    contact_b: Contact
    similarity: float
    auto_merge: bool  # True if confidence high enough

def generate_review_queue(
    contacts: list[Contact],
    auto_threshold: float = 0.95,  # Auto-merge above this
    review_threshold: float = 0.75  # Manual review between thresholds
) -> tuple[list[ReviewItem], list[Contact]]:
    """Generate review queue for borderline duplicates.
    
    Returns:
        (review_queue, auto_merged_contacts)
    """
    review_queue = []
    
    # Build similarity matrix
    for i, contact_a in enumerate(contacts):
        for j, contact_b in enumerate(contacts):
            if i >= j:
                continue
            
            score = levenshtein_similarity(contact_a.name, contact_b.name)
            
            if score >= review_threshold:
                review_queue.append(ReviewItem(
                    contact_a=contact_a,
                    contact_b=contact_b,
                    similarity=score,
                    auto_merge=(score >= auto_threshold)
                ))
    
    # Auto-merge high-confidence duplicates
    auto_merge_items = [item for item in review_queue if item.auto_merge]
    # ... merge logic ...
    
    # Return borderline cases for manual review
    manual_review = [item for item in review_queue if not item.auto_merge]
    
    return manual_review, auto_merged
```

**Trade-offs**:
- ✅ Prevents costly false positives
- ✅ Builds confidence through human validation
- ❌ Requires manual effort
- ❌ Slower processing time

## Choosing the Right Variation

| Scenario | Recommended Variation |
|----------|----------------------|
| High data quality requirements (finance, medical) | Manual Review Queue |
| Large existing database + frequent new records | Incremental Deduplication |
| Multiple identifying fields available | Multi-Field Matching |
| Standard contact/customer deduplication | Base implementation (this tutorial) |

## Summary

**What You Accomplished**:
- ✅ Built a fuzzy deduplication system using `string_similarity` with Levenshtein distance
- ✅ Implemented similarity matrix construction for pairwise comparisons
- ✅ Used union-find algorithm to group transitive duplicates efficiently
- ✅ Created configurable merge strategy with source priority
- ✅ Learned threshold tuning for precision/recall trade-offs

**Key Takeaways**:
1. **Levenshtein distance is ideal for typo detection**: Handles single-character edits ("John" vs "Jon") effectively
2. **Threshold tuning is critical**: 0.8-0.85 works for most names, adjust based on false positive/negative analysis
3. **Union-find efficiently groups duplicates**: O(α(n)) amortized time handles transitive relationships automatically
4. **Source priority matters**: CRM > API > Manual Entry - keep the most trusted record
5. **O(n²) complexity requires optimization for scale**: Use blocking/clustering for >10,000 records

**When to Use This Pattern**:
- ✅ Contact list deduplication (CRM systems, email marketing)
- ✅ Customer database cleanup (merging duplicate accounts)
- ✅ Product catalog deduplication (handling typos in product names)
- ✅ User account merging (same person, different spellings)
- ❌ Exact-match deduplication (use `set()` or SQL `DISTINCT` instead)
- ❌ Large-scale entity resolution (>100,000 records - use specialized tools)

## Related Resources

**lionherd-core API Reference**:
- [string_similarity](../../../docs/api/libs/string_handlers/string_similarity.md) - Complete API for similarity algorithms
- [SimilarityAlgo](../../../docs/api/libs/string_handlers/string_similarity.md#similarityalgo) - Available algorithm enum

**Related Tutorials**:
- [Fuzzy Validation](../ln_utilities/fuzzy_validation.ipynb) - Using fuzzy matching for data validation

**External Resources**:
- [Levenshtein Distance - Wikipedia](https://en.wikipedia.org/wiki/Levenshtein_distance) - Theory and algorithms
- [Dedupe.io Documentation](https://docs.dedupe.io/en/latest/) - Advanced deduplication techniques for large datasets