# PyDI Data Integration Tutorial

This tutorial demonstrates comprehensive data integration using PyDI. We'll work with movie datasets to showcase the data integration pipeline from entity matching to Data Fusion.

### What You'll Learn

1. **Data Loading & Profiling**: Load and analyze movie datasets with provenance tracking
2. **Entity Matching**: 
   - Blocking strategies (Standard, Sorted Neighbourhood, Token-based, Embedding-based)
   - Multi-attribute similarity matching with custom comparators
   - Machine learning-based entity matching
3. **Data Fusion**: 
   - Conflict resolution with custom fusion rules
   - Quality assessment against test set
   - Provenance-based conflict resolution

### Datasets

We'll use three movie datasets:
- **Academy Awards**: Movies with Oscar information (4,592 records)
- **Actors**: Movies with actor details (149 records) 
- **Golden Globes**: Movies with Golden Globe awards (2,286 records)

These datasets contain overlapping movie information but with different attributes, data quality issues, and conflicting values - perfect for demonstrating real-world data integration challenges.

In [1]:
from pathlib import Path

# Setup paths
def get_repo_root():
    """Get repository root directory."""
    current = Path.cwd()
    while current != current.parent:
        if (current / 'pyproject.toml').exists():
            return current
        current = current.parent
    return Path.cwd()

ROOT = get_repo_root()
OUTPUT_DIR = ROOT / "docs" / "tutorial" / "output" / "movies"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print(f"PyDI Tutorial")
print(f"Repository root: {ROOT}")
print(f"Output directory: {OUTPUT_DIR}")
print(f"All systems ready! 🚀")

PyDI Tutorial
Repository root: /Users/aaronsteiner/Documents/GitHub/PyDI
Output directory: /Users/aaronsteiner/Documents/GitHub/PyDI/docs/tutorial/output/movies
All systems ready! 🚀


## Part 1: Data Loading and Profiling

PyDI provides provenance-aware data loading that automatically tracks dataset metadata and optionally adds unique identifiers to each record. Let's load our movie datasets and understand their characteristics.

In [2]:
from PyDI.io import load_xml

# Define dataset paths
DATA_DIR = ROOT / "docs" / "tutorial" / "input" / "movies"

# Load Academy Awards dataset
academy_awards = load_xml(
    DATA_DIR / "data" / "academy_awards.xml",
    name="academy_awards",
    nested_handling="aggregate"
)

# Load Actors dataset  
actors = load_xml(
    DATA_DIR / "data" / "actors.xml",
    name="actors", 
    nested_handling="aggregate"
)

# Load Golden Globes dataset
golden_globes = load_xml(
    DATA_DIR / "data" / "golden_globes.xml",
    name="golden_globes",
    nested_handling="aggregate"
)

# Display basic information
datasets = [academy_awards, actors, golden_globes]
names = ["Academy Awards", "Actors", "Golden Globes"]

for df, name in zip(datasets, names):
    print(f"{name}:")
    print(f"  Records: {len(df):,}")
    print(f"  Attributes: {len(df.columns)}")
    print(f"  Columns: {list(df.columns)}")
    print(f"  Dataset name: {df.attrs.get('dataset_name', 'unknown')}")
    print()

total_records = sum(len(df) for df in datasets)
print(f"Total records across all datasets: {total_records:,}")

Academy Awards:
  Records: 4,580
  Attributes: 6
  Columns: ['id', 'title', 'actors_actor_name', 'date', 'director_name', 'oscar']
  Dataset name: academy_awards

Actors:
  Records: 151
  Attributes: 6
  Columns: ['id', 'title', 'actors_actor_name', 'actors_actor_birthday', 'actors_actor_birthplace', 'date']
  Dataset name: actors

Golden Globes:
  Records: 2,279
  Attributes: 6
  Columns: ['id', 'title', 'actors_actor_name', 'date', 'director_name', 'globe']
  Dataset name: golden_globes

Total records across all datasets: 7,010


In [3]:
# Preview the data structure

print("\n📽️ Academy Awards Dataset:")
display(academy_awards.head(3))

print("\n🎭 Actors Dataset:")
display(actors.head(3))

print("\n🏆 Golden Globes Dataset:")
display(golden_globes.head(3))


📽️ Academy Awards Dataset:


Unnamed: 0,id,title,actors_actor_name,date,director_name,oscar
0,academy_awards_1,Biutiful,Javier Bardem,2010-01-01,,
1,academy_awards_2,True Grit,"[Jeff Bridges, Hailee Steinfeld]",2010-01-01,Joel Coen and Ethan Coen,
2,academy_awards_3,The Social Network,Jesse Eisenberg,2010-01-01,David Fincher,yes



🎭 Actors Dataset:


Unnamed: 0,id,title,actors_actor_name,actors_actor_birthday,actors_actor_birthplace,date
0,actors_1,7th Heaven,Janet Gaynor,1906-01-01,Pennsylvania,1929-01-01
1,actors_2,Coquette,Mary Pickford,1892-01-01,Canada,1930-01-01
2,actors_3,The Divorcee,Norma Shearer,1902-01-01,Canada,1931-01-01



🏆 Golden Globes Dataset:


Unnamed: 0,id,title,actors_actor_name,date,director_name,globe
0,golden_globes_1,Frankie and Alice,Halle Berry,2011-01-01,,
1,golden_globes_2,Rabbit Hole,Nicole Kidman,2011-01-01,,
2,golden_globes_3,Winter's Bone,Jennifer Lawrence,2011-01-01,,


### Data Quality Analysis

Let's use PyDI's profiling capabilities to understand our data quality and identify the best attributes for matching.

### Basic Dataset Summary

First, let's use the DataProfiler's `summary()` method to get basic statistics for each dataset.

In [4]:
from PyDI.profiling import DataProfiler

# Initialize the DataProfiler
profiler = DataProfiler()

for df, name in zip(datasets, names):
    profile = profiler.summary(df) # automatically prints some statistics and returns object containing stats

display(profile)

academy_awards:
  Rows: 4,580
  Columns: 6
  Total nulls: 11,028
  Null percentage: 40.1%
  Null counts per column:
    title: 12 (0.3%)
    actors_actor_name: 3,531 (77.1%)
    director_name: 4,172 (91.1%)
    oscar: 3,313 (72.3%)

actors:
  Rows: 151
  Columns: 6
  Total nulls: 0
  Null percentage: 0.0%

golden_globes:
  Rows: 2,279
  Columns: 6
  Total nulls: 3,677
  Null percentage: 26.9%
  Null counts per column:
    actors_actor_name: 54 (2.4%)
    director_name: 1,966 (86.3%)
    globe: 1,657 (72.7%)



{'rows': 2279,
 'columns': 6,
 'nulls_total': 3677,
 'nulls_per_column': {'id': 0,
  'title': 0,
  'actors_actor_name': 54,
  'date': 0,
  'director_name': 1966,
  'globe': 1657},
 'dtypes': {'id': 'object',
  'title': 'object',
  'actors_actor_name': 'object',
  'date': 'object',
  'director_name': 'object',
  'globe': 'object'}}

### Attribute Coverage Analysis

Next, let's use the `analyze_coverage()` method to understand how attributes overlap across datasets.

In [5]:
coverage = profiler.analyze_coverage(
    datasets=datasets,
    include_samples=True,
    sample_count=3  # Show 3 sample values per attribute
)

print("📊 Attribute coverage across datasets:")
display(coverage)

# Identify attributes suitable for entity matching
print("\n🔗 Attributes suitable for entity matching:")
matching_attrs = coverage[coverage['datasets_with_attribute'] >= 2]['attribute'].tolist()
print(f"Attributes available in 2+ datasets: {matching_attrs}")

📊 Attribute coverage across datasets:


Unnamed: 0,attribute,academy_awards_count,academy_awards_pct,academy_awards_coverage,academy_awards_samples,actors_count,actors_pct,actors_coverage,actors_samples,golden_globes_count,golden_globes_pct,golden_globes_coverage,golden_globes_samples,avg_coverage,max_coverage,datasets_with_attribute
0,actors_actor_birthday,0/0,0%,0.0,,151/151,100.0%,1.0,"['1906-01-01', '1892-01-01', '1902-01-01']",0/0,0%,0.0,,0.333333,1.0,1
1,actors_actor_birthplace,0/0,0%,0.0,,151/151,100.0%,1.0,"['Pennsylvania', 'Canada', 'Canada']",0/0,0%,0.0,,0.333333,1.0,1
2,actors_actor_name,1049/4580,22.9%,0.229039,"['Javier Bardem', ['Jeff Bridges', 'Hailee Ste...",151/151,100.0%,1.0,"['Janet Gaynor', 'Mary Pickford', 'Norma Shear...",2225/2279,97.6%,0.976305,"['Halle Berry', 'Nicole Kidman', 'Jennifer Law...",0.735115,1.0,3
3,date,4580/4580,100.0%,1.0,"['2010-01-01', '2010-01-01', '2010-01-01']",151/151,100.0%,1.0,"['1929-01-01', '1930-01-01', '1931-01-01']",2279/2279,100.0%,1.0,"['2011-01-01', '2011-01-01', '2011-01-01']",1.0,1.0,3
4,director_name,408/4580,8.9%,0.089083,"['Joel Coen and Ethan Coen', 'David Fincher', ...",0/0,0%,0.0,,313/2279,13.7%,0.137341,"['Darren Aronofsky', 'David Fincher', 'Tom Hoo...",0.075475,0.137341,2
5,globe,0/0,0%,0.0,,0/0,0%,0.0,,622/2279,27.3%,0.272927,"['yes', 'yes', 'yes']",0.090976,0.272927,1
6,id,4580/4580,100.0%,1.0,"['academy_awards_1', 'academy_awards_2', 'acad...",151/151,100.0%,1.0,"['actors_1', 'actors_2', 'actors_3']",2279/2279,100.0%,1.0,"['golden_globes_1', 'golden_globes_2', 'golden...",1.0,1.0,3
7,oscar,1267/4580,27.7%,0.276638,"['yes', 'yes', 'yes']",0/0,0%,0.0,,0/0,0%,0.0,,0.092213,0.276638,1
8,title,4568/4580,99.7%,0.99738,"['Biutiful', 'True Grit', 'The Social Network']",151/151,100.0%,1.0,"['7th Heaven', 'Coquette', 'The Divorcee']",2279/2279,100.0%,1.0,"['Frankie and Alice', 'Rabbit Hole', ""Winter's...",0.999127,1.0,3



🔗 Attributes suitable for entity matching:
Attributes available in 2+ datasets: ['actors_actor_name', 'date', 'director_name', 'id', 'title']


### Detailed Data Profiling

Now let's generate comprehensive HTML profiles for each dataset using the `profile()` method. These reports provide in-depth statistical analysis.

In [6]:
# Generate detailed HTML profiles for each dataset

profile_dir = OUTPUT_DIR / "dataset-profiles"
profile_dir.mkdir(parents=True, exist_ok=True)

profile_paths = []

for df, name in zip(datasets, names):
    print(f"📊 Profiling {name}...")
    
    profile_path = profiler.profile(df, str(profile_dir))
    profile_paths.append(profile_path)
    print(f"  ✅ Profile saved: {profile_path}")

print(f"\n🎯 Generated {len(profile_paths)} detailed HTML reports")
print(f"📁 Location: {profile_dir}")
print("\n💡 Open these HTML files in your browser for interactive exploration:")
for path in profile_paths:
    print(f"  • {Path(path).name}")


📊 Profiling Academy Awards...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 6/6 [00:00<00:00, 162.17it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

  ✅ Profile saved: c:\Users\Ralph\dev\pydi\docs\tutorial\output\movies\dataset-profiles\academy_awards_profile.html
📊 Profiling Actors...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 6/6 [00:00<00:00, 333.31it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

  ✅ Profile saved: c:\Users\Ralph\dev\pydi\docs\tutorial\output\movies\dataset-profiles\actors_profile.html
📊 Profiling Golden Globes...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 6/6 [00:00<00:00, 171.44it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

  ✅ Profile saved: c:\Users\Ralph\dev\pydi\docs\tutorial\output\movies\dataset-profiles\golden_globes_profile.html

🎯 Generated 3 detailed HTML reports
📁 Location: c:\Users\Ralph\dev\pydi\docs\tutorial\output\movies\dataset-profiles

💡 Open these HTML files in your browser for interactive exploration:
  • academy_awards_profile.html
  • actors_profile.html
  • golden_globes_profile.html


## Part 2: Entity Matching

Entity Matching is the process of identifying records that refer to the same real-world entity. PyDI implements different blocking and matching methods.

### Step 1: Blocking

Blocking reduces the number of comparisons from O(n²) to a manageable subset. Let's explore different blocking strategies.

In [7]:
# Let's setup logging first
import logging

import os
os.makedirs('output/logs', exist_ok=True)

# choose either default logging or debug logging

# Configure logging for INFO level
logging.basicConfig(
    level=logging.INFO,
    format='[%(levelname)-5s] %(name)s - %(message)s',
    handlers=[
          logging.FileHandler('output/logs/pydi.log'),  # Save to file
          logging.StreamHandler()                      # Display on console
      ],
    force=True
)

# # Configure logging for DEBUG level
# logging.basicConfig(
#     level=logging.DEBUG,
#     format='[%(levelname)-5s] %(name)s - %(message)s',
#     handlers=[
#           logging.FileHandler('output/logs/pydi.log'),  # Save to file
#           logging.StreamHandler()                      # Display on console
#       ],
#     force=True
# )

In [8]:
from PyDI.entitymatching import NoBlocker, StandardBlocker, SortedNeighbourhoodBlocker, TokenBlocker, EmbeddingBlocker

# We'll focus on Actors and Golden Globes for showcasing blocking strategies

max_pairs = len(actors) * len(golden_globes)
print(f"Without blocking: {max_pairs:,} comparisons required")
print("\n🎯 Goal: Reduce comparisons while maintaining high recall\n")

# No Blocking - compare all possible pairs
print("\n No Blocking")

no_blocker = NoBlocker(
    actors, golden_globes,
    batch_size=1000,
    id_column='id'  # specify the ID column for both datasets
)

# in an actual large-scale application, we do not build a list of all pairs but stream over them like this
for batch in no_blocker:
    # do something with the pairs
    continue

# but we can also generate the full set of pairs for smaller datasets
no_candidates = no_blocker.materialize()

print(f"  Generated: {len(no_candidates):,} candidates")

Without blocking: 344,129 comparisons required

🎯 Goal: Reduce comparisons while maintaining high recall


 No Blocking
  Generated: 344,129 candidates


Now let's use an actual blocker. Note that when instantiating the blocker, it also writes out a corresponding debug file.

In [9]:
# 1. Standard Blocking - First 3 characters of title
print("\n1️⃣ Standard Blocking (Concatenation of first 2 characters of each of the first three tokens of title)")

# Add title_prefix directly to the original dataframes
actors['title_prefix'] = actors['title'].astype(str).apply(lambda x: ''.join([word[:2].upper() for word in x.split()[:3]]))
golden_globes['title_prefix'] = golden_globes['title'].astype(str).apply(lambda x: ''.join([word[:2].upper() for word in x.split()[:3]]))

standard_blocker_a2g = StandardBlocker(
    actors, golden_globes,
    on=['title_prefix'],
    batch_size=1000,
    output_dir=OUTPUT_DIR / "blocking-evaluation",
    id_column='id'
)

standard_candidates_a2g = standard_blocker_a2g.materialize()

print()
print(f"  Generated: {len(standard_candidates_a2g):,} candidates")

[INFO ] PyDI.entitymatching.blocking.standard.StandardBlocker - created 145 blocking keys for first dataset
[INFO ] PyDI.entitymatching.blocking.standard.StandardBlocker - created 1522 blocking keys for second dataset
[INFO ] PyDI.entitymatching.blocking.standard.StandardBlocker - created 91 blocks from blocking keys
[INFO ] PyDI.entitymatching.blocking.standard.StandardBlocker - Debug results written to file: c:\Users\Ralph\dev\pydi\docs\tutorial\output\movies\blocking-evaluation\debugResultsBlocking_StandardBlocker.csv



1️⃣ Standard Blocking (Concatenation of first 2 characters of each of the first three tokens of title)

  Generated: 277 candidates


In [10]:
# 2. Sorted Neighbourhood - Sequential similarity
print("\n2️⃣ Sorted Neighbourhood Blocking (Title-based, Window=5)")

sn_blocker_a2g = SortedNeighbourhoodBlocker(
    actors, golden_globes,
    key='title',  # Sort by title
    window=20,     # Compare with 20 neighbors
    batch_size=1000,
    output_dir=OUTPUT_DIR / "blocking-evaluation",
    id_column='id'
)

sn_candidates_a2g = sn_blocker_a2g.materialize()

print()
print(f"  Generated: {len(sn_candidates_a2g):,} candidates")

[INFO ] PyDI.entitymatching.blocking.sorted_neighbourhood.SortedNeighbourhoodBlocker - created sorted neighbourhood with window size 20
[INFO ] PyDI.entitymatching.blocking.sorted_neighbourhood.SortedNeighbourhoodBlocker - created 1 sorted sequence from 2430 records
[INFO ] PyDI.entitymatching.blocking.sorted_neighbourhood.SortedNeighbourhoodBlocker - Debug results written to file: c:\Users\Ralph\dev\pydi\docs\tutorial\output\movies\blocking-evaluation\debugResultsBlocking_SortedNeighbourhoodBlocker.csv



2️⃣ Sorted Neighbourhood Blocking (Title-based, Window=5)

  Generated: 4,899 candidates


In [11]:
# 3. Token Blocking - Token-based similarity
print("\n3️⃣ Token Blocking (Title Tokens, Min Length=3, 2-grams)")

token_blocker_a2g = TokenBlocker(
    actors, golden_globes,
    column='title',      # Tokenize titles
    batch_size=1000,
    output_dir=OUTPUT_DIR / "blocking-evaluation",
    id_column='id',
    ngram_size=2,
    ngram_type='character'
)

token_candidates_a2g = token_blocker_a2g.materialize()

print()
print(f"  Generated: {len(token_candidates_a2g):,} candidates")

[INFO ] PyDI.entitymatching.blocking.token_blocking.TokenBlocker - created 330 token keys for first dataset
[INFO ] PyDI.entitymatching.blocking.token_blocking.TokenBlocker - created 572 token keys for second dataset
[INFO ] PyDI.entitymatching.blocking.token_blocking.TokenBlocker - created 325 blocks from token keys
[INFO ] PyDI.entitymatching.blocking.token_blocking.TokenBlocker - Debug results written to file: c:\Users\Ralph\dev\pydi\docs\tutorial\output\movies\blocking-evaluation\debugResultsBlocking_TokenBlocker.csv



3️⃣ Token Blocking (Title Tokens, Min Length=3, 2-grams)

  Generated: 166,834 candidates


In [12]:
# 4. Embedding Blocking - Semantic similarity
print("\n4️⃣ Embedding Blocking (Semantic Similarity)")

embedding_blocker_a2g = EmbeddingBlocker(
    actors, golden_globes,
    text_cols=['title'],
    model="sentence-transformers/all-MiniLM-L6-v2",
    index_backend="sklearn",
    top_k=20,          # Top 20 most similar
    batch_size=500,
    output_dir=OUTPUT_DIR / "blocking-evaluation",
    id_column='id'
)
    
embedding_candidates_a2g = embedding_blocker_a2g.materialize()

print()
print(f"  Generated: {len(embedding_candidates_a2g):,} candidates")

[INFO ] PyDI.entitymatching.blocking.embedding.EmbeddingBlocker - Initialized EmbeddingBlocker with sklearn backend, top_k=20, threshold=0.3
[INFO ] sentence_transformers.SentenceTransformer - Use pytorch device_name: cpu
[INFO ] sentence_transformers.SentenceTransformer - Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2



4️⃣ Embedding Blocking (Semantic Similarity)


[INFO ] PyDI.entitymatching.blocking.embedding.EmbeddingBlocker - Loaded sentence transformer model: sentence-transformers/all-MiniLM-L6-v2
[INFO ] PyDI.entitymatching.blocking.embedding.EmbeddingBlocker - created 384d embeddings for first dataset
[INFO ] PyDI.entitymatching.blocking.embedding.EmbeddingBlocker - created 384d embeddings for second dataset
[INFO ] PyDI.entitymatching.blocking.embedding.EmbeddingBlocker - created similarity index with 2279 vectors, metric=cosine
[INFO ] PyDI.entitymatching.blocking.embedding.EmbeddingBlocker - Debug results written to file: c:\Users\Ralph\dev\pydi\docs\tutorial\output\movies\blocking-evaluation\debugResultsBlocking_EmbeddingBlocker.csv



  Generated: 2,945 candidates


### Step 2: Evaluation Against Ground Truth

PyDI provides evaluation methods for blocking with pair completeness, pair quality, and reduction ratio:
- **`evaluate_blocking()`**: Evaluates blocking given an already materialized set of pairs.
- **`evaluate_blocking_batched()`**: Evaluates blocking by iterating over batches and storing results. Useful for very large datasets 

Let's first evaluate materialized blocking results against a set of provided ground truth correspondences.

In [13]:
import pandas as pd
from PyDI.io import load_csv
from PyDI.entitymatching import EntityMatchingEvaluator
# Showcase EntityMatchingEvaluator.evaluate_blocking utility

# Load test set with proper column names
test_gt = load_csv(
    DATA_DIR / "entitymatching" / "actors_2_golden_globes_test.csv",
    name="test_set", header=None, names=['id1', 'id2', 'label'], add_index=False
)

# Use EntityMatchingEvaluator.evaluate_blocking on Standard Blocking
results = EntityMatchingEvaluator.evaluate_blocking(
    candidate_pairs=standard_candidates_a2g,
    blocker=standard_blocker_a2g,
    test_pairs=test_gt,
    out_dir=OUTPUT_DIR / "blocking-evaluation"
)

print(f"\n💡 Evaluating pair quality only makes sense if the test set contains all possible pairs, which is not the case in this example!")

display(results)

[INFO ] root -   Pair Completeness: 0.346
[INFO ] root -   Pair Quality:      0.032
[INFO ] root -   Reduction Ratio:   0.999
[INFO ] root -   True Matches Found: 9/26
[INFO ] root - Blocking evaluation complete!



💡 Evaluating pair quality only makes sense if the test set contains all possible pairs, which is not the case in this example!


{'pair_completeness': 0.34615384615384615,
 'pair_quality': 0.032490974729241874,
 'reduction_ratio': 0.9991950692908764,
 'total_candidates': 277,
 'total_possible_pairs': 344129,
 'true_positives_found': 9,
 'total_true_pairs': 26,
 'evaluation_timestamp': '2025-09-25T15:02:11.693712',
 'output_files': ['c:\\Users\\Ralph\\dev\\pydi\\docs\\tutorial\\output\\movies\\blocking-evaluation\\blocking_evaluation_summary.json',
  'c:\\Users\\Ralph\\dev\\pydi\\docs\\tutorial\\output\\movies\\blocking-evaluation\\blocking_detailed_results.csv']}

When datasets are huge, it is necessary to use the evaluate_blocking_batched() function to avoid materializing the full set of pairs.

In [14]:
results = EntityMatchingEvaluator.evaluate_blocking_batched(
    blocker=standard_blocker_a2g,
    test_pairs=test_gt,
    out_dir=OUTPUT_DIR / "blocking-evaluation"
)

display(results)

[INFO ] root - Starting batched blocking evaluation...
[INFO ] root -   Pair Completeness: 0.346
[INFO ] root -   Pair Quality:      0.032
[INFO ] root -   Reduction Ratio:   0.999
[INFO ] root -   True Matches Found: 9/26
[INFO ] root -   Batches Processed:  1
[INFO ] root - Blocking evaluation complete!


{'pair_completeness': 0.34615384615384615,
 'pair_quality': 0.032490974729241874,
 'reduction_ratio': 0.9991950692908764,
 'total_candidates': 277,
 'total_possible_pairs': 344129,
 'true_positives_found': 9,
 'total_true_pairs': 26,
 'batches_processed': 1,
 'evaluation_timestamp': '2025-09-25T15:02:11.723528',
 'output_files': ['c:\\Users\\Ralph\\dev\\pydi\\docs\\tutorial\\output\\movies\\blocking-evaluation\\blocking_evaluation_summary.json',
  'c:\\Users\\Ralph\\dev\\pydi\\docs\\tutorial\\output\\movies\\blocking-evaluation\\blocking_detailed_results.csv']}

Let's do the same kind of blocking for the dataset combination Academy Awards <-> Actors

In [15]:
# Add title_prefix directly to the original dataframes
academy_awards['title_prefix'] = academy_awards['title'].astype(str).apply(lambda x: ''.join([word[:2].upper() for word in x.split()[:3]]))

standard_blocker_aa2a = StandardBlocker(
    academy_awards, actors,
    on=['title_prefix'],  # Block on first 3 characters of title
    batch_size=1000,
    output_dir=OUTPUT_DIR / "blocking-evaluation",
    id_column='id'
)
standard_candidates_aa2a = standard_blocker_aa2a.materialize()

sn_blocker_aa2a = SortedNeighbourhoodBlocker(
    academy_awards, actors,
    key='title',  # Sort by title
    window=20,     # Compare with 20 neighbors
    batch_size=1000,
    output_dir=OUTPUT_DIR / "blocking-evaluation",
    id_column='id'
)
sn_candidates_aa2a = sn_blocker_aa2a.materialize()

token_blocker_aa2a = TokenBlocker(
    academy_awards, actors,
    column='title',      # Tokenize titles
    batch_size=1000,
    output_dir=OUTPUT_DIR / "blocking-evaluation",
    id_column='id',
    ngram_size=2,
    ngram_type='character'
)
token_candidates_aa2a = token_blocker_aa2a.materialize()

embedding_blocker_aa2a = EmbeddingBlocker(
    academy_awards, actors,
    text_cols=['title'],
    model="sentence-transformers/all-MiniLM-L6-v2",
    index_backend="sklearn",
    top_k=20,          # Top 20 most similar
    batch_size=500,
    output_dir=OUTPUT_DIR / "blocking-evaluation",
    id_column='id'
)
embedding_candidates_aa2a = embedding_blocker_aa2a.materialize()

[INFO ] PyDI.entitymatching.blocking.standard.StandardBlocker - created 3585 blocking keys for first dataset
[INFO ] PyDI.entitymatching.blocking.standard.StandardBlocker - created 145 blocking keys for second dataset
[INFO ] PyDI.entitymatching.blocking.standard.StandardBlocker - created 142 blocks from blocking keys
[INFO ] PyDI.entitymatching.blocking.standard.StandardBlocker - Debug results written to file: c:\Users\Ralph\dev\pydi\docs\tutorial\output\movies\blocking-evaluation\debugResultsBlocking_StandardBlocker.csv
[INFO ] PyDI.entitymatching.blocking.sorted_neighbourhood.SortedNeighbourhoodBlocker - created sorted neighbourhood with window size 20
[INFO ] PyDI.entitymatching.blocking.sorted_neighbourhood.SortedNeighbourhoodBlocker - created 1 sorted sequence from 4731 records
[INFO ] PyDI.entitymatching.blocking.sorted_neighbourhood.SortedNeighbourhoodBlocker - Debug results written to file: c:\Users\Ralph\dev\pydi\docs\tutorial\output\movies\blocking-evaluation\debugResultsBlo

Now let's evaluate which blocking method we want to use for each dataset combination:

In [16]:
# Evaluate all blocking methods for both dataset combinations

evaluator = EntityMatchingEvaluator()

# Create dictionaries of candidates for both dataset combinations
a2g_blocking_candidates = {
    'StandardBlocking': [standard_candidates_a2g, standard_blocker_a2g],
    'SortedNeighbourhoodBlocker': [sn_candidates_a2g, sn_blocker_a2g],
    'TokenBlocking': [token_candidates_a2g,token_blocker_a2g],
    'EmbeddingBlocking': [embedding_candidates_a2g,embedding_blocker_a2g]
}

aa2a_blocking_candidates = {
    'StandardBlocking': [standard_candidates_aa2a,standard_blocker_aa2a],
    'SortedNeighbourhood': [sn_candidates_aa2a, sn_blocker_aa2a],
    'TokenBlocking': [token_candidates_aa2a,token_blocker_aa2a],
    'EmbeddingBlocking': [embedding_candidates_aa2a,embedding_blocker_aa2a]
}

# Load correspondences for evaluation
a2g_correspondences = load_csv(
    DATA_DIR / "entitymatching" / "actors_2_golden_globes_test.csv",
    name="a2g_test", header=None, names=['id1', 'id2', 'label'], add_index=False
)

aa2a_correspondences = load_csv(
    DATA_DIR / "entitymatching" / "academy_awards_2_actors_test.csv",
    name="aa2a_test", header=None, names=['id1', 'id2', 'label'], add_index=False
)

# Evaluate blocking for a2g datasets
a2g_results = []
for method_name, candidates in a2g_blocking_candidates.items():
    result = evaluator.evaluate_blocking(candidates[0], a2g_correspondences,candidates[1], out_dir=OUTPUT_DIR / "blocking-evaluation")
    result['method'] = method_name
    result['dataset'] = 'a2g'
    a2g_results.append(result)

# Evaluate blocking for aa2a datasets  
aa2a_results = []
for method_name, candidates in aa2a_blocking_candidates.items():
    result = evaluator.evaluate_blocking(candidates[0], aa2a_correspondences,candidates[1], out_dir=OUTPUT_DIR / "blocking-evaluation")
    result['method'] = method_name
    result['dataset'] = 'aa2a'
    aa2a_results.append(result)

# Select best method for each dataset (highest pair_completeness, then highest reduction_ratio)
a2g_best = max(a2g_results, key=lambda x: (x['pair_completeness'], x['reduction_ratio']))
aa2a_best = max(aa2a_results, key=lambda x: (x['pair_completeness'], x['reduction_ratio']))

print(f"Best blocking for a2g: {a2g_best['method']} (PC: {a2g_best['pair_completeness']:.3f}, RR: {a2g_best['reduction_ratio']:.3f})")
print(f"Best blocking for aa2a: {aa2a_best['method']} (PC: {aa2a_best['pair_completeness']:.3f}, RR: {aa2a_best['reduction_ratio']:.3f})")

[INFO ] root -   Pair Completeness: 0.346
[INFO ] root -   Pair Quality:      0.032
[INFO ] root -   Reduction Ratio:   0.999
[INFO ] root -   True Matches Found: 9/26
[INFO ] root - Blocking evaluation complete!
[INFO ] root -   Pair Completeness: 0.462
[INFO ] root -   Pair Quality:      0.002
[INFO ] root -   Reduction Ratio:   0.986
[INFO ] root -   True Matches Found: 12/26
[INFO ] root - Blocking evaluation complete!
[INFO ] root -   Pair Completeness: 1.000
[INFO ] root -   Pair Quality:      0.000
[INFO ] root -   Reduction Ratio:   0.515
[INFO ] root -   True Matches Found: 26/26
[INFO ] root - Blocking evaluation complete!
[INFO ] root -   Pair Completeness: 1.000
[INFO ] root -   Pair Quality:      0.009
[INFO ] root -   Reduction Ratio:   0.991
[INFO ] root -   True Matches Found: 26/26
[INFO ] root - Blocking evaluation complete!
[INFO ] root -   Pair Completeness: 0.957
[INFO ] root -   Pair Quality:      0.113
[INFO ] root -   Reduction Ratio:   0.999
[INFO ] root -   Tr

Best blocking for a2g: EmbeddingBlocking (PC: 1.000, RR: 0.991)
Best blocking for aa2a: EmbeddingBlocking (PC: 1.000, RR: 0.943)


### Step 3: Entity Matching with Comparators

Now we'll use PyDI's linear matching rule capabilities to find duplicate movies using multiple attribute comparisons.

First, we define some comparators for attributes relevant to matching:

In [17]:
from PyDI.entitymatching import StringComparator, DateComparator, NumericComparator

# Create comparators for different attributes
comparators = [
    # Title similarity - most important for movies
    StringComparator(
        column='title',
        similarity_function='jaccard',  # Good for movie titles
        preprocess=str.lower  # Case normalization
    ),
    
    # Date proximity - movies from same year likely same film
    DateComparator(
        column='date', 
        max_days_difference=365  # Allow 1 year difference
    ),
    
    # Actor name similarity - supporting evidence
    StringComparator(
        column='actors_actor_name',
        similarity_function='jaccard',  # Good for names
        preprocess=str.lower,
        list_strategy='concatenate' # Handle list attribute by concatenation
    )
]

Next, we setup the matcher and run the matching with our chosen best blocking method:

In [18]:
from PyDI.entitymatching import RuleBasedMatcher

# Initialize the blocker
embedding_blocker_a2g = EmbeddingBlocker(
    actors, golden_globes,
    text_cols=['title'],
    model="sentence-transformers/all-MiniLM-L6-v2",
    index_backend="sklearn",
    top_k=20,          # Top 20 most similar
    batch_size=500,
    output_dir=OUTPUT_DIR / "blocking-evaluation",
    id_column='id'
)

# Initialize Rule-Based Matcher
matcher = RuleBasedMatcher()

correspondences_a2g = matcher.match(
    df_left=actors,
    df_right=golden_globes, 
    candidates=embedding_blocker_a2g, # pass the blocker, which will internally generate candidate pairs using batching
    comparators=comparators,
    weights=[0.7, 0.2, 0.1],  # Title most important, then date, then actor,
    threshold=0.7, # set a similarity threshold for a match
    id_column='id'
)

[INFO ] PyDI.entitymatching.blocking.embedding.EmbeddingBlocker - Initialized EmbeddingBlocker with sklearn backend, top_k=20, threshold=0.3
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Blocking 151 x 2279 elements
[INFO ] sentence_transformers.SentenceTransformer - Use pytorch device_name: cpu
[INFO ] sentence_transformers.SentenceTransformer - Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2
[INFO ] PyDI.entitymatching.blocking.embedding.EmbeddingBlocker - Loaded sentence transformer model: sentence-transformers/all-MiniLM-L6-v2
[INFO ] PyDI.entitymatching.blocking.embedding.EmbeddingBlocker - created 384d embeddings for first dataset
[INFO ] PyDI.entitymatching.blocking.embedding.EmbeddingBlocker - created 384d embeddings for second dataset
[INFO ] PyDI.entitymatching.blocking.embedding.EmbeddingBlocker - created similarity index with 2279 vectors, metric=cosine

### Step 4: Evaluation Against Ground Truth

We can evaluate the result of our entity matching with this method of the EntityMatchingEvaluator:
- **`evaluate_matching()`**: Evaluates matching given a test set and the predicted correspondences. 

In [19]:
gt_test = load_csv(
    DATA_DIR / "entitymatching" / "actors_2_golden_globes_test.csv", 
    name="test_entity_matching",
    header=None,
    names=['id1', 'id2', 'label'],
    add_index=False
)

debug_output_dir = OUTPUT_DIR / "debug_results_entity_matching"
debug_output_dir.mkdir(parents=True, exist_ok=True)

eval_results = EntityMatchingEvaluator.evaluate_matching(
    correspondences=correspondences_a2g,
    test_pairs=gt_test,
    out_dir=debug_output_dir
)

display(eval_results)

[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.768
[INFO ] root -   Precision: 1.000
[INFO ] root -   Recall:    0.269
[INFO ] root -   F1-Score:  0.424
[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  7
[INFO ] root -   True Negatives:  56
[INFO ] root -   False Positives: 0
[INFO ] root -   False Negatives: 19
[INFO ] root - Matching evaluation complete: P=1.0000 R=0.2692 F1=0.4242


{'precision': 1.0,
 'recall': 0.2692307692307692,
 'f1': 0.42424242424242425,
 'accuracy': 0.7682926829268293,
 'true_positives': 7,
 'false_positives': 0,
 'false_negatives': 19,
 'true_negatives': 56,
 'threshold_used': 0.0,
 'total_correspondences': 86,
 'filtered_correspondences': 86,
 'evaluation_timestamp': '2025-09-25T15:02:35.090947',
 'output_files': ['c:\\Users\\Ralph\\dev\\pydi\\docs\\tutorial\\output\\movies\\debug_results_entity_matching\\matching_evaluation_summary.json',
  'c:\\Users\\Ralph\\dev\\pydi\\docs\\tutorial\\output\\movies\\debug_results_entity_matching\\matching_detailed_results.csv']}

If we need more detailed debugging results, we can set the debug flag during matching and pass the resulting info object to the evaluate_matching function to write detailed debug logs to a directory of our choice.

In [20]:
# Re-run the matcher with debug mode enabled to get detailed debug data
print("🔍 Re-running matcher with debug mode to capture detailed results:")

correspondences_a2g, debug_info = matcher.match(
    df_left=actors,
    df_right=golden_globes, 
    candidates=embedding_blocker_a2g, # pass the blocker, which will internally generate candidate pairs using batching
    comparators=comparators,
    weights=[0.7, 0.2, 0.1],  # Title most important, then date, then actor,
    threshold=0.7, # set a similarity threshold for a match
    id_column='id',
    debug=True  # This enables debug output capture
)

eval_results = EntityMatchingEvaluator.evaluate_matching(
    correspondences=correspondences_a2g,
    test_pairs=gt_test,
    out_dir=debug_output_dir,
    debug_info=debug_info, # add debug info
    matcher_instance=matcher # add matcher instance for context for debug files
)

[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Blocking 151 x 2279 elements
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Matching 151 x 2279 elements after 0:00:0.099; 2945 blocked pairs (reduction ratio: 0.9914421626773681)


🔍 Re-running matcher with debug mode to capture detailed results:


[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Entity Matching finished after 0:00:1.351; found 86 correspondences.
[INFO ] root - Debug results written to c:\Users\Ralph\dev\pydi\docs\tutorial\output\movies\debug_results_entity_matching\debugResultsMatchingRule.csv and c:\Users\Ralph\dev\pydi\docs\tutorial\output\movies\debug_results_entity_matching\debugResultsMatchingRule.csv_short
[INFO ] root - Debug results written to c:\Users\Ralph\dev\pydi\docs\tutorial\output\movies\debug_results_entity_matching\debugResultsMatchingRule.csv and c:\Users\Ralph\dev\pydi\docs\tutorial\output\movies\debug_results_entity_matching\debugResultsMatchingRule.csv_short
[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  0.768
[INFO ] root -   Precision: 1.000
[INFO ] root -   Recall:    0.269
[INFO ] root -   F1-Score:  0.424
[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  7
[INFO ] root -   True Negatives:  56
[INFO ] root -   False Positives: 0
[INFO ] root - 

Another helpful tool for investigating the goodness of the matching is to create the cluster size distribution that shows how many clusters (records referencing same entity) after matching exist.

In [21]:
print("Analyzing cluster size distribution in our entity matching results...")

# Create cluster size distribution from our matches
cluster_distribution = EntityMatchingEvaluator.create_cluster_size_distribution(
    correspondences=correspondences_a2g,
    out_dir=str(OUTPUT_DIR / "cluster_analysis")
)

print(f"\n📊 Cluster Size Distribution Results:")
display(cluster_distribution)

[INFO ] root - Cluster Size Distribution of 80 clusters:
[INFO ] root - 	Cluster Size	| Frequency	| Percentage
[INFO ] root - 	──────────────────────────────────────────────────
[INFO ] root - 		2	|	78	|	97.50%
[INFO ] root - 		3	|	1	|	1.25%
[INFO ] root - 		7	|	1	|	1.25%
[INFO ] root - Cluster size distribution written to c:\Users\Ralph\dev\pydi\docs\tutorial\output\movies\cluster_analysis\cluster_size_distribution.csv


Analyzing cluster size distribution in our entity matching results...

📊 Cluster Size Distribution Results:


Unnamed: 0,cluster_size,frequency,percentage
0,2,78,97.5
1,3,1,1.25
2,7,1,1.25


If we see strange distribution of clusters, we can further investigate specific clusters by writing out detailed cluster information:

In [22]:
# Write out detailed cluster information with all entity records for debugging purposes

# Use the matches we found earlier to demonstrate cluster details
cluster_details_path = OUTPUT_DIR / "cluster_analysis" / "detailed_cluster_info.json"

# Call write_cluster_details with our entity matches
output_path = EntityMatchingEvaluator.write_cluster_details(
    correspondences=correspondences_a2g,
    out_path=cluster_details_path
)

[INFO ] root - Cluster details written to c:\Users\Ralph\dev\pydi\docs\tutorial\output\movies\cluster_analysis\detailed_cluster_info.json
[INFO ] root - Exported 80 clusters with detailed record information


Additionally, PyDI offers 6 different post-clustering methods to "clean" clusters after entity matching. For example, if we want to enforce that each record in a dataset can only have exactly one correspondence in the other dataset, we can apply a greedy one-to-one matching, maximum bipartite matching or stable marriage matching.

In [23]:
from PyDI.entitymatching import MaximumBipartiteMatching, StableMatching

# use Maximum Bipartite Matching to refine results to 1:1 matches
clusterer = MaximumBipartiteMatching()
mbm_correspondences_a2g = clusterer.cluster(correspondences_a2g)

# use Stable Matching to refine results to 1:1 matches
clusterer = StableMatching()
sm_correspondences_a2g = clusterer.cluster(correspondences_a2g)

[INFO ] root - Filtered correspondences: 86 -> 86 (threshold=0.0)
[INFO ] root - Maximum bipartite matching: 86 -> 80 
[INFO ] root - MaximumBipartiteMatching: 86 -> 80 correspondences
[INFO ] root - MaximumBipartiteMatching: 166 -> 160 entities
[INFO ] root - Filtered correspondences: 86 -> 86 (threshold=0.0)
[INFO ] root - Stable matching: 86 -> 80 correspondences (160 entities matched)
[INFO ] root - StableMatching: 86 -> 80 correspondences
[INFO ] root - StableMatching: 166 -> 160 entities


### Step 5: Machine Learning-based Matching Rules

Instead of using manually configured matching rules, we can also learn the weights and best comparators using machine learning if we have a labeled training set available.

Let's do this for the dataset combination Academy Awards <-> Actors.

First, we need to create the features for machine learning using PyDIs FeatureExtractor class:

In [None]:
from PyDI.entitymatching import FeatureExtractor

# Load ground truth correspondences
aa2a_train = load_csv(
    DATA_DIR / "entitymatching" / "academy_awards_2_actors_training.csv",
    name="ground_truth_train",
    header=None,
    names=['id1', 'id2', 'label'],
    add_index=False
)

aa2a_test = load_csv(
    DATA_DIR / "entitymatching" / "academy_awards_2_actors_test.csv",
    name="ground_truth_test",
    header=None,
    names=['id1', 'id2', 'label'],
    add_index=False
)

similarity_comparators = [
    # Title similarity features - most important for movie matching
    StringComparator("title", similarity_function="jaro_winkler", preprocess=str.lower),
    StringComparator("title", similarity_function="levenshtein", preprocess=str.lower),
    StringComparator("title", similarity_function="cosine", preprocess=str.lower),
    StringComparator("title", similarity_function="jaccard", preprocess=str.lower),
    
    # Date proximity features
    DateComparator("date", max_days_difference=365),  # 1 years tolerance
    
    # Actor name similarity
    StringComparator("actors_actor_name", similarity_function="jaccard", preprocess=str.lower, list_strategy='concatenate'),
    StringComparator("actors_actor_name", similarity_function="jaccard", preprocess=str.lower, list_strategy='best_match'),
]

feature_extractor = FeatureExtractor(similarity_comparators)

# Extract features using FeatureExtractor
train_features = feature_extractor.create_features(
    academy_awards, actors, aa2a_train[['id1', 'id2']], labels=aa2a_train['label'], id_column='id'
)

print(f"✅ Training features extracted!")
print(f"Feature columns: {[col for col in train_features.columns if col not in ['id1', 'id2', 'label']]}")

# Prepare data for ML training
feature_columns = [col for col in train_features.columns if col not in ['id1', 'id2', 'label']]

X_train = train_features[feature_columns]
y_train = train_features['label']

print(f"Training data: X={X_train.shape}, y={y_train.shape}")
print(f"Class distribution: {y_train.value_counts().to_dict()}")

[INFO ] root - Label distribution: 103 positive, 232 negative


✅ Training features extracted!
Feature columns: ['StringComparator(title, jaro_winkler, tokenization=char, list_strategy=None)', 'StringComparator(title, levenshtein, tokenization=char, list_strategy=None)', 'StringComparator(title, cosine, tokenization=word, list_strategy=None)', 'StringComparator(title, jaccard, tokenization=word, list_strategy=None)', 'DateComparator(date, list_strategy=None)', 'StringComparator(actors_actor_name, jaccard, tokenization=word, list_strategy=concatenate)', 'StringComparator(actors_actor_name, jaccard, tokenization=word, list_strategy=best_match)']
Training data: X=(335, 7), y=(335,)
Class distribution: {False: 232, True: 103}


#### Full Scikit-learn integration

From here on out, the full scikit-learn library can be used with the features extracted from PyDIs feature extractor without any wrapping as everything in PyDI is based on pandas dataframes

In [25]:
# Set up GridSearchCV with multiple models and hyperparameters
print(f"\n🔍 Setting up GridSearchCV...")

from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import make_scorer, f1_score

# Define models and parameter grids
param_grids = {
    'RandomForest': {
        'model': RandomForestClassifier(random_state=42),
        'params': {
            'n_estimators': [50, 100, 200],
            'max_depth': [5, 10, None],
            'min_samples_split': [2, 5],
            'class_weight': ['balanced', None]
        }
    },
    'LogisticRegression': {
        'model': LogisticRegression(random_state=42, max_iter=1000),
        'params': {
            'C': [0.1, 1.0, 10.0],
            'penalty': ['l2'],
            'class_weight': ['balanced', None]
        }
    },
    'GradientBoosting': {
        'model': GradientBoostingClassifier(random_state=42),
        'params': {
            'n_estimators': [50, 100],
            'learning_rate': [0.1, 0.2],
            'max_depth': [3, 5],
        }
    },
    'SVM': {
        'model': SVC(random_state=42, probability=True),
        'params': {
            'C': [0.1, 1.0, 10.0],
            'kernel': ['rbf', 'linear'],
            'class_weight': ['balanced', None]
        }
    }
}

# Use F1 score as the scoring metric (good for imbalanced data)
scorer = make_scorer(f1_score)
cv_folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

print(f"GridSearch setup: {len(param_grids)} models, F1 scoring, 5-fold CV")

# Train models using GridSearchCV
print(f"\n🚀 Training Models with GridSearchCV...")

grid_search_results = {}
best_overall_score = -1
best_overall_model = None
best_model_name = None

for model_name, config in param_grids.items():
    print(f"\nTraining {model_name}...")
    

    # Create GridSearchCV
    grid_search = GridSearchCV(
        estimator=config['model'],
        param_grid=config['params'],
        scoring=scorer,
        cv=cv_folds,
        n_jobs=-1,  # Use all available cores
        verbose=0
    )
    
    # Fit GridSearchCV
    grid_search.fit(X_train, y_train)
    
    # Store results
    grid_search_results[model_name] = {
        'grid_search': grid_search,
        'best_score': grid_search.best_score_,
        'best_params': grid_search.best_params_,
        'best_estimator': grid_search.best_estimator_
    }
    
    print(f"  ✅ {model_name}: Best CV F1 = {grid_search.best_score_:.4f}")
    print(f"     Best params: {grid_search.best_params_}")
    
    # Track overall best model
    if grid_search.best_score_ > best_overall_score:
        best_overall_score = grid_search.best_score_
        best_overall_model = grid_search.best_estimator_
        best_model_name = model_name
            
print(f"\n🏆 Best Overall Model: {best_model_name} (CV F1: {best_overall_score:.4f})")


🔍 Setting up GridSearchCV...
GridSearch setup: 4 models, F1 scoring, 5-fold CV

🚀 Training Models with GridSearchCV...

Training RandomForest...
  ✅ RandomForest: Best CV F1 = 0.9856
     Best params: {'class_weight': 'balanced', 'max_depth': 5, 'min_samples_split': 5, 'n_estimators': 200}

Training LogisticRegression...
  ✅ LogisticRegression: Best CV F1 = 0.9902
     Best params: {'C': 0.1, 'class_weight': None, 'penalty': 'l2'}

Training GradientBoosting...
  ✅ GradientBoosting: Best CV F1 = 0.9905
     Best params: {'learning_rate': 0.2, 'max_depth': 5, 'n_estimators': 50}

Training SVM...
  ✅ SVM: Best CV F1 = 0.9902
     Best params: {'C': 0.1, 'class_weight': 'balanced', 'kernel': 'rbf'}

🏆 Best Overall Model: GradientBoosting (CV F1: 0.9905)


Now, we can directly use the trained model with PyDIs MLBasedMatcher

In [26]:
from PyDI.entitymatching import MLBasedMatcher

# Create MLBasedMatcher and apply trained model
ml_matcher = MLBasedMatcher(feature_extractor)

correspondences_aa2a = ml_matcher.match(
    academy_awards, actors, candidates=embedding_blocker_aa2a, id_column='id', trained_classifier=best_overall_model
)

[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Starting Entity Matching
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Blocking 4580 x 151 elements
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Matching 4580 x 151 elements after 0:00:0.850; 39558 blocked pairs (reduction ratio: 0.9428005436825819)
[INFO ] PyDI.entitymatching.ml_based.MLBasedMatcher - Entity Matching finished after 0:00:16.290; found 150 correspondences.


In [27]:
# Show feature importance if available
if hasattr(best_overall_model, 'feature_importances_'):
    print(f"\n🔍 Top Feature Importances:")
    importance_df = ml_matcher.get_feature_importance(best_overall_model, feature_columns)
    display(importance_df.head(8))


🔍 Top Feature Importances:


Unnamed: 0,feature,importance
6,"StringComparator(actors_actor_name, jaccard, t...",0.5782
5,"StringComparator(actors_actor_name, jaccard, t...",0.3943
1,"StringComparator(title, levenshtein, tokenizat...",0.0275
2,"StringComparator(title, cosine, tokenization=w...",0.0
0,"StringComparator(title, jaro_winkler, tokeniza...",0.0
4,"DateComparator(date, list_strategy=None)",0.0
3,"StringComparator(title, jaccard, tokenization=...",0.0


Let's evaluate the ML-based matching with the evaluator:

In [28]:
eval_results = EntityMatchingEvaluator.evaluate_matching(
    correspondences=correspondences_aa2a,
    test_pairs=aa2a_test,
    out_dir=debug_output_dir
)

display(eval_results)

# Create cluster size distribution from our matches
cluster_distribution = EntityMatchingEvaluator.create_cluster_size_distribution(
    correspondences=correspondences_aa2a,
    out_dir=OUTPUT_DIR / "cluster_analysis"
)

print(f"\n📊 Cluster Size Distribution Results:")
display(cluster_distribution)

[INFO ] root - Performance Metrics:
[INFO ] root -   Accuracy:  1.000
[INFO ] root -   Precision: 1.000
[INFO ] root -   Recall:    1.000
[INFO ] root -   F1-Score:  1.000
[INFO ] root - Confusion Matrix:
[INFO ] root -   True Positives:  47
[INFO ] root -   True Negatives:  3300
[INFO ] root -   False Positives: 0
[INFO ] root -   False Negatives: 0
[INFO ] root - Matching evaluation complete: P=1.0000 R=1.0000 F1=1.0000


{'precision': 1.0,
 'recall': 1.0,
 'f1': 1.0,
 'accuracy': 1.0,
 'true_positives': 47,
 'false_positives': 0,
 'false_negatives': 0,
 'true_negatives': 3300,
 'threshold_used': 0.0,
 'total_correspondences': 150,
 'filtered_correspondences': 150,
 'evaluation_timestamp': '2025-09-25T15:02:56.752762',
 'output_files': ['c:\\Users\\Ralph\\dev\\pydi\\docs\\tutorial\\output\\movies\\debug_results_entity_matching\\matching_evaluation_summary.json',
  'c:\\Users\\Ralph\\dev\\pydi\\docs\\tutorial\\output\\movies\\debug_results_entity_matching\\matching_detailed_results.csv']}

[INFO ] root - Cluster Size Distribution of 148 clusters:
[INFO ] root - 	Cluster Size	| Frequency	| Percentage
[INFO ] root - 	──────────────────────────────────────────────────
[INFO ] root - 		2	|	146	|	98.65%
[INFO ] root - 		3	|	2	|	1.35%
[INFO ] root - Cluster size distribution written to c:\Users\Ralph\dev\pydi\docs\tutorial\output\movies\cluster_analysis\cluster_size_distribution.csv



📊 Cluster Size Distribution Results:


Unnamed: 0,cluster_size,frequency,percentage
0,2,146,98.648649
1,3,2,1.351351


Alternatively to similarity metrics for each attribute, PyDIs VectorFeatureExtractor can be used to create embeddings using SentenceTransformers:

In [29]:
# VectorFeatureExtractor Examples

from PyDI.entitymatching import VectorFeatureExtractor

# SentenceTransformers embeddings using VectorFeatureExtractor
st_extractor = VectorFeatureExtractor(
    embedding_model='sentence-transformers/all-MiniLM-L6-v2',
    columns=['title', 'actors_actor_name', 'date'],
    distance_metrics=['cosine'],
    pooling_strategy='concatenate',
    list_strategies={'actors_actor_name': 'concatenate'}
)

# Extract features using VectorFeatureExtractor
train_features = st_extractor.create_features(
    academy_awards, actors, aa2a_train[['id1', 'id2']], labels=aa2a_train['label'], id_column='id'
)

# ready to train ML models with scikit-learn as before
# matching workflow is analogous to previous example with FeatureExtractor

[INFO ] sentence_transformers.SentenceTransformer - Use pytorch device_name: cpu
[INFO ] sentence_transformers.SentenceTransformer - Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2
[INFO ] root - Initialized VectorFeatureExtractor with model sentence-transformers/all-MiniLM-L6-v2
[INFO ] root - Computing vector features for 362 pairs
[INFO ] root - Computing embeddings for left dataset...
[INFO ] root - Computing embeddings for right dataset...
[INFO ] root - Vector feature extraction complete: 335 pairs embedded.


## Part 3: Data Fusion

In [30]:
academy_awards["academy_awards_id"] = academy_awards["id"]

academy_awards.attrs["trust_score"] = 3
actors.attrs["trust_score"] = 2
golden_globes.attrs["trust_score"] = 1

In [31]:
all_correspondences = pd.concat([correspondences_a2g, correspondences_aa2a], ignore_index=True)
print(f'Total correspondences: {len(all_correspondences):,}')

Total correspondences: 236


## Define Fusion Strategy 

In [32]:
from PyDI.fusion import DataFusionStrategy, longest_string, union, prefer_higher_trust

strategy = DataFusionStrategy('movie_fusion_strategy')

strategy.add_attribute_fuser('title', longest_string)
strategy.add_attribute_fuser('director_name', longest_string)
strategy.add_attribute_fuser('date', prefer_higher_trust, trust_key="trust_score")

strategy.add_attribute_fuser('actors_actor_name', union)

print('Strategy ready.')

[INFO ] PyDI.fusion.strategy - Registered fuser for attribute 'title' using rule 'longest_string'
[INFO ] PyDI.fusion.strategy - Registered fuser for attribute 'director_name' using rule 'longest_string'
[INFO ] PyDI.fusion.strategy - Registered fuser for attribute 'date' using rule 'prefer_higher_trust'
[INFO ] PyDI.fusion.strategy - Registered fuser for attribute 'actors_actor_name' using rule 'union'


Strategy ready.


## Run Fusion
We build connected components from the converted correspondences and fuse per attribute using the rules above.

In [33]:
from PyDI.fusion import DataFusionEngine

engine = DataFusionEngine(strategy, debug=True, debug_format='json',debug_file=OUTPUT_DIR / "data_fusion" / "debug_fusion.jsonl")

fused = engine.run(
    datasets=[academy_awards, actors, golden_globes],
    correspondences=all_correspondences,
    id_column="id",
    include_singletons=False,
)
print(f'Fused rows: {len(fused):,}')
display(fused.head(5))

[INFO ] PyDI.fusion.engine - Fusion debug logging enabled; refer to c:\Users\Ralph\dev\pydi\docs\tutorial\output\movies\data_fusion\debug_fusion.jsonl for detailed traces.
[INFO ] PyDI.fusion.engine - Starting data fusion with strategy 'movie_fusion_strategy'
[INFO ] PyDI.fusion.engine - Correspondence ID coverage: matched 383 of 383 unique IDs
[INFO ] PyDI.fusion.engine - Created 6775 record groups from 236 correspondences
[INFO ] PyDI.fusion.engine - Group size distribution (size: count): 1: 6627, 2: 67, 3: 79, 4: 1, 8: 1
[INFO ] PyDI.fusion.engine - Fusion complete: 148 records from 148 groups
[INFO ] PyDI.fusion.engine - Fusion time: 0.62 seconds


Fused rows: 148


Unnamed: 0,_id,_fusion_group_id,_fusion_sources,id,title_prefix,actors_actor_birthplace,academy_awards_id,date,oscar,director_name,actors_actor_birthday,actors_actor_name,title,_fusion_confidence,_fusion_metadata,globe
0,actors_141,group_0,"[academy_awards, actors]",actors_141,FOGU,California,academy_awards_902,1994-01-01,yes,Robert Zemeckis,1956-01-01,"[Gary Sinise, Tom Hanks]",Forrest Gump,0.671296,"{'id_rule': 'first_non_null', 'title_prefix_ru...",
1,golden_globes_2004,group_1,"[academy_awards, golden_globes, actors]",golden_globes_2004,MAFOAL,England,academy_awards_2337,1966-01-01,yes,Fred Zinnemann,1922-01-01,"[Paul Scofield, Robert Shaw, Wendy Hiller]","Man For All Seasons, a",0.602273,"{'id_rule': 'first_non_null', 'title_prefix_ru...",yes
2,actors_126,group_2,"[academy_awards, actors]",actors_126,ONFLOV,New York,academy_awards_1880,1975-01-01,yes,Milos Forman,1937-01-01,"[Brad Dourif, Jack Nicholson, Louise Fletcher]",One Flew over the Cuckoo's Nest,0.668459,"{'id_rule': 'first_non_null', 'title_prefix_ru...",
3,actors_29,group_3,"[academy_awards, golden_globes, actors]",actors_29,AN,Sweden,academy_awards_2892,1956-01-01,yes,,1915-01-01,"[Helen Hayes, Ingrid Bergman]",Anastasia,0.5,"{'id_rule': 'first_non_null', 'title_prefix_ru...",
4,academy_awards_503,group_4,"[academy_awards, golden_globes, actors]",academy_awards_503,TRDA,New York,academy_awards_503,2001-01-01,yes,,1954-01-01,"[Denzel Washington, Ethan Hawke]",Training Day,0.5,"{'id_rule': 'first_non_null', 'title_prefix_ru...",


## Evaluate with Gold Standard
We load the gold standard and evaluate accuracy.

In [34]:
from PyDI.fusion import tokenized_match, year_only_match, boolean_match

strategy.add_evaluation_function("title", tokenized_match)
strategy.add_evaluation_function("director_name", tokenized_match)
strategy.add_evaluation_function("actors_actor_name", tokenized_match)
strategy.add_evaluation_function("date", year_only_match)
strategy.add_evaluation_function("oscar", boolean_match)

[INFO ] PyDI.fusion.strategy - Registered evaluation function for attribute 'title'
[INFO ] PyDI.fusion.strategy - Registered evaluation function for attribute 'director_name'
[INFO ] PyDI.fusion.strategy - Registered evaluation function for attribute 'actors_actor_name'
[INFO ] PyDI.fusion.strategy - Registered evaluation function for attribute 'date'
[INFO ] PyDI.fusion.strategy - Registered evaluation function for attribute 'oscar'


In [35]:
from PyDI.fusion import DataFusionEvaluator

fusion_test_set = load_xml(DATA_DIR / 'fusion' / 'test_set.xml', name='fusion_test_set', nested_handling='aggregate')

# Keep core evaluation columns if present in fused output
eval_cols = ['academy_awards_id','title','director_name','actors_actor_name','date','oscar']
fused_eval = fused[eval_cols].copy()

# Create evaluator with our fusion strategy
evaluator = DataFusionEvaluator(strategy)

# Evaluate the fused results against the gold standard
print("Evaluating fusion results against gold standard...")
evaluation_results = evaluator.evaluate(
    fused_df=fused_eval,
    fused_id_column='academy_awards_id',
    gold_df=fusion_test_set,
    gold_id_column='id',
)

# Display evaluation metrics
print("\nFusion Evaluation Results:")
print("=" * 40)
for metric, value in evaluation_results.items():
    if isinstance(value, float):
        print(f"  {metric}: {value:.3f}")
    else:
        print(f"  {metric}: {value}")
        
print(f"\nOverall Accuracy: {evaluation_results.get('overall_accuracy', 0):.1%}")

[INFO ] PyDI.fusion.evaluation - Starting fusion evaluation
[INFO ] PyDI.fusion.evaluation - Evaluation complete: 0.947 overall accuracy (90/95)


Evaluating fusion results against gold standard...

Fusion Evaluation Results:
  overall_accuracy: 0.947
  macro_accuracy: 0.950
  num_evaluated_records: 20
  num_evaluated_attributes: 5
  total_evaluations: 95
  total_correct: 90
  date_accuracy: 0.950
  date_count: 20
  oscar_accuracy: 1.000
  oscar_count: 20
  director_name_accuracy: 1.000
  director_name_count: 15
  actors_actor_name_accuracy: 0.850
  actors_actor_name_count: 20
  title_accuracy: 0.950
  title_count: 20

Overall Accuracy: 94.7%
