# PyDI Data Integration Tutorial

This tutorial demonstrates comprehensive data integration using PyDI. We'll work with movie datasets to showcase the data integration pipeline from entity matching to Data Fusion.

### What You'll Learn

1. **Data Loading & Profiling**: Load and analyze movie datasets with provenance tracking
2. **Identity Resolution**: 
   - Blocking strategies (Standard, Sorted Neighbourhood, Token-based, Embedding-based)
   - Multi-attribute similarity matching with custom comparators
   - Machine learning-based entity matching
3. **Data Fusion**: 
   - Conflict resolution with custom fusion rules
   - Quality assessment against test set
   - Provenance-based conflict resolution

### Datasets

We'll use three movie datasets:
- **Academy Awards**: Movies with Oscar information (4,592 records)
- **Actors**: Movies with actor details (149 records) 
- **Golden Globes**: Movies with Golden Globe awards (2,286 records)

These datasets contain overlapping movie information but with different attributes, data quality issues, and conflicting values - perfect for demonstrating real-world data integration challenges.

In [1]:
# # Core Python libraries
# import pandas as pd
# import numpy as np
# import logging
# import time
# import json
# from datetime import datetime

# # PyDI imports for data loading and profiling

# # PyDI imports for entity matching
# from PyDI.entitymatching import (
#     # Blocking strategies
#     NoBlocking, StandardBlocking, SortedNeighbourhood, 
#     TokenBlocking, EmbeddingBlocking,
#     # Matchers
#     RuleBasedMatcher, MLBasedMatcher,
#     # Feature extraction for ML
#     FeatureExtractor,
#     # Comparators
#     StringComparator, DateComparator, NumericComparator,
#     # Evaluation - NEW: Separate methods for blocking and matching evaluation
#     EntityMatchingEvaluator,
#     # Utilities
#     ensure_record_ids
# )

# # PyDI imports for data fusion
# from PyDI.fusion import (
#     DataFusionEngine, DataFusionStrategy, DataFusionEvaluator,
#     # Fusion rules
#     longest_string, shortest_string, most_recent, earliest,
#     average, median, maximum, minimum, most_complete,
#     union, intersection, voting,
#     # Convenient aliases
#     LONGEST, SHORTEST, LATEST, EARLIEST, AVG, MAX, MIN, VOTE, UNION,
#     # Analysis and reporting
#     FusionReport, FusionQualityMetrics, ProvenanceTracker,
#     build_record_groups_from_correspondences,
# )

### Setup the environment

In [2]:
# Install the PyDI package if not already installed
# First navigate to the root directory of the repository in your terminal, then run:
# !pip install -e .

In [3]:
from pathlib import Path

# Setup paths
def get_repo_root():
    """Get repository root directory."""
    current = Path.cwd()
    while current != current.parent:
        if (current / 'pyproject.toml').exists():
            return current
        current = current.parent
    return Path.cwd()

ROOT = get_repo_root()
OUTPUT_DIR = ROOT / "PyDI" / "tutorial" / "output" / "movies"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print(f"PyDI Tutorial")
print(f"Repository root: {ROOT}")
print(f"Output directory: {OUTPUT_DIR}")
print(f"All systems ready! 🚀")

PyDI Tutorial
Repository root: c:\Users\Ralph\dev\pydi
Output directory: c:\Users\Ralph\dev\pydi\PyDI\tutorial\output\movies
All systems ready! 🚀


## Part 1: Data Loading and Profiling

PyDI provides provenance-aware data loading that automatically tracks dataset metadata and optionally adds unique identifiers to each record. Let's load our movie datasets and understand their characteristics.

In [4]:
from PyDI.io import load_xml

# Define dataset paths
DATA_DIR = ROOT / "PyDI" / "tutorial" / "input" / "movies"

# Load Academy Awards dataset
academy_awards = load_xml(
    DATA_DIR / "data" / "academy_awards.xml",
    name="academy_awards",
    record_tag="movie",
    add_index=True,
    index_column_name="_id"
)

# Load Actors dataset  
actors = load_xml(
    DATA_DIR / "data" / "actors.xml",
    name="actors", 
    record_tag="movie",
    add_index=True,
    index_column_name="_id"
)

# Load Golden Globes dataset
golden_globes = load_xml(
    DATA_DIR / "data" / "golden_globes.xml",
    name="golden_globes",
    record_tag="movie", 
    add_index=True,
    index_column_name="_id"
)

# Display basic information
datasets = [academy_awards, actors, golden_globes]
names = ["Academy Awards", "Actors", "Golden Globes"]

for df, name in zip(datasets, names):
    print(f"{name}:")
    print(f"  Records: {len(df):,}")
    print(f"  Attributes: {len(df.columns)}")
    print(f"  Columns: {list(df.columns)}")
    print(f"  Dataset name: {df.attrs.get('dataset_name', 'unknown')}")
    print()

total_records = sum(len(df) for df in datasets)
print(f"Total records across all datasets: {total_records:,}")

Academy Awards:
  Records: 4,592
  Attributes: 7
  Columns: ['_id', 'id', 'title', 'actor_name', 'date', 'director_name', 'oscar']
  Dataset name: academy_awards

Actors:
  Records: 151
  Attributes: 7
  Columns: ['_id', 'id', 'title', 'actor_name', 'actors_actor_birthday', 'actors_actor_birthplace', 'date']
  Dataset name: actors

Golden Globes:
  Records: 2,286
  Attributes: 7
  Columns: ['_id', 'id', 'title', 'actor_name', 'date', 'director_name', 'globe']
  Dataset name: golden_globes

Total records across all datasets: 7,029


In [5]:
# Preview the data structure

print("\n📽️ Academy Awards Dataset:")
display(academy_awards.head(3))

print("\n🎭 Actors Dataset:")
display(actors.head(3))

print("\n🏆 Golden Globes Dataset:")
display(golden_globes.head(3))


📽️ Academy Awards Dataset:


Unnamed: 0,_id,id,title,actor_name,date,director_name,oscar
0,academy_awards-0000,academy_awards_1,Biutiful,Javier Bardem,2010-01-01,,
1,academy_awards-0001,academy_awards_2,True Grit,Jeff Bridges,2010-01-01,Joel Coen,
2,academy_awards-0002,academy_awards_2,True Grit,Jeff Bridges,2010-01-01,Ethan Coen,



🎭 Actors Dataset:


Unnamed: 0,_id,id,title,actor_name,actors_actor_birthday,actors_actor_birthplace,date
0,actors-0000,actors_1,7th Heaven,Janet Gaynor,1906-01-01,Pennsylvania,1929-01-01
1,actors-0001,actors_2,Coquette,Mary Pickford,1892-01-01,Canada,1930-01-01
2,actors-0002,actors_3,The Divorcee,Norma Shearer,1902-01-01,Canada,1931-01-01



🏆 Golden Globes Dataset:


Unnamed: 0,_id,id,title,actor_name,date,director_name,globe
0,golden_globes-0000,golden_globes_1,Frankie and Alice,Halle Berry,2011-01-01,,
1,golden_globes-0001,golden_globes_2,Rabbit Hole,Nicole Kidman,2011-01-01,,
2,golden_globes-0002,golden_globes_3,Winter's Bone,Jennifer Lawrence,2011-01-01,,


### Data Quality Analysis

Let's use PyDI's profiling capabilities to understand our data quality and identify the best attributes for matching.

### Basic Dataset Summary

First, let's use the DataProfiler's `summary()` method to get basic statistics for each dataset.

In [6]:
from PyDI.profiling import DataProfiler

# Initialize the DataProfiler
profiler = DataProfiler()

for df, name in zip(datasets, names):
    profile = profiler.summary(df) # automatically prints some statistics and returns object containing stats

display(profile)

academy_awards:
  Rows: 4,592
  Columns: 7
  Total nulls: 11,036
  Null percentage: 34.3%
  Null counts per column:
    title: 12 (0.3%)
    actor_name: 3,535 (77.0%)
    director_name: 4,172 (90.9%)
    oscar: 3,317 (72.2%)

actors:
  Rows: 151
  Columns: 7
  Total nulls: 0
  Null percentage: 0.0%

golden_globes:
  Rows: 2,286
  Columns: 7
  Total nulls: 3,681
  Null percentage: 23.0%
  Null counts per column:
    actor_name: 54 (2.4%)
    director_name: 1,966 (86.0%)
    globe: 1,661 (72.7%)



{'rows': 2286,
 'columns': 7,
 'nulls_total': 3681,
 'nulls_per_column': {'_id': 0,
  'id': 0,
  'title': 0,
  'actor_name': 54,
  'date': 0,
  'director_name': 1966,
  'globe': 1661},
 'dtypes': {'_id': 'string',
  'id': 'object',
  'title': 'object',
  'actor_name': 'object',
  'date': 'object',
  'director_name': 'object',
  'globe': 'object'}}

### Attribute Coverage Analysis

Next, let's use the `analyze_coverage()` method to understand how attributes overlap across datasets.

In [7]:
coverage = profiler.analyze_coverage(
    datasets=datasets,
    include_samples=True,
    sample_count=3  # Show 3 sample values per attribute
)

print("📊 Attribute coverage across datasets:")
display(coverage)

# Identify attributes suitable for entity matching
print("\n🔗 Attributes suitable for entity matching:")
matching_attrs = coverage[coverage['datasets_with_attribute'] >= 2]['attribute'].tolist()
print(f"Attributes available in 2+ datasets: {matching_attrs}")

📊 Attribute coverage across datasets:


Unnamed: 0,attribute,academy_awards_count,academy_awards_pct,academy_awards_coverage,academy_awards_samples,actors_count,actors_pct,actors_coverage,actors_samples,golden_globes_count,golden_globes_pct,golden_globes_coverage,golden_globes_samples,avg_coverage,max_coverage,datasets_with_attribute
0,_id,4592/4592,100.0%,1.0,"['academy_awards-0000', 'academy_awards-0001',...",151/151,100.0%,1.0,"['actors-0000', 'actors-0001', 'actors-0002']",2286/2286,100.0%,1.0,"['golden_globes-0000', 'golden_globes-0001', '...",1.0,1.0,3
1,actor_name,1057/4592,23.0%,0.230183,"['Javier Bardem', 'Jeff Bridges', 'Jeff Bridges']",151/151,100.0%,1.0,"['Janet Gaynor', 'Mary Pickford', 'Norma Shear...",2232/2286,97.6%,0.976378,"['Halle Berry', 'Nicole Kidman', 'Jennifer Law...",0.73552,1.0,3
2,actors_actor_birthday,0/0,0%,0.0,,151/151,100.0%,1.0,"['1906-01-01', '1892-01-01', '1902-01-01']",0/0,0%,0.0,,0.333333,1.0,1
3,actors_actor_birthplace,0/0,0%,0.0,,151/151,100.0%,1.0,"['Pennsylvania', 'Canada', 'Canada']",0/0,0%,0.0,,0.333333,1.0,1
4,date,4592/4592,100.0%,1.0,"['2010-01-01', '2010-01-01', '2010-01-01']",151/151,100.0%,1.0,"['1929-01-01', '1930-01-01', '1931-01-01']",2286/2286,100.0%,1.0,"['2011-01-01', '2011-01-01', '2011-01-01']",1.0,1.0,3
5,director_name,420/4592,9.1%,0.091463,"['Joel Coen', 'Ethan Coen', 'David Fincher']",0/0,0%,0.0,,320/2286,14.0%,0.139983,"['Darren Aronofsky', 'David Fincher', 'Tom Hoo...",0.077149,0.139983,2
6,globe,0/0,0%,0.0,,0/0,0%,0.0,,625/2286,27.3%,0.273403,"['yes', 'yes', 'yes']",0.091134,0.273403,1
7,id,4592/4592,100.0%,1.0,"['academy_awards_1', 'academy_awards_2', 'acad...",151/151,100.0%,1.0,"['actors_1', 'actors_2', 'actors_3']",2286/2286,100.0%,1.0,"['golden_globes_1', 'golden_globes_2', 'golden...",1.0,1.0,3
8,oscar,1275/4592,27.8%,0.277657,"['yes', 'yes', 'yes']",0/0,0%,0.0,,0/0,0%,0.0,,0.092552,0.277657,1
9,title,4580/4592,99.7%,0.997387,"['Biutiful', 'True Grit', 'True Grit']",151/151,100.0%,1.0,"['7th Heaven', 'Coquette', 'The Divorcee']",2286/2286,100.0%,1.0,"['Frankie and Alice', 'Rabbit Hole', ""Winter's...",0.999129,1.0,3



🔗 Attributes suitable for entity matching:
Attributes available in 2+ datasets: ['_id', 'actor_name', 'date', 'director_name', 'id', 'title']


### Detailed Data Profiling

Now let's generate comprehensive HTML profiles for each dataset using the `profile()` method. These reports provide in-depth statistical analysis.

In [8]:
# Generate detailed HTML profiles for each dataset

profile_dir = OUTPUT_DIR / "dataset-profiles"
profile_dir.mkdir(parents=True, exist_ok=True)

profile_paths = []

for df, name in zip(datasets, names):
    print(f"📊 Profiling {name}...")
    
    profile_path = profiler.profile(df, str(profile_dir))
    profile_paths.append(profile_path)
    print(f"  ✅ Profile saved: {profile_path}")

print(f"\n🎯 Generated {len(profile_paths)} detailed HTML reports")
print(f"📁 Location: {profile_dir}")
print("\n💡 Open these HTML files in your browser for interactive exploration:")
for path in profile_paths:
    print(f"  • {Path(path).name}")


📊 Profiling Academy Awards...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 7/7 [00:00<00:00, 82.38it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

  ✅ Profile saved: c:\Users\Ralph\dev\pydi\PyDI\tutorial\output\movies\dataset-profiles\academy_awards_profile.html
📊 Profiling Actors...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 7/7 [00:00<00:00, 208.81it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

  ✅ Profile saved: c:\Users\Ralph\dev\pydi\PyDI\tutorial\output\movies\dataset-profiles\actors_profile.html
📊 Profiling Golden Globes...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 7/7 [00:00<00:00, 120.70it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

  ✅ Profile saved: c:\Users\Ralph\dev\pydi\PyDI\tutorial\output\movies\dataset-profiles\golden_globes_profile.html

🎯 Generated 3 detailed HTML reports
📁 Location: c:\Users\Ralph\dev\pydi\PyDI\tutorial\output\movies\dataset-profiles

💡 Open these HTML files in your browser for interactive exploration:
  • academy_awards_profile.html
  • actors_profile.html
  • golden_globes_profile.html


### Dataset Comparison

Finally, let's use the `compare()` method to create a comparison report between two datasets, highlighting differences and similarities.

In [9]:
# Compare Academy Awards vs Golden Globes datasets

compare_dir = OUTPUT_DIR / "dataset-comparisons"
compare_dir.mkdir(parents=True, exist_ok=True)

print("🔍 Comparing Academy Awards vs Golden Globes datasets...")

comparison_path = profiler.compare(academy_awards, golden_globes, compare_dir)
print(f"✅ Comparison report saved: {comparison_path}")

print("🔍 Comparing Academy Awards vs Golden Globes datasets...")

comparison_path = profiler.compare(academy_awards, actors, compare_dir)
print(f"✅ Comparison report saved: {comparison_path}")

print("🔍 Comparing Academy Awards vs Golden Globes datasets...")

comparison_path = profiler.compare(actors, golden_globes, compare_dir)
print(f"✅ Comparison report saved: {comparison_path}")

🔍 Comparing Academy Awards vs Golden Globes datasets...


                                             |          | [  0%]   00:00 -> (? left)

Report c:\Users\Ralph\dev\pydi\PyDI\tutorial\output\movies\dataset-comparisons\academy_awards_vs_golden_globes_compare.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.
✅ Comparison report saved: c:\Users\Ralph\dev\pydi\PyDI\tutorial\output\movies\dataset-comparisons\academy_awards_vs_golden_globes_compare.html
🔍 Comparing Academy Awards vs Golden Globes datasets...


                                             |          | [  0%]   00:00 -> (? left)

Report c:\Users\Ralph\dev\pydi\PyDI\tutorial\output\movies\dataset-comparisons\academy_awards_vs_actors_compare.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.
✅ Comparison report saved: c:\Users\Ralph\dev\pydi\PyDI\tutorial\output\movies\dataset-comparisons\academy_awards_vs_actors_compare.html
🔍 Comparing Academy Awards vs Golden Globes datasets...


                                             |          | [  0%]   00:00 -> (? left)

Report c:\Users\Ralph\dev\pydi\PyDI\tutorial\output\movies\dataset-comparisons\actors_vs_golden_globes_compare.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.
✅ Comparison report saved: c:\Users\Ralph\dev\pydi\PyDI\tutorial\output\movies\dataset-comparisons\actors_vs_golden_globes_compare.html


## Part 2: Identity Resolution (Entity Matching)

Identity Resolution is the process of identifying records that refer to the same real-world entity. PyDI provides comprehensive blocking and matching capabilities.

### Step 1: Blocking Strategies

Blocking reduces the number of comparisons from O(n²) to a manageable subset. Let's explore different blocking strategies.

In [10]:
# Let's setup logging first
import logging

# # Configure logging for INFO level
# logging.basicConfig(
#     level=logging.INFO,
#     format='[%(levelname)-5s] %(name)s - %(message)s',
#     handlers=[
#           logging.FileHandler('output/logs/pydi.log'),  # Save to file
#           logging.StreamHandler()                      # Display on console
#       ],
#     force=True
# )

# Configure logging for DEBUG level
logging.basicConfig(
    level=logging.DEBUG,
    format='[%(levelname)-5s] %(name)s - %(message)s',
    handlers=[
          logging.FileHandler('output/logs/pydi.log'),  # Save to file
          logging.StreamHandler()                      # Display on console
      ],
    force=True
)

In [11]:
from PyDI.entitymatching import NoBlocking, StandardBlocking, SortedNeighbourhood, TokenBlocking, EmbeddingBlocking

# We'll focus on Actors and Golden Globes for showcasing blocking strategies

max_pairs = len(actors) * len(golden_globes)
print(f"Without blocking: {max_pairs:,} comparisons required")
print("\n🎯 Goal: Reduce comparisons while maintaining high recall\n")

# No Blocking - compare all possible pairs
print("\n No Blocking")

no_blocker = NoBlocking(
    actors, golden_globes,
    batch_size=1000
)

# in an actual large-scale application, we do not build a list of all pairs but stream over them like this
for batch in no_blocker:
    # do something with the pairs
    continue

# but we can also generate the full set of pairs for smaller datasets
no_candidates = no_blocker.materialize()

print(f"  Generated: {len(no_candidates):,} candidates")

Without blocking: 345,186 comparisons required

🎯 Goal: Reduce comparisons while maintaining high recall


 No Blocking
  Generated: 345,186 candidates


Now let's use an actual blocker. Note that when instantiating the blocker, it also writes out a corresponding debug file.

In [12]:
# 1. Standard Blocking - First 3 characters of title
print("\n1️⃣ Standard Blocking (First 3 Characters of Title)")

# Add title_prefix directly to the original dataframes
actors['title_prefix'] = actors['title'].astype(str).str[:3]
golden_globes['title_prefix'] = golden_globes['title'].astype(str).str[:3]

standard_blocker_a2g = StandardBlocking(
    actors, golden_globes,
    on=['title_prefix'],  # Block on first 3 characters of title
    batch_size=1000
)

standard_candidates_a2g = standard_blocker_a2g.materialize()

print()
print(f"  Generated: {len(standard_candidates_a2g):,} candidates")

[DEBUG] PyDI.entitymatching.blocking.standard.StandardBlocking - Creating blocking key values for dataset1: 151 records
[DEBUG] PyDI.entitymatching.blocking.standard.StandardBlocking - Creating blocking key values for dataset2: 2286 records
[INFO ] PyDI.entitymatching.blocking.standard.StandardBlocking - created 109 blocking keys for first dataset
[INFO ] PyDI.entitymatching.blocking.standard.StandardBlocking - created 792 blocking keys for second dataset
[DEBUG] PyDI.entitymatching.blocking.standard.StandardBlocking - Joining blocking key values: 109 x 792 blocks
[INFO ] PyDI.entitymatching.blocking.standard.StandardBlocking - created 91 blocks from blocking keys
[DEBUG] PyDI.entitymatching.blocking.standard.StandardBlocking - Block size distribution:
[DEBUG] PyDI.entitymatching.blocking.standard.StandardBlocking - Frequency   Element
[DEBUG] PyDI.entitymatching.blocking.standard.StandardBlocking - 19          1
[DEBUG] PyDI.entitymatching.blocking.standard.StandardBlocking - 13      


1️⃣ Standard Blocking (First 3 Characters of Title)

  Generated: 736 candidates


In [13]:
# 2. Sorted Neighbourhood - Sequential similarity
print("\n2️⃣ Sorted Neighbourhood Blocking (Title-based, Window=5)")

sn_blocker_a2g = SortedNeighbourhood(
    actors, golden_globes,
    key='title',  # Sort by title
    window=10,     # Compare with 5 neighbors
    batch_size=1000
)

sn_candidates_a2g = sn_blocker_a2g.materialize()

print()
print(f"  Generated: {len(sn_candidates_a2g):,} candidates")

[DEBUG] PyDI.entitymatching.blocking.sorted_neighbourhood.SortedNeighbourhood - Creating sort keys for dataset1: 151 records
[DEBUG] PyDI.entitymatching.blocking.sorted_neighbourhood.SortedNeighbourhood - Creating sort keys for dataset2: 2286 records
[DEBUG] PyDI.entitymatching.blocking.sorted_neighbourhood.SortedNeighbourhood - Sorting combined dataset with 2437 records
[INFO ] PyDI.entitymatching.blocking.sorted_neighbourhood.SortedNeighbourhood - created sorted neighbourhood with window size 10
[INFO ] PyDI.entitymatching.blocking.sorted_neighbourhood.SortedNeighbourhood - created 1 sorted sequence from 2437 records
[INFO ] PyDI.entitymatching.blocking.sorted_neighbourhood.SortedNeighbourhood - Debug results written to file: output/debugResultsBlocking_SortedNeighbourhood.csv
[DEBUG] PyDI.entitymatching.blocking.sorted_neighbourhood.SortedNeighbourhood - Creating candidate record pairs from sorted neighbourhood with window 10



2️⃣ Sorted Neighbourhood Blocking (Title-based, Window=5)

  Generated: 2,360 candidates


In [14]:
# 3. Token Blocking - Token-based similarity
print("\n3️⃣ Token Blocking (Title Tokens, Min Length=5)")

token_blocker_a2g = TokenBlocking(
    actors, golden_globes,
    column='title',      # Tokenize titles
    min_token_len=5,     # Ignore very short tokens
    batch_size=1000
)

token_candidates_a2g = token_blocker_a2g.materialize()

print()
print(f"  Generated: {len(token_candidates_a2g):,} candidates")

[DEBUG] PyDI.entitymatching.blocking.token_blocking.TokenBlocking - Creating token index for dataset1: 151 records
[DEBUG] PyDI.entitymatching.blocking.token_blocking.TokenBlocking - Creating token index for dataset2: 2286 records
[INFO ] PyDI.entitymatching.blocking.token_blocking.TokenBlocking - created 178 token keys for first dataset
[INFO ] PyDI.entitymatching.blocking.token_blocking.TokenBlocking - created 1776 token keys for second dataset
[DEBUG] PyDI.entitymatching.blocking.token_blocking.TokenBlocking - Joining token keys: 178 x 1776 tokens
[INFO ] PyDI.entitymatching.blocking.token_blocking.TokenBlocking - created 142 blocks from token keys
[DEBUG] PyDI.entitymatching.blocking.token_blocking.TokenBlocking - Token frequency distribution:
[DEBUG] PyDI.entitymatching.blocking.token_blocking.TokenBlocking - Frequency   Element
[DEBUG] PyDI.entitymatching.blocking.token_blocking.TokenBlocking - 74          1
[DEBUG] PyDI.entitymatching.blocking.token_blocking.TokenBlocking - 29  


3️⃣ Token Blocking (Title Tokens, Min Length=5)

  Generated: 431 candidates


In [15]:
# 4. Embedding Blocking - Semantic similarity
print("\n4️⃣ Embedding Blocking (Semantic Similarity)")

embedding_blocker_a2g = EmbeddingBlocking(
    actors, golden_globes,
    text_cols=['title'],
    model="sentence-transformers/all-MiniLM-L6-v2",
    index_backend="sklearn",
    top_k=10,          # Top 10 most similar
    batch_size=500
)
    
embedding_candidates_a2g = embedding_blocker_a2g.materialize()

print()
print(f"  Generated: {len(embedding_candidates_a2g):,} candidates")

[INFO ] PyDI.entitymatching.blocking.embedding.EmbeddingBlocking - Initialized EmbeddingBlocking with sklearn backend, top_k=10, threshold=0.3
[DEBUG] PyDI.entitymatching.blocking.embedding.EmbeddingBlocking - Computing embeddings for datasets...
[DEBUG] PyDI.entitymatching.blocking.embedding.EmbeddingBlocking - Creating embeddings for dataset1: 151 records



4️⃣ Embedding Blocking (Semantic Similarity)


[INFO ] sentence_transformers.SentenceTransformer - Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2
[DEBUG] urllib3.connectionpool - Starting new HTTPS connection (1): huggingface.co:443
[DEBUG] urllib3.connectionpool - https://huggingface.co:443 "HEAD /sentence-transformers/all-MiniLM-L6-v2/resolve/main/modules.json HTTP/1.1" 307 0
[DEBUG] urllib3.connectionpool - https://huggingface.co:443 "HEAD /api/resolve-cache/models/sentence-transformers/all-MiniLM-L6-v2/c9745ed1d9f207416be6d2e6f8de32d1f16199bf/modules.json HTTP/1.1" 200 0
[DEBUG] urllib3.connectionpool - https://huggingface.co:443 "HEAD /sentence-transformers/all-MiniLM-L6-v2/resolve/main/config_sentence_transformers.json HTTP/1.1" 307 0
[DEBUG] urllib3.connectionpool - https://huggingface.co:443 "HEAD /api/resolve-cache/models/sentence-transformers/all-MiniLM-L6-v2/c9745ed1d9f207416be6d2e6f8de32d1f16199bf/config_sentence_transformers.json HTTP/1.1" 200 0
[DEBUG] urllib3.connectionpool - https://hugg


  Generated: 1,497 candidates


### Step 2: Evaluation Against Ground Truth

PyDI provides evaluation methods for blocking with pair completeness, pair quality, and reduction ratio:
- **`evaluate_blocking()`**: Evaluates blocking given an already materialized set of pairs.
- **`evaluate_blocking_batched()`**: Evaluates blocking by iterating over batches and storing results. Useful for very large datasets 

Let's first evaluate materialized blocking results against a set of provided ground truth correspondences.

In [16]:
import pandas as pd
from PyDI.io import load_csv
from PyDI.entitymatching import EntityMatchingEvaluator
# Showcase EntityMatchingEvaluator.evaluate_blocking utility

# Load test set with proper column names
test_gt = load_csv(
    DATA_DIR / "entitymatching" / "actors_2_golden_globes_test.csv",
    name="test_set", header=None, names=['id1', 'id2', 'label'], add_index=False
)

# Use EntityMatchingEvaluator.evaluate_blocking on Standard Blocking
results = EntityMatchingEvaluator.evaluate_blocking(
    candidate_pairs=standard_candidates_a2g,
    blocker=standard_blocker_a2g,
    test_pairs=test_gt,
    out_dir=OUTPUT_DIR / "blocking-evaluation"
)

print(f"\n💡 Evaluating pair quality only makes sense if the test set contains all possible pairs, which is not the case in this example!")

display(results)

[INFO ] root -   Pair Completeness: 0.385
[INFO ] root -   Pair Quality:      0.014
[INFO ] root -   Reduction Ratio:   0.998
[INFO ] root -   True Matches Found: 10/26
[INFO ] root - Blocking evaluation complete: Completeness=0.3846 Quality=0.0136 Reduction=0.9979



💡 Evaluating pair quality only makes sense if the test set contains all possible pairs, which is not the case in this example!


{'pair_completeness': 0.38461538461538464,
 'pair_quality': 0.01358695652173913,
 'reduction_ratio': 0.9978678161918502,
 'total_candidates': 736,
 'total_possible_pairs': 345186,
 'true_positives_found': 10,
 'total_true_pairs': 26,
 'evaluation_timestamp': '2025-09-12T17:38:40.516562',
 'output_files': ['c:\\Users\\Ralph\\dev\\pydi\\PyDI\\tutorial\\output\\movies\\blocking-evaluation\\blocking_evaluation_summary.json',
  'c:\\Users\\Ralph\\dev\\pydi\\PyDI\\tutorial\\output\\movies\\blocking-evaluation\\blocking_detailed_results.csv']}

When datasets are huge, it is necessary to use the evaluate_blocking_batched() function to avoid materializing the full set of pairs.

In [17]:
results = EntityMatchingEvaluator.evaluate_blocking_batched(
    blocker=standard_blocker_a2g,
    test_pairs=test_gt,
    out_dir=OUTPUT_DIR / "blocking-evaluation"
)

display(results)

[INFO ] root - Starting batched blocking evaluation...
[DEBUG] PyDI.entitymatching.blocking.standard.StandardBlocking - Creating candidate record pairs from 91 blocks
[INFO ] root -   Pair Completeness: 0.385
[INFO ] root -   Pair Quality:      0.014
[INFO ] root -   Reduction Ratio:   0.998
[INFO ] root -   True Matches Found: 10/26
[INFO ] root -   Batches Processed:  1
[INFO ] root - Batched blocking evaluation complete: Completeness=0.3846 Quality=0.0136 Reduction=0.9979 Batches=1


{'pair_completeness': 0.38461538461538464,
 'pair_quality': 0.01358695652173913,
 'reduction_ratio': 0.9978678161918502,
 'total_candidates': 736,
 'total_possible_pairs': 345186,
 'true_positives_found': 10,
 'total_true_pairs': 26,
 'batches_processed': 1,
 'evaluation_timestamp': '2025-09-12T17:38:40.565197',
 'output_files': ['c:\\Users\\Ralph\\dev\\pydi\\PyDI\\tutorial\\output\\movies\\blocking-evaluation\\blocking_evaluation_summary.json',
  'c:\\Users\\Ralph\\dev\\pydi\\PyDI\\tutorial\\output\\movies\\blocking-evaluation\\blocking_detailed_results.csv']}

Let's do the same kind of blocking for the dataset combination Academy Awards <-> Actors

In [18]:
# Add title_prefix directly to the original dataframes
academy_awards['title_prefix'] = academy_awards['title'].astype(str).str[:3]

standard_blocker_aa2a = StandardBlocking(
    academy_awards, actors,
    on=['title_prefix'],  # Block on first 3 characters of title
    batch_size=1000
)
standard_candidates_aa2a = standard_blocker_aa2a.materialize()

sn_blocker_aa2a = SortedNeighbourhood(
    academy_awards, actors,
    key='title',  # Sort by title
    window=10,     # Compare with 5 neighbors
    batch_size=1000
)
sn_candidates_aa2a = sn_blocker_aa2a.materialize()

token_blocker_aa2a = TokenBlocking(
    academy_awards, actors,
    column='title',      # Tokenize titles
    min_token_len=5,     # Ignore very short tokens
    batch_size=1000
)
token_candidates_aa2a = token_blocker_aa2a.materialize()

embedding_blocker_aa2a = EmbeddingBlocking(
    academy_awards, actors,
    text_cols=['title'],
    model="sentence-transformers/all-MiniLM-L6-v2",
    index_backend="sklearn",
    top_k=10,          # Top 10 most similar
    batch_size=500
)
embedding_candidates_aa2a = embedding_blocker_aa2a.materialize()

[DEBUG] PyDI.entitymatching.blocking.standard.StandardBlocking - Creating blocking key values for dataset1: 4592 records
[DEBUG] PyDI.entitymatching.blocking.standard.StandardBlocking - Creating blocking key values for dataset2: 151 records
[INFO ] PyDI.entitymatching.blocking.standard.StandardBlocking - created 1076 blocking keys for first dataset
[INFO ] PyDI.entitymatching.blocking.standard.StandardBlocking - created 109 blocking keys for second dataset
[DEBUG] PyDI.entitymatching.blocking.standard.StandardBlocking - Joining blocking key values: 1076 x 109 blocks
[INFO ] PyDI.entitymatching.blocking.standard.StandardBlocking - created 108 blocks from blocking keys
[DEBUG] PyDI.entitymatching.blocking.standard.StandardBlocking - Block size distribution:
[DEBUG] PyDI.entitymatching.blocking.standard.StandardBlocking - Frequency   Element
[DEBUG] PyDI.entitymatching.blocking.standard.StandardBlocking - 13          2
[DEBUG] PyDI.entitymatching.blocking.standard.StandardBlocking - 13   

Now let's evaluate which blocking method we want to use for each dataset combination:

In [19]:
# Evaluate all blocking methods for both dataset combinations

evaluator = EntityMatchingEvaluator()

# Create dictionaries of candidates for both dataset combinations
a2g_blocking_candidates = {
    'StandardBlocking': [standard_candidates_a2g, standard_blocker_a2g],
    'SortedNeighbourhood': [sn_candidates_a2g, sn_blocker_a2g],
    'TokenBlocking': [token_candidates_a2g,token_blocker_a2g],
    'EmbeddingBlocking': [embedding_candidates_a2g,embedding_blocker_a2g]
}

aa2a_blocking_candidates = {
    'StandardBlocking': [standard_candidates_aa2a,standard_blocker_aa2a],
    'SortedNeighbourhood': [sn_candidates_aa2a, sn_blocker_aa2a],
    'TokenBlocking': [token_candidates_aa2a,token_blocker_aa2a],
    'EmbeddingBlocking': [embedding_candidates_aa2a,embedding_blocker_aa2a]
}

# Load correspondences for evaluation
a2g_correspondences = load_csv(
    DATA_DIR / "entitymatching" / "actors_2_golden_globes_test.csv",
    name="a2g_test", header=None, names=['id1', 'id2', 'label'], add_index=False
)

aa2a_correspondences = load_csv(
    DATA_DIR / "entitymatching" / "academy_awards_2_actors_test.csv",
    name="aa2a_test", header=None, names=['id1', 'id2', 'label'], add_index=False
)

# Evaluate blocking for a2g datasets
a2g_results = []
for method_name, candidates in a2g_blocking_candidates.items():
    result = evaluator.evaluate_blocking(candidates[0], a2g_correspondences,candidates[1], out_dir=OUTPUT_DIR / "blocking-evaluation")
    result['method'] = method_name
    result['dataset'] = 'a2g'
    a2g_results.append(result)

# Evaluate blocking for aa2a datasets  
aa2a_results = []
for method_name, candidates in aa2a_blocking_candidates.items():
    result = evaluator.evaluate_blocking(candidates[0], aa2a_correspondences,candidates[1], out_dir=OUTPUT_DIR / "blocking-evaluation")
    result['method'] = method_name
    result['dataset'] = 'aa2a'
    aa2a_results.append(result)

# Select best method for each dataset (highest pair_completeness, then highest reduction_ratio)
a2g_best = max(a2g_results, key=lambda x: (x['pair_completeness'], x['reduction_ratio']))
aa2a_best = max(aa2a_results, key=lambda x: (x['pair_completeness'], x['reduction_ratio']))

print(f"Best blocking for a2g: {a2g_best['method']} (PC: {a2g_best['pair_completeness']:.3f}, RR: {a2g_best['reduction_ratio']:.3f})")
print(f"Best blocking for aa2a: {aa2a_best['method']} (PC: {aa2a_best['pair_completeness']:.3f}, RR: {aa2a_best['reduction_ratio']:.3f})")

[INFO ] root -   Pair Completeness: 0.385
[INFO ] root -   Pair Quality:      0.014
[INFO ] root -   Reduction Ratio:   0.998
[INFO ] root -   True Matches Found: 10/26
[INFO ] root - Blocking evaluation complete: Completeness=0.3846 Quality=0.0136 Reduction=0.9979
[INFO ] root -   Pair Completeness: 0.385
[INFO ] root -   Pair Quality:      0.004
[INFO ] root -   Reduction Ratio:   0.993
[INFO ] root -   True Matches Found: 10/26
[INFO ] root - Blocking evaluation complete: Completeness=0.3846 Quality=0.0042 Reduction=0.9932
[INFO ] root -   Pair Completeness: 0.846
[INFO ] root -   Pair Quality:      0.051
[INFO ] root -   Reduction Ratio:   0.999
[INFO ] root -   True Matches Found: 22/26
[INFO ] root - Blocking evaluation complete: Completeness=0.8462 Quality=0.0510 Reduction=0.9988
[INFO ] root -   Pair Completeness: 1.000
[INFO ] root -   Pair Quality:      0.017
[INFO ] root -   Reduction Ratio:   0.996
[INFO ] root -   True Matches Found: 26/26
[INFO ] root - Blocking evaluatio

Best blocking for a2g: EmbeddingBlocking (PC: 1.000, RR: 0.996)
Best blocking for aa2a: EmbeddingBlocking (PC: 0.894, RR: 0.956)


### Step 2: Entity Matching with Comparators

Now we'll use PyDI's linear matching rule capabilities to find duplicate movies using multiple attribute comparisons.

First, we define some comparators for attributes relevant to matching:

In [20]:
from PyDI.entitymatching import StringComparator, DateComparator, NumericComparator

# Create comparators for different attributes
comparators = [
    # Title similarity - most important for movies
    StringComparator(
        column='title',
        similarity_function='jaro_winkler',  # Good for movie titles
        preprocess=str.lower  # Case normalization
    ),
    
    # Date proximity - movies from same year likely same film
    DateComparator(
        column='date', 
        max_days_difference=365  # Allow 1 year difference
    ),
    
    # Actor name similarity - supporting evidence
    StringComparator(
        column='actor_name',
        similarity_function='cosine',  # Good for names
        preprocess=str.lower
    )
]

# Define attribute weights
weights = [0.6, 0.25, 0.15]  # Title most important, then date, then actor

Next, we setup the matcher and run the matching with our chosen best blocking method:

In [21]:
from PyDI.entitymatching import RuleBasedMatcher

# Initialize the blocker
embedding_blocker_a2g = EmbeddingBlocking(
    actors, golden_globes,
    text_cols=['title'],
    model="sentence-transformers/all-MiniLM-L6-v2",
    index_backend="sklearn",
    top_k=10,          # Top 10 most similar
    batch_size=500
)

# Initialize Rule-Based Matcher
matcher = RuleBasedMatcher()

correspondences_a2g = matcher.match(
    df_left=actors,
    df_right=golden_globes, 
    candidates=embedding_blocker_a2g, # pass the blocker, which will internally generate candidate pairs using batching
    comparators=comparators,
    weights=weights,
    threshold=0.7 # set a similarity threshold for a match
)

[INFO ] PyDI.entitymatching.blocking.embedding.EmbeddingBlocking - Initialized EmbeddingBlocking with sklearn backend, top_k=10, threshold=0.3
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Starting Identity Resolution
[INFO ] PyDI.entitymatching.rule_based.RuleBasedMatcher - Blocking 151 x 2286 elements
[DEBUG] PyDI.entitymatching.blocking.embedding.EmbeddingBlocking - Computing embeddings for datasets...
[DEBUG] PyDI.entitymatching.blocking.embedding.EmbeddingBlocking - Creating embeddings for dataset1: 151 records
[INFO ] sentence_transformers.SentenceTransformer - Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2
[DEBUG] urllib3.connectionpool - Resetting dropped connection: huggingface.co
[DEBUG] urllib3.connectionpool - https://huggingface.co:443 "HEAD /sentence-transformers/all-MiniLM-L6-v2/resolve/main/modules.json HTTP/1.1" 307 0
[DEBUG] urllib3.connectionpool - https://huggingface.co:443 "HEAD /api/resolve-cache/models/sentence-transformer

In [18]:
print("=== Evaluation Against Ground Truth ===")
print("Loading Winter framework's ground truth correspondences...\n")

# Load ground truth correspondences
gt_train = load_csv(
    DATA_DIR / "entitymatching" / "splits" / "academy_awards_2_actors_training.csv",
    name="ground_truth_train",
    header=None,
    names=['id1', 'id2', 'label'],
    add_index=False
)

gt_test = load_csv(
    DATA_DIR / "entitymatching" / "splits" / "academy_awards_2_actors_test.csv", 
    name="ground_truth_test",
    header=None,
    names=['id1', 'id2', 'label'],
    add_index=False
)

print(f"Training ground truth: {len(gt_train):,} pairs")
print(f"Test ground truth: {len(gt_test):,} pairs")

# Analyze label distribution
for name, gt in [('Training', gt_train), ('Test', gt_test)]:
    true_matches = (gt['label'] == 'TRUE').sum() if 'TRUE' in gt['label'].values else (gt['label'] == True).sum()
    total = len(gt)
    print(f"{name} set: {true_matches:,} positive matches out of {total:,} pairs ({true_matches/total*100:.1f}%)")

print(f"\n🎯 We'll evaluate against the test set ({len(gt_test):,} pairs)")

=== Evaluation Against Ground Truth ===
Loading Winter framework's ground truth correspondences...

Training ground truth: 335 pairs
Test ground truth: 3,347 pairs
Training set: 103 positive matches out of 335 pairs (30.7%)
Test set: 47 positive matches out of 3,347 pairs (1.4%)

🎯 We'll evaluate against the test set (3,347 pairs)


In [19]:
# Perform evaluation using PyDI's EntityMatchingEvaluator
print("\n=== Entity Matching Evaluation Results ===")

# Use the new evaluate_matching method for cleaner evaluation
eval_results = EntityMatchingEvaluator.evaluate_matching(
    correspondences=matches,
    test_pairs=gt_test,
    out_dir=str(OUTPUT_DIR)
)

display(eval_results)


=== Entity Matching Evaluation Results ===
Performance Metrics:
  Accuracy:  0.976
  Precision: 0.342
  Recall:    0.830
  F1-Score:  0.484
Confusion Matrix:
  True Positives:  39
  True Negatives:  3299
  False Positives: 75
  False Negatives: 8


{'precision': 0.34210526315789475,
 'recall': 0.8297872340425532,
 'f1': 0.484472049689441,
 'accuracy': 0.9757380882782812,
 'true_positives': 39,
 'false_positives': 75,
 'false_negatives': 8,
 'true_negatives': 3299,
 'threshold_used': 0.0,
 'total_correspondences': 114,
 'filtered_correspondences': 114,
 'evaluation_timestamp': '2025-09-09T15:12:55.279738',
 'output_files': ['c:\\Users\\Ralph\\dev\\pydi\\output\\tutorial\\matching_evaluation_summary.json',
  'c:\\Users\\Ralph\\dev\\pydi\\output\\tutorial\\matching_detailed_results.csv']}

In [20]:
# Re-run the matcher with debug mode enabled to get detailed debug data
print("🔍 Re-running matcher with debug mode to capture detailed results:")

# Use the same candidates and settings from before
candidates_df = pd.DataFrame(best_candidates)
print(f"  Using {len(candidates_df)} actual candidate pairs from {best_method} blocking")

# Re-run matching with debug enabled to capture detailed comparator results
start_time = time.time()

# Enable debug mode in the matcher to capture detailed results
matches, debug_info = matcher.match(
    df_left=left_df,
    df_right=right_df, 
    candidates=[candidates_df],
    comparators=comparators,
    weights=weights,
    threshold=0.7,
    debug=True  # This enables debug output capture
)

matching_time = time.time() - start_time
print(f"  Found {len(matches)} matches in {matching_time:.3f} seconds with debug enabled")

debug_output_dir = OUTPUT_DIR / "debug_results"
debug_output_dir.mkdir(parents=True, exist_ok=True)

# Call the write_debug_results function with actual results
full_debug_path, short_debug_path = EntityMatchingEvaluator.write_debug_results(
    correspondences=matches,
    debug_results=debug_info,
    out_dir=str(debug_output_dir),
    matcher_instance=matcher
)

print(f"  ✅ Full debug results: {Path(full_debug_path).name}")
print(f"  ✅ Short debug results: {Path(short_debug_path).name}")

print(f"📁 Debug files saved to: {debug_output_dir}")

🔍 Re-running matcher with debug mode to capture detailed results:
  Using 1030 actual candidate pairs from Embedding blocking
  Found 114 matches in 0.514 seconds with debug enabled
  ✅ Full debug results: debugResultsMatchingRule.csv
  ✅ Short debug results: debugResultsMatchingRule.csv_short
📁 Debug files saved to: c:\Users\Ralph\dev\pydi\output\tutorial\debug_results


In [21]:
print("=== Demonstrating Cluster Size Distribution Analysis ===")
print("Analyzing cluster size distribution in our entity matching results...")

# Create cluster size distribution from our matches
cluster_distribution = EntityMatchingEvaluator.create_cluster_size_distribution(
    correspondences=matches,
    out_dir=str(OUTPUT_DIR / "cluster_analysis")
)

print(f"\n📊 Cluster Size Distribution Results:")
display(cluster_distribution)


=== Demonstrating Cluster Size Distribution Analysis ===
Analyzing cluster size distribution in our entity matching results...

📊 Cluster Size Distribution Results:


Unnamed: 0,cluster_size,frequency,percentage
0,2,110,98.214286
1,3,2,1.785714


In [22]:
# Write out detailed cluster information with all entity records for debugging purposes

# Use the matches we found earlier to demonstrate cluster details
cluster_details_path = OUTPUT_DIR / "cluster_analysis" / "detailed_cluster_info.json"
cluster_details_path.parent.mkdir(parents=True, exist_ok=True)

# Call write_cluster_details with our entity matches
output_path = EntityMatchingEvaluator.write_cluster_details(
    correspondences=matches,
    out_path=str(cluster_details_path)
)

### Step 4: Machine Learning-based Matching

In [23]:
print("=== ML-Based Matching with Similarity Features ===")
print("Demonstrating MLBasedMatcher with FeatureExtractor using GridSearchCV")
print("Training on gt_train and testing on gt_test\n")

# Convert string labels to numeric
gt_train['label'] = gt_train['label'].map({'TRUE': 1, 'FALSE': 0, True: 1, False: 0})
gt_test['label'] = gt_test['label'].map({'TRUE': 1, 'FALSE': 0, True: 1, False: 0})

# Create similarity-based FeatureExtractor 
print("\n🔧 Creating Similarity-Based FeatureExtractor...")

similarity_comparators = [
    # Title similarity features - most important for movie matching
    StringComparator("title", similarity_function="jaro_winkler", preprocess=str.lower),
    StringComparator("title", similarity_function="levenshtein", preprocess=str.lower),
    StringComparator("title", similarity_function="cosine", preprocess=str.lower),
    StringComparator("title", similarity_function="jaccard", preprocess=str.lower),
    
    # Date proximity features
    DateComparator("date", max_days_difference=730),  # 2 years tolerance
    
    # Actor name similarity
    StringComparator("actor_name", similarity_function="jaro_winkler", preprocess=str.lower),
    StringComparator("actor_name", similarity_function="cosine", preprocess=str.lower),
]

feature_extractor = FeatureExtractor(similarity_comparators)
print(f"✅ Created FeatureExtractor with {len(similarity_comparators)} similarity features")
print(f"Feature names: {feature_extractor.get_feature_names()}")

# Extract training features
print(f"\n⚙️ Extracting Features from Training Pairs...")

# Filter training pairs to ensure both records exist
valid_train_pairs = []
valid_train_labels = []

for _, row in gt_train.iterrows():
    id1, id2, label = row['id1'], row['id2'], row['label']
    if (id1 in left_df['_id'].values and id2 in right_df['_id'].values):
        valid_train_pairs.append({'id1': id1, 'id2': id2})
        valid_train_labels.append(label)

train_pairs_df = pd.DataFrame(valid_train_pairs)
train_labels_series = pd.Series(valid_train_labels)

print(f"Valid training pairs: {len(train_pairs_df)} out of {len(gt_train)}")

# Extract features using FeatureExtractor
train_features = feature_extractor.create_features(
    left_df, right_df, train_pairs_df, labels=train_labels_series
)

print(f"✅ Training features extracted: {train_features.shape}")
print(f"Feature columns: {[col for col in train_features.columns if col not in ['id1', 'id2', 'label']]}")

# Prepare data for ML training
feature_columns = [col for col in train_features.columns if col not in ['id1', 'id2', 'label']]

X_train = train_features[feature_columns]
y_train = train_features['label']

print(f"Training data: X={X_train.shape}, y={y_train.shape}")
print(f"Class distribution: {y_train.value_counts().to_dict()}")

=== ML-Based Matching with Similarity Features ===
Demonstrating MLBasedMatcher with FeatureExtractor using GridSearchCV
Training on gt_train and testing on gt_test


🔧 Creating Similarity-Based FeatureExtractor...
✅ Created FeatureExtractor with 7 similarity features
Feature names: ['StringComparator(title, jaro_winkler)', 'StringComparator(title, levenshtein)', 'StringComparator(title, cosine)', 'StringComparator(title, jaccard)', 'DateComparator(date)', 'StringComparator(actor_name, jaro_winkler)', 'StringComparator(actor_name, cosine)']

⚙️ Extracting Features from Training Pairs...
Valid training pairs: 335 out of 335
✅ Training features extracted: (335, 10)
Feature columns: ['StringComparator(title, jaro_winkler)', 'StringComparator(title, levenshtein)', 'StringComparator(title, cosine)', 'StringComparator(title, jaccard)', 'DateComparator(date)', 'StringComparator(actor_name, jaro_winkler)', 'StringComparator(actor_name, cosine)']
Training data: X=(335, 7), y=(335,)
Class distri

#### Full Scikit-learn integration

From here on out, the full scikit-learn library can be used with the features extracted from PyDIs feature extractor without any wrapping as everything in PyDI is based on pandas dataframes

In [24]:
# Set up GridSearchCV with multiple models and hyperparameters
print(f"\n🔍 Setting up GridSearchCV...")

from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import make_scorer, f1_score

# Define models and parameter grids
param_grids = {
    'RandomForest': {
        'model': RandomForestClassifier(random_state=42),
        'params': {
            'n_estimators': [50, 100, 200],
            'max_depth': [5, 10, None],
            'min_samples_split': [2, 5],
            'class_weight': ['balanced', None]
        }
    },
    'LogisticRegression': {
        'model': LogisticRegression(random_state=42, max_iter=1000),
        'params': {
            'C': [0.1, 1.0, 10.0],
            'penalty': ['l2'],
            'class_weight': ['balanced', None]
        }
    },
    'GradientBoosting': {
        'model': GradientBoostingClassifier(random_state=42),
        'params': {
            'n_estimators': [50, 100],
            'learning_rate': [0.1, 0.2],
            'max_depth': [3, 5],
        }
    },
    'SVM': {
        'model': SVC(random_state=42, probability=True),
        'params': {
            'C': [0.1, 1.0, 10.0],
            'kernel': ['rbf', 'linear'],
            'class_weight': ['balanced', None]
        }
    }
}

# Use F1 score as the scoring metric (good for imbalanced data)
scorer = make_scorer(f1_score)
cv_folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

print(f"GridSearch setup: {len(param_grids)} models, F1 scoring, 5-fold CV")

# Train models using GridSearchCV
print(f"\n🚀 Training Models with GridSearchCV...")

grid_search_results = {}
best_overall_score = -1
best_overall_model = None
best_model_name = None

for model_name, config in param_grids.items():
    print(f"\nTraining {model_name}...")
    

    # Create GridSearchCV
    grid_search = GridSearchCV(
        estimator=config['model'],
        param_grid=config['params'],
        scoring=scorer,
        cv=cv_folds,
        n_jobs=-1,  # Use all available cores
        verbose=0
    )
    
    # Fit GridSearchCV
    grid_search.fit(X_train, y_train)
    
    # Store results
    grid_search_results[model_name] = {
        'grid_search': grid_search,
        'best_score': grid_search.best_score_,
        'best_params': grid_search.best_params_,
        'best_estimator': grid_search.best_estimator_
    }
    
    print(f"  ✅ {model_name}: Best CV F1 = {grid_search.best_score_:.4f}")
    print(f"     Best params: {grid_search.best_params_}")
    
    # Track overall best model
    if grid_search.best_score_ > best_overall_score:
        best_overall_score = grid_search.best_score_
        best_overall_model = grid_search.best_estimator_
        best_model_name = model_name
            
print(f"\n🏆 Best Overall Model: {best_model_name} (CV F1: {best_overall_score:.4f})")


🔍 Setting up GridSearchCV...
GridSearch setup: 4 models, F1 scoring, 5-fold CV

🚀 Training Models with GridSearchCV...

Training RandomForest...
  ✅ RandomForest: Best CV F1 = 0.9853
     Best params: {'class_weight': 'balanced', 'max_depth': 5, 'min_samples_split': 2, 'n_estimators': 50}

Training LogisticRegression...
  ✅ LogisticRegression: Best CV F1 = 0.9953
     Best params: {'C': 1.0, 'class_weight': 'balanced', 'penalty': 'l2'}

Training GradientBoosting...
  ✅ GradientBoosting: Best CV F1 = 0.9953
     Best params: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}

Training SVM...
  ✅ SVM: Best CV F1 = 0.9953
     Best params: {'C': 0.1, 'class_weight': 'balanced', 'kernel': 'rbf'}

🏆 Best Overall Model: LogisticRegression (CV F1: 0.9953)


In [25]:
# Apply best trained model using MLBasedMatcher on test data
print(f"\n🎯 Testing Best Model on Test Set...")


# Prepare test pairs
valid_test_pairs = []
valid_test_labels = []

for _, row in gt_test.iterrows():
    id1, id2, label = row['id1'], row['id2'], row['label']
    if (id1 in left_df['_id'].values and id2 in right_df['_id'].values):
        valid_test_pairs.append({'id1': id1, 'id2': id2})
        valid_test_labels.append(label)

test_pairs_df = pd.DataFrame(valid_test_pairs)
test_labels_series = pd.Series(valid_test_labels)

print(f"Valid test pairs: {len(test_pairs_df)} out of {len(gt_test)}")


# Create MLBasedMatcher and apply trained model
ml_matcher = MLBasedMatcher(feature_extractor)

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

matches = ml_matcher.match(
    left_df, right_df, [test_pairs_df], best_overall_model
)

# Show feature importance if available
if hasattr(best_overall_model, 'feature_importances_'):
    print(f"\n🔍 Top Feature Importances:")
    importance_df = ml_matcher.get_feature_importance(best_overall_model, feature_columns)
    display(importance_df.head(8))


🎯 Testing Best Model on Test Set...
Valid test pairs: 3347 out of 3347


Let's evaluate the ML-based matching with the evaluator:

In [26]:
# Perform evaluation using PyDI's EntityMatchingEvaluator
print("\n=== ML-based Entity Matching Evaluation Results ===")

# Use the new evaluate_matching method for cleaner evaluation
eval_results = EntityMatchingEvaluator.evaluate_matching(
    correspondences=matches,
    test_pairs=gt_test,
    out_dir=str(OUTPUT_DIR)
)

display(eval_results)

print("=== Cluster Size Distribution Analysis ===")

# Create cluster size distribution from our matches
cluster_distribution = EntityMatchingEvaluator.create_cluster_size_distribution(
    correspondences=matches,
    out_dir=str(OUTPUT_DIR / "cluster_analysis")
)

print(f"\n📊 Cluster Size Distribution Results:")
display(cluster_distribution)



=== ML-based Entity Matching Evaluation Results ===
Performance Metrics:
  Accuracy:  0.998
  Precision: 0.870
  Recall:    1.000
  F1-Score:  0.931
Confusion Matrix:
  True Positives:  47
  True Negatives:  3293
  False Positives: 7
  False Negatives: 0


{'precision': 0.8703703703703703,
 'recall': 1.0,
 'f1': 0.9306930693069307,
 'accuracy': 0.9979085748431431,
 'true_positives': 47,
 'false_positives': 7,
 'false_negatives': 0,
 'true_negatives': 3293,
 'threshold_used': 0.0,
 'total_correspondences': 54,
 'filtered_correspondences': 54,
 'evaluation_timestamp': '2025-09-09T15:13:01.824437',
 'output_files': ['c:\\Users\\Ralph\\dev\\pydi\\output\\tutorial\\matching_evaluation_summary.json',
  'c:\\Users\\Ralph\\dev\\pydi\\output\\tutorial\\matching_detailed_results.csv']}

=== Cluster Size Distribution Analysis ===

📊 Cluster Size Distribution Results:


Unnamed: 0,cluster_size,frequency,percentage
0,2,48,94.117647
1,3,3,5.882353


Alternatively to similarity metrics for each attribute, PyDIs VectorFeatureExtractor can be used to create embeddings using SentenceTransformers:

In [27]:
# VectorFeatureExtractor Examples

from PyDI.entitymatching import VectorFeatureExtractor

# SentenceTransformers embeddings using VectorFeatureExtractor
st_extractor = VectorFeatureExtractor(
    embedding_model='sentence-transformers/all-MiniLM-L6-v2',
    columns=['title', 'actor_name', 'date'],
    distance_metrics=['cosine'],
    pooling_strategy='concatenate'
)

st_features = st_extractor.create_features(
    left_df, right_df, candidates_df
)
print(f"SentenceTransformer features shape: {st_features.shape}")

# Extract features using FeatureExtractor
train_features = feature_extractor.create_features(
    left_df, right_df, train_pairs_df, labels=train_labels_series
)

# ready to train ML models with scikit-learn as before


SentenceTransformer features shape: (1030, 3)


## Part 3: Data Fusion

In [28]:
print("📊 Fusion Input Datasets:")
for df, name in zip(datasets, names):
    print(f"  {name}: {len(df):,} records")

total_input_records = sum(len(df) for df in datasets)
print(f"  Total: {total_input_records:,} records")
print(f"\n🎯 Goal: Create single authoritative movie record per entity")

📊 Fusion Input Datasets:
  Academy Awards: 4,592 records
  Actors: 149 records
  Golden Globes: 2,286 records
  Total: 7,027 records

🎯 Goal: Create single authoritative movie record per entity


### Step 1: Loading Correspondence Files

Data fusion requires correspondence information to group records referring to the same entity. Let's load the pre-computed correspondences.

In [29]:
# Load pre-computed correspondences from the Winter framework
print("=== Loading Correspondences for Data Fusion ===")

CORR_DIR = ROOT / "input" / "movies" / "fusion" / "correspondences"

# Load correspondence files
academy_actors_corr = load_csv(
    CORR_DIR / "academy_awards_2_actors_correspondences.csv",
    name="academy_actors_correspondences",
    header=None,
    names=['id1', 'id2', 'score'],
    add_index=False
)

actors_globes_corr = load_csv(
    CORR_DIR / "actors_2_golden_globes_correspondences.csv", 
    name="actors_globes_correspondences",
    header=None,
    names=['id1', 'id2', 'score'],
    add_index=False
)

print(f"Academy Awards ↔ Actors correspondences: {len(academy_actors_corr):,}")
print(f"Actors ↔ Golden Globes correspondences: {len(actors_globes_corr):,}")

# Preview correspondence structure
print("\n📊 Correspondence Structure:")
print("Academy Awards ↔ Actors:")
display(academy_actors_corr.head())

print("Actors ↔ Golden Globes:")
display(actors_globes_corr.head())

=== Loading Correspondences for Data Fusion ===
Academy Awards ↔ Actors correspondences: 150
Actors ↔ Golden Globes correspondences: 107

📊 Correspondence Structure:
Academy Awards ↔ Actors:


Unnamed: 0,id1,id2,score
0,academy_awards_4557,actors_1,1.0
1,academy_awards_4529,actors_2,1.0
2,academy_awards_4500,actors_3,1.0
3,academy_awards_4475,actors_4,1.0
4,academy_awards_4446,actors_5,1.0


Actors ↔ Golden Globes:


Unnamed: 0,id1,id2,score
0,actors_16,golden_globes_2279,1.0
1,actors_22,golden_globes_2263,1.0
2,actors_23,golden_globes_2252,1.0
3,actors_24,golden_globes_2240,1.0
4,actors_25,golden_globes_2226,1.0


### Step 2: Running Fusion using correspondences to build record groups

In [30]:
# Combine all correspondences into a single list
all_correspondences = []

# Add Academy Awards ↔ Actors correspondences
for _, row in academy_actors_corr.iterrows():
    all_correspondences.append((row['id1'], row['id2'], row['score']))
    
# Add Actors ↔ Golden Globes correspondences  
for _, row in actors_globes_corr.iterrows():
    all_correspondences.append((row['id1'], row['id2'], row['score']))

all_correspondences = pd.DataFrame(all_correspondences, columns=['id1', 'id2', 'score'])

print(f"Total correspondences: {len(all_correspondences):,}")

Total correspondences: 257


In [31]:
print("=== PyDI Data Fusion Framework Demonstration ===")

# Import additional fusion components needed
from PyDI.fusion import AttributeValueFuser

# Initialize the fusion strategy
fusion_strategy = DataFusionStrategy("movie_fusion")

# Title: Use longest string (often more complete)
fusion_strategy.add_attribute_fuser("title", AttributeValueFuser(longest_string))

# Date: Use most recent (latest data often more accurate)
fusion_strategy.add_attribute_fuser("date", AttributeValueFuser(most_recent))

# Actor name: Use most complete (non-null, longest)
fusion_strategy.add_attribute_fuser("actor_name", AttributeValueFuser(most_complete))

# Director name: Use longest string
fusion_strategy.add_attribute_fuser("director_name", AttributeValueFuser(longest_string))

# Awards: Union all award information
fusion_strategy.add_attribute_fuser("oscar", AttributeValueFuser(union))
fusion_strategy.add_attribute_fuser("globe", AttributeValueFuser(union))

print(f"\n✅ Strategy '{fusion_strategy.name}' configured with {len(fusion_strategy.get_registered_attributes())} rules")



=== PyDI Data Fusion Framework Demonstration ===

✅ Strategy 'movie_fusion' configured with 6 rules


In [32]:
# Create fusion engine with our strategy
fusion_engine = DataFusionEngine(fusion_strategy)

print(f"Input datasets: {len(datasets)}")
print(f"Input records: {total_input_records:,}")
print(f"Correspondences: {len(all_correspondences):,}")

# Execute fusion with timing
start_time = time.time()

fused_dataset, execution_time = fusion_engine.run(
    datasets=datasets,
    correspondences=all_correspondences, 
    id_column='id',  # Use original 'id' column for matching
    include_singletons=True  # Include unmatched records
)

total_time = time.time() - start_time

print(f"\n✅ Fusion Complete!")
print(f"  Total time: {total_time:.3f} seconds") 
print(f"  Output records: {len(fused_dataset):,}")
print(f"  Compression ratio: {len(fused_dataset)/total_input_records:.1%}")

Input datasets: 3
Input records: 7,027
Correspondences: 257

✅ Fusion Complete!
  Total time: 0.287 seconds
  Output records: 6,755
  Compression ratio: 96.1%
