# PyDI Entity Matching with RuleBasedMatcher Example

This notebook demonstrates the comprehensive entity matching capabilities in PyDI using the RuleBasedMatcher.

What this shows:
- Load datasets with provenance tracking
- **Simple candidate generation**: Create candidate pairs for matching without full blocking
- **Rule-based entity matching**: Use multiple comparators to find duplicate records
- **Different similarity functions**: String, numeric, and date comparators
- **Evaluation**: Assess matching quality with precision, recall, and F1 scores
- **Threshold tuning**: Find optimal similarity thresholds
- **End-to-end workflow**: Complete entity matching pipeline

Run cells below in order. Adjust paths if running outside the repo root.

In [1]:
# PyDI imports
from PyDI.io import load_xml
from PyDI.entitymatching import (
    RuleBasedMatcher,
    StringComparator,
    NumericComparator,
    DateComparator,
    EntityMatchingEvaluator,
    ensure_record_ids
)

# Additional imports
import pandas as pd
import numpy as np
from pathlib import Path
import itertools
from datetime import datetime

def repo_root():
    """Return the repository root directory."""
    # For notebooks in PyDI/examples/, go up 2 levels to reach repo root
    if '__file__' in globals():
        return Path(__file__).parent.parent.parent
    else:
        # In Jupyter, find the pyproject.toml to locate repo root
        current = Path.cwd()
        while current != current.parent:
            if (current / 'pyproject.toml').exists():
                return current
            current = current.parent
        return Path.cwd()  # fallback

## Step 1: Load datasets with provenance

We'll use the movie datasets - Academy Awards and Actors data. These datasets contain movie information from different sources and are commonly used for entity matching research.

In [2]:
root = repo_root()
academy_path = root / "input" / "movies" / "entitymatching" / "data" / "academy_awards.xml"
actors_path = root / "input" / "movies" / "entitymatching" / "data" / "actors.xml"

print(f"Academy awards data: {academy_path}")
print(f"Actors data: {actors_path}")

# Load datasets using PyDI's provenance-aware XML loader
academy_df = load_xml(academy_path, name="academy_awards")
actors_df = load_xml(actors_path, name="actors")

print(f"\nAcademy Awards shape: {academy_df.shape}")
print(f"Academy Awards columns: {list(academy_df.columns)}")

print(f"\nActors shape: {actors_df.shape}")
print(f"Actors columns: {list(actors_df.columns)}")

Academy awards data: c:\Users\Ralph\dev\pydi\input\movies\entitymatching\data\academy_awards.xml
Actors data: c:\Users\Ralph\dev\pydi\input\movies\entitymatching\data\actors.xml

Academy Awards shape: (4592, 7)
Academy Awards columns: ['academy_awards_id', 'id', 'title', 'actor_name', 'date', 'director_name', 'oscar']

Actors shape: (149, 7)
Actors columns: ['actors_id', 'id', 'title', 'actor_name', 'actors_actor_birthday', 'actors_actor_birthplace', 'date']


In [3]:
# Preview the datasets to understand their structure
print("=== Academy Awards Dataset Sample ===")
display(academy_df.head(3))

print("\n=== Actors Dataset Sample ===") 
display(actors_df.head(3))

=== Academy Awards Dataset Sample ===


Unnamed: 0,academy_awards_id,id,title,actor_name,date,director_name,oscar
0,academy_awards-0000,academy_awards_1,Biutiful,Javier Bardem,2010-01-01,,
1,academy_awards-0001,academy_awards_2,True Grit,Jeff Bridges,2010-01-01,Joel Coen,
2,academy_awards-0002,academy_awards_2,True Grit,Jeff Bridges,2010-01-01,Ethan Coen,



=== Actors Dataset Sample ===


Unnamed: 0,actors_id,id,title,actor_name,actors_actor_birthday,actors_actor_birthplace,date
0,actors-0000,actors_1,7th Heaven,Janet Gaynor,1906-01-01,Pennsylvania,1929-01-01
1,actors-0001,actors_2,Coquette,Mary Pickford,1892-01-01,Canada,1930-01-01
2,actors-0002,actors_3,The Divorcee,Norma Shearer,1902-01-01,Canada,1931-01-01


In [4]:
# Ensure datasets have record IDs for entity matching
academy_df = ensure_record_ids(academy_df)
actors_df = ensure_record_ids(actors_df)

print(f"Academy Awards dataset now has {len(academy_df)} records with _id column")
print(f"Actors dataset now has {len(actors_df)} records with _id column")

print(f"\nSample Academy Awards IDs: {academy_df['_id'].head(3).tolist()}")
print(f"Sample Actors IDs: {actors_df['_id'].head(3).tolist()}")

Academy Awards dataset now has 4592 records with _id column
Actors dataset now has 149 records with _id column

Sample Academy Awards IDs: ['academy_awards_000000', 'academy_awards_000001', 'academy_awards_000002']
Sample Actors IDs: ['actors_000000', 'actors_000001', 'actors_000002']


## Step 2: Data Exploration and Understanding

Let's explore the datasets to understand what attributes we can use for matching and their data quality.

In [5]:
# Analyze data completeness and overlap
def analyze_dataset_quality(df, name):
    print(f"=== {name} Dataset Quality Analysis ===")
    print(f"Total records: {len(df)}")
    
    # Check key columns for completeness
    key_columns = ['title', 'actor_name', 'date', 'director_name']
    available_columns = [col for col in key_columns if col in df.columns]
    
    for col in available_columns:
        non_null = df[col].notna().sum()
        completeness = (non_null / len(df)) * 100
        print(f"  {col}: {non_null}/{len(df)} ({completeness:.1f}% complete)")
    
    # Show some sample values for key attributes
    if 'title' in df.columns:
        print(f"\nSample titles: {df['title'].dropna().head(5).tolist()}")
    
    if 'actor_name' in df.columns:
        print(f"Sample actors: {df['actor_name'].dropna().head(5).tolist()}")
        
    print()

analyze_dataset_quality(academy_df, "Academy Awards")
analyze_dataset_quality(actors_df, "Actors")

=== Academy Awards Dataset Quality Analysis ===
Total records: 4592
  title: 4580/4592 (99.7% complete)
  actor_name: 1057/4592 (23.0% complete)
  date: 4592/4592 (100.0% complete)
  director_name: 420/4592 (9.1% complete)

Sample titles: ['Biutiful', 'True Grit', 'True Grit', 'The Social Network', "The King's Speech"]
Sample actors: ['Javier Bardem', 'Jeff Bridges', 'Jeff Bridges', 'Jesse Eisenberg', 'Colin Firth']

=== Actors Dataset Quality Analysis ===
Total records: 149
  title: 149/149 (100.0% complete)
  actor_name: 149/149 (100.0% complete)
  date: 149/149 (100.0% complete)

Sample titles: ['7th Heaven', 'Coquette', 'The Divorcee', 'Min and Bill', 'The Sin of Madelon Claudet']
Sample actors: ['Janet Gaynor', 'Mary Pickford', 'Norma Shearer', 'Marie Dressler', 'Helen Hayes']



In [6]:
# Check date formats and ranges
def analyze_dates(df, name):
    if 'date' not in df.columns:
        print(f"{name}: No date column")
        return
        
    print(f"=== {name} Date Analysis ===")
    date_col = df['date'].dropna()
    if len(date_col) > 0:
        print(f"Date range: {date_col.min()} to {date_col.max()}")
        print(f"Sample dates: {date_col.head(3).tolist()}")
        
        # Try to parse dates to check format consistency
        try:
            parsed_dates = pd.to_datetime(date_col)
            print(f"Date parsing successful: {len(parsed_dates)} dates parsed")
        except Exception as e:
            print(f"Date parsing issues: {e}")
    print()

analyze_dates(academy_df, "Academy Awards")
analyze_dates(actors_df, "Actors")

=== Academy Awards Date Analysis ===
Date range: 1927-01-01 to 2010-01-01
Sample dates: ['2010-01-01', '2010-01-01', '2010-01-01']
Date parsing successful: 4592 dates parsed

=== Actors Date Analysis ===
Date range: 1929-01-01 to 2005-01-01
Sample dates: ['1929-01-01', '1930-01-01', '1931-01-01']
Date parsing successful: 149 dates parsed



## Step 3: Simple Candidate Generation

Since PyDI doesn't have blocking algorithms implemented yet, we'll create simple candidate generation strategies. In practice, you'd use more sophisticated blocking techniques to reduce the number of candidate pairs.

In [7]:
def create_sample_candidates(df_left, df_right, max_pairs=100, strategy="random"):
    """Create candidate pairs for entity matching.
    
    Parameters
    ----------
    df_left, df_right : pandas.DataFrame
        Source datasets with _id columns
    max_pairs : int
        Maximum number of candidate pairs to generate
    strategy : str
        Strategy for candidate generation: 'random', 'all', 'title_similarity'
    
    Returns
    -------
    pandas.DataFrame
        Candidate pairs with id1, id2 columns
    """
    left_ids = df_left['_id'].tolist()
    right_ids = df_right['_id'].tolist()
    
    if strategy == "all":
        # Cartesian product (use with caution - can be very large!)
        candidates = [(left_id, right_id) for left_id in left_ids for right_id in right_ids]
        candidates = candidates[:max_pairs]  # Limit size
        
    elif strategy == "random":
        # Random sampling
        np.random.seed(42)  # For reproducibility
        candidates = []
        for _ in range(min(max_pairs, len(left_ids) * len(right_ids))):
            left_id = np.random.choice(left_ids)
            right_id = np.random.choice(right_ids)
            candidates.append((left_id, right_id))
            
    elif strategy == "title_similarity":
        # Simple title-based blocking (first character match)
        candidates = []
        
        # Group by first character of title (simple blocking key)
        left_groups = df_left.groupby(df_left['title'].str[0].fillna(''))['_id'].apply(list).to_dict()
        right_groups = df_right.groupby(df_right['title'].str[0].fillna(''))['_id'].apply(list).to_dict()
        
        for key in left_groups:
            if key in right_groups:
                for left_id in left_groups[key]:
                    for right_id in right_groups[key]:
                        candidates.append((left_id, right_id))
                        if len(candidates) >= max_pairs:
                            break
                    if len(candidates) >= max_pairs:
                        break
                if len(candidates) >= max_pairs:
                    break
    
    # Convert to DataFrame
    candidate_df = pd.DataFrame(candidates, columns=['id1', 'id2'])
    
    # Remove duplicates
    candidate_df = candidate_df.drop_duplicates()
    
    return candidate_df

# Create candidate pairs using different strategies
print("Generating candidate pairs...")

# Random sampling - good for initial exploration
random_candidates = create_sample_candidates(academy_df, actors_df, max_pairs=200, strategy="random")
print(f"Random candidates: {len(random_candidates)} pairs")

# Title-based simple blocking - more targeted
title_candidates = create_sample_candidates(academy_df, actors_df, max_pairs=500, strategy="title_similarity")
print(f"Title-based candidates: {len(title_candidates)} pairs")

print(f"\nSample candidate pairs:")
display(title_candidates.head())

Generating candidate pairs...
Random candidates: 200 pairs
Title-based candidates: 500 pairs

Sample candidate pairs:


Unnamed: 0,id1,id2
0,academy_awards_000393,actors_000000
1,academy_awards_002506,actors_000000
2,academy_awards_004567,actors_000000
3,academy_awards_000013,actors_000023
4,academy_awards_000013,actors_000028


## Step 4: RuleBasedMatcher with String Similarity

Let's start with simple string-based matching using movie titles. This is often the most discriminative attribute for movie data.

In [8]:
# Create string comparator for titles
title_comparator = StringComparator(
    column="title", 
    similarity_function="jaro_winkler",  # Good for names and titles
    preprocess=str.lower  # Normalize case
)

# Initialize rule-based matcher
matcher = RuleBasedMatcher()

# Perform title-only matching
title_matches = matcher.match(
    df_left=academy_df,
    df_right=actors_df,
    candidates=[title_candidates],  # Use title-based candidates
    comparators=[title_comparator],
    weights=[1.0],  # Single comparator, full weight
    threshold=0.7  # Require 70% similarity
)

print(f"Title-based matching found {len(title_matches)} matches above threshold 0.7")

if len(title_matches) > 0:
    print(f"\nTop matches by similarity score:")
    top_matches = title_matches.sort_values('score', ascending=False).head(10)
    
    # Show matches with actual titles for verification
    for _, match in top_matches.iterrows():
        id1, id2, score = match['id1'], match['id2'], match['score']
        
        # Get titles
        title1 = academy_df[academy_df['_id'] == id1]['title'].iloc[0]
        title2 = actors_df[actors_df['_id'] == id2]['title'].iloc[0]
        
        print(f"  Score {score:.3f}: '{title1}' <-> '{title2}'")
else:
    print("No matches found with current threshold. Try lowering the threshold.")

Title-based matching found 7 matches above threshold 0.7

Top matches by similarity score:
  Score 1.000: '7th Heaven' <-> '7th Heaven'
  Score 0.885: 'American Gangster' <-> 'American Beauty'
  Score 0.851: 'American Splendor' <-> 'American Beauty'
  Score 0.825: 'Alice in Wonderland' <-> 'Alice Doesn�t live Here Anymor'
  Score 0.790: 'Ali' <-> 'Alice Doesn�t live Here Anymor'
  Score 0.778: 'Ali' <-> 'All the King's Men'
  Score 0.733: 'Avatar' <-> 'Anastasia'


In [9]:
# Let's try different similarity functions and thresholds
similarity_functions = ["jaro_winkler", "levenshtein", "jaccard", "cosine"]
threshold = 0.5  # Lower threshold to see more results

print("Comparing different string similarity functions:")
print(f"Using threshold: {threshold}")
print()

for sim_func in similarity_functions:
    comparator = StringComparator(
        column="title", 
        similarity_function=sim_func,
        preprocess=str.lower
    )
    
    matches = matcher.match(
        df_left=academy_df,
        df_right=actors_df,
        candidates=[title_candidates[:100]],  # Limit for speed
        comparators=[comparator],
        threshold=threshold
    )
    
    print(f"{sim_func.upper():12}: {len(matches):3d} matches")
    
    if len(matches) > 0:
        best_match = matches.loc[matches['score'].idxmax()]
        id1, id2, score = best_match['id1'], best_match['id2'], best_match['score']
        title1 = academy_df[academy_df['_id'] == id1]['title'].iloc[0]
        title2 = actors_df[actors_df['_id'] == id2]['title'].iloc[0]
        print(f"{'':12}   Best: {score:.3f} - '{title1}' <-> '{title2}'")
    print()

Comparing different string similarity functions:
Using threshold: 0.5

JARO_WINKLER:  71 matches
               Best: 1.000 - '7th Heaven' <-> '7th Heaven'

LEVENSHTEIN :   1 matches
               Best: 1.000 - '7th Heaven' <-> '7th Heaven'

JACCARD     :   5 matches
               Best: 1.000 - '7th Heaven' <-> '7th Heaven'

COSINE      :  39 matches
               Best: 1.000 - '7th Heaven' <-> '7th Heaven'



## Step 5: Multi-Attribute Matching

Now let's combine multiple attributes for more robust matching. We'll use title, date, and actor information where available.

In [10]:
# Create multiple comparators for different attributes
comparators = [
    StringComparator("title", similarity_function="jaro_winkler", preprocess=str.lower),
    DateComparator("date", max_days_difference=365),  # Allow 1 year difference
    StringComparator("actor_name", similarity_function="jaro_winkler", preprocess=str.lower)
]

# Different weight configurations to test
weight_configs = [
    ([0.6, 0.2, 0.2], "Title-focused"),
    ([0.4, 0.4, 0.2], "Title+Date balanced"),  
    ([0.33, 0.33, 0.34], "Equal weights"),
    ([0.5, 0.1, 0.4], "Title+Actor focused")
]

print("Multi-attribute matching with different weight configurations:")
print()

best_config = None
best_count = 0

for weights, description in weight_configs:
    multi_matches = matcher.match(
        df_left=academy_df,
        df_right=actors_df,
        candidates=[title_candidates[:150]],  # Limit for performance
        comparators=comparators,
        weights=weights,
        threshold=0.6
    )
    
    print(f"{description:20}: {len(multi_matches):3d} matches (weights: {weights})")
    
    if len(multi_matches) > best_count:
        best_count = len(multi_matches)
        best_config = (weights, description, multi_matches)
        
    # Show best match for this configuration
    if len(multi_matches) > 0:
        best = multi_matches.loc[multi_matches['score'].idxmax()]
        print(f"{'':20}  Best score: {best['score']:.3f}")
    print()

print(f"Best configuration: {best_config[1]} with {best_count} matches")

Multi-attribute matching with different weight configurations:

Title-focused       :   1 matches (weights: [0.6, 0.2, 0.2])
                      Best score: 0.800

Title+Date balanced :   1 matches (weights: [0.4, 0.4, 0.2])
                      Best score: 0.600

Equal weights       :   1 matches (weights: [0.33, 0.33, 0.34])
                      Best score: 0.670

Title+Actor focused :   1 matches (weights: [0.5, 0.1, 0.4])
                      Best score: 0.900

Best configuration: Title-focused with 1 matches


In [11]:
# Use the best configuration for detailed analysis
if best_config:
    best_weights, best_desc, best_matches = best_config
    
    print(f"=== Detailed Analysis: {best_desc} Configuration ===")
    print(f"Weights: {best_weights}")
    print(f"Total matches: {len(best_matches)}")
    
    if len(best_matches) > 0:
        print(f"\nScore distribution:")
        print(f"  Mean: {best_matches['score'].mean():.3f}")
        print(f"  Std:  {best_matches['score'].std():.3f}")
        print(f"  Min:  {best_matches['score'].min():.3f}")
        print(f"  Max:  {best_matches['score'].max():.3f}")
        
        print(f"\nTop 5 matches:")
        top_5 = best_matches.sort_values('score', ascending=False).head(5)
        
        for i, (_, match) in enumerate(top_5.iterrows(), 1):
            id1, id2, score = match['id1'], match['id2'], match['score']
            
            # Get record details
            rec1 = academy_df[academy_df['_id'] == id1].iloc[0]
            rec2 = actors_df[actors_df['_id'] == id2].iloc[0]
            
            print(f"\n{i}. Score: {score:.3f}")
            print(f"   Academy: '{rec1.get('title', 'N/A')}' ({rec1.get('date', 'N/A')}) - {rec1.get('actor_name', 'N/A')}")
            print(f"   Actors:  '{rec2.get('title', 'N/A')}' ({rec2.get('date', 'N/A')}) - {rec2.get('actor_name', 'N/A')}")

=== Detailed Analysis: Title-focused Configuration ===
Weights: [0.6, 0.2, 0.2]
Total matches: 1

Score distribution:
  Mean: 0.800
  Std:  nan
  Min:  0.800
  Max:  0.800

Top 5 matches:

1. Score: 0.800
   Academy: '7th Heaven' (1927-01-01) - Janet Gaynor
   Actors:  '7th Heaven' (1929-01-01) - Janet Gaynor


## Step 6: Load Ground Truth and Evaluation

Now let's load the ground truth correspondences and evaluate our matching performance.

In [12]:
# Load ground truth correspondences
train_path = root / "input" / "movies" / "entitymatching" / "splits" / "gs_academy_awards_2_actors_training.csv"
test_path = root / "input" / "movies" / "entitymatching" / "splits" / "gs_academy_awards_2_actors_test.csv"

def load_correspondences(file_path):
    """Load correspondence file and convert to PyDI ID format."""
    if not file_path.exists():
        print(f"File not found: {file_path}")
        return pd.DataFrame()
    
    # Load raw correspondences
    corr = pd.read_csv(file_path, names=['id1', 'id2', 'label'])
    
    # Convert boolean labels to numeric
    corr['label'] = corr['label'].map({True: 1, 'TRUE': 1, False: 0, 'FALSE': 0})
    
    # Convert original XML IDs to PyDI format
    # Original IDs like 'academy_awards_1' need to be converted to 'academy_awards_000000'
    def convert_id(original_id):
        if pd.isna(original_id):
            return original_id
        
        id_str = str(original_id)
        if 'academy_awards_' in id_str:
            # Extract number and reformat
            try:
                num = int(id_str.split('_')[-1]) - 1  # Convert to 0-based index
                return f"academy_awards_{num:06d}"
            except:
                return id_str
        elif 'actors_' in id_str:
            # Extract number and reformat
            try:
                num = int(id_str.split('_')[-1]) - 1  # Convert to 0-based index
                return f"actors_{num:06d}"
            except:
                return id_str
        
        return id_str
    
    corr['id1'] = corr['id1'].apply(convert_id)
    corr['id2'] = corr['id2'].apply(convert_id)
    
    return corr

# Load training and test correspondences
train_corr = load_correspondences(train_path)
test_corr = load_correspondences(test_path)

print(f"Training correspondences: {len(train_corr)} pairs")
print(f"Test correspondences: {len(test_corr)} pairs")

if len(train_corr) > 0:
    print(f"\nTraining set label distribution:")
    print(train_corr['label'].value_counts())
    
    print(f"\nSample training correspondences:")
    display(train_corr.head())

if len(test_corr) > 0:
    print(f"\nTest set label distribution:")
    print(test_corr['label'].value_counts())

Training correspondences: 358 pairs
Test correspondences: 3347 pairs

Training set label distribution:
label
0    255
1    103
Name: count, dtype: int64

Sample training correspondences:


Unnamed: 0,id1,id2,label
0,academy_awards_004556,actors_000000,1
1,academy_awards_004362,actors_000006,1
2,academy_awards_004319,actors_000007,1
3,academy_awards_004206,actors_000009,1
4,academy_awards_004145,actors_000010,1



Test set label distribution:
label
0    3300
1      47
Name: count, dtype: int64


In [14]:
# Use training correspondences as candidates for evaluation
# This simulates having perfect recall from blocking
if len(train_corr) > 0:
    # Use training pairs as candidates
    train_candidates = train_corr[['id1', 'id2']].copy()
    
    print(f"Using {len(train_candidates)} training pairs as candidates")
    
    # Perform matching with our best configuration
    evaluation_matches = matcher.match(
        df_left=academy_df,
        df_right=actors_df,
        candidates=[train_candidates],
        comparators=comparators,
        weights=best_config[0] if best_config else [0.6, 0.2, 0.2],
        threshold=0.5  # Lower threshold for evaluation
    )
    
    print(f"Matching found {len(evaluation_matches)} matches above threshold 0.5")
    
    if len(evaluation_matches) > 0:
        # Evaluate against training ground truth
        evaluation_results = EntityMatchingEvaluator.evaluate(
            corr=evaluation_matches,
            test_pairs=train_corr
        )
        
        print(f"\n=== Evaluation Results ===")
        print(f"Precision: {evaluation_results['precision']:.3f}")
        print(f"Recall:    {evaluation_results['recall']:.3f}")
        print(f"F1 Score:  {evaluation_results['f1']:.3f}")
        print(f"\nCorrect matches: {evaluation_results['true_positives']}")
        print(f"False positives: {evaluation_results['false_positives']}")
        print(f"False negatives: {evaluation_results['false_negatives']}")
    else:
        print("No matches found for evaluation. Try lowering the threshold.")
else:
    print("No training correspondences available for evaluation.")



Using 358 training pairs as candidates
Matching found 40 matches above threshold 0.5

=== Evaluation Results ===
Precision: 0.750
Recall:    0.291
F1 Score:  0.420

Correct matches: 30
False positives: 10
False negatives: 73


## Step 7: Threshold Analysis and Optimization

Let's analyze how different thresholds affect our matching performance.

In [15]:
# Perform threshold sweep analysis
if len(train_corr) > 0:
    print("=== Threshold Sweep Analysis ===")
    
    thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
    results = []
    
    for threshold in thresholds:
        # Perform matching
        matches = matcher.match(
            df_left=academy_df,
            df_right=actors_df,
            candidates=[train_candidates],
            comparators=comparators,
            weights=best_config[0] if best_config else [0.6, 0.2, 0.2],
            threshold=threshold
        )
        
        if len(matches) > 0:
            # Evaluate
            eval_result = EntityMatchingEvaluator.evaluate(matches, train_corr)
            results.append({
                'threshold': threshold,
                'matches': len(matches),
                'precision': eval_result['precision'],
                'recall': eval_result['recall'],
                'f1': eval_result['f1']
            })
        else:
            results.append({
                'threshold': threshold,
                'matches': 0,
                'precision': 0.0,
                'recall': 0.0,
                'f1': 0.0
            })
    
    # Convert to DataFrame for analysis
    threshold_results = pd.DataFrame(results)
    
    print("\nThreshold Analysis Results:")
    display(threshold_results)
    
    # Find optimal threshold (best F1 score)
    if len(threshold_results) > 0:
        best_threshold_idx = threshold_results['f1'].idxmax()
        best_threshold_row = threshold_results.loc[best_threshold_idx]
        
        print(f"\n=== Optimal Threshold ===")
        print(f"Threshold: {best_threshold_row['threshold']}")
        print(f"F1 Score:  {best_threshold_row['f1']:.3f}")
        print(f"Precision: {best_threshold_row['precision']:.3f}")
        print(f"Recall:    {best_threshold_row['recall']:.3f}")
        print(f"Matches:   {best_threshold_row['matches']}")



=== Threshold Sweep Analysis ===





Threshold Analysis Results:


Unnamed: 0,threshold,matches,precision,recall,f1
0,0.1,319,0.31348,0.970874,0.473934
1,0.2,316,0.316456,0.970874,0.477327
2,0.3,236,0.372881,0.854369,0.519174
3,0.4,115,0.582609,0.650485,0.614679
4,0.5,40,0.75,0.291262,0.41958
5,0.6,6,0.333333,0.019417,0.036697
6,0.7,0,0.0,0.0,0.0
7,0.8,0,0.0,0.0,0.0
8,0.9,0,0.0,0.0,0.0



=== Optimal Threshold ===
Threshold: 0.4
F1 Score:  0.615
Precision: 0.583
Recall:    0.650
Matches:   115.0


## Step 8: Advanced Comparator Examples

Let's explore different types of comparators and their specific use cases.

In [16]:
# Dictionary format comparators with embedded weights
print("=== Advanced Comparator Configuration ===")

# Method 1: Dictionary format with embedded weights
dict_comparators = [
    {"comparator": StringComparator("title", "jaro_winkler", str.lower), "weight": 0.5},
    {"comparator": DateComparator("date", max_days_difference=730), "weight": 0.3},
    {"comparator": StringComparator("actor_name", "cosine", str.lower), "weight": 0.2}
]

dict_matches = matcher.match(
    df_left=academy_df,
    df_right=actors_df,
    candidates=[title_candidates[:100]],
    comparators=dict_comparators,
    # No weights parameter needed - they're embedded in comparators
    threshold=0.6
)

print(f"Dictionary format comparators: {len(dict_matches)} matches")

# Method 2: Custom comparator functions
def custom_title_comparator(record1, record2):
    """Custom comparator that handles missing values and applies fuzzy matching."""
    title1 = record1.get('title', '')
    title2 = record2.get('title', '')
    
    # Handle empty titles
    if not title1 or not title2:
        return 0.0
    
    # Simple word overlap approach
    words1 = set(str(title1).lower().split())
    words2 = set(str(title2).lower().split())
    
    if not words1 or not words2:
        return 0.0
    
    # Jaccard similarity
    intersection = len(words1 & words2)
    union = len(words1 | words2)
    
    return intersection / union if union > 0 else 0.0

def custom_year_comparator(record1, record2):
    """Custom year-based comparator with flexible tolerance."""
    try:
        date1 = pd.to_datetime(record1.get('date', ''))
        date2 = pd.to_datetime(record2.get('date', ''))
        
        year_diff = abs(date1.year - date2.year)
        
        # Exact year match = 1.0, 1 year diff = 0.8, 2 years = 0.6, etc.
        if year_diff == 0:
            return 1.0
        elif year_diff == 1:
            return 0.8
        elif year_diff == 2:
            return 0.6
        elif year_diff <= 5:
            return 0.4
        else:
            return 0.0
            
    except:
        return 0.0

# Use custom comparators
custom_matches = matcher.match(
    df_left=academy_df,
    df_right=actors_df,
    candidates=[title_candidates[:100]],
    comparators=[custom_title_comparator, custom_year_comparator],
    weights=[0.7, 0.3],
    threshold=0.5
)

print(f"Custom function comparators: {len(custom_matches)} matches")

# Show some examples of custom comparator results
if len(custom_matches) > 0:
    print(f"\nTop custom matches:")
    for _, match in custom_matches.sort_values('score', ascending=False).head(3).iterrows():
        id1, id2, score = match['id1'], match['id2'], match['score']
        rec1 = academy_df[academy_df['_id'] == id1].iloc[0]
        rec2 = actors_df[actors_df['_id'] == id2].iloc[0]
        print(f"  {score:.3f}: '{rec1.get('title')}' ({rec1.get('date')}) <-> '{rec2.get('title')}' ({rec2.get('date')})")

=== Advanced Comparator Configuration ===
Dictionary format comparators: 1 matches
Custom function comparators: 1 matches

Top custom matches:
  0.880: '7th Heaven' (1927-01-01) <-> '7th Heaven' (1929-01-01)


## Step 9: Complete End-to-End Workflow

Let's put everything together in a complete entity matching pipeline with output generation.

In [17]:
def complete_entity_matching_pipeline(df_left, df_right, ground_truth=None, output_dir=None):
    """Complete entity matching pipeline with evaluation and outputs."""
    
    print("=== Complete Entity Matching Pipeline ===")
    print(f"Left dataset: {df_left.attrs.get('dataset_name', 'unknown')} ({len(df_left)} records)")
    print(f"Right dataset: {df_right.attrs.get('dataset_name', 'unknown')} ({len(df_right)} records)")
    
    # Step 1: Candidate Generation
    print("\n1. Generating candidates...")
    candidates = create_sample_candidates(df_left, df_right, max_pairs=300, strategy="title_similarity")
    print(f"   Generated {len(candidates)} candidate pairs")
    
    # Step 2: Configure Comparators
    print("\n2. Setting up comparators...")
    comparators = [
        StringComparator("title", "jaro_winkler", str.lower),
        DateComparator("date", max_days_difference=365),
        StringComparator("actor_name", "jaro_winkler", str.lower)
    ]
    weights = [0.6, 0.25, 0.15]  # Title most important, then date, then actor
    print(f"   Using {len(comparators)} comparators with weights {weights}")
    
    # Step 3: Matching
    print("\n3. Performing entity matching...")
    matcher = RuleBasedMatcher()
    matches = matcher.match(
        df_left=df_left,
        df_right=df_right,
        candidates=[candidates],
        comparators=comparators,
        weights=weights,
        threshold=0.5
    )
    
    print(f"   Found {len(matches)} matches above threshold 0.5")
    
    # Step 4: Evaluation (if ground truth available)
    evaluation_results = None
    if ground_truth is not None and len(ground_truth) > 0 and len(matches) > 0:
        print("\n4. Evaluating results...")
        evaluation_results = EntityMatchingEvaluator.evaluate(matches, ground_truth)
        
        print(f"   Precision: {evaluation_results['precision']:.3f}")
        print(f"   Recall:    {evaluation_results['recall']:.3f}")
        print(f"   F1 Score:  {evaluation_results['f1']:.3f}")
    
    # Step 5: Output Generation
    if output_dir:
        print(f"\n5. Saving outputs to {output_dir}...")
        output_path = Path(output_dir)
        output_path.mkdir(parents=True, exist_ok=True)
        
        # Save matches
        matches_file = output_path / "entity_matches.csv"
        matches.to_csv(matches_file, index=False)
        print(f"   Saved matches: {matches_file}")
        
        # Save detailed match information
        if len(matches) > 0:
            detailed_matches = []
            for _, match in matches.iterrows():
                id1, id2, score = match['id1'], match['id2'], match['score']
                rec1 = df_left[df_left['_id'] == id1].iloc[0]
                rec2 = df_right[df_right['_id'] == id2].iloc[0]
                
                detailed_matches.append({
                    'id1': id1,
                    'id2': id2,
                    'score': score,
                    'title1': rec1.get('title', ''),
                    'title2': rec2.get('title', ''),
                    'date1': rec1.get('date', ''),
                    'date2': rec2.get('date', ''),
                    'actor1': rec1.get('actor_name', ''),
                    'actor2': rec2.get('actor_name', '')
                })
            
            detailed_df = pd.DataFrame(detailed_matches)
            detailed_file = output_path / "detailed_matches.csv"
            detailed_df.to_csv(detailed_file, index=False)
            print(f"   Saved detailed matches: {detailed_file}")
        
        # Save evaluation results
        if evaluation_results:
            eval_file = output_path / "evaluation_results.json"
            import json
            with open(eval_file, 'w') as f:
                json.dump(evaluation_results, f, indent=2)
            print(f"   Saved evaluation: {eval_file}")
    
    return {
        'matches': matches,
        'candidates': candidates,
        'evaluation': evaluation_results,
        'comparators': comparators,
        'weights': weights
    }

# Run complete pipeline
output_dir = root / "output" / "examples" / "entitymatching"

pipeline_results = complete_entity_matching_pipeline(
    df_left=academy_df,
    df_right=actors_df,
    ground_truth=train_corr if len(train_corr) > 0 else None,
    output_dir=str(output_dir)
)

print(f"\n=== Pipeline Complete ===")
print(f"Check {output_dir} for outputs")

=== Complete Entity Matching Pipeline ===
Left dataset: academy_awards (4592 records)
Right dataset: actors (149 records)

1. Generating candidates...
   Generated 300 candidate pairs

2. Setting up comparators...
   Using 3 comparators with weights [0.6, 0.25, 0.15]

3. Performing entity matching...
   Found 2 matches above threshold 0.5

4. Evaluating results...
   Precision: 0.000
   Recall:    0.000
   F1 Score:  0.000

5. Saving outputs to c:\Users\Ralph\dev\pydi\output\examples\entitymatching...
   Saved matches: c:\Users\Ralph\dev\pydi\output\examples\entitymatching\entity_matches.csv
   Saved detailed matches: c:\Users\Ralph\dev\pydi\output\examples\entitymatching\detailed_matches.csv
   Saved evaluation: c:\Users\Ralph\dev\pydi\output\examples\entitymatching\evaluation_results.json

=== Pipeline Complete ===
Check c:\Users\Ralph\dev\pydi\output\examples\entitymatching for outputs


## Summary and Key Takeaways

This notebook demonstrated the complete entity matching workflow in PyDI using the RuleBasedMatcher:

### Key Features Demonstrated:

1. **Data Loading**: Provenance-aware loading of XML datasets with automatic ID generation
2. **Candidate Generation**: Simple blocking strategies to reduce comparison space
3. **RuleBasedMatcher**: Weighted combination of multiple attribute comparators
4. **Multiple Comparator Types**:
   - StringComparator with different similarity functions (Jaro-Winkler, Levenshtein, etc.)
   - DateComparator with configurable tolerance
   - NumericComparator for numerical attributes
5. **Flexible Configuration**: Multiple ways to specify comparators and weights
6. **Evaluation**: Precision, recall, and F1 score calculation against ground truth
7. **Threshold Analysis**: Finding optimal similarity thresholds
8. **Output Generation**: Structured results saved to CSV and JSON files

### Best Practices:

- **Use domain knowledge** to weight attributes appropriately (titles usually most important for movies)
- **Tune thresholds** based on precision/recall trade-offs for your use case
- **Combine multiple attributes** for more robust matching than single-attribute approaches
- **Evaluate systematically** using held-out ground truth data
- **Generate good candidates** - blocking is crucial for scalability in real applications

### Next Steps:

- Implement more sophisticated blocking algorithms (sorted neighborhood, LSH, etc.)
- Try machine learning-based approaches with the MLBasedMatcher
- Experiment with ensemble methods combining multiple matchers
- Apply to your own datasets with domain-specific comparators