# PyDI Data Integration Tutorial

This tutorial demonstrates comprehensive data integration using PyDI. We'll work with movie datasets to showcase the complete data integration pipeline.

### What You'll Learn

1. **Data Loading & Profiling**: Load and analyze movie datasets with provenance tracking
2. **Identity Resolution**: 
   - Advanced blocking strategies (Standard, Sorted Neighbourhood, Token-based, Embedding-based)
   - Multi-attribute similarity matching with custom comparators
   - Machine learning-based entity matching
3. **Data Fusion**: 
   - Conflict resolution with custom fusion rules
   - Quality assessment against gold standards
   - Provenance tracking and trust management
4. **Advanced Techniques**: 
   - Semantic similarity with embeddings
   - Performance optimization and scalability
   - End-to-end pipeline integration

### Datasets

We'll use three movie datasets:
- **Academy Awards**: Movies with Oscar information (4,592 records)
- **Actors**: Movies with actor details (149 records) 
- **Golden Globes**: Movies with Golden Globe awards (2,286 records)

These datasets contain overlapping movie information but with different attributes, data quality issues, and conflicting values - perfect for demonstrating real-world data integration challenges.

### Setup the environment

In [1]:
# Install the PyDI package if not already installed
# First navigate to the root directory of the repository in your terminal, then run:
# !pip install -e .

In [2]:
# Core Python libraries
import pandas as pd
import numpy as np
from pathlib import Path
import logging
import time
import json
from datetime import datetime

# PyDI imports for data loading and profiling
from PyDI.io import load_xml, load_csv
from PyDI.profiling import DataProfiler

# PyDI imports for entity matching
from PyDI.entitymatching import (
    # Blocking strategies
    NoBlocking, StandardBlocking, SortedNeighbourhood, 
    TokenBlocking, EmbeddingBlocking,
    # Matchers
    RuleBasedMatcher, MLBasedMatcher,
    # Comparators
    StringComparator, DateComparator, NumericComparator,
    # Evaluation - NEW: Separate methods for blocking and matching evaluation
    EntityMatchingEvaluator,
    # Utilities
    ensure_record_ids
)

# PyDI imports for data fusion
from PyDI.fusion import (
    DataFusionEngine, DataFusionStrategy, DataFusionEvaluator,
    # Fusion rules
    longest_string, shortest_string, most_recent, earliest,
    average, median, maximum, minimum, most_complete,
    union, intersection, voting,
    # Convenient aliases
    LONGEST, SHORTEST, LATEST, EARLIEST, AVG, MAX, MIN, VOTE, UNION,
    # Analysis and reporting
    FusionReport, FusionQualityMetrics, ProvenanceTracker,
    build_record_groups_from_correspondences,
)

# Setup paths
def get_repo_root():
    """Get repository root directory."""
    current = Path.cwd()
    while current != current.parent:
        if (current / 'pyproject.toml').exists():
            return current
        current = current.parent
    return Path.cwd()

ROOT = get_repo_root()
OUTPUT_DIR = ROOT / "output" / "tutorial"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# Check if embeddings are available
try:
    from sentence_transformers import SentenceTransformer
    use_embeddings = True
    print("üß† Embedding models available")
except ImportError:
    use_embeddings = False
    print("‚ö†Ô∏è  Embedding models not available (install sentence-transformers)")

print(f"PyDI Tutorial")
print(f"Repository root: {ROOT}")
print(f"Output directory: {OUTPUT_DIR}")
print(f"All systems ready! üöÄ")

üß† Embedding models available
PyDI Tutorial
Repository root: c:\Users\Ralph\dev\pydi
Output directory: c:\Users\Ralph\dev\pydi\output\tutorial
All systems ready! üöÄ


## Part 1: Data Loading and Profiling

PyDI provides provenance-aware data loading that automatically tracks dataset metadata and adds unique identifiers. Let's load our movie datasets and understand their characteristics.

In [3]:
# Define dataset paths
DATA_DIR = ROOT / "input" / "movies"

print("=== Loading Movie Datasets ===")
print("PyDI provides provenance-aware loading with automatic ID generation.\n")

# Load Academy Awards dataset
academy_awards = load_xml(
    DATA_DIR / "entitymatching" / "data" / "academy_awards.xml",
    name="academy_awards",
    record_tag="movie",
    add_index=True,
    index_column_name="_id"
)

# Load Actors dataset  
actors = load_xml(
    DATA_DIR / "entitymatching" / "data" / "actors.xml",
    name="actors", 
    record_tag="movie",
    add_index=True,
    index_column_name="_id"
)

# Load Golden Globes dataset
golden_globes = load_xml(
    DATA_DIR / "fusion" / "data" / "golden_globes.xml",
    name="golden_globes",
    record_tag="movie", 
    add_index=True,
    index_column_name="_id"
)

# Display basic information
datasets = [academy_awards, actors, golden_globes]
names = ["Academy Awards", "Actors", "Golden Globes"]

for df, name in zip(datasets, names):
    print(f"{name}:")
    print(f"  Records: {len(df):,}")
    print(f"  Attributes: {len(df.columns)}")
    print(f"  Columns: {list(df.columns)}")
    print(f"  Dataset name: {df.attrs.get('dataset_name', 'unknown')}")
    print()

total_records = sum(len(df) for df in datasets)
print(f"Total records across all datasets: {total_records:,}")

=== Loading Movie Datasets ===
PyDI provides provenance-aware loading with automatic ID generation.

Academy Awards:
  Records: 4,592
  Attributes: 7
  Columns: ['_id', 'id', 'title', 'actor_name', 'date', 'director_name', 'oscar']
  Dataset name: academy_awards

Actors:
  Records: 149
  Attributes: 7
  Columns: ['_id', 'id', 'title', 'actor_name', 'actors_actor_birthday', 'actors_actor_birthplace', 'date']
  Dataset name: actors

Golden Globes:
  Records: 2,286
  Attributes: 7
  Columns: ['_id', 'id', 'title', 'actor_name', 'date', 'director_name', 'globe']
  Dataset name: golden_globes

Total records across all datasets: 7,027


In [4]:
# Preview the data structure
print("=== Dataset Previews ===")

print("\nüìΩÔ∏è Academy Awards Dataset:")
display(academy_awards.head(3))

print("\nüé≠ Actors Dataset:")
display(actors.head(3))

print("\nüèÜ Golden Globes Dataset:")
display(golden_globes.head(3))

=== Dataset Previews ===

üìΩÔ∏è Academy Awards Dataset:


Unnamed: 0,_id,id,title,actor_name,date,director_name,oscar
0,academy_awards-0000,academy_awards_1,Biutiful,Javier Bardem,2010-01-01,,
1,academy_awards-0001,academy_awards_2,True Grit,Jeff Bridges,2010-01-01,Joel Coen,
2,academy_awards-0002,academy_awards_2,True Grit,Jeff Bridges,2010-01-01,Ethan Coen,



üé≠ Actors Dataset:


Unnamed: 0,_id,id,title,actor_name,actors_actor_birthday,actors_actor_birthplace,date
0,actors-0000,actors_1,7th Heaven,Janet Gaynor,1906-01-01,Pennsylvania,1929-01-01
1,actors-0001,actors_2,Coquette,Mary Pickford,1892-01-01,Canada,1930-01-01
2,actors-0002,actors_3,The Divorcee,Norma Shearer,1902-01-01,Canada,1931-01-01



üèÜ Golden Globes Dataset:


Unnamed: 0,_id,id,title,actor_name,date,director_name,globe
0,golden_globes-0000,golden_globes_1,Frankie and Alice,Halle Berry,2011-01-01,,
1,golden_globes-0001,golden_globes_2,Rabbit Hole,Nicole Kidman,2011-01-01,,
2,golden_globes-0002,golden_globes_3,Winter's Bone,Jennifer Lawrence,2011-01-01,,


### Data Quality Analysis

Let's use PyDI's profiling capabilities to understand our data quality and identify the best attributes for matching.

### Basic Dataset Summary

First, let's use the DataProfiler's `summary()` method to get basic statistics for each dataset.

In [5]:
# Initialize the DataProfiler
profiler = DataProfiler()

print("=== Dataset Summary Statistics ===\n")

for df, name in zip(datasets, names):
    profile = profiler.summary(df) # automatically prints some statistics and returns object containing stats

display(profile)

=== Dataset Summary Statistics ===

academy_awards:
  Rows: 4,592
  Columns: 7
  Total nulls: 11,036
  Null percentage: 34.3%
  Null counts per column:
    title: 12 (0.3%)
    actor_name: 3,535 (77.0%)
    director_name: 4,172 (90.9%)
    oscar: 3,317 (72.2%)

actors:
  Rows: 149
  Columns: 7
  Total nulls: 0
  Null percentage: 0.0%

golden_globes:
  Rows: 2,286
  Columns: 7
  Total nulls: 3,681
  Null percentage: 23.0%
  Null counts per column:
    actor_name: 54 (2.4%)
    director_name: 1,966 (86.0%)
    globe: 1,661 (72.7%)



{'rows': 2286,
 'columns': 7,
 'nulls_total': 3681,
 'nulls_per_column': {'_id': 0,
  'id': 0,
  'title': 0,
  'actor_name': 54,
  'date': 0,
  'director_name': 1966,
  'globe': 1661},
 'dtypes': {'_id': 'string',
  'id': 'object',
  'title': 'object',
  'actor_name': 'object',
  'date': 'object',
  'director_name': 'object',
  'globe': 'object'}}

### Attribute Coverage Analysis

Next, let's use the `analyze_coverage()` method to understand how attributes overlap across datasets.

In [6]:
# Analyze attribute coverage across all three datasets
print("=== Attribute Coverage Analysis ===\n")

coverage = profiler.analyze_coverage(
    datasets=datasets,
    include_samples=True,
    sample_count=3  # Show 3 sample values per attribute
)

print("üìä Attribute coverage across datasets:")
display(coverage)

# Identify attributes suitable for entity matching
print("\nüîó Attributes suitable for entity matching:")
matching_attrs = coverage[coverage['datasets_with_attribute'] >= 2]['attribute'].tolist()
print(f"Available in 2+ datasets: {matching_attrs}")

=== Attribute Coverage Analysis ===

üìä Attribute coverage across datasets:


Unnamed: 0,attribute,academy_awards_count,academy_awards_pct,academy_awards_coverage,academy_awards_samples,actors_count,actors_pct,actors_coverage,actors_samples,golden_globes_count,golden_globes_pct,golden_globes_coverage,golden_globes_samples,avg_coverage,max_coverage,datasets_with_attribute
0,_id,4592/4592,100.0%,1.0,"['academy_awards-0000', 'academy_awards-0001',...",149/149,100.0%,1.0,"['actors-0000', 'actors-0001', 'actors-0002']",2286/2286,100.0%,1.0,"['golden_globes-0000', 'golden_globes-0001', '...",1.0,1.0,3
1,actor_name,1057/4592,23.0%,0.230183,"['Javier Bardem', 'Jeff Bridges', 'Jeff Bridges']",149/149,100.0%,1.0,"['Janet Gaynor', 'Mary Pickford', 'Norma Shear...",2232/2286,97.6%,0.976378,"['Halle Berry', 'Nicole Kidman', 'Jennifer Law...",0.73552,1.0,3
2,actors_actor_birthday,0/0,0%,0.0,,149/149,100.0%,1.0,"['1906-01-01', '1892-01-01', '1902-01-01']",0/0,0%,0.0,,0.333333,1.0,1
3,actors_actor_birthplace,0/0,0%,0.0,,149/149,100.0%,1.0,"['Pennsylvania', 'Canada', 'Canada']",0/0,0%,0.0,,0.333333,1.0,1
4,date,4592/4592,100.0%,1.0,"['2010-01-01', '2010-01-01', '2010-01-01']",149/149,100.0%,1.0,"['1929-01-01', '1930-01-01', '1931-01-01']",2286/2286,100.0%,1.0,"['2011-01-01', '2011-01-01', '2011-01-01']",1.0,1.0,3
5,director_name,420/4592,9.1%,0.091463,"['Joel Coen', 'Ethan Coen', 'David Fincher']",0/0,0%,0.0,,320/2286,14.0%,0.139983,"['Darren Aronofsky', 'David Fincher', 'Tom Hoo...",0.077149,0.139983,2
6,globe,0/0,0%,0.0,,0/0,0%,0.0,,625/2286,27.3%,0.273403,"['yes', 'yes', 'yes']",0.091134,0.273403,1
7,id,4592/4592,100.0%,1.0,"['academy_awards_1', 'academy_awards_2', 'acad...",149/149,100.0%,1.0,"['actors_1', 'actors_2', 'actors_3']",2286/2286,100.0%,1.0,"['golden_globes_1', 'golden_globes_2', 'golden...",1.0,1.0,3
8,oscar,1275/4592,27.8%,0.277657,"['yes', 'yes', 'yes']",0/0,0%,0.0,,0/0,0%,0.0,,0.092552,0.277657,1
9,title,4580/4592,99.7%,0.997387,"['Biutiful', 'True Grit', 'True Grit']",149/149,100.0%,1.0,"['7th Heaven', 'Coquette', 'The Divorcee']",2286/2286,100.0%,1.0,"['Frankie and Alice', 'Rabbit Hole', ""Winter's...",0.999129,1.0,3



üîó Attributes suitable for entity matching:
Available in 2+ datasets: ['_id', 'actor_name', 'date', 'director_name', 'id', 'title']


### Detailed Data Profiling

Now let's generate comprehensive HTML profiles for each dataset using the `profile()` method. These reports provide in-depth statistical analysis.

In [7]:
# Generate detailed HTML profiles for each dataset
print("=== Generating Detailed Dataset Profiles ===\n")

profile_dir = OUTPUT_DIR / "data_profiles"
profile_dir.mkdir(parents=True, exist_ok=True)

profile_paths = []

for df, name in zip(datasets, names):
    print(f"üìä Profiling {name}...")
    
    profile_path = profiler.profile(df, str(profile_dir))
    profile_paths.append(profile_path)
    print(f"  ‚úÖ Profile saved: {profile_path}")

print(f"\nüéØ Generated {len(profile_paths)} detailed HTML reports")
print(f"üìÅ Location: {profile_dir}")
print("\nüí° Open these HTML files in your browser for interactive exploration:")
for path in profile_paths:
    print(f"  ‚Ä¢ {Path(path).name}")


=== Generating Detailed Dataset Profiles ===

üìä Profiling Academy Awards...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 7/7 [00:00<00:00, 101.45it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

  ‚úÖ Profile saved: c:\Users\Ralph\dev\pydi\output\tutorial\data_profiles\academy_awards_profile.html
üìä Profiling Actors...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 7/7 [00:00<00:00, 225.81it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

  ‚úÖ Profile saved: c:\Users\Ralph\dev\pydi\output\tutorial\data_profiles\actors_profile.html
üìä Profiling Golden Globes...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 7/7 [00:00<00:00, 148.94it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

  ‚úÖ Profile saved: c:\Users\Ralph\dev\pydi\output\tutorial\data_profiles\golden_globes_profile.html

üéØ Generated 3 detailed HTML reports
üìÅ Location: c:\Users\Ralph\dev\pydi\output\tutorial\data_profiles

üí° Open these HTML files in your browser for interactive exploration:
  ‚Ä¢ academy_awards_profile.html
  ‚Ä¢ actors_profile.html
  ‚Ä¢ golden_globes_profile.html


### Dataset Comparison

Finally, let's use the `compare()` method to create a comparison report between two datasets, highlighting differences and similarities.

In [8]:
# Compare Academy Awards vs Golden Globes datasets
print("=== Dataset Comparison Analysis ===\n")

compare_dir = OUTPUT_DIR / "comparisons"
compare_dir.mkdir(parents=True, exist_ok=True)

print("üîç Comparing Academy Awards vs Golden Globes datasets...")

# Fix the comparison call by using Sweetviz directly with correct format
import sweetviz as sv
report = sv.compare((academy_awards, "Academy Awards"), (golden_globes, "Golden Globes"))
comparison_path = str(compare_dir / "academy_awards_vs_golden_globes_compare.html")
report.show_html(comparison_path)
print(f"‚úÖ Comparison report saved: {comparison_path}")

print(f"\nüéØ Interactive comparison report generated")
print(f"üìÅ Location: {comparison_path}")
print("üí° Open in browser to explore:")
print("  ‚Ä¢ Attribute distributions")
print("  ‚Ä¢ Value frequency comparisons") 
print("  ‚Ä¢ Missing data patterns")
print("  ‚Ä¢ Statistical differences")

=== Dataset Comparison Analysis ===

üîç Comparing Academy Awards vs Golden Globes datasets...


                                             |          | [  0%]   00:00 -> (? left)

Report c:\Users\Ralph\dev\pydi\output\tutorial\comparisons\academy_awards_vs_golden_globes_compare.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.
‚úÖ Comparison report saved: c:\Users\Ralph\dev\pydi\output\tutorial\comparisons\academy_awards_vs_golden_globes_compare.html

üéØ Interactive comparison report generated
üìÅ Location: c:\Users\Ralph\dev\pydi\output\tutorial\comparisons\academy_awards_vs_golden_globes_compare.html
üí° Open in browser to explore:
  ‚Ä¢ Attribute distributions
  ‚Ä¢ Value frequency comparisons
  ‚Ä¢ Missing data patterns
  ‚Ä¢ Statistical differences


## Part 2: Identity Resolution (Entity Matching)

Identity Resolution is the process of identifying records that refer to the same real-world entity. PyDI provides comprehensive blocking and matching capabilities.

### Step 1: Blocking Strategies

Blocking reduces the number of comparisons from O(n¬≤) to a manageable subset. Let's explore different blocking strategies.

In [9]:
print("=== Identity Resolution: Blocking Strategies ===")
print("Blocking reduces comparisons from full Cartesian product to manageable candidates.\n")

# We'll focus on Academy Awards vs Actors for entity matching
left_df = academy_awards
right_df = actors

max_pairs = len(left_df) * len(right_df)
print(f"Without blocking: {max_pairs:,} comparisons required")
print("\nüéØ Goal: Reduce comparisons while maintaining high recall\n")

# Ensure datasets have proper IDs for matching
left_df = ensure_record_ids(left_df)
right_df = ensure_record_ids(right_df)

blocking_results = []

print("Testing different blocking strategies...")

=== Identity Resolution: Blocking Strategies ===
Blocking reduces comparisons from full Cartesian product to manageable candidates.

Without blocking: 684,208 comparisons required

üéØ Goal: Reduce comparisons while maintaining high recall

Testing different blocking strategies...


In [10]:
# 1. Standard Blocking - First 3 characters of title
print("\n1Ô∏è‚É£ Standard Blocking (First 3 Characters of Title)")

# Add title_prefix directly to the original dataframes
academy_awards['title_prefix'] = academy_awards['title'].astype(str).str[:3]
actors['title_prefix'] = actors['title'].astype(str).str[:3]

standard_blocker = StandardBlocking(
    academy_awards, actors,
    on=['title_prefix'],  # Block on first 3 characters of title
    batch_size=1000
)

start_time = time.time()
standard_candidates = []
for batch in standard_blocker:
    standard_candidates.extend(batch.to_dict('records'))
    
standard_time = time.time() - start_time
reduction_ratio = len(standard_candidates) / max_pairs

print(f"  Generated: {len(standard_candidates):,} candidates")
print(f"  Reduction: {(1-reduction_ratio)*100:.1f}% ({reduction_ratio:.4f} ratio)")
print(f"  Time: {standard_time:.3f} seconds")

blocking_results.append({
    'strategy': 'StandardBlocking',
    'candidates': len(standard_candidates),
    'reduction_ratio': reduction_ratio,
    'time_seconds': standard_time
})


1Ô∏è‚É£ Standard Blocking (First 3 Characters of Title)
  Generated: 34,457 candidates
  Reduction: 95.0% (0.0504 ratio)
  Time: 0.064 seconds


In [11]:
# 2. Sorted Neighbourhood - Sequential similarity
print("\n2Ô∏è‚É£ Sorted Neighbourhood Blocking (Title-based, Window=5)")

sn_blocker = SortedNeighbourhood(
    academy_awards, actors,
    key='title',  # Sort by title
    window=10,     # Compare with 5 neighbors
    batch_size=1000
)

start_time = time.time()
sn_candidates = []
for batch in sn_blocker:
    sn_candidates.extend(batch.to_dict('records'))
    
sn_time = time.time() - start_time
reduction_ratio = len(sn_candidates) / max_pairs

print(f"  Generated: {len(sn_candidates):,} candidates")
print(f"  Reduction: {(1-reduction_ratio)*100:.1f}% ({reduction_ratio:.4f} ratio)")
print(f"  Time: {sn_time:.3f} seconds")

blocking_results.append({
    'strategy': 'SortedNeighbourhood', 
    'candidates': len(sn_candidates),
    'reduction_ratio': reduction_ratio,
    'time_seconds': sn_time
})


2Ô∏è‚É£ Sorted Neighbourhood Blocking (Title-based, Window=5)
  Generated: 2,906 candidates
  Reduction: 99.6% (0.0042 ratio)
  Time: 0.007 seconds


In [12]:
# 3. Token Blocking - Token-based similarity
print("\n3Ô∏è‚É£ Token Blocking (Title Tokens, Min Length=2)")

token_blocker = TokenBlocking(
    academy_awards, actors,
    column='title',      # Tokenize titles
    min_token_len=2,     # Ignore very short tokens
    batch_size=1000
)

start_time = time.time()
token_candidates = []
batch_count = 0

# Token blocking can generate many candidates, so we'll limit processing
for batch in token_blocker:
    batch_count += 1
    token_candidates.extend(batch.to_dict('records'))
        
token_time = time.time() - start_time
reduction_ratio = len(token_candidates) / max_pairs

print(f"  Generated: {len(token_candidates):,} candidates")
print(f"  Reduction: {(1-reduction_ratio)*100:.1f}% ({reduction_ratio:.4f} ratio)")
print(f"  Time: {token_time:.3f} seconds")

blocking_results.append({
    'strategy': 'TokenBlocking',
    'candidates': len(token_candidates),
    'reduction_ratio': reduction_ratio, 
    'time_seconds': token_time
})


3Ô∏è‚É£ Token Blocking (Title Tokens, Min Length=2)
  Generated: 75,242 candidates
  Reduction: 89.0% (0.1100 ratio)
  Time: 0.147 seconds


In [13]:
# 4. Embedding Blocking - Semantic similarity (Advanced)
print("\n4Ô∏è‚É£ Embedding Blocking (Semantic Similarity)")
print("Using neural embeddings for semantic movie matching...")

embedding_blocker = EmbeddingBlocking(
    academy_awards, actors,
    text_cols=['title'],
    model="sentence-transformers/all-MiniLM-L6-v2",
    index_backend="sklearn",
    top_k=10,          # Top 10 most similar
    threshold=0.5,     # Similarity threshold
    batch_size=500
)

start_time = time.time()
embedding_candidates = []
for batch in embedding_blocker:
    embedding_candidates.extend(batch.to_dict('records'))
    
embedding_time = time.time() - start_time
reduction_ratio = len(embedding_candidates) / max_pairs

print(f"  Generated: {len(embedding_candidates):,} candidates")
print(f"  Reduction: {(1-reduction_ratio)*100:.1f}% ({reduction_ratio:.4f} ratio)")
print(f"  Time: {embedding_time:.3f} seconds")
print("  üß† Semantic matching can find similar movies with different titles!")

blocking_results.append({
    'strategy': 'EmbeddingBlocking',
    'candidates': len(embedding_candidates),
    'reduction_ratio': reduction_ratio,
    'time_seconds': embedding_time
})


4Ô∏è‚É£ Embedding Blocking (Semantic Similarity)
Using neural embeddings for semantic movie matching...
  Generated: 1,030 candidates
  Reduction: 99.8% (0.0015 ratio)
  Time: 3.393 seconds
  üß† Semantic matching can find similar movies with different titles!


In [14]:
# Showcase EntityMatchingEvaluator.evaluate_blocking utility

# Load test set with proper _id format
test_gt = load_csv(
    DATA_DIR / "entitymatching" / "splits" / "academy_awards_2_actors_test.csv",
    name="test_set", header=None, names=['id1', 'id2', 'label'], add_index=False
)

# Use EntityMatchingEvaluator.evaluate_blocking on Standard Blocking
candidates_df = pd.DataFrame(standard_candidates)
total_pairs = len(academy_awards) * len(actors)

results = EntityMatchingEvaluator.evaluate_blocking(
    candidate_pairs=candidates_df[['id1', 'id2']],
    test_pairs=test_gt,
    total_possible_pairs=total_pairs
)

print(f"\nüí° Evaluating pair quality only makes sense if the test set contains all possible pairs, which is not the case in this example!")

display(results)

  Pair Completeness: 0.979
  Pair Quality:      0.001
  Reduction Ratio:   0.950
  True Matches Found: 46/47

üí° Evaluating pair quality only makes sense if the test set contains all possible pairs, which is not the case in this example!


{'pair_completeness': 0.9787234042553191,
 'pair_quality': 0.0013349972429404766,
 'reduction_ratio': 0.9496395832846152,
 'total_candidates': 34457,
 'total_possible_pairs': 684208,
 'true_positives_found': 46,
 'total_true_pairs': 47,
 'evaluation_timestamp': '2025-09-08T17:45:50.888971'}

In [15]:
# Evaluate all blocking methods and select the best one based on highest pair completeness, then highest reduction ratio (if tie)
print("=== Selecting Best Blocking Method ===")

# Evaluate all blocking strategies
blocking_methods = {
    'Standard': (standard_candidates, standard_time),
    'SortedNeighbourhood': (sn_candidates, sn_time), 
    'Token': (token_candidates, token_time),
    'Embedding': (embedding_candidates, embedding_time)
}

best_method = None
best_completeness = -1
best_reduction = -1
results_summary = []

for method, (candidates, time_taken) in blocking_methods.items():
    print(method)
    candidates_df = pd.DataFrame(candidates)
    eval_results = EntityMatchingEvaluator.evaluate_blocking(
        candidate_pairs=candidates_df[['id1', 'id2']],
        test_pairs=test_gt,
        total_possible_pairs=total_pairs
    )
    
    completeness = eval_results['pair_completeness']
    reduction = eval_results['reduction_ratio']
    
    results_summary.append({
        'Method': method,
        'Candidates': len(candidates),
        'Completeness': f"{completeness:.3f}",
        'Reduction': f"{reduction:.3f}",
        'Time (s)': f"{time_taken:.3f}"
    })
    
    # Select best: highest completeness, then highest reduction ratio (if tie)
    if (completeness > best_completeness or 
        (completeness == best_completeness and reduction > best_reduction)):
        best_completeness = completeness
        best_reduction = reduction
        best_method = method

# Display results
print("üìä Blocking Method Comparison:")
display(pd.DataFrame(results_summary))

# Select best candidates
best_candidates = blocking_methods[best_method][0]
print(f"\nüèÜ Best Method: {best_method} (Completeness: {best_completeness:.3f}, Reduction: {best_reduction:.3f})")
print(f"‚úÖ Using {len(best_candidates):,} candidate pairs for matching")

=== Selecting Best Blocking Method ===
Standard
  Pair Completeness: 0.979
  Pair Quality:      0.001
  Reduction Ratio:   0.950
  True Matches Found: 46/47
SortedNeighbourhood
  Pair Completeness: 0.979
  Pair Quality:      0.016
  Reduction Ratio:   0.996
  True Matches Found: 46/47
Token
  Pair Completeness: 1.000
  Pair Quality:      0.001
  Reduction Ratio:   0.890
  True Matches Found: 47/47
Embedding
  Pair Completeness: 1.000
  Pair Quality:      0.046
  Reduction Ratio:   0.998
  True Matches Found: 47/47
üìä Blocking Method Comparison:


Unnamed: 0,Method,Candidates,Completeness,Reduction,Time (s)
0,Standard,34457,0.979,0.95,0.064
1,SortedNeighbourhood,2906,0.979,0.996,0.007
2,Token,75242,1.0,0.89,0.147
3,Embedding,1030,1.0,0.998,3.393



üèÜ Best Method: Embedding (Completeness: 1.000, Reduction: 0.998)
‚úÖ Using 1,030 candidate pairs for matching


## TODO: Blocking log functionality. What should this look like as each method does blocking very differently? Print samples of the "blocks"?

### Step 2: Entity Matching with Comparators

Now we'll use PyDI's matching capabilities to find duplicate movies using multiple attribute comparisons.

In [16]:
# Create comparators for different attributes
comparators = [
    # Title similarity - most important for movies
    StringComparator(
        column='title',
        similarity_function='jaro_winkler',  # Good for movie titles
        preprocess=str.lower  # Case normalization
    ),
    
    # Date proximity - movies from same year likely same film
    DateComparator(
        column='date', 
        max_days_difference=365  # Allow 1 year difference
    ),
    
    # Actor name similarity - supporting evidence
    StringComparator(
        column='actor_name',
        similarity_function='cosine',  # Good for names
        preprocess=str.lower
    )
]

# Define attribute weights
weights = [0.6, 0.25, 0.15]  # Title most important, then date, then actor

In [17]:
# Initialize Rule-Based Matcher
matcher = RuleBasedMatcher()

print("\n=== Performing Entity Matching ===")
print(f"Candidate pairs to evaluate: {len(best_candidates):,}")
print("Applying multi-attribute matching rules with threshold 0.7...\n")

candidates_df = pd.DataFrame(best_candidates)

# Perform matching with threshold 0.7
start_time = time.time()

matches = matcher.match(
    df_left=left_df,
    df_right=right_df, 
    candidates=[candidates_df],
    comparators=comparators,
    weights=weights,
    threshold=0.7
)

matching_time = time.time() - start_time

print(f"Found {len(matches):,} matches in {matching_time:.3f} seconds")


=== Performing Entity Matching ===
Candidate pairs to evaluate: 1,030
Applying multi-attribute matching rules with threshold 0.7...

Found 114 matches in 0.462 seconds


### Step 3: Evaluation Against Ground Truth

PyDI provides separate, focused evaluation methods for different aspects of entity matching:
- **`evaluate_blocking()`**: Evaluates blocking strategies with pair completeness, pair quality, and reduction ratio
- **`evaluate_matching()`**: Evaluates matching results with precision, recall, F1-score, and accuracy

Let's evaluate our matching results against the provided ground truth correspondences.

In [18]:
print("=== Evaluation Against Ground Truth ===")
print("Loading Winter framework's ground truth correspondences...\n")

# Load ground truth correspondences
gt_train = load_csv(
    DATA_DIR / "entitymatching" / "splits" / "academy_awards_2_actors_training.csv",
    name="ground_truth_train",
    header=None,
    names=['id1', 'id2', 'label'],
    add_index=False
)

gt_test = load_csv(
    DATA_DIR / "entitymatching" / "splits" / "academy_awards_2_actors_test.csv", 
    name="ground_truth_test",
    header=None,
    names=['id1', 'id2', 'label'],
    add_index=False
)

print(f"Training ground truth: {len(gt_train):,} pairs")
print(f"Test ground truth: {len(gt_test):,} pairs")

# Analyze label distribution
for name, gt in [('Training', gt_train), ('Test', gt_test)]:
    true_matches = (gt['label'] == 'TRUE').sum() if 'TRUE' in gt['label'].values else (gt['label'] == True).sum()
    total = len(gt)
    print(f"{name} set: {true_matches:,} positive matches out of {total:,} pairs ({true_matches/total*100:.1f}%)")

print(f"\nüéØ We'll evaluate against the test set ({len(gt_test):,} pairs)")

=== Evaluation Against Ground Truth ===
Loading Winter framework's ground truth correspondences...

Training ground truth: 335 pairs
Test ground truth: 3,347 pairs
Training set: 103 positive matches out of 335 pairs (30.7%)
Test set: 47 positive matches out of 3,347 pairs (1.4%)

üéØ We'll evaluate against the test set (3,347 pairs)


In [19]:
# Perform evaluation using PyDI's EntityMatchingEvaluator
print("\n=== Entity Matching Evaluation Results ===")

# Use the new evaluate_matching method for cleaner evaluation
eval_results = EntityMatchingEvaluator.evaluate_matching(
    correspondences=matches,
    test_pairs=gt_test,
    out_dir=str(OUTPUT_DIR)
)

display(eval_results)


=== Entity Matching Evaluation Results ===
Performance Metrics:
  Accuracy:  0.976
  Precision: 0.342
  Recall:    0.830
  F1-Score:  0.484
Confusion Matrix:
  True Positives:  39
  True Negatives:  3299
  False Positives: 75
  False Negatives: 8


{'precision': 0.34210526315789475,
 'recall': 0.8297872340425532,
 'f1': 0.484472049689441,
 'accuracy': 0.9757380882782812,
 'true_positives': 39,
 'false_positives': 75,
 'false_negatives': 8,
 'true_negatives': 3299,
 'threshold_used': 0.0,
 'total_correspondences': 114,
 'filtered_correspondences': 114,
 'evaluation_timestamp': '2025-09-08T17:45:52.835410',
 'output_files': ['c:\\Users\\Ralph\\dev\\pydi\\output\\tutorial\\matching_evaluation_summary.json',
  'c:\\Users\\Ralph\\dev\\pydi\\output\\tutorial\\matching_detailed_results.csv']}

In [20]:
# Re-run the matcher with debug mode enabled to get detailed debug data
print("üîç Re-running matcher with debug mode to capture detailed results:")

# Use the same candidates and settings from before
candidates_df = pd.DataFrame(best_candidates)
print(f"  Using {len(candidates_df)} actual candidate pairs from {best_method} blocking")

# Re-run matching with debug enabled to capture detailed comparator results
start_time = time.time()

# Enable debug mode in the matcher to capture detailed results
matches, debug_info = matcher.match(
    df_left=left_df,
    df_right=right_df, 
    candidates=[candidates_df],
    comparators=comparators,
    weights=weights,
    threshold=0.7,
    debug=True  # This enables debug output capture
)

matching_time = time.time() - start_time
print(f"  Found {len(matches)} matches in {matching_time:.3f} seconds with debug enabled")

debug_output_dir = OUTPUT_DIR / "debug_results"
debug_output_dir.mkdir(parents=True, exist_ok=True)

# Call the write_debug_results function with actual results
full_debug_path, short_debug_path = EntityMatchingEvaluator.write_debug_results(
    correspondences=matches,
    debug_results=debug_info,
    out_dir=str(debug_output_dir),
    matcher_instance=matcher
)

print(f"  ‚úÖ Full debug results: {Path(full_debug_path).name}")
print(f"  ‚úÖ Short debug results: {Path(short_debug_path).name}")

print(f"üìÅ Debug files saved to: {debug_output_dir}")

üîç Re-running matcher with debug mode to capture detailed results:
  Using 1030 actual candidate pairs from Embedding blocking
  Found 114 matches in 0.480 seconds with debug enabled
  ‚úÖ Full debug results: debugResultsMatchingRule.csv
  ‚úÖ Short debug results: debugResultsMatchingRule.csv_short
üìÅ Debug files saved to: c:\Users\Ralph\dev\pydi\output\tutorial\debug_results


In [21]:
print("=== Demonstrating Cluster Size Distribution Analysis ===")
print("Analyzing cluster size distribution in our entity matching results...")

# Create cluster size distribution from our matches
cluster_distribution = EntityMatchingEvaluator.create_cluster_size_distribution(
    correspondences=matches,
    out_dir=str(OUTPUT_DIR / "cluster_analysis")
)

print(f"\nüìä Cluster Size Distribution Results:")
display(cluster_distribution)


=== Demonstrating Cluster Size Distribution Analysis ===
Analyzing cluster size distribution in our entity matching results...

üìä Cluster Size Distribution Results:


Unnamed: 0,cluster_size,frequency,percentage
0,2,110,98.214286
1,3,2,1.785714


In [24]:
# Write out detailed cluster information with all entity records for debugging purposes

# Use the matches we found earlier to demonstrate cluster details
cluster_details_path = OUTPUT_DIR / "cluster_analysis" / "detailed_cluster_info.json"
cluster_details_path.parent.mkdir(parents=True, exist_ok=True)

# Call write_cluster_details with our entity matches
output_path = EntityMatchingEvaluator.write_cluster_details(
    correspondences=matches,
    out_path=str(cluster_details_path)
)

### Step 4: Machine Learning-based Matching

### TBD in notebook, functionality is there

## Part 3: Data Fusion

In [26]:
print("=== Data Fusion: Resolving Conflicts ===")
print("Creating unified movie records from multiple sources...\n")

# Load all three datasets for fusion
print("üìä Fusion Input Datasets:")
for df, name in zip(datasets, names):
    print(f"  {name}: {len(df):,} records")

total_input_records = sum(len(df) for df in datasets)
print(f"  Total: {total_input_records:,} records")
print(f"\nüéØ Goal: Create single authoritative movie record per entity")

=== Data Fusion: Resolving Conflicts ===
Creating unified movie records from multiple sources...

üìä Fusion Input Datasets:
  Academy Awards: 4,592 records
  Actors: 149 records
  Golden Globes: 2,286 records
  Total: 7,027 records

üéØ Goal: Create single authoritative movie record per entity
