# Entity Matching Blocking Demonstration

This notebook demonstrates end-to-end candidate generation using multiple blocking strategies in PyDI. We'll work with movie datasets (academy awards and actors) to showcase various blocking techniques without performing actual matching.

## Overview

**Blocking** is a critical preprocessing step in entity matching that reduces the number of candidate pairs from the full Cartesian product to a manageable subset. This notebook demonstrates:

- **NoBlocking**: Full Cartesian product with sampling (baseline)
- **StandardBlocking**: Equality-based blocking on shared attributes
- **SortedNeighbourhood**: Sequential similarity with sliding windows
- **TokenBlocking**: Token-based blocking with deduplication

In [23]:
# Setup and imports
import os
import random
import numpy as np
import pandas as pd
from pathlib import Path
import logging
import time
import json 
from datetime import datetime

# PyDI imports
from PyDI.io.loaders import load_xml, load_csv
from PyDI.profiling import DataProfiler
from PyDI.entitymatching.blocking import (
    NoBlocking,
    StandardBlocking, 
    SortedNeighbourhood,
    TokenBlocking,
    EmbeddingBlocking
)
from PyDI.entitymatching import BlockingEvaluator

# Set random seed for reproducibility
random.seed(42)
np.random.seed(42)

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Cross-platform path handling - works on both Windows and Mac
root = Path.cwd().parents[1]  # repo root fallback for notebooks

# Set up output directory using absolute paths from root
OUTPUT_DIR = root / "output" / "examples" / "entitymatching" / "blocking_demo"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# Initialize profiler
profiler = DataProfiler()

print(f"Repository root: {root}")
print(f"Output directory: {OUTPUT_DIR.absolute()}")
print(f"Random seed set to 42 for reproducibility")

Repository root: /Users/aaronsteiner/Documents/GitHub/PyDI
Output directory: /Users/aaronsteiner/Documents/GitHub/PyDI/output/examples/entitymatching/blocking_demo
Random seed set to 42 for reproducibility


## Data Loading & Profiling

We'll load the movie datasets using PyDI's XML loader, which automatically adds unique identifiers and provenance metadata, then profile them using PyDI's DataProfiler.

In [24]:
# Define paths to the movie datasets using cross-platform paths
ACADEMY_AWARDS_PATH = root / "input" / "movies" / "entitymatching" / "data" / "academy_awards.xml"
ACTORS_PATH = root / "input" / "movies" / "entitymatching" / "data" / "actors.xml"

print(f"Academy Awards path: {ACADEMY_AWARDS_PATH}")
print(f"Actors path: {ACTORS_PATH}")
print(f"Academy Awards exists: {ACADEMY_AWARDS_PATH.exists()}")
print(f"Actors exists: {ACTORS_PATH.exists()}")

# Load the datasets
print("\nLoading academy awards dataset...")
df_awards = load_xml(
    ACADEMY_AWARDS_PATH, 
    name="academy_awards",
    add_index=True,
    index_column_name="_id"
)

print("Loading actors dataset...")
df_actors = load_xml(
    ACTORS_PATH, 
    name="actors",
    add_index=True,
    index_column_name="_id"
)

print(f"Academy Awards dataset shape: {df_awards.shape}")
print(f"Actors dataset shape: {df_actors.shape}")

# Preview the datasets
print("\n=== Academy Awards Dataset Preview ===")
print(df_awards.head())
print("\nColumns:", list(df_awards.columns))

print("\n=== Actors Dataset Preview ===")
print(df_actors.head())
print("\nColumns:", list(df_actors.columns))

2025-08-29 11:48:48,333 - PyDI.io.loaders - INFO - Loaded dataset 'academy_awards' via read_xml_flattened: shape=(4592, 7), source=/Users/aaronsteiner/Documents/GitHub/PyDI/input/movies/entitymatching/data/academy_awards.xml
2025-08-29 11:48:48,336 - PyDI.io.loaders - INFO - Loaded dataset 'actors' via read_xml_flattened: shape=(149, 7), source=/Users/aaronsteiner/Documents/GitHub/PyDI/input/movies/entitymatching/data/actors.xml


Academy Awards path: /Users/aaronsteiner/Documents/GitHub/PyDI/input/movies/entitymatching/data/academy_awards.xml
Actors path: /Users/aaronsteiner/Documents/GitHub/PyDI/input/movies/entitymatching/data/actors.xml
Academy Awards exists: True
Actors exists: True

Loading academy awards dataset...
Loading actors dataset...
Academy Awards dataset shape: (4592, 7)
Actors dataset shape: (149, 7)

=== Academy Awards Dataset Preview ===
                   _id                id               title       actor_name  \
0  academy_awards-0000  academy_awards_1            Biutiful    Javier Bardem   
1  academy_awards-0001  academy_awards_2           True Grit     Jeff Bridges   
2  academy_awards-0002  academy_awards_2           True Grit     Jeff Bridges   
3  academy_awards-0003  academy_awards_3  The Social Network  Jesse Eisenberg   
4  academy_awards-0004  academy_awards_4   The King's Speech      Colin Firth   

         date  director_name oscar  
0  2010-01-01            NaN   NaN  
1  20

## Data Profiling & Analysis

Let's use PyDI's DataProfiler to analyze our datasets comprehensively. The DataProfiler provides:

- **Quick summaries**: Basic statistics without heavy report generation
- **Detailed HTML reports**: Rich profiling using ydata-profiling (optional)
- **Dataset comparison**: Side-by-side analysis using sweetviz (optional)

This will help us understand data quality, column distributions, and identify the best columns for blocking.

In [25]:
# Generate profiling reports and summaries using PyDI's DataProfiler
print("=== Dataset Profiling ===")

# Create profiling output directory
profiling_dir = OUTPUT_DIR / "profiling"

# Generate quick summaries for both datasets
print("\n--- Academy Awards Summary ---")
awards_summary = profiler.summary(df_awards)
for key, value in awards_summary.items():
    if key == 'nulls_per_column':
        print(f"{key}: {sum(v for v in value.values())} total nulls")
        null_cols = {k: v for k, v in value.items() if v > 0}
        if null_cols:
            print("  Columns with nulls:")
            for col, count in null_cols.items():
                print(f"    {col}: {count} ({count/awards_summary['rows']*100:.1f}%)")
    elif key == 'dtypes':
        print(f"Column types: {len(set(value.values()))} unique types")
        type_counts = {}
        for dtype in value.values():
            type_counts[dtype] = type_counts.get(dtype, 0) + 1
        for dtype, count in type_counts.items():
            print(f"  {dtype}: {count} columns")
    else:
        print(f"{key}: {value:,}" if isinstance(value, int) else f"{key}: {value}")

print("\n--- Actors Summary ---")
actors_summary = profiler.summary(df_actors)
for key, value in actors_summary.items():
    if key == 'nulls_per_column':
        print(f"{key}: {sum(v for v in value.values())} total nulls")
        null_cols = {k: v for k, v in value.items() if v > 0}
        if null_cols:
            print("  Columns with nulls:")
            for col, count in null_cols.items():
                print(f"    {col}: {count} ({count/actors_summary['rows']*100:.1f}%)")
    elif key == 'dtypes':
        print(f"Column types: {len(set(value.values()))} unique types")
        type_counts = {}
        for dtype in value.values():
            type_counts[dtype] = type_counts.get(dtype, 0) + 1
        for dtype, count in type_counts.items():
            print(f"  {dtype}: {count} columns")
    else:
        print(f"{key}: {value:,}" if isinstance(value, int) else f"{key}: {value}")

# Find common columns for blocking
awards_cols = set(df_awards.columns)
actors_cols = set(df_actors.columns)
common_cols = awards_cols.intersection(actors_cols)
common_cols.discard('_id')  # Remove the ID column

print(f"\n--- Column Analysis ---")
print(f"Academy Awards columns: {sorted(awards_cols)}")
print(f"Actors columns: {sorted(actors_cols)}")
print(f"Common columns for blocking: {sorted(common_cols)}")

# Sample values from key columns for insight
print(f"\n--- Sample Values ---")
key_columns = ['title', 'actor_name', 'director_name'] if 'title' in common_cols else list(common_cols)[:3]
for col in key_columns:
    if col in df_awards.columns:
        awards_sample = df_awards[col].dropna().sample(min(3, df_awards[col].nunique()), random_state=42).tolist()
        print(f"Academy Awards {col}: {awards_sample}")
    if col in df_actors.columns:
        actors_sample = df_actors[col].dropna().sample(min(3, df_actors[col].nunique()), random_state=42).tolist()
        print(f"Actors {col}: {actors_sample}")

# Generate detailed HTML profiling reports (optional - requires ydata-profiling)
print(f"\n--- Generating Detailed Reports ---")
try:
    awards_profile_path = profiler.profile(df_awards, str(profiling_dir))
    print(f"✓ Academy Awards profile: {awards_profile_path}")
except ImportError:
    print("⚠️  ydata-profiling not installed - skipping detailed HTML reports")
    print("   Install with: pip install ydata-profiling")

try:
    actors_profile_path = profiler.profile(df_actors, str(profiling_dir))
    print(f"✓ Actors profile: {actors_profile_path}")
except ImportError:
    pass  # Already warned above

=== Dataset Profiling ===

--- Academy Awards Summary ---
rows: 4,592
columns: 7
nulls_total: 11,036
nulls_per_column: 11036 total nulls
  Columns with nulls:
    title: 12 (0.3%)
    actor_name: 3535 (77.0%)
    director_name: 4172 (90.9%)
    oscar: 3317 (72.2%)
Column types: 2 unique types
  string: 1 columns
  object: 6 columns

--- Actors Summary ---
rows: 149
columns: 7
nulls_total: 0
nulls_per_column: 0 total nulls
Column types: 2 unique types
  string: 1 columns
  object: 6 columns

--- Column Analysis ---
Academy Awards columns: ['_id', 'actor_name', 'date', 'director_name', 'id', 'oscar', 'title']
Actors columns: ['_id', 'actor_name', 'actors_actor_birthday', 'actors_actor_birthplace', 'date', 'id', 'title']
Common columns for blocking: ['actor_name', 'date', 'id', 'title']

--- Sample Values ---
Academy Awards title: ['Eskimo', 'That Hamilton Woman', 'Task Force']
Actors title: ['Erin Brockovich', 'To Each His Own', 'In the Heat of the Night']
Academy Awards actor_name: ['Al

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
100%|██████████| 7/7 [00:00<00:00, 155.63it/s]0<00:00, 16.54it/s, Describe variable: oscar]
Summarize dataset: 100%|██████████| 17/17 [00:00<00:00, 33.84it/s, Completed]                
Generate report structure: 100%|██████████| 1/1 [00:01<00:00,  1.17s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00, 13.95it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 609.81it/s]
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


✓ Academy Awards profile: /Users/aaronsteiner/Documents/GitHub/PyDI/output/examples/entitymatching/blocking_demo/profiling/academy_awards_profile.html


100%|██████████| 7/7 [00:00<00:00, 601.57it/s]0<00:00, 24.14it/s, Describe variable: date]
Summarize dataset: 100%|██████████| 16/16 [00:00<00:00, 74.01it/s, Completed]                 
Generate report structure: 100%|██████████| 1/1 [00:01<00:00,  1.04s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00, 14.38it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 722.04it/s]

✓ Actors profile: /Users/aaronsteiner/Documents/GitHub/PyDI/output/examples/entitymatching/blocking_demo/profiling/actors_profile.html





## Candidate Generation Statistics

Before diving into blocking strategies, let's understand the scale of the problem.

In [26]:
# Calculate theoretical maximum pairs
max_pairs = len(df_awards) * len(df_actors)
print(f"Full Cartesian product: {max_pairs:,} pairs")
print(f"Memory estimate (assuming 16 bytes per pair): {max_pairs * 16 / 1024**2:.1f} MB")

if max_pairs > 100_000:
    print("⚠️  WARNING: Large dataset - blocking is essential!")
elif max_pairs > 10_000:
    print("⚠️  CAUTION: Medium dataset - blocking recommended")
else:
    print("✓ Small dataset - blocking optional but educational")

Full Cartesian product: 684,208 pairs
Memory estimate (assuming 16 bytes per pair): 10.4 MB


In [27]:
blocking_stats = []

## Blocking Strategies Overview

Each blocker follows PyDI's unified API pattern but uses different parameters for specifying columns:

- **NoBlocking**: No column parameters (generates full Cartesian product)
- **StandardBlocking**: `on=[column_list]` - accepts list of columns for exact matching
- **SortedNeighbourhood**: `key=column, window=size` - single column for sorting with window size
- **TokenBlocking**: `column=column, min_token_len=N` - single column for tokenization

All blockers share: `batch_size` parameter and return `CandidateBatch` DataFrames with `id1, id2` columns.

## 1. NoBlocking (Baseline)

NoBlocking generates the full Cartesian product in manageable batches. This serves as our baseline and is only practical for small datasets.

In [28]:
# Initialize NoBlocking
no_blocker = NoBlocking(df_awards, df_actors, batch_size=1000)

print("=== NoBlocking Strategy ===")
print(f"Estimated pairs: {no_blocker.estimate_pairs():,}")

# Sample some candidate pairs to avoid memory issues
print("\nSampling candidate pairs...")
# Process all batches
all_pairs = []
batch_count = 0

for batch in no_blocker:
    batch_count += 1
sample_pairs = []
batch_count = 0
pair_count = 0

for batch in no_blocker:
    batch_count += 1
    pair_count += len(batch)
    
    # Sample from this batch
    if len(sample_pairs) < 100:  # Keep only first 100 for display
        sample_size = min(10, len(batch))
        sample_indices = np.random.choice(len(batch), sample_size, replace=False)
        for idx in sample_indices:
            sample_pairs.append({
                'id1': batch.iloc[idx]['id1'],
                'id2': batch.iloc[idx]['id2'],
                'batch': batch_count
            })
    
    # Stop early if too many pairs
    if batch_count >= 10:
        print(f"Stopping early after {batch_count} batches...")
        break

print(f"Processed {batch_count} batches")
print(f"Total pairs processed: {pair_count:,}")
print(f"Sample pairs collected: {len(sample_pairs)}")

# Display sample pairs
if sample_pairs:
    sample_df = pd.DataFrame(sample_pairs[:10])
    print("\nSample candidate pairs:")
    print(sample_df.to_string(index=False))

print(f"\n📊 NoBlocking: {pair_count:,} pairs processed in {batch_count} batches")

=== NoBlocking Strategy ===
Estimated pairs: 684,208

Sampling candidate pairs...
Stopping early after 10 batches...
Processed 10 batches
Total pairs processed: 8,940
Sample pairs collected: 100

Sample candidate pairs:
                id1         id2  batch
academy_awards-0004 actors-0115      1
academy_awards-0002 actors-0142      1
academy_awards-0003 actors-0078      1
academy_awards-0004 actors-0126      1
academy_awards-0000 actors-0039      1
academy_awards-0001 actors-0141      1
academy_awards-0002 actors-0002      1
academy_awards-0002 actors-0035      1
academy_awards-0001 actors-0059      1
academy_awards-0000 actors-0136      1

📊 NoBlocking: 8,940 pairs processed in 10 batches


## 2. StandardBlocking

StandardBlocking groups records by exact matches on a specified attribute. Records are only compared if they have identical values for the blocking key.

In [29]:
# Find the best blocking column (most common between datasets)
blocking_candidates = ['title'] if 'title' in common_cols else list(common_cols)[:1]

if not blocking_candidates:
    print("⚠️  No common columns found for StandardBlocking")
    standard_blocking_stats = None
else:
    blocking_column = blocking_candidates[0]
    print(f"=== StandardBlocking on '{blocking_column}' ===")
    
    # Initialize StandardBlocking 
    standard_blocker = StandardBlocking(
        df_awards, 
        df_actors, 
        on=[blocking_column],  
        batch_size=1000
    )
    
    print(f"Estimated pairs: {standard_blocker.estimate_pairs() or 'Unknown'}")
    
    # Process all batches
    all_pairs = []
    batch_count = 0
    
    for batch in standard_blocker:
        batch_count += 1
        all_pairs.extend(batch.to_dict('records'))
        
        if batch_count >= 50:  # Limit batches for performance
            print(f"Stopping after {batch_count} batches...")
            break
    
    pair_count = len(all_pairs)
    reduction_ratio = pair_count / max_pairs if max_pairs > 0 else 0
    
    print(f"Generated {pair_count:,} candidate pairs in {batch_count} batches")
    print(f"Reduction ratio: {reduction_ratio:.4f} ({100 * (1-reduction_ratio):.1f}% reduction)")
    
    # Sample pairs for display
    if all_pairs:
        sample_pairs = pd.DataFrame(all_pairs[:10])
        print("\\nSample candidate pairs:")
        print(sample_pairs.to_string(index=False))
    
    # Analyze block sizes if available
    if hasattr(standard_blocker, '_common_keys'):
        print(f"\\nBlocking statistics:")
        print(f"Number of blocks: {len(standard_blocker._common_keys)}")
        if standard_blocker._common_keys:
            block_sizes = [len(standard_blocker._left_blocks[k]) * len(standard_blocker._right_blocks[k]) 
                          for k in standard_blocker._common_keys[:10]]  # Sample first 10
            print(f"Average block size (sample): {np.mean(block_sizes):.1f}")
            print(f"Max block size (sample): {max(block_sizes)}")
    
    print(f"\\n📊 StandardBlocking: {pair_count:,} pairs ({reduction_ratio:.4f} ratio)")

=== StandardBlocking on 'title' ===
Estimated pairs: 138
Generated 138 candidate pairs in 1 batches
Reduction ratio: 0.0002 (100.0% reduction)
\nSample candidate pairs:
                id1         id2
academy_awards-0001 actors-0119
academy_awards-0002 actors-0119
academy_awards-2194 actors-0119
academy_awards-0347 actors-0077
academy_awards-0348 actors-0148
academy_awards-0405 actors-0147
academy_awards-0412 actors-0076
academy_awards-0452 actors-0146
academy_awards-0457 actors-0075
academy_awards-0507 actors-0145
\nBlocking statistics:
Number of blocks: 128
Average block size (sample): 1.2
Max block size (sample): 3
\n📊 StandardBlocking: 138 pairs (0.0002 ratio)


In [30]:
# Prepare teh gold standard
gold = load_csv(
    root / "input" / "movies" / "entitymatching" / "splits" / "gs_academy_awards_2_actors_test.csv",
    name="gs_academy_awards_2_actors_test",
    header=None,
    names=["id1", "id2", "label"],
    add_index=False,
    index_col=False,
    dtype=str,
)
gold

2025-08-29 11:48:51,696 - PyDI.io.loaders - INFO - Loaded dataset 'gs_academy_awards_2_actors_test' via read_csv: shape=(3347, 3), source=/Users/aaronsteiner/Documents/GitHub/PyDI/input/movies/entitymatching/splits/gs_academy_awards_2_actors_test.csv


Unnamed: 0,id1,id2,label
0,academy_awards_4529,actors_2,TRUE
1,academy_awards_4500,actors_3,TRUE
2,academy_awards_4475,actors_4,TRUE
3,academy_awards_4446,actors_5,TRUE
4,academy_awards_4399,actors_6,TRUE
...,...,...,...
3342,academy_awards_3765,actors_15,FALSE
3343,academy_awards_1049,actors_65,FALSE
3344,academy_awards_1115,actors_101,FALSE
3345,academy_awards_3244,actors_101,TRUE


In [31]:
try:
    std_cands = standard_blocker.materialize()

    # Map internal _id to original id values 
    # Careful: the _id is an id we created, not the id in the gold standard
    left_map = df_awards.set_index('_id')['id']
    right_map = df_actors.set_index('_id')['id']
    std_cands['id1'] = std_cands['id1'].map(left_map)
    std_cands['id2'] = std_cands['id2'].map(right_map)

    std_eval = BlockingEvaluator.evaluate(
        std_cands,
        gold_pairs=gold,
        gold_label_col="label",
        total_possible_pairs=max_pairs,
        out_dir=str(OUTPUT_DIR / "standardblocking"),
    )

    blocking_stats.append({
        'strategy': f'StandardBlocking(on=[{blocking_column}])' if 'blocking_column' in locals() else 'StandardBlocking(on=[title])',
        'estimated_pairs': standard_blocker.estimate_pairs(),
        'actual_pairs': len(std_cands),
        'batches_processed': None,  
        'candidate_recall': std_eval.get('candidate_recall'),
        'reduction_ratio': len(std_cands) / max_pairs if max_pairs > 0 else 0,
        'processing_time_seconds': None  
    })
    
    print("StandardBlocking eval:", {k: std_eval.get(k) for k in ["unique_candidates", "candidate_recall", "pair_reduction"]})
except Exception as e:
    print("StandardBlocking evaluation failed:", e)


2025-08-29 11:48:51,727 - root - INFO - Blocking evaluation: candidates=138 unique=137 duplicates=1


StandardBlocking eval: {'unique_candidates': 137, 'candidate_recall': 0.723404255319149, 'pair_reduction': 0.9997997684914529}


## 3. SortedNeighbourhood Blocking

SortedNeighbourhood sorting records by a key attribute and compares each record with its neighbors within a sliding window.

In [32]:
# Use title for sorted neighbourhood if available
if 'title' in common_cols:
    sort_key = 'title'
else:
    sort_key = list(common_cols)[0] if common_cols else None

if not sort_key:
    print("⚠️  No suitable column found for SortedNeighbourhood blocking")
    sn_blocking_stats = None
else:
    print(f"=== SortedNeighbourhood Blocking on '{sort_key}' ===")
    
    # Initialize SortedNeighbourhood with unified 'key' parameter (single column) and 'window'
    sn_blocker = SortedNeighbourhood(
        df_awards,
        df_actors,
        key=sort_key,  
        window=5,      
        batch_size=1000
    )
    
    print(f"Window size: 5")
    print(f"Estimated pairs: {sn_blocker.estimate_pairs() or 'Unknown'}")
    
    
    all_pairs = []
    batch_count = 0
    
    for batch in sn_blocker:  
        batch_count += 1
        all_pairs.extend(batch.to_dict('records'))
        
        if batch_count >= 50:  # Limit batches
            print(f"Stopping after {batch_count} batches...")
            break
    
    pair_count = len(all_pairs)
    reduction_ratio = pair_count / max_pairs if max_pairs > 0 else 0
    
    print(f"Generated {pair_count:,} candidate pairs in {batch_count} batches")
    print(f"Reduction ratio: {reduction_ratio:.4f} ({100 * (1-reduction_ratio):.1f}% reduction)")
    
    # Sample pairs for display
    if all_pairs:
        sample_pairs = pd.DataFrame(all_pairs[:10])
        print("\\nSample candidate pairs:")
        print(sample_pairs.to_string(index=False))
        
        # Show actual values for sample pairs
        print("\\nSample pair details:")
        for i, pair in enumerate(sample_pairs[:3].to_dict('records')):
            awards_record = df_awards[df_awards['_id'] == pair['id1']]
            actors_record = df_actors[df_actors['_id'] == pair['id2']]
            if not awards_record.empty and not actors_record.empty:
                awards_val = awards_record[sort_key].iloc[0] if sort_key in awards_record.columns else 'N/A'
                actors_val = actors_record[sort_key].iloc[0] if sort_key in actors_record.columns else 'N/A'
                print(f"  {i+1}. {pair['id1']} ({awards_val}) <-> {pair['id2']} ({actors_val})")
    print(f"\\n📊 SortedNeighbourhood: {pair_count:,} pairs ({reduction_ratio:.4f} ratio)")

=== SortedNeighbourhood Blocking on 'title' ===
Window size: 5
Estimated pairs: 11852
Generated 1,456 candidate pairs in 2 batches
Reduction ratio: 0.0021 (99.8% reduction)
\nSample candidate pairs:
                id1         id2
academy_awards-4416 actors-0000
academy_awards-2548 actors-0000
academy_awards-2506 actors-0000
academy_awards-0393 actors-0000
academy_awards-4567 actors-0000
academy_awards-0487 actors-0000
academy_awards-0931 actors-0000
academy_awards-0333 actors-0000
academy_awards-0504 actors-0000
academy_awards-2177 actors-0000
\nSample pair details:
  1. academy_awards-4416 (42nd Street) <-> actors-0000 (7th Heaven)
  2. academy_awards-2548 (55 Days at Peking) <-> actors-0000 (7th Heaven)
  3. academy_awards-2506 (7 Faces of Dr. Lao) <-> actors-0000 (7th Heaven)
\n📊 SortedNeighbourhood: 1,456 pairs (0.0021 ratio)


In [33]:
try:
    sn_cands = sn_blocker.materialize()

    # Map internal _id to original id values 
    # Careful: the _id is an id we created, not the id in the gold standard
    left_map = df_awards.set_index('_id')['id']
    right_map = df_actors.set_index('_id')['id']
    sn_cands['id1'] = sn_cands['id1'].map(left_map)
    sn_cands['id2'] = sn_cands['id2'].map(right_map)

    sn_eval = BlockingEvaluator.evaluate(
        sn_cands,
        gold_pairs=gold,
        gold_label_col="label",
        total_possible_pairs=max_pairs,
        out_dir=str(OUTPUT_DIR / ""),
    )
    blocking_stats.append({
        'strategy': f'SortedNeighbourhood(key={sort_key}, window=5)',
        'estimated_pairs': sn_blocker.estimate_pairs(),
        'actual_pairs': len(sn_cands),
        'batches_processed': batch_count,  
        'candidate_recall': sn_eval.get('candidate_recall'),
        'reduction_ratio': len(sn_cands) / max_pairs if max_pairs > 0 else 0,
        'processing_time_seconds': None
    })
    print("SortedNeighbourhood eval:", {k: sn_eval.get(k) for k in ["unique_candidates", "candidate_recall", "pair_reduction"]})
except Exception as e:
    print("SortedNeighbourhood evaluation failed:", e)


2025-08-29 11:48:51,756 - root - INFO - Blocking evaluation: candidates=1456 unique=1453 duplicates=3


SortedNeighbourhood eval: {'unique_candidates': 1453, 'candidate_recall': 0.9787234042553191, 'pair_reduction': 0.9978763767743143}


## 4. TokenBlocking

TokenBlocking splits text attributes into tokens and creates blocks for each token. Records sharing any token are considered candidate pairs.

In [34]:
# Use title for token blocking if available
if 'title' in common_cols:
    token_key = 'title'
else:
    # Find a text column for token blocking
    text_cols = [col for col in common_cols 
                 if df_awards[col].dtype == 'object' or df_actors[col].dtype == 'object']
    token_key = text_cols[0] if text_cols else None

if not token_key:
    print("⚠️  No suitable text column found for TokenBlocking")
    token_blocking_stats = None
else:
    print(f"=== TokenBlocking on '{token_key}' ===")
    
   
    token_blocker = TokenBlocking(
        df_awards,
        df_actors,
        column=token_key,     
        min_token_len=2,      
        batch_size=1000
    )
    
    print(f"Min token length: 2")
    print(f"Estimated pairs: {token_blocker.estimate_pairs() or 'Unknown'}")
    
    # Process all batches
    all_pairs = []
    batch_count = 0
    unique_pairs = set()
    
    for batch in token_blocker:
        batch_count += 1
        
        # Deduplicate pairs (TokenBlocking can generate duplicates)
        for _, row in batch.iterrows():
            pair_key = (row['id1'], row['id2'])
            if pair_key not in unique_pairs:
                unique_pairs.add(pair_key)
                all_pairs.append(row.to_dict())
        
        if batch_count >= 50:  # Limit batches
            print(f"Stopping after {batch_count} batches...")
            break
    
    pair_count = len(all_pairs)
    reduction_ratio = pair_count / max_pairs if max_pairs > 0 else 0
    
    print(f"Generated {pair_count:,} unique candidate pairs in {batch_count} batches")
    print(f"Reduction ratio: {reduction_ratio:.4f} ({100 * (1-reduction_ratio):.1f}% reduction)")
    
    print(f"\\n📊 TokenBlocking: {pair_count:,} pairs ({reduction_ratio:.4f} ratio)")

=== TokenBlocking on 'title' ===
Min token length: 2
Estimated pairs: 80850
Stopping after 50 batches...
Generated 50,000 unique candidate pairs in 50 batches
Reduction ratio: 0.0731 (92.7% reduction)
\n📊 TokenBlocking: 50,000 pairs (0.0731 ratio)


In [35]:
# Evaluate TokenBlocking
try:
    token_cands = token_blocker.materialize()
    
    token_cands['id1'] = token_cands['id1'].map(left_map)
    token_cands['id2'] = token_cands['id2'].map(right_map)

    token_eval = BlockingEvaluator.evaluate(
        token_cands,
        gold_pairs=gold,
        gold_label_col="label",
        total_possible_pairs=max_pairs,
        out_dir=str(OUTPUT_DIR / "tokenblocking"),
    )
    blocking_stats.append({
        'strategy': f"TokenBlocking(column={token_key})",
        'estimated_pairs': token_blocker.estimate_pairs(),
        'actual_pairs': len(token_cands),
        'batches_processed': batch_count,
        'candidate_recall': token_eval.get('candidate_recall'),
        'reduction_ratio': len(token_cands) / max_pairs if max_pairs > 0 else 0,
        'processing_time_seconds': None
    })
    print("TokenBlocking eval:", {k: token_eval.get(k) for k in ["unique_candidates", "candidate_recall", "pair_reduction"]})
except Exception as e:
    print("TokenBlocking evaluation failed:", e)


2025-08-29 11:48:52,490 - root - INFO - Blocking evaluation: candidates=75242 unique=75126 duplicates=116


TokenBlocking eval: {'unique_candidates': 75126, 'candidate_recall': 1.0, 'pair_reduction': 0.8902000561232841}


## Embedding Blocking

In [36]:
# Use title and actor_name for embedding-based blocking
text_columns = ['title', 'actor_name']

print(f"=== EmbeddingBlocking on {text_columns} ===")

# Initialize EmbeddingBlocking with smaller parameters for demo
embedding_blocker = EmbeddingBlocking(
    df_awards,
    df_actors,
    text_cols=text_columns,
    model="sentence-transformers/all-MiniLM-L6-v2",
    index_backend="sklearn",
    metric="cosine",
    top_k=20,           
    threshold=0.5,      
    normalize=True,
    batch_size=1000,
    query_batch_size=100  
)

print(f"Text columns: {text_columns}")
print(f"Model: sentence-transformers/all-MiniLM-L6-v2")
print(f"Index backend: sklearn")
print(f"Top-k neighbors: 20")
print(f"Similarity threshold: {embedding_blocker.threshold}")
print(f"Estimated pairs: {embedding_blocker.estimate_pairs() or 'Computing...'}...")


start_time = time.time()

all_pairs = []
batch_count = 0

print("\nGenerating candidate pairs...")
for batch in embedding_blocker:
    batch_count += 1
    all_pairs.extend(batch.to_dict('records'))
    
    if batch_count >= 20:  # Limit batches for demo
        print(f"Stopping after {batch_count} batches for demo...")
        break

processing_time = time.time() - start_time
pair_count = len(all_pairs)
reduction_ratio = pair_count / max_pairs if max_pairs > 0 else 0

print(f"\nGenerated {pair_count:,} candidate pairs in {batch_count} batches")
print(f"Processing time: {processing_time:.2f} seconds")
print(f"Reduction ratio: {reduction_ratio:.4f} ({100 * (1-reduction_ratio):.1f}% reduction)")

# Sample pairs for display
if all_pairs:
    sample_pairs = pd.DataFrame(all_pairs[:10])
    print("\nSample candidate pairs:")
    print(sample_pairs.to_string(index=False))
    
    # Show actual text values for sample pairs
    print("\nSample pair details (showing combined text):")
    for i, pair in enumerate(sample_pairs[:3].to_dict('records')):
        awards_record = df_awards[df_awards['_id'] == pair['id1']]
        actors_record = df_actors[df_actors['_id'] == pair['id2']]
        if not awards_record.empty and not actors_record.empty:
            # Combine text columns
            awards_text = ' '.join([
                str(awards_record[col].iloc[0]) if col in awards_record.columns and pd.notna(awards_record[col].iloc[0]) 
                else '' for col in text_columns
            ]).strip()
            actors_text = ' '.join([
                str(actors_record[col].iloc[0]) if col in actors_record.columns and pd.notna(actors_record[col].iloc[0]) 
                else '' for col in text_columns
            ]).strip()
            print(f"  {i+1}. Awards: '{awards_text}'")
            print(f"     Actors:  '{actors_text}'")

print(f"\n📊 EmbeddingBlocking: {pair_count:,} pairs ({reduction_ratio:.4f} ratio) in {processing_time:.2f}s")

2025-08-29 11:48:52,502 - PyDI.entitymatching.blocking.embedding - INFO - Initialized EmbeddingBlocking with sklearn backend, top_k=20, threshold=0.5
2025-08-29 11:48:52,507 - PyDI.entitymatching.blocking.embedding - INFO - Computing embeddings for datasets...
2025-08-29 11:48:52,508 - PyDI.entitymatching.blocking.embedding - INFO - Computing embeddings for left dataset (4592 records)
2025-08-29 11:48:52,508 - sentence_transformers.SentenceTransformer - INFO - Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2


=== EmbeddingBlocking on ['title', 'actor_name'] ===
Text columns: ['title', 'actor_name']
Model: sentence-transformers/all-MiniLM-L6-v2
Index backend: sklearn
Top-k neighbors: 20
Similarity threshold: 0.5


2025-08-29 11:48:54,057 - sentence_transformers.SentenceTransformer - INFO - Use pytorch device_name: mps
2025-08-29 11:48:54,368 - PyDI.entitymatching.blocking.embedding - INFO - Loaded sentence transformer model: sentence-transformers/all-MiniLM-L6-v2
2025-08-29 11:48:55,742 - PyDI.entitymatching.blocking.embedding - INFO - Computing embeddings for right dataset (149 records)
2025-08-29 11:48:55,784 - PyDI.entitymatching.blocking.embedding - INFO - Built sklearn index with 149 vectors, metric=cosine
2025-08-29 11:48:55,839 - PyDI.entitymatching.blocking.embedding - INFO - Estimated 743 candidate pairs from 1000 samples
2025-08-29 11:48:55,854 - PyDI.entitymatching.blocking.embedding - INFO - Starting embedding-based blocking...


Estimated pairs: 743...

Generating candidate pairs...


2025-08-29 11:48:57,432 - PyDI.entitymatching.blocking.embedding - INFO - Completed embedding-based blocking



Generated 762 candidate pairs in 1 batches
Processing time: 1.60 seconds
Reduction ratio: 0.0011 (99.9% reduction)

Sample candidate pairs:
                id1         id2
academy_awards-0001 actors-0119
academy_awards-0002 actors-0119
academy_awards-0010 actors-0075
academy_awards-0011 actors-0028
academy_awards-0011 actors-0077
academy_awards-0012 actors-0067
academy_awards-0017 actors-0047
academy_awards-0044 actors-0095
academy_awards-0071 actors-0055
academy_awards-0082 actors-0124

Sample pair details (showing combined text):
  1. Awards: 'True Grit Jeff Bridges'
     Actors:  'True Grit John Wayne'
  2. Awards: 'True Grit Jeff Bridges'
     Actors:  'True Grit John Wayne'
  3. Awards: 'Rabbit Hole Nicole Kidman'
     Actors:  'The Hours Nicole Kidman'

📊 EmbeddingBlocking: 762 pairs (0.0011 ratio) in 1.60s


In [37]:
# Evaluate EmbeddingBlocking
try:
    embedding_cands = embedding_blocker.materialize()
    
    # Map internal _id to original id values 
    embedding_cands['id1'] = embedding_cands['id1'].map(left_map)
    embedding_cands['id2'] = embedding_cands['id2'].map(right_map)

    embedding_eval = BlockingEvaluator.evaluate(
        embedding_cands,
        gold_pairs=gold,
        gold_label_col="label",
        total_possible_pairs=max_pairs,
        out_dir=str(OUTPUT_DIR / "embeddingblocking"),
    )
    
    # Store embedding blocking stats for comparison
    embedding_blocking_stats = {
        'strategy': f"EmbeddingBlocking(text_cols={text_columns})",
        'estimated_pairs': embedding_blocker.estimate_pairs() if hasattr(embedding_blocker, "estimate_pairs") else None,
        'actual_pairs': len(embedding_cands),
        'batches_processed': batch_count,
        'candidate_recall': embedding_eval.get('candidate_recall'),
        'reduction_ratio': len(embedding_cands) / max_pairs if max_pairs > 0 else 0,
        'processing_time_seconds': processing_time if 'processing_time' in locals() else None,
    }
    blocking_stats.append(embedding_blocking_stats)
    print("EmbeddingBlocking eval:", {k: embedding_eval.get(k) for k in ["unique_candidates", "candidate_recall", "pair_reduction"]})
    
except Exception as e:
    print("EmbeddingBlocking evaluation failed:", e)

2025-08-29 11:48:57,459 - PyDI.entitymatching.blocking.embedding - INFO - Starting embedding-based blocking...
2025-08-29 11:48:59,084 - PyDI.entitymatching.blocking.embedding - INFO - Completed embedding-based blocking
2025-08-29 11:48:59,096 - root - INFO - Blocking evaluation: candidates=762 unique=761 duplicates=1
2025-08-29 11:48:59,135 - PyDI.entitymatching.blocking.embedding - INFO - Estimated 794 candidate pairs from 1000 samples


EmbeddingBlocking eval: {'unique_candidates': 761, 'candidate_recall': 1.0, 'pair_reduction': 0.9988877651240559}


## Updated Comparison with EmbeddingBlocking

Let's update our comparison to include the new EmbeddingBlocking results.

In [41]:
comparison_df = pd.DataFrame(blocking_stats).sort_values("candidate_recall", ascending=False)
comparison_df

Unnamed: 0,strategy,estimated_pairs,actual_pairs,batches_processed,candidate_recall,reduction_ratio,processing_time_seconds
2,TokenBlocking(column=title),80850,75242,50.0,1.0,0.109969,
3,"EmbeddingBlocking(text_cols=['title', 'actor_n...",794,762,1.0,1.0,0.001114,1.599513
1,"SortedNeighbourhood(key=title, window=5)",11852,1456,2.0,0.978723,0.002128,
0,StandardBlocking(on=[title]),138,138,,0.723404,0.000202,


## Final Artifact Generation

Save all results 

In [44]:
results_updated = {
    'metadata': {
        'generated_at': datetime.now().isoformat(),
        'datasets': {
            'academy_awards': {
                'path': str(ACADEMY_AWARDS_PATH),
                'shape': df_awards.shape,
                'columns': list(df_awards.columns)
            },
            'actors': {
                'path': str(ACTORS_PATH),
                'shape': df_actors.shape,
                'columns': list(df_actors.columns)
            }
        },
        'max_possible_pairs': max_pairs,
        'common_columns': list(common_cols)
    },
    'blocking_results': blocking_stats 
}

# Save final results as JSON
results_path_final = OUTPUT_DIR / 'blocking_comparison_final.json'
with open(results_path_final, 'w', encoding='utf-8') as f:
    json.dump(results_updated, f, indent=2, ensure_ascii=False)
    
print(f"\n📁 Final detailed results saved to: {results_path_final}")

# Save final comparison DataFrame as CSV
if 'comparison_df_updated' in locals():
    csv_path_final = OUTPUT_DIR / 'blocking_comparison_final.csv'
    comparison_df.to_csv(csv_path_final, index=False, encoding='utf-8')
    print(f"📊 Final comparison table saved to: {csv_path_final}")

# Updated summary with EmbeddingBlocking
print(f"\n✅ All final artifacts saved to: {OUTPUT_DIR.absolute()}")

print("\n=== Final Summary ===")
print(f"Datasets: Academy Awards ({len(df_awards)} records) × Actors ({len(df_actors)} records)")

print(f"Blocking strategies tested: {len(blocking_stats)}")
print(f"Maximum possible pairs: {max_pairs:,}")

# Find best performing strategies
if blocking_stats and len(blocking_stats) > 1:
    blocking_final_stats = [s for s in blocking_stats if s['strategy'] != 'NoBlocking' and s.get('reduction_ratio')]
    if blocking_final_stats:
        best_reduction = max((1-s['reduction_ratio']) for s in blocking_final_stats if s.get('reduction_ratio'))
        print(f"Best reduction achieved: {best_reduction*100:.1f}%")
        
        # Find best recall
        recall_final_stats = [s for s in blocking_stats if s.get('candidate_recall') is not None]
        if recall_final_stats:
            best_recall = max(s['candidate_recall'] for s in recall_final_stats)
            print(f"Best recall achieved: {best_recall*100:.1f}%")

print("\n🎯 Next steps: Use these candidate pairs for entity matching with similarity functions!")
print("🧠 EmbeddingBlocking provides excellent semantic matching capabilities for text-heavy datasets!")


📁 Final detailed results saved to: /Users/aaronsteiner/Documents/GitHub/PyDI/output/examples/entitymatching/blocking_demo/blocking_comparison_final.json

✅ All final artifacts saved to: /Users/aaronsteiner/Documents/GitHub/PyDI/output/examples/entitymatching/blocking_demo

=== Final Summary ===
Datasets: Academy Awards (4592 records) × Actors (149 records)
Blocking strategies tested: 4
Maximum possible pairs: 684,208
Best reduction achieved: 100.0%
Best recall achieved: 100.0%

🎯 Next steps: Use these candidate pairs for entity matching with similarity functions!
🧠 EmbeddingBlocking provides excellent semantic matching capabilities for text-heavy datasets!
