# PyDI Data Integration Tutorial

This tutorial demonstrates comprehensive data integration using PyDI. We'll work with movie datasets to showcase the data integration pipeline from entity matching to Data Fusion.

### What You'll Learn

1. **Data Loading & Profiling**: Load and analyze movie datasets with provenance tracking
2. **Entity Matching**: 
   - Blocking strategies (Standard, Sorted Neighbourhood, Token-based, Embedding-based)
   - Multi-attribute similarity matching with custom comparators
   - Machine learning-based entity matching
3. **Data Fusion**: 
   - Conflict resolution with custom fusion rules
   - Quality assessment against test set
   - Provenance-based conflict resolution

### Datasets

We'll use three movie datasets:
- **Academy Awards**: Movies with Oscar information (4,592 records)
- **Actors**: Movies with actor details (149 records) 
- **Golden Globes**: Movies with Golden Globe awards (2,286 records)

These datasets contain overlapping movie information but with different attributes, data quality issues, and conflicting values - perfect for demonstrating real-world data integration challenges.

In [None]:
from pathlib import Path

# Setup paths
def get_repo_root():
    """Get repository root directory."""
    current = Path.cwd()
    while current != current.parent:
        if (current / 'pyproject.toml').exists():
            return current
        current = current.parent
    return Path.cwd()

ROOT = get_repo_root()
OUTPUT_DIR = ROOT / "PyDI" / "tutorial" / "output" / "movies"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print(f"PyDI Tutorial")
print(f"Repository root: {ROOT}")
print(f"Output directory: {OUTPUT_DIR}")
print(f"All systems ready! 🚀")

## Part 1: Data Loading and Profiling

PyDI provides provenance-aware data loading that automatically tracks dataset metadata and optionally adds unique identifiers to each record. Let's load our movie datasets and understand their characteristics.

In [None]:
from PyDI.io import load_xml

# Define dataset paths
DATA_DIR = ROOT / "docs" / "tutorial" / "input" / "movies"

# Load Academy Awards dataset
academy_awards = load_xml(
    DATA_DIR / "data" / "academy_awards.xml",
    name="academy_awards",
    nested_handling="aggregate"
)

# Load Actors dataset  
actors = load_xml(
    DATA_DIR / "data" / "actors.xml",
    name="actors", 
    nested_handling="aggregate"
)

# Load Golden Globes dataset
golden_globes = load_xml(
    DATA_DIR / "data" / "golden_globes.xml",
    name="golden_globes",
    nested_handling="aggregate"
)

# Display basic information
datasets = [academy_awards, actors, golden_globes]
names = ["Academy Awards", "Actors", "Golden Globes"]

for df, name in zip(datasets, names):
    print(f"{name}:")
    print(f"  Records: {len(df):,}")
    print(f"  Attributes: {len(df.columns)}")
    print(f"  Columns: {list(df.columns)}")
    print(f"  Dataset name: {df.attrs.get('dataset_name', 'unknown')}")
    print()

total_records = sum(len(df) for df in datasets)
print(f"Total records across all datasets: {total_records:,}")

In [None]:
# Preview the data structure

print("\n📽️ Academy Awards Dataset:")
display(academy_awards.head(3))

print("\n🎭 Actors Dataset:")
display(actors.head(3))

print("\n🏆 Golden Globes Dataset:")
display(golden_globes.head(3))

### Data Quality Analysis

Let's use PyDI's profiling capabilities to understand our data quality and identify the best attributes for matching.

### Basic Dataset Summary

First, let's use the DataProfiler's `summary()` method to get basic statistics for each dataset.

In [None]:
from PyDI.profiling import DataProfiler

# Initialize the DataProfiler
profiler = DataProfiler()

for df, name in zip(datasets, names):
    profile = profiler.summary(df) # automatically prints some statistics and returns object containing stats

display(profile)

### Attribute Coverage Analysis

Next, let's use the `analyze_coverage()` method to understand how attributes overlap across datasets.

In [None]:
coverage = profiler.analyze_coverage(
    datasets=datasets,
    include_samples=True,
    sample_count=3  # Show 3 sample values per attribute
)

print("📊 Attribute coverage across datasets:")
display(coverage)

# Identify attributes suitable for entity matching
print("\n🔗 Attributes suitable for entity matching:")
matching_attrs = coverage[coverage['datasets_with_attribute'] >= 2]['attribute'].tolist()
print(f"Attributes available in 2+ datasets: {matching_attrs}")

### Detailed Data Profiling

Now let's generate comprehensive HTML profiles for each dataset using the `profile()` method. These reports provide in-depth statistical analysis.

In [None]:
# Generate detailed HTML profiles for each dataset

profile_dir = OUTPUT_DIR / "dataset-profiles"
profile_dir.mkdir(parents=True, exist_ok=True)

profile_paths = []

for df, name in zip(datasets, names):
    print(f"📊 Profiling {name}...")
    
    profile_path = profiler.profile(df, str(profile_dir))
    profile_paths.append(profile_path)
    print(f"  ✅ Profile saved: {profile_path}")

print(f"\n🎯 Generated {len(profile_paths)} detailed HTML reports")
print(f"📁 Location: {profile_dir}")
print("\n💡 Open these HTML files in your browser for interactive exploration:")
for path in profile_paths:
    print(f"  • {Path(path).name}")


## Part 2: Entity Matching

Entity Matching is the process of identifying records that refer to the same real-world entity. PyDI implements different blocking and matching methods.

### Step 1: Blocking

Blocking reduces the number of comparisons from O(n²) to a manageable subset. Let's explore different blocking strategies.

In [None]:
# Let's setup logging first
import logging

import os
os.makedirs('output/logs', exist_ok=True)

# choose either default logging or debug logging

# # Configure logging for INFO level
# logging.basicConfig(
#     level=logging.INFO,
#     format='[%(levelname)-5s] %(name)s - %(message)s',
#     handlers=[
#           logging.FileHandler('output/logs/pydi.log'),  # Save to file
#           logging.StreamHandler()                      # Display on console
#       ],
#     force=True
# )

# Configure logging for DEBUG level
logging.basicConfig(
    level=logging.DEBUG,
    format='[%(levelname)-5s] %(name)s - %(message)s',
    handlers=[
          logging.FileHandler('output/logs/pydi.log'),  # Save to file
          logging.StreamHandler()                      # Display on console
      ],
    force=True
)

In [None]:
from PyDI.entitymatching import NoBlocker, StandardBlocker, SortedNeighbourhoodBlocker, TokenBlocker, EmbeddingBlocker

# We'll focus on Actors and Golden Globes for showcasing blocking strategies

max_pairs = len(actors) * len(golden_globes)
print(f"Without blocking: {max_pairs:,} comparisons required")
print("\n🎯 Goal: Reduce comparisons while maintaining high recall\n")

# No Blocking - compare all possible pairs
print("\n No Blocking")

no_blocker = NoBlocker(
    actors, golden_globes,
    batch_size=1000,
    id_column='id'  # specify the ID column for both datasets
)

# in an actual large-scale application, we do not build a list of all pairs but stream over them like this
for batch in no_blocker:
    # do something with the pairs
    continue

# but we can also generate the full set of pairs for smaller datasets
no_candidates = no_blocker.materialize()

print(f"  Generated: {len(no_candidates):,} candidates")

Now let's use an actual blocker. Note that when instantiating the blocker, it also writes out a corresponding debug file.

In [None]:
# 1. Standard Blocking - First 3 characters of title
print("\n1️⃣ Standard Blocking (Concatenation of first 2 characters of each of the first three tokens of title)")

# Add title_prefix directly to the original dataframes
actors['title_prefix'] = actors['title'].astype(str).apply(lambda x: ''.join([word[:2].upper() for word in x.split()[:3]]))
golden_globes['title_prefix'] = golden_globes['title'].astype(str).apply(lambda x: ''.join([word[:2].upper() for word in x.split()[:3]]))

standard_blocker_a2g = StandardBlocker(
    actors, golden_globes,
    on=['title_prefix'],
    batch_size=1000,
    output_dir=OUTPUT_DIR / "blocking-evaluation",
    id_column='id'
)

standard_candidates_a2g = standard_blocker_a2g.materialize()

print()
print(f"  Generated: {len(standard_candidates_a2g):,} candidates")

In [None]:
# 2. Sorted Neighbourhood - Sequential similarity
print("\n2️⃣ Sorted Neighbourhood Blocking (Title-based, Window=5)")

sn_blocker_a2g = SortedNeighbourhoodBlocker(
    actors, golden_globes,
    key='title',  # Sort by title
    window=20,     # Compare with 20 neighbors
    batch_size=1000,
    output_dir=OUTPUT_DIR / "blocking-evaluation",
    id_column='id'
)

sn_candidates_a2g = sn_blocker_a2g.materialize()

print()
print(f"  Generated: {len(sn_candidates_a2g):,} candidates")

In [None]:
# 3. Token Blocking - Token-based similarity
print("\n3️⃣ Token Blocking (Title Tokens, Min Length=5)")

token_blocker_a2g = TokenBlocker(
    actors, golden_globes,
    column='title',      # Tokenize titles
    min_token_len=3,     # Ignore very short tokens
    batch_size=1000,
    output_dir=OUTPUT_DIR / "blocking-evaluation",
    id_column='id'
)

token_candidates_a2g = token_blocker_a2g.materialize()

print()
print(f"  Generated: {len(token_candidates_a2g):,} candidates")

In [None]:
# 4. Embedding Blocking - Semantic similarity
print("\n4️⃣ Embedding Blocking (Semantic Similarity)")

embedding_blocker_a2g = EmbeddingBlocker(
    actors, golden_globes,
    text_cols=['title'],
    model="sentence-transformers/all-MiniLM-L6-v2",
    index_backend="sklearn",
    top_k=20,          # Top 20 most similar
    batch_size=500,
    output_dir=OUTPUT_DIR / "blocking-evaluation",
    id_column='id'
)
    
embedding_candidates_a2g = embedding_blocker_a2g.materialize()

print()
print(f"  Generated: {len(embedding_candidates_a2g):,} candidates")

### Step 2: Evaluation Against Ground Truth

PyDI provides evaluation methods for blocking with pair completeness, pair quality, and reduction ratio:
- **`evaluate_blocking()`**: Evaluates blocking given an already materialized set of pairs.
- **`evaluate_blocking_batched()`**: Evaluates blocking by iterating over batches and storing results. Useful for very large datasets 

Let's first evaluate materialized blocking results against a set of provided ground truth correspondences.

In [None]:
import pandas as pd
from PyDI.io import load_csv
from PyDI.entitymatching import EntityMatchingEvaluator
# Showcase EntityMatchingEvaluator.evaluate_blocking utility

# Load test set with proper column names
test_gt = load_csv(
    DATA_DIR / "entitymatching" / "actors_2_golden_globes_test.csv",
    name="test_set", header=None, names=['id1', 'id2', 'label'], add_index=False
)

# Use EntityMatchingEvaluator.evaluate_blocking on Standard Blocking
results = EntityMatchingEvaluator.evaluate_blocking(
    candidate_pairs=standard_candidates_a2g,
    blocker=standard_blocker_a2g,
    test_pairs=test_gt,
    out_dir=OUTPUT_DIR / "blocking-evaluation"
)

print(f"\n💡 Evaluating pair quality only makes sense if the test set contains all possible pairs, which is not the case in this example!")

display(results)

When datasets are huge, it is necessary to use the evaluate_blocking_batched() function to avoid materializing the full set of pairs.

In [None]:
results = EntityMatchingEvaluator.evaluate_blocking_batched(
    blocker=standard_blocker_a2g,
    test_pairs=test_gt,
    out_dir=OUTPUT_DIR / "blocking-evaluation"
)

display(results)

Let's do the same kind of blocking for the dataset combination Academy Awards <-> Actors

In [None]:
# Add title_prefix directly to the original dataframes
academy_awards['title_prefix'] = academy_awards['title'].astype(str).apply(lambda x: ''.join([word[:2].upper() for word in x.split()[:3]]))

standard_blocker_aa2a = StandardBlocker(
    academy_awards, actors,
    on=['title_prefix'],  # Block on first 3 characters of title
    batch_size=1000,
    output_dir=OUTPUT_DIR / "blocking-evaluation",
    id_column='id'
)
standard_candidates_aa2a = standard_blocker_aa2a.materialize()

sn_blocker_aa2a = SortedNeighbourhoodBlocker(
    academy_awards, actors,
    key='title',  # Sort by title
    window=20,     # Compare with 20 neighbors
    batch_size=1000,
    output_dir=OUTPUT_DIR / "blocking-evaluation",
    id_column='id'
)
sn_candidates_aa2a = sn_blocker_aa2a.materialize()

token_blocker_aa2a = TokenBlocker(
    academy_awards, actors,
    column='title',      # Tokenize titles
    min_token_len=3,     # Ignore very short tokens
    batch_size=1000,
    output_dir=OUTPUT_DIR / "blocking-evaluation",
    id_column='id'
)
token_candidates_aa2a = token_blocker_aa2a.materialize()

embedding_blocker_aa2a = EmbeddingBlocker(
    academy_awards, actors,
    text_cols=['title'],
    model="sentence-transformers/all-MiniLM-L6-v2",
    index_backend="sklearn",
    top_k=20,          # Top 20 most similar
    batch_size=500,
    output_dir=OUTPUT_DIR / "blocking-evaluation",
    id_column='id'
)
embedding_candidates_aa2a = embedding_blocker_aa2a.materialize()

Now let's evaluate which blocking method we want to use for each dataset combination:

In [None]:
# Evaluate all blocking methods for both dataset combinations

evaluator = EntityMatchingEvaluator()

# Create dictionaries of candidates for both dataset combinations
a2g_blocking_candidates = {
    'StandardBlocking': [standard_candidates_a2g, standard_blocker_a2g],
    'SortedNeighbourhoodBlocker': [sn_candidates_a2g, sn_blocker_a2g],
    'TokenBlocking': [token_candidates_a2g,token_blocker_a2g],
    'EmbeddingBlocking': [embedding_candidates_a2g,embedding_blocker_a2g]
}

aa2a_blocking_candidates = {
    'StandardBlocking': [standard_candidates_aa2a,standard_blocker_aa2a],
    'SortedNeighbourhood': [sn_candidates_aa2a, sn_blocker_aa2a],
    'TokenBlocking': [token_candidates_aa2a,token_blocker_aa2a],
    'EmbeddingBlocking': [embedding_candidates_aa2a,embedding_blocker_aa2a]
}

# Load correspondences for evaluation
a2g_correspondences = load_csv(
    DATA_DIR / "entitymatching" / "actors_2_golden_globes_test.csv",
    name="a2g_test", header=None, names=['id1', 'id2', 'label'], add_index=False
)

aa2a_correspondences = load_csv(
    DATA_DIR / "entitymatching" / "academy_awards_2_actors_test.csv",
    name="aa2a_test", header=None, names=['id1', 'id2', 'label'], add_index=False
)

# Evaluate blocking for a2g datasets
a2g_results = []
for method_name, candidates in a2g_blocking_candidates.items():
    result = evaluator.evaluate_blocking(candidates[0], a2g_correspondences,candidates[1], out_dir=OUTPUT_DIR / "blocking-evaluation")
    result['method'] = method_name
    result['dataset'] = 'a2g'
    a2g_results.append(result)

# Evaluate blocking for aa2a datasets  
aa2a_results = []
for method_name, candidates in aa2a_blocking_candidates.items():
    result = evaluator.evaluate_blocking(candidates[0], aa2a_correspondences,candidates[1], out_dir=OUTPUT_DIR / "blocking-evaluation")
    result['method'] = method_name
    result['dataset'] = 'aa2a'
    aa2a_results.append(result)

# Select best method for each dataset (highest pair_completeness, then highest reduction_ratio)
a2g_best = max(a2g_results, key=lambda x: (x['pair_completeness'], x['reduction_ratio']))
aa2a_best = max(aa2a_results, key=lambda x: (x['pair_completeness'], x['reduction_ratio']))

print(f"Best blocking for a2g: {a2g_best['method']} (PC: {a2g_best['pair_completeness']:.3f}, RR: {a2g_best['reduction_ratio']:.3f})")
print(f"Best blocking for aa2a: {aa2a_best['method']} (PC: {aa2a_best['pair_completeness']:.3f}, RR: {aa2a_best['reduction_ratio']:.3f})")

### Step 3: Entity Matching with Comparators

Now we'll use PyDI's linear matching rule capabilities to find duplicate movies using multiple attribute comparisons.

First, we define some comparators for attributes relevant to matching:

In [None]:
from PyDI.entitymatching import StringComparator, DateComparator, NumericComparator

# Create comparators for different attributes
comparators = [
    # Title similarity - most important for movies
    StringComparator(
        column='title',
        similarity_function='jaccard',  # Good for movie titles
        preprocess=str.lower  # Case normalization
    ),
    
    # Date proximity - movies from same year likely same film
    DateComparator(
        column='date', 
        max_days_difference=365  # Allow 1 year difference
    ),
    
    # Actor name similarity - supporting evidence
    StringComparator(
        column='actors_actor_name',
        similarity_function='jaccard',  # Good for names
        preprocess=str.lower,
        list_strategy='concatenate' # Handle list attribute by concatenation
    )
]

Next, we setup the matcher and run the matching with our chosen best blocking method:

In [None]:
from PyDI.entitymatching import RuleBasedMatcher

# Initialize the blocker
embedding_blocker_a2g = EmbeddingBlocker(
    actors, golden_globes,
    text_cols=['title'],
    model="sentence-transformers/all-MiniLM-L6-v2",
    index_backend="sklearn",
    top_k=20,          # Top 20 most similar
    batch_size=500,
    output_dir=OUTPUT_DIR / "blocking-evaluation",
    id_column='id'
)

# Initialize Rule-Based Matcher
matcher = RuleBasedMatcher()

correspondences_a2g = matcher.match(
    df_left=actors,
    df_right=golden_globes, 
    candidates=embedding_blocker_a2g, # pass the blocker, which will internally generate candidate pairs using batching
    comparators=comparators,
    weights=[0.7, 0.2, 0.1],  # Title most important, then date, then actor,
    threshold=0.7, # set a similarity threshold for a match
    id_column='id'
)

### Step 4: Evaluation Against Ground Truth

We can evaluate the result of our entity matching with this method of the EntityMatchingEvaluator:
- **`evaluate_matching()`**: Evaluates matching given a test set and the predicted correspondences. 

In [None]:
gt_test = load_csv(
    DATA_DIR / "entitymatching" / "actors_2_golden_globes_test.csv", 
    name="test_entity_matching",
    header=None,
    names=['id1', 'id2', 'label'],
    add_index=False
)

debug_output_dir = OUTPUT_DIR / "debug_results_entity_matching"
debug_output_dir.mkdir(parents=True, exist_ok=True)

eval_results = EntityMatchingEvaluator.evaluate_matching(
    correspondences=correspondences_a2g,
    test_pairs=gt_test,
    out_dir=debug_output_dir
)

display(eval_results)

If we need more detailed debugging results, we can set the debug flag during matching and pass the resulting info object to the evaluate_matching function to write detailed debug logs to a directory of our choice.

In [None]:
# Re-run the matcher with debug mode enabled to get detailed debug data
print("🔍 Re-running matcher with debug mode to capture detailed results:")

correspondences_a2g, debug_info = matcher.match(
    df_left=actors,
    df_right=golden_globes, 
    candidates=embedding_blocker_a2g, # pass the blocker, which will internally generate candidate pairs using batching
    comparators=comparators,
    weights=[0.7, 0.2, 0.1],  # Title most important, then date, then actor,
    threshold=0.7, # set a similarity threshold for a match
    id_column='id',
    debug=True  # This enables debug output capture
)

eval_results = EntityMatchingEvaluator.evaluate_matching(
    correspondences=correspondences_a2g,
    test_pairs=gt_test,
    out_dir=debug_output_dir,
    debug_info=debug_info, # add debug info
    matcher_instance=matcher # add matcher instance for context for debug files
)

Another helpful tool for investigating the goodness of the matching is to create the cluster size distribution that shows how many clusters (records referencing same entity) after matching exist.

In [None]:
print("Analyzing cluster size distribution in our entity matching results...")

# Create cluster size distribution from our matches
cluster_distribution = EntityMatchingEvaluator.create_cluster_size_distribution(
    correspondences=correspondences_a2g,
    out_dir=str(OUTPUT_DIR / "cluster_analysis")
)

print(f"\n📊 Cluster Size Distribution Results:")
display(cluster_distribution)

If we see strange distribution of clusters, we can further investigate specific clusters by writing out detailed cluster information:

In [None]:
# Write out detailed cluster information with all entity records for debugging purposes

# Use the matches we found earlier to demonstrate cluster details
cluster_details_path = OUTPUT_DIR / "cluster_analysis" / "detailed_cluster_info.json"

# Call write_cluster_details with our entity matches
output_path = EntityMatchingEvaluator.write_cluster_details(
    correspondences=correspondences_a2g,
    out_path=cluster_details_path
)

### Step 4: Machine Learning-based Matching Rules

Instead of using manually configured matching rules, we can also learn the weights and best comparators using machine learning if we have a labeled training set available.

Let's do this for the dataset combination Academy Awards <-> Actors.

First, we need to create the features for machine learning using PyDIs FeatureExtractor class:

In [None]:
from PyDI.entitymatching import FeatureExtractor

# Load ground truth correspondences
aa2a_train = load_csv(
    DATA_DIR / "entitymatching" / "academy_awards_2_actors_training.csv",
    name="ground_truth_train",
    header=None,
    names=['id1', 'id2', 'label'],
    add_index=False
)

aa2a_test = load_csv(
    DATA_DIR / "entitymatching" / "academy_awards_2_actors_test.csv",
    name="ground_truth_test",
    header=None,
    names=['id1', 'id2', 'label'],
    add_index=False
)

similarity_comparators = [
    # Title similarity features - most important for movie matching
    StringComparator("title", similarity_function="jaro_winkler", preprocess=str.lower),
    StringComparator("title", similarity_function="levenshtein", preprocess=str.lower),
    StringComparator("title", similarity_function="cosine", preprocess=str.lower),
    StringComparator("title", similarity_function="jaccard", preprocess=str.lower),
    
    # Date proximity features
    DateComparator("date", max_days_difference=365),  # 1 years tolerance
    
    # Actor name similarity
    StringComparator("actors_actor_name", similarity_function="jaccard", preprocess=str.lower, list_strategy='concatenate'),
    StringComparator("actors_actor_name", similarity_function="jaccard", preprocess=str.lower, list_strategy='best_match'),
]

feature_extractor = FeatureExtractor(similarity_comparators)

# Extract features using FeatureExtractor
train_features = feature_extractor.create_features(
    academy_awards, actors, aa2a_train[['id1', 'id2']], labels=aa2a_train['label'], id_column='id'
)

print(f"✅ Training features extracted!")
print(f"Feature columns: {[col for col in train_features.columns if col not in ['id1', 'id2', 'label']]}")

# Prepare data for ML training
feature_columns = [col for col in train_features.columns if col not in ['id1', 'id2', 'label']]

X_train = train_features[feature_columns]
y_train = train_features['label']

print(f"Training data: X={X_train.shape}, y={y_train.shape}")
print(f"Class distribution: {y_train.value_counts().to_dict()}")

#### Full Scikit-learn integration

From here on out, the full scikit-learn library can be used with the features extracted from PyDIs feature extractor without any wrapping as everything in PyDI is based on pandas dataframes

In [None]:
# Set up GridSearchCV with multiple models and hyperparameters
print(f"\n🔍 Setting up GridSearchCV...")

from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import make_scorer, f1_score

# Define models and parameter grids
param_grids = {
    'RandomForest': {
        'model': RandomForestClassifier(random_state=42),
        'params': {
            'n_estimators': [50, 100, 200],
            'max_depth': [5, 10, None],
            'min_samples_split': [2, 5],
            'class_weight': ['balanced', None]
        }
    },
    'LogisticRegression': {
        'model': LogisticRegression(random_state=42, max_iter=1000),
        'params': {
            'C': [0.1, 1.0, 10.0],
            'penalty': ['l2'],
            'class_weight': ['balanced', None]
        }
    },
    'GradientBoosting': {
        'model': GradientBoostingClassifier(random_state=42),
        'params': {
            'n_estimators': [50, 100],
            'learning_rate': [0.1, 0.2],
            'max_depth': [3, 5],
        }
    },
    'SVM': {
        'model': SVC(random_state=42, probability=True),
        'params': {
            'C': [0.1, 1.0, 10.0],
            'kernel': ['rbf', 'linear'],
            'class_weight': ['balanced', None]
        }
    }
}

# Use F1 score as the scoring metric (good for imbalanced data)
scorer = make_scorer(f1_score)
cv_folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

print(f"GridSearch setup: {len(param_grids)} models, F1 scoring, 5-fold CV")

# Train models using GridSearchCV
print(f"\n🚀 Training Models with GridSearchCV...")

grid_search_results = {}
best_overall_score = -1
best_overall_model = None
best_model_name = None

for model_name, config in param_grids.items():
    print(f"\nTraining {model_name}...")
    

    # Create GridSearchCV
    grid_search = GridSearchCV(
        estimator=config['model'],
        param_grid=config['params'],
        scoring=scorer,
        cv=cv_folds,
        n_jobs=-1,  # Use all available cores
        verbose=0
    )
    
    # Fit GridSearchCV
    grid_search.fit(X_train, y_train)
    
    # Store results
    grid_search_results[model_name] = {
        'grid_search': grid_search,
        'best_score': grid_search.best_score_,
        'best_params': grid_search.best_params_,
        'best_estimator': grid_search.best_estimator_
    }
    
    print(f"  ✅ {model_name}: Best CV F1 = {grid_search.best_score_:.4f}")
    print(f"     Best params: {grid_search.best_params_}")
    
    # Track overall best model
    if grid_search.best_score_ > best_overall_score:
        best_overall_score = grid_search.best_score_
        best_overall_model = grid_search.best_estimator_
        best_model_name = model_name
            
print(f"\n🏆 Best Overall Model: {best_model_name} (CV F1: {best_overall_score:.4f})")

Now, we can directly use the trained model with PyDIs MLBasedMatcher

In [None]:
from PyDI.entitymatching import MLBasedMatcher

# Create MLBasedMatcher and apply trained model
ml_matcher = MLBasedMatcher(feature_extractor)

correspondences_aa2a = ml_matcher.match(
    academy_awards, actors, candidates=embedding_blocker_aa2a, id_column='id', trained_classifier=best_overall_model
)

In [None]:
# Show feature importance if available
if hasattr(best_overall_model, 'feature_importances_'):
    print(f"\n🔍 Top Feature Importances:")
    importance_df = ml_matcher.get_feature_importance(best_overall_model, feature_columns)
    display(importance_df.head(8))

Let's evaluate the ML-based matching with the evaluator:

In [None]:
eval_results = EntityMatchingEvaluator.evaluate_matching(
    correspondences=correspondences_aa2a,
    test_pairs=aa2a_test,
    out_dir=debug_output_dir
)

display(eval_results)

# Create cluster size distribution from our matches
cluster_distribution = EntityMatchingEvaluator.create_cluster_size_distribution(
    correspondences=correspondences_aa2a,
    out_dir=OUTPUT_DIR / "cluster_analysis"
)

print(f"\n📊 Cluster Size Distribution Results:")
display(cluster_distribution)

Alternatively to similarity metrics for each attribute, PyDIs VectorFeatureExtractor can be used to create embeddings using SentenceTransformers:

In [None]:
# VectorFeatureExtractor Examples

from PyDI.entitymatching import VectorFeatureExtractor

# SentenceTransformers embeddings using VectorFeatureExtractor
st_extractor = VectorFeatureExtractor(
    embedding_model='sentence-transformers/all-MiniLM-L6-v2',
    columns=['title', 'actors_actor_name', 'date'],
    distance_metrics=['cosine'],
    pooling_strategy='concatenate',
    list_strategies={'actors_actor_name': 'concatenate'}
)

# Extract features using VectorFeatureExtractor
train_features = st_extractor.create_features(
    academy_awards, actors, aa2a_train[['id1', 'id2']], labels=aa2a_train['label'], id_column='id'
)

# ready to train ML models with scikit-learn as before
# matching workflow is analogous to previous example with FeatureExtractor

## Part 3: Data Fusion

In [None]:
academy_awards["academy_awards_id"] = academy_awards["id"]

academy_awards.attrs["trust_score"] = 3
actors.attrs["trust_score"] = 2
golden_globes.attrs["trust_score"] = 1

In [None]:
all_correspondences = pd.concat([correspondences_a2g, correspondences_aa2a], ignore_index=True)
print(f'Total correspondences: {len(all_correspondences):,}')

## Define Fusion Strategy 

In [None]:
from PyDI.fusion import DataFusionStrategy, longest_string, union, prefer_higher_trust

strategy = DataFusionStrategy('movie_fusion_strategy')

strategy.add_attribute_fuser('title', longest_string)
strategy.add_attribute_fuser('director_name', longest_string)
strategy.add_attribute_fuser('date', prefer_higher_trust, trust_key="trust_score")

strategy.add_attribute_fuser('actors_actor_name', union)

print('Strategy ready.')

## Run Fusion
We build connected components from the converted correspondences and fuse per attribute using the rules above.

In [None]:
from PyDI.fusion import DataFusionEngine

engine = DataFusionEngine(strategy, debug=True, debug_format='json')

fused = engine.run(
    datasets=[academy_awards, actors, golden_globes],
    correspondences=all_correspondences,
    id_column="id",
    include_singletons=False,
)
print(f'Fused rows: {len(fused):,}')
display(fused.head(5))

## Evaluate with Gold Standard
We load the gold standard and evaluate accuracy.

In [None]:
from PyDI.fusion import tokenized_match, year_only_match, boolean_match

strategy.add_evaluation_function("title", tokenized_match)
strategy.add_evaluation_function("director_name", tokenized_match)
strategy.add_evaluation_function("actors_actor_name", tokenized_match)
strategy.add_evaluation_function("date", year_only_match)
strategy.add_evaluation_function("oscar", boolean_match)

In [None]:
from PyDI.fusion import DataFusionEvaluator

fusion_test_set = load_xml(DATA_DIR / 'fusion' / 'test_set.xml', name='fusion_test_set', nested_handling='aggregate')

# Keep core evaluation columns if present in fused output
eval_cols = ['academy_awards_id','title','director_name','actors_actor_name','date','oscar']
fused_eval = fused[eval_cols].copy()

# Create evaluator with our fusion strategy
evaluator = DataFusionEvaluator(strategy)

# Evaluate the fused results against the gold standard
print("Evaluating fusion results against gold standard...")
evaluation_results = evaluator.evaluate(
    fused_df=fused_eval,
    fused_id_column='academy_awards_id',
    gold_df=fusion_test_set,
    gold_id_column='id',
)

# Display evaluation metrics
print("\nFusion Evaluation Results:")
print("=" * 40)
for metric, value in evaluation_results.items():
    if isinstance(value, float):
        print(f"  {metric}: {value:.3f}")
    else:
        print(f"  {metric}: {value}")
        
print(f"\nOverall Accuracy: {evaluation_results.get('overall_accuracy', 0):.1%}")