# PyDI Machine Learning Entity Matching Example

This notebook demonstrates comprehensive machine learning-based entity matching in PyDI using the MLBasedMatcher with scikit-learn integration.

**What this shows:**
- Load datasets with provenance tracking
- **Traditional feature extraction**: Convert entity pairs to similarity features using comparators
- **Vector-based features**: Use embeddings and distance metrics for deep feature representation
- **ML model training**: Train various scikit-learn classifiers with proper validation
- **MLBasedMatcher usage**: Apply trained models to find entity correspondences
- **Model evaluation**: Compare performance across different ML approaches
- **Feature importance**: Understand which features contribute most to matching decisions
- **End-to-end ML pipeline**: Complete workflow from data to deployment

Run cells below in order. Adjust paths if running outside the repo root.

In [None]:
# PyDI imports
from PyDI.io import load_xml
from PyDI.entitymatching import (
    MLBasedMatcher,
    FeatureExtractor,
    VectorFeatureExtractor,
    StringComparator,
    NumericComparator,
    DateComparator,
    EntityMatchingEvaluator,
    ensure_record_ids
)

# ML and data processing imports
import pandas as pd
import numpy as np
from pathlib import Path
import warnings
from datetime import datetime
import json
import logging

# Scikit-learn imports
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
import seaborn as sns

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore', category=UserWarning)
logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')

def repo_root():
    """Return the repository root directory."""
    # For notebooks in PyDI/examples/, go up 2 levels to reach repo root
    if '__file__' in globals():
        return Path(__file__).parent.parent.parent
    else:
        # In Jupyter, find the pyproject.toml to locate repo root
        current = Path.cwd()
        while current != current.parent:
            if (current / 'pyproject.toml').exists():
                return current
            current = current.parent
        return Path.cwd()  # fallback

print("✓ All imports successful")
print(f"✓ Repository root: {repo_root()}")

## Step 1: Load Datasets and Ground Truth

We'll use the movie datasets with ground truth labels for supervised learning.

In [None]:
root = repo_root()
academy_path = root / "input" / "movies" / "entitymatching" / "data" / "academy_awards.xml"
actors_path = root / "input" / "movies" / "entitymatching" / "data" / "actors.xml"

print(f"Academy awards data: {academy_path}")
print(f"Actors data: {actors_path}")

# Load datasets using PyDI's provenance-aware XML loader
academy_df = load_xml(academy_path, name="academy_awards")
actors_df = load_xml(actors_path, name="actors")

print(f"\nAcademy Awards shape: {academy_df.shape}")
print(f"Academy Awards columns: {list(academy_df.columns)}")

print(f"\nActors shape: {actors_df.shape}")
print(f"Actors columns: {list(actors_df.columns)}")

# Ensure record IDs
academy_df = ensure_record_ids(academy_df)
actors_df = ensure_record_ids(actors_df)

print(f"\n✓ Datasets loaded with record IDs")

In [None]:
# Load ground truth correspondences
train_path = root / "input" / "movies" / "entitymatching" / "splits" / "gs_academy_awards_2_actors_training.csv"
test_path = root / "input" / "movies" / "entitymatching" / "splits" / "gs_academy_awards_2_actors_test.csv"

def load_correspondences(file_path):
    """Load correspondence file and convert to PyDI ID format."""
    if not file_path.exists():
        print(f"File not found: {file_path}")
        return pd.DataFrame()
    
    # Load raw correspondences
    corr = pd.read_csv(file_path, names=['id1', 'id2', 'label'])
    
    # Convert boolean labels to numeric
    corr['label'] = corr['label'].map({True: 1, 'TRUE': 1, False: 0, 'FALSE': 0})
    
    # Convert original XML IDs to PyDI format
    def convert_id(original_id):
        if pd.isna(original_id):
            return original_id
        
        id_str = str(original_id)
        if 'academy_awards_' in id_str:
            # Extract number and reformat
            try:
                num = int(id_str.split('_')[-1]) - 1  # Convert to 0-based index
                return f"academy_awards_{num:06d}"
            except:
                return id_str
        elif 'actors_' in id_str:
            # Extract number and reformat
            try:
                num = int(id_str.split('_')[-1]) - 1  # Convert to 0-based index
                return f"actors_{num:06d}"
            except:
                return id_str
        
        return id_str
    
    corr['id1'] = corr['id1'].apply(convert_id)
    corr['id2'] = corr['id2'].apply(convert_id)
    
    return corr

# Load training and test correspondences
train_corr = load_correspondences(train_path)
test_corr = load_correspondences(test_path)

print(f"Training correspondences: {len(train_corr)} pairs")
if len(train_corr) > 0:
    print(f"Training label distribution:")
    print(train_corr['label'].value_counts())

print(f"\nTest correspondences: {len(test_corr)} pairs")
if len(test_corr) > 0:
    print(f"Test label distribution:")
    print(test_corr['label'].value_counts())
    
print(f"\n✓ Ground truth loaded successfully")

## Step 2: Traditional Feature Extraction

We'll create similarity-based features using various comparators and convert labeled pairs into training data.

In [None]:
# Create comprehensive feature extractor with multiple comparators
traditional_comparators = [
    StringComparator("title", similarity_function="jaro_winkler", preprocess=str.lower),
    StringComparator("title", similarity_function="levenshtein", preprocess=str.lower), 
    StringComparator("title", similarity_function="cosine", preprocess=str.lower),
    DateComparator("date", max_days_difference=730),  # 2 years tolerance
    StringComparator("actor_name", similarity_function="jaro_winkler", preprocess=str.lower),
    # Custom features
    {
        "function": lambda r1, r2: 1.0 if str(r1.get('title', '')).lower() == str(r2.get('title', '')).lower() else 0.0,
        "name": "exact_title_match"
    },
    {
        "function": lambda r1, r2: len(set(str(r1.get('title', '')).lower().split()) & set(str(r2.get('title', '')).lower().split())),
        "name": "common_title_words"
    },
]

# Create feature extractor
traditional_extractor = FeatureExtractor(traditional_comparators)

print(f"Traditional feature extractor created with {len(traditional_comparators)} features")
print(f"Feature names: {traditional_extractor.get_feature_names()}")

In [None]:
# Extract features for training data
if len(train_corr) > 0:
    print(f"Extracting features for {len(train_corr)} training pairs...")
    
    # Filter out pairs where records might be missing
    valid_pairs = []
    valid_labels = []
    
    for idx, row in train_corr.iterrows():
        id1, id2, label = row['id1'], row['id2'], row['label']
        
        # Check if both records exist
        if (id1 in academy_df['_id'].values and 
            id2 in actors_df['_id'].values):
            valid_pairs.append({'id1': id1, 'id2': id2})
            valid_labels.append(label)
    
    valid_pairs_df = pd.DataFrame(valid_pairs)
    valid_labels = pd.Series(valid_labels)
    
    print(f"Valid training pairs: {len(valid_pairs_df)} out of {len(train_corr)}")
    
    if len(valid_pairs_df) > 0:
        # Extract features
        train_features = traditional_extractor.create_features(
            academy_df, actors_df, valid_pairs_df, labels=valid_labels
        )
        
        print(f"\n✓ Training features extracted: {train_features.shape}")
        print(f"Feature columns: {[col for col in train_features.columns if col not in ['id1', 'id2', 'label']]}")
        
        # Show sample features
        print(f"\nSample training features:")
        display(train_features.head(3))
        
        # Feature statistics
        feature_cols = [col for col in train_features.columns if col not in ['id1', 'id2', 'label']]
        print(f"\nFeature statistics:")
        display(train_features[feature_cols].describe())
    else:
        print("No valid training pairs found - cannot extract features")
        train_features = pd.DataFrame()
else:
    print("No training correspondences available")
    train_features = pd.DataFrame()

## Step 3: Vector-Based Feature Extraction (Optional)

If sentence-transformers is available, we'll demonstrate vector-based features using embeddings.

In [None]:
# Check if sentence-transformers is available
vector_features_available = False
vector_extractor = None

try:
    import sentence_transformers
    print("✓ sentence-transformers available - will demonstrate vector features")
    
    # Create vector feature extractor
    vector_extractor = VectorFeatureExtractor(
        embedding_model='all-MiniLM-L6-v2',  # Lightweight model
        columns=['title', 'actor_name'],  # Use title and actor name
        distance_metrics=['cosine', 'euclidean'],
        pooling_strategy='concatenate'
    )
    
    print(f"Vector feature extractor created with model: all-MiniLM-L6-v2")
    print(f"Vector feature names: {vector_extractor.get_feature_names()}")
    vector_features_available = True
    
    # Extract vector features for a subset of training data (for performance)
    if len(valid_pairs_df) > 0:
        # Use first 50 pairs for vector demo
        sample_pairs = valid_pairs_df.head(50).copy()
        sample_labels = valid_labels.head(50).copy()
        
        print(f"\nExtracting vector features for {len(sample_pairs)} sample pairs...")
        
        vector_train_features = vector_extractor.create_features(
            academy_df, actors_df, sample_pairs, labels=sample_labels
        )
        
        print(f"✓ Vector training features: {vector_train_features.shape}")
        display(vector_train_features.head(3))
        
except ImportError:
    print("⚠ sentence-transformers not available - skipping vector features")
    print("To install: pip install sentence-transformers")
    vector_train_features = pd.DataFrame()
except Exception as e:
    print(f"⚠ Error initializing vector features: {e}")
    vector_train_features = pd.DataFrame()

## Step 4: Train Machine Learning Models

We'll train multiple scikit-learn classifiers and compare their performance.

In [None]:
# Prepare training data for ML models
trained_models = {}

if len(train_features) > 0:
    # Prepare feature matrix and labels
    feature_columns = [col for col in train_features.columns if col not in ['id1', 'id2', 'label']]
    X = train_features[feature_columns]
    y = train_features['label']
    
    print(f"Training data prepared:")
    print(f"  Features: {X.shape}")
    print(f"  Labels: {y.shape}")
    print(f"  Positive class: {sum(y)} ({sum(y)/len(y)*100:.1f}%)")
    print(f"  Negative class: {len(y) - sum(y)} ({(len(y) - sum(y))/len(y)*100:.1f}%)")
    
    # Split into train/validation sets
    X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=0.3, random_state=42, stratify=y
    )
    
    print(f"\n✓ Train/validation split: {len(X_train)}/{len(X_val)}")
else:
    print("No training features available - cannot train models")
    X_train = X_val = y_train = y_val = pd.DataFrame()

In [None]:
# Train multiple ML models
if len(X_train) > 0:
    # Define models to train
    models_to_train = {
        'RandomForest': RandomForestClassifier(n_estimators=100, random_state=42),
        'LogisticRegression': LogisticRegression(random_state=42, max_iter=1000),
        'GradientBoosting': GradientBoostingClassifier(random_state=42),
        'DecisionTree': DecisionTreeClassifier(random_state=42, max_depth=10),
        'SVM': SVC(random_state=42, probability=True)  # probability=True for predict_proba
    }
    
    print("Training ML models...\n")
    
    model_results = []
    
    for model_name, model in models_to_train.items():
        print(f"Training {model_name}...")
        
        try:
            # Train model
            model.fit(X_train, y_train)
            
            # Validation predictions
            val_predictions = model.predict(X_val)
            val_probabilities = model.predict_proba(X_val)[:, 1] if hasattr(model, 'predict_proba') else val_predictions
            
            # Calculate metrics
            from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
            
            accuracy = accuracy_score(y_val, val_predictions)
            precision = precision_score(y_val, val_predictions, zero_division=0)
            recall = recall_score(y_val, val_predictions, zero_division=0)
            f1 = f1_score(y_val, val_predictions, zero_division=0)
            
            # Store results
            model_results.append({
                'Model': model_name,
                'Accuracy': accuracy,
                'Precision': precision,
                'Recall': recall,
                'F1': f1
            })
            
            # Store trained model
            trained_models[model_name] = model
            
            print(f"  ✓ {model_name}: F1={f1:.3f}, Precision={precision:.3f}, Recall={recall:.3f}")
            
        except Exception as e:
            print(f"  ✗ {model_name} training failed: {e}")
    
    # Display results table
    if model_results:
        results_df = pd.DataFrame(model_results).round(3)
        print(f"\n=== Model Comparison ===\n")
        display(results_df)
        
        # Find best model
        best_model_idx = results_df['F1'].idxmax()
        best_model_name = results_df.loc[best_model_idx, 'Model']
        best_f1 = results_df.loc[best_model_idx, 'F1']
        
        print(f"\n🏆 Best model: {best_model_name} (F1: {best_f1:.3f})")
        
    print(f"\n✓ {len(trained_models)} models trained successfully")

else:
    print("Cannot train models - no training data available")

## Step 5: Feature Importance Analysis

Let's analyze which features are most important for entity matching decisions.

In [None]:
# Analyze feature importance for models that support it
if trained_models and len(X_train) > 0:
    print("=== Feature Importance Analysis ===\n")
    
    for model_name, model in trained_models.items():
        if hasattr(model, 'feature_importances_'):
            print(f"\n{model_name} Feature Importance:")
            
            try:
                # Create MLBasedMatcher to get feature importance
                matcher = MLBasedMatcher(traditional_extractor)
                importance_df = matcher.get_feature_importance(model, feature_columns)
                
                # Display top features
                display(importance_df.head(8))
                
                # Create visualization
                plt.figure(figsize=(10, 6))
                top_features = importance_df.head(8)
                plt.barh(range(len(top_features)), top_features['importance'])
                plt.yticks(range(len(top_features)), top_features['feature'])
                plt.xlabel('Feature Importance')
                plt.title(f'{model_name} - Top Feature Importances')
                plt.gca().invert_yaxis()
                plt.tight_layout()
                plt.show()
                
            except Exception as e:
                print(f"Error analyzing feature importance: {e}")
        
        elif hasattr(model, 'coef_'):
            print(f"\n{model_name} Coefficients:")
            # For linear models, show coefficients
            coef_df = pd.DataFrame({
                'feature': feature_columns,
                'coefficient': model.coef_[0] if len(model.coef_.shape) > 1 else model.coef_
            }).reindex(model.coef_[0].argsort()[::-1] if len(model.coef_.shape) > 1 else model.coef_.argsort()[::-1])
            
            display(coef_df.head(8))
        
        else:
            print(f"{model_name}: Feature importance not available for this model type")


## Step 6: MLBasedMatcher Usage

Now we'll use our trained models with the MLBasedMatcher to find entity correspondences.

In [None]:
# Create candidate pairs for testing MLBasedMatcher
def create_sample_candidates(df_left, df_right, max_pairs=200, strategy="random"):
    """Create candidate pairs for entity matching."""
    left_ids = df_left['_id'].tolist()
    right_ids = df_right['_id'].tolist()
    
    if strategy == "random":
        # Random sampling
        np.random.seed(42)  # For reproducibility
        candidates = []
        for _ in range(min(max_pairs, len(left_ids) * len(right_ids))):
            left_id = np.random.choice(left_ids)
            right_id = np.random.choice(right_ids)
            candidates.append((left_id, right_id))
    elif strategy == "title_similarity":
        # Simple title-based blocking (first character match)
        candidates = []
        
        # Group by first character of title
        left_groups = df_left.groupby(df_left['title'].str[0].fillna(''))['_id'].apply(list).to_dict()
        right_groups = df_right.groupby(df_right['title'].str[0].fillna(''))['_id'].apply(list).to_dict()
        
        for key in left_groups:
            if key in right_groups:
                for left_id in left_groups[key]:
                    for right_id in right_groups[key]:
                        candidates.append((left_id, right_id))
                        if len(candidates) >= max_pairs:
                            break
                    if len(candidates) >= max_pairs:
                        break
                if len(candidates) >= max_pairs:
                    break
    
    # Convert to DataFrame and remove duplicates
    candidate_df = pd.DataFrame(candidates, columns=['id1', 'id2']).drop_duplicates()
    return candidate_df

# Create test candidates
test_candidates = create_sample_candidates(academy_df, actors_df, max_pairs=150, strategy="title_similarity")
print(f"Created {len(test_candidates)} test candidate pairs")
display(test_candidates.head())

In [None]:
# Use MLBasedMatcher with trained models
if trained_models and len(test_candidates) > 0:
    print("=== MLBasedMatcher Results ===\n")
    
    # Create MLBasedMatcher
    ml_matcher = MLBasedMatcher(traditional_extractor)
    
    # Test each trained model
    ml_results = {}
    
    for model_name, trained_model in trained_models.items():
        print(f"Testing {model_name}...")
        
        try:
            # Find matches using trained model
            matches = ml_matcher.match(
                df_left=academy_df,
                df_right=actors_df,
                candidates=[test_candidates],
                trained_classifier=trained_model,
                threshold=0.5,  # 50% confidence threshold
                use_probabilities=True
            )
            
            print(f"  ✓ Found {len(matches)} matches above threshold 0.5")
            
            # Store results
            ml_results[model_name] = matches
            
            # Show top matches
            if len(matches) > 0:
                print(f"  Top matches:")
                top_matches = matches.sort_values('score', ascending=False).head(3)
                
                for _, match in top_matches.iterrows():
                    id1, id2, score = match['id1'], match['id2'], match['score']
                    
                    # Get titles for display
                    title1 = academy_df[academy_df['_id'] == id1]['title'].iloc[0]
                    title2 = actors_df[actors_df['_id'] == id2]['title'].iloc[0]
                    
                    print(f"    {score:.3f}: '{title1}' <-> '{title2}'")
            
            print()  # New line
            
        except Exception as e:
            print(f"  ✗ Error with {model_name}: {e}\n")
    
    print(f"✓ MLBasedMatcher testing complete")

else:
    print("Cannot test MLBasedMatcher - no trained models or test candidates available")
    ml_results = {}

## Step 7: Model Evaluation on Test Set

Let's evaluate our models on the test correspondences to get proper performance metrics.

In [None]:
# Evaluate on test set if available
if len(test_corr) > 0 and trained_models:
    print("=== Test Set Evaluation ===\n")
    
    # Prepare test pairs (filter valid ones)
    test_pairs = []
    test_labels = []
    
    for idx, row in test_corr.iterrows():
        id1, id2, label = row['id1'], row['id2'], row['label']
        
        # Check if both records exist
        if (id1 in academy_df['_id'].values and 
            id2 in actors_df['_id'].values):
            test_pairs.append({'id1': id1, 'id2': id2})
            test_labels.append(label)
    
    test_pairs_df = pd.DataFrame(test_pairs)
    test_labels = pd.Series(test_labels)
    
    print(f"Valid test pairs: {len(test_pairs_df)} out of {len(test_corr)}")
    print(f"Test positive class: {sum(test_labels)} ({sum(test_labels)/len(test_labels)*100:.1f}%)")
    
    if len(test_pairs_df) > 0:
        # Create MLBasedMatcher
        ml_matcher = MLBasedMatcher(traditional_extractor)
        
        test_evaluation_results = []
        
        for model_name, trained_model in trained_models.items():
            print(f"\nEvaluating {model_name} on test set...")
            
            try:
                # Get predictions for test pairs
                predictions_df = ml_matcher.predict_pairs(
                    academy_df, actors_df, test_pairs_df, trained_model, use_probabilities=True
                )
                
                if len(predictions_df) > 0:
                    # Convert to binary predictions with different thresholds
                    thresholds = [0.3, 0.5, 0.7]
                    
                    for threshold in thresholds:
                        binary_predictions = (predictions_df['prediction'] >= threshold).astype(int)
                        
                        # Calculate metrics
                        accuracy = accuracy_score(test_labels, binary_predictions)
                        precision = precision_score(test_labels, binary_predictions, zero_division=0)
                        recall = recall_score(test_labels, binary_predictions, zero_division=0)
                        f1 = f1_score(test_labels, binary_predictions, zero_division=0)
                        
                        test_evaluation_results.append({
                            'Model': model_name,
                            'Threshold': threshold,
                            'Accuracy': accuracy,
                            'Precision': precision,
                            'Recall': recall,
                            'F1': f1
                        })
                        
                        print(f"  Threshold {threshold}: F1={f1:.3f}, P={precision:.3f}, R={recall:.3f}")
                
            except Exception as e:
                print(f"  Error evaluating {model_name}: {e}")
        
        # Display test results
        if test_evaluation_results:
            test_results_df = pd.DataFrame(test_evaluation_results).round(3)
            print(f"\n=== Test Set Results ===\n")
            display(test_results_df)
            
            # Find best configuration
            best_test_idx = test_results_df['F1'].idxmax()
            best_test = test_results_df.loc[best_test_idx]
            print(f"\n🏆 Best test performance: {best_test['Model']} @ threshold {best_test['Threshold']} (F1: {best_test['F1']:.3f})")
    
else:
    print("Test evaluation not available - no test correspondences or trained models")

## Step 8: Advanced MLBasedMatcher Features

Let's explore additional features of the MLBasedMatcher including threshold analysis and prediction utilities.

In [None]:
# Threshold analysis for optimal decision boundary
if trained_models and len(test_pairs_df) > 0:
    print("=== Threshold Analysis ===\n")
    
    # Use best performing model
    if 'best_model_name' in globals() and best_model_name in trained_models:
        best_model = trained_models[best_model_name]
        print(f"Using best model: {best_model_name}")
    else:
        # Use first available model
        best_model_name = list(trained_models.keys())[0]
        best_model = trained_models[best_model_name]
        print(f"Using model: {best_model_name}")
    
    # Get predictions
    ml_matcher = MLBasedMatcher(traditional_extractor)
    predictions_df = ml_matcher.predict_pairs(
        academy_df, actors_df, test_pairs_df, best_model, use_probabilities=True
    )
    
    if len(predictions_df) > 0:
        # Test different thresholds
        threshold_range = np.arange(0.1, 1.0, 0.1)
        threshold_analysis = []
        
        for threshold in threshold_range:
            binary_predictions = (predictions_df['prediction'] >= threshold).astype(int)
            
            precision = precision_score(test_labels, binary_predictions, zero_division=0)
            recall = recall_score(test_labels, binary_predictions, zero_division=0)
            f1 = f1_score(test_labels, binary_predictions, zero_division=0)
            
            threshold_analysis.append({
                'Threshold': threshold,
                'Precision': precision,
                'Recall': recall,
                'F1': f1,
                'Predictions': sum(binary_predictions)
            })
        
        threshold_df = pd.DataFrame(threshold_analysis).round(3)
        display(threshold_df)
        
        # Plot precision-recall curve
        plt.figure(figsize=(12, 4))
        
        plt.subplot(1, 2, 1)
        plt.plot(threshold_df['Threshold'], threshold_df['Precision'], 'o-', label='Precision')
        plt.plot(threshold_df['Threshold'], threshold_df['Recall'], 's-', label='Recall')
        plt.plot(threshold_df['Threshold'], threshold_df['F1'], '^-', label='F1')
        plt.xlabel('Threshold')
        plt.ylabel('Score')
        plt.title(f'{best_model_name} - Precision/Recall vs Threshold')
        plt.legend()
        plt.grid(True, alpha=0.3)
        
        plt.subplot(1, 2, 2)
        plt.plot(threshold_df['Recall'], threshold_df['Precision'], 'o-')
        plt.xlabel('Recall')
        plt.ylabel('Precision')
        plt.title(f'{best_model_name} - Precision-Recall Curve')
        plt.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
        # Find optimal threshold (best F1)
        optimal_idx = threshold_df['F1'].idxmax()
        optimal_threshold = threshold_df.loc[optimal_idx, 'Threshold']
        optimal_f1 = threshold_df.loc[optimal_idx, 'F1']
        
        print(f"\n🎯 Optimal threshold: {optimal_threshold:.1f} (F1: {optimal_f1:.3f})")


## Step 9: Complete ML Pipeline

Let's put everything together in a complete machine learning entity matching pipeline.

In [None]:
def complete_ml_entity_matching_pipeline(
    df_left, df_right, 
    train_correspondences=None, 
    test_correspondences=None,
    output_dir=None
):
    """Complete ML entity matching pipeline."""
    
    print("=== Complete ML Entity Matching Pipeline ===")
    print(f"Left dataset: {df_left.attrs.get('dataset_name', 'unknown')} ({len(df_left)} records)")
    print(f"Right dataset: {df_right.attrs.get('dataset_name', 'unknown')} ({len(df_right)} records)")
    
    pipeline_results = {}
    
    # Step 1: Feature Engineering
    print("\n1. Feature Engineering...")
    
    # Create comprehensive feature extractor
    comparators = [
        StringComparator("title", "jaro_winkler", str.lower),
        StringComparator("title", "cosine", str.lower),
        DateComparator("date", max_days_difference=365),
        StringComparator("actor_name", "jaro_winkler", str.lower),
    ]
    feature_extractor = FeatureExtractor(comparators)
    
    print(f"   ✓ Created feature extractor with {len(comparators)} features")
    
    # Step 2: Training Data Preparation
    if train_correspondences is not None and len(train_correspondences) > 0:
        print("\n2. Preparing training data...")
        
        # Filter valid pairs
        train_pairs = []
        train_labels = []
        for _, row in train_correspondences.iterrows():
            id1, id2, label = row['id1'], row['id2'], row['label']
            if (id1 in df_left['_id'].values and id2 in df_right['_id'].values):
                train_pairs.append({'id1': id1, 'id2': id2})
                train_labels.append(label)
        
        if len(train_pairs) > 0:
            train_pairs_df = pd.DataFrame(train_pairs)
            train_labels = pd.Series(train_labels)
            
            # Extract features
            train_features = feature_extractor.create_features(
                df_left, df_right, train_pairs_df, train_labels
            )
            
            print(f"   ✓ Training features: {train_features.shape}")
            print(f"   ✓ Positive examples: {sum(train_labels)} ({sum(train_labels)/len(train_labels)*100:.1f}%)")
            
            # Step 3: Model Training
            print("\n3. Training ML models...")
            
            feature_cols = [col for col in train_features.columns if col not in ['id1', 'id2', 'label']]
            X = train_features[feature_cols]
            y = train_features['label']
            
            # Train best performing model (Random Forest)
            best_model = RandomForestClassifier(n_estimators=100, random_state=42)
            best_model.fit(X, y)
            
            print(f"   ✓ Trained Random Forest classifier")
            pipeline_results['trained_model'] = best_model
            pipeline_results['feature_extractor'] = feature_extractor
            
            # Step 4: Model Application
            print("\n4. Applying ML model for matching...")
            
            # Create candidate pairs
            candidates = create_sample_candidates(df_left, df_right, max_pairs=200, strategy="title_similarity")
            print(f"   ✓ Generated {len(candidates)} candidate pairs")
            
            # Use MLBasedMatcher
            ml_matcher = MLBasedMatcher(feature_extractor)
            matches = ml_matcher.match(
                df_left, df_right, [candidates], best_model, threshold=0.5
            )
            
            print(f"   ✓ Found {len(matches)} matches above threshold 0.5")
            pipeline_results['matches'] = matches
            
            # Step 5: Evaluation (if test data available)
            if test_correspondences is not None and len(test_correspondences) > 0:
                print("\n5. Model evaluation...")
                
                # Prepare test data
                test_pairs = []
                test_labels = []
                for _, row in test_correspondences.iterrows():
                    id1, id2, label = row['id1'], row['id2'], row['label']
                    if (id1 in df_left['_id'].values and id2 in df_right['_id'].values):
                        test_pairs.append({'id1': id1, 'id2': id2})
                        test_labels.append(label)
                
                if len(test_pairs) > 0:
                    test_pairs_df = pd.DataFrame(test_pairs)
                    test_labels = pd.Series(test_labels)
                    
                    # Get predictions
                    predictions_df = ml_matcher.predict_pairs(
                        df_left, df_right, test_pairs_df, best_model
                    )
                    
                    if len(predictions_df) > 0:
                        binary_predictions = (predictions_df['prediction'] >= 0.5).astype(int)
                        
                        # Calculate metrics
                        eval_results = {
                            'accuracy': accuracy_score(test_labels, binary_predictions),
                            'precision': precision_score(test_labels, binary_predictions, zero_division=0),
                            'recall': recall_score(test_labels, binary_predictions, zero_division=0),
                            'f1': f1_score(test_labels, binary_predictions, zero_division=0)
                        }
                        
                        pipeline_results['evaluation'] = eval_results
                        print(f"   ✓ Test F1: {eval_results['f1']:.3f}, Precision: {eval_results['precision']:.3f}, Recall: {eval_results['recall']:.3f}")
            
            # Step 6: Output Generation
            if output_dir:
                print(f"\n6. Saving outputs to {output_dir}...")
                output_path = Path(output_dir)
                output_path.mkdir(parents=True, exist_ok=True)
                
                # Save matches
                if len(matches) > 0:
                    matches.to_csv(output_path / "ml_entity_matches.csv", index=False)
                    print(f"   ✓ Saved {len(matches)} matches")
                
                # Save model info
                model_info = {
                    'model_type': 'RandomForestClassifier',
                    'n_features': len(feature_cols),
                    'feature_names': feature_cols,
                    'training_samples': len(train_features),
                    'threshold': 0.5
                }
                
                if 'evaluation' in pipeline_results:
                    model_info.update(pipeline_results['evaluation'])
                
                with open(output_path / "model_info.json", 'w') as f:
                    json.dump(model_info, f, indent=2)
                
                print(f"   ✓ Saved model information")
        
        else:
            print("   ✗ No valid training pairs found")
    
    else:
        print("\n2. No training data provided - cannot train ML models")
    
    return pipeline_results

# Run complete pipeline
if len(train_corr) > 0:
    output_dir = root / "output" / "examples" / "ml_entitymatching"
    
    pipeline_results = complete_ml_entity_matching_pipeline(
        df_left=academy_df,
        df_right=actors_df,
        train_correspondences=train_corr,
        test_correspondences=test_corr if len(test_corr) > 0 else None,
        output_dir=str(output_dir)
    )
    
    print(f"\n✅ Complete ML pipeline finished!")
    print(f"📁 Check {output_dir} for outputs")
    
    if 'evaluation' in pipeline_results:
        eval_results = pipeline_results['evaluation']
        print(f"📊 Final Performance: F1={eval_results['f1']:.3f}, P={eval_results['precision']:.3f}, R={eval_results['recall']:.3f}")

else:
    print("Cannot run complete pipeline - no training correspondences available")

## Summary and Key Takeaways

This notebook demonstrated the complete machine learning entity matching workflow in PyDI using the MLBasedMatcher:

### Key Features Demonstrated:

1. **Traditional Feature Extraction**: Convert entity pairs to similarity features using comparators
   - String similarity (Jaro-Winkler, Levenshtein, Cosine)
   - Date/numeric comparisons
   - Custom feature functions
   
2. **Vector-Based Features** (optional): Deep feature representation using embeddings
   - Sentence transformer models
   - Multiple distance metrics (cosine, euclidean, manhattan)
   - Flexible pooling strategies

3. **ML Model Training**: Full scikit-learn integration
   - Multiple classifier types (Random Forest, Logistic Regression, SVM, etc.)
   - Proper train/validation splits
   - Cross-validation and hyperparameter tuning ready

4. **MLBasedMatcher Usage**: Production-ready matching with trained models
   - Probabilistic and binary prediction modes
   - Batch processing of candidates
   - Flexible threshold configuration

5. **Model Evaluation**: Comprehensive performance analysis
   - Precision, recall, F1 score calculation
   - Threshold optimization
   - Feature importance analysis

6. **Production Pipeline**: End-to-end workflow with proper outputs
   - Structured result files
   - Model metadata and configuration
   - Reproducible execution

### Best Practices for ML Entity Matching:

1. **Feature Engineering is Crucial**: Combine multiple similarity measures for robust feature representation

2. **Handle Class Imbalance**: Entity matching datasets typically have many more non-matches than matches

3. **Threshold Tuning**: Different applications require different precision/recall trade-offs

4. **Cross-Validation**: Use proper validation techniques to avoid overfitting

5. **Feature Importance**: Understand which features drive matching decisions for interpretability

6. **Scalability**: Consider blocking/candidate generation strategies for large datasets

### Integration with Scikit-learn Ecosystem:

The MLBasedMatcher provides minimal wrapping around scikit-learn, allowing you to:
- Use any scikit-learn classifier
- Apply hyperparameter tuning (GridSearchCV, RandomizedSearchCV)
- Use ensemble methods and pipelines
- Leverage feature selection and preprocessing utilities
- Apply cross-validation and model evaluation tools

### Next Steps:

- **Advanced Blocking**: Implement sophisticated candidate generation strategies
- **Deep Learning**: Experiment with neural network architectures for entity matching
- **Active Learning**: Iteratively improve models with human feedback
- **Ensemble Methods**: Combine multiple matching approaches for better performance
- **Domain Adaptation**: Apply to your specific datasets with domain-specific comparators