# PyDI Data Fusion Framework Showcase

This notebook demonstrates the data fusion capabilities of PyDI. We'll show:

1. **Loading and preparing Winter movie datasets**
2. **Creating sophisticated fusion strategies**
3. **Running the fusion engine with connected components grouping**
4. **Evaluating fusion quality**
5. **Generating reports**
6. **Custom conflict resolution rules**
7. **Provenance tracking**


In [1]:
import pandas as pd
import numpy as np
import xml.etree.ElementTree as ET
from pathlib import Path
import logging
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Configure logging for better visibility
logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')
logger = logging.getLogger(__name__)

print("✅ Imports successful")

✅ Imports successful


In [2]:
# Import PyDI fusion components
from PyDI.fusion import (
    # Core engine and strategy
    DataFusionEngine,
    DataFusionStrategy,
    DataFusionEvaluator,
    
    # Conflict resolution rules
    LongestString,
    ShortestString,
    Average,
    Median,
    Maximum,
    MostRecent,
    Earliest,
    Union,
    Intersection,
    Voting,
    FavourSources,
    
    # Reporting and evaluation
    FusionReport,
    FusionQualityMetrics,
    ProvenanceTracker,
    
    # Base classes for custom rules
    ConflictResolutionFunction,
    AttributeValueFuser,
    FusionResult,
    
    # Convenience functions
    create_simple_strategy,
    build_record_groups_from_correspondences,
)

# Import entity matching components
from PyDI.entitymatching.base import ensure_record_ids

print("✅ PyDI fusion components imported successfully")

✅ PyDI fusion components imported successfully


## 1. Loading Winter Movie Datasets

We'll load the movie datasets from Winter's XML files and convert them to pandas DataFrames suitable for fusion.

In [None]:
def parse_winter_xml(xml_path, dataset_name):
    """Parse Winter XML format and convert to pandas DataFrame."""
    
    tree = ET.parse(xml_path)
    root = tree.getroot()
    
    movies = []
    
    for movie in root.findall('.//movie'):
        movie_data = {}
        
        # Basic movie information
        movie_data['_id'] = movie.find('id').text if movie.find('id') is not None else None
        movie_data['title'] = movie.find('title').text if movie.find('title') is not None else None
        movie_data['date'] = movie.find('date').text if movie.find('date') is not None else None
        
        # Handle directors
        directors = movie.find('directors')
        if directors is not None:
            director_names = [d.find('name').text for d in directors.findall('director') 
                            if d.find('name') is not None]
            movie_data['director'] = ', '.join(director_names) if director_names else None
        
        # Handle actors
        actors = movie.find('actors')
        if actors is not None:
            actor_names = [a.find('name').text for a in actors.findall('actor') 
                          if a.find('name') is not None]
            movie_data['actors'] = ', '.join(actor_names) if actor_names else None
        
        # Handle studios
        studios = movie.find('studios')
        if studios is not None:
            studio_names = [s.find('name').text for s in studios.findall('studio') 
                           if s.find('name') is not None]
            movie_data['studio'] = ', '.join(studio_names) if studio_names else None
        
        movies.append(movie_data)
    
    df = pd.DataFrame(movies)
    
    # Add dataset metadata
    df.attrs['dataset_name'] = dataset_name
    df.attrs['source_format'] = 'winter_xml'
    df.attrs['loaded_at'] = datetime.now().isoformat()
    
    return df

# Define paths to Winter datasets
data_dir = Path('/Users/aaronsteiner/Documents/GitHub/PyDI/input/movies/fusion/data')
correspondences_dir = Path('/Users/aaronsteiner/Documents/GitHub/PyDI/input/movies/fusion/correspondences')

print(f"Data directory: {data_dir}")
print(f"Correspondences directory: {correspondences_dir}")

In [None]:
# Load the three movie datasets
academy_awards_df = parse_winter_xml(data_dir / 'academy_awards.xml', 'academy_awards')
actors_df = parse_winter_xml(data_dir / 'actors.xml', 'actors')
golden_globes_df = parse_winter_xml(data_dir / 'golden_globes.xml', 'golden_globes')

print("📊 Dataset Overview:")
print(f"Academy Awards: {len(academy_awards_df)} movies")
print(f"Actors: {len(actors_df)} movies")
print(f"Golden Globes: {len(golden_globes_df)} movies")
print(f"\nTotal records: {len(academy_awards_df) + len(actors_df) + len(golden_globes_df)}")

In [None]:
# Display sample data from each dataset
print("🎬 Academy Awards Sample:")
display(academy_awards_df[['_id', 'title', 'director', 'date']].head())

print("\n🎭 Actors Sample:")
display(actors_df[['_id', 'title', 'actors', 'date']].head())

print("\n🏆 Golden Globes Sample:")
display(golden_globes_df[['_id', 'title', 'director', 'date']].head())

In [None]:
# Load correspondences (matches between datasets)
def load_correspondences(corr_path):
    """Load correspondences from Winter CSV format."""
    df = pd.read_csv(corr_path, header=None, names=['id1', 'id2', 'score'])
    return df

# Load all correspondence files
corr_aa_actors = load_correspondences(correspondences_dir / 'academy_awards_2_actors_correspondences.csv')
corr_actors_gg = load_correspondences(correspondences_dir / 'actors_2_golden_globes_correspondences.csv')

print("🔗 Correspondences Overview:")
print(f"Academy Awards ↔ Actors: {len(corr_aa_actors)} matches")
print(f"Actors ↔ Golden Globes: {len(corr_actors_gg)} matches")

# Combine correspondences for fusion
all_correspondences = pd.concat([corr_aa_actors, corr_actors_gg], ignore_index=True)
print(f"\nTotal correspondences: {len(all_correspondences)}")

display(all_correspondences.head(10))

## 2. Exploring Data Quality and Overlap

Before fusion, let's analyze the data quality and understand what conflicts we might encounter.

In [None]:
# Analyze attribute coverage across datasets
datasets = [academy_awards_df, actors_df, golden_globes_df]
dataset_names = ['Academy Awards', 'Actors', 'Golden Globes']

print("📊 Attribute Coverage Analysis:")
print("-" * 50)

all_attributes = set()
for df in datasets:
    all_attributes.update(df.columns)

coverage_df = pd.DataFrame(index=sorted(all_attributes), columns=dataset_names)

for i, df in enumerate(datasets):
    for attr in all_attributes:
        if attr in df.columns:
            non_null_count = df[attr].count()
            total_count = len(df)
            coverage = f"{non_null_count}/{total_count} ({non_null_count/total_count:.1%})"
        else:
            coverage = "0/0 (0%)"
        coverage_df.iloc[coverage_df.index.get_loc(attr), i] = coverage

display(coverage_df.fillna("0/0 (0%)"))

In [None]:
# Analyze potential conflicts by examining matched records
def analyze_conflicts_preview(datasets, correspondences, sample_size=5):
    """Preview potential conflicts in matched records."""
    
    # Build lookup tables
    id_to_record = {}
    id_to_dataset = {}
    
    for df in datasets:
        dataset_name = df.attrs['dataset_name']
        for _, record in df.iterrows():
            record_id = record['_id']
            id_to_record[record_id] = record
            id_to_dataset[record_id] = dataset_name
    
    # Sample some correspondences to show conflicts
    sample_corr = correspondences.head(sample_size)
    
    print(f"🔍 Conflict Analysis Preview (First {sample_size} matches):")
    print("=" * 80)
    
    for i, (_, corr) in enumerate(sample_corr.iterrows(), 1):
        id1, id2, score = corr['id1'], corr['id2'], corr['score']
        
        record1 = id_to_record.get(id1)
        record2 = id_to_record.get(id2)
        
        if record1 is None or record2 is None:
            continue
            
        dataset1 = id_to_dataset[id1]
        dataset2 = id_to_dataset[id2]
        
        print(f"\nMatch {i}: {dataset1} ↔ {dataset2} (score: {score})")
        print(f"  {dataset1}: {record1.get('title', 'N/A')} | {record1.get('director', 'N/A')} | {record1.get('date', 'N/A')}")
        print(f"  {dataset2}: {record2.get('title', 'N/A')} | {record2.get('director', 'N/A')} | {record2.get('date', 'N/A')}")
        
        # Check for conflicts
        conflicts = []
        common_attrs = set(record1.index) & set(record2.index)
        for attr in ['title', 'director', 'date']:
            if attr in common_attrs:
                val1, val2 = record1.get(attr), record2.get(attr)
                if pd.notna(val1) and pd.notna(val2) and str(val1) != str(val2):
                    conflicts.append(f"{attr}: '{val1}' vs '{val2}'")
        
        if conflicts:
            print(f"  ⚠️  Conflicts: {'; '.join(conflicts)}")
        else:
            print(f"  ✅ No obvious conflicts")

analyze_conflicts_preview(datasets, all_correspondences)

## 3. Creating a Sophisticated Fusion Strategy

Now we'll create a comprehensive fusion strategy that handles different types of conflicts intelligently.

In [None]:
# Create a custom conflict resolution function for movie titles
class SmartMovieTitle(ConflictResolutionFunction):
    """Intelligent movie title fusion that handles common variations."""
    
    @property
    def name(self) -> str:
        return "smart_movie_title"
    
    def resolve(self, values, context):
        if not values:
            return FusionResult(value=None, confidence=0.0, rule_used=self.name)
        
        # Remove duplicates and None values
        clean_values = [str(v).strip() for v in values if pd.notna(v) and str(v).strip()]
        
        if not clean_values:
            return FusionResult(value=None, confidence=0.0, rule_used=self.name)
        
        # If all values are the same, high confidence
        if len(set(clean_values)) == 1:
            return FusionResult(
                value=clean_values[0],
                confidence=1.0,
                rule_used=self.name,
                metadata={"reason": "unanimous"}
            )
        
        # Choose the longest title (often more complete)
        longest_title = max(clean_values, key=len)
        
        # Check if shorter titles are substrings of longer ones
        is_superset = all(short_title.lower() in longest_title.lower() 
                         for short_title in clean_values)
        
        confidence = 0.9 if is_superset else 0.7
        
        return FusionResult(
            value=longest_title,
            confidence=confidence,
            rule_used=self.name,
            metadata={
                "candidates": clean_values,
                "is_superset": is_superset
            }
        )

# Create a custom date fusion rule
class SmartDateFusion(ConflictResolutionFunction):
    """Smart date fusion that handles different date formats and chooses most reliable."""
    
    @property
    def name(self) -> str:
        return "smart_date"
    
    def resolve(self, values, context):
        if not values:
            return FusionResult(value=None, confidence=0.0, rule_used=self.name)
        
        clean_values = [v for v in values if pd.notna(v)]
        if not clean_values:
            return FusionResult(value=None, confidence=0.0, rule_used=self.name)
        
        # Parse dates and score them by precision
        parsed_dates = []
        for val in clean_values:
            date_str = str(val)
            precision_score = 0
            
            # Score based on precision (more specific dates get higher scores)
            if len(date_str) >= 10:  # Full date YYYY-MM-DD
                precision_score = 3
            elif len(date_str) >= 7:   # Year-Month YYYY-MM
                precision_score = 2
            elif len(date_str) >= 4:   # Just year YYYY
                precision_score = 1
            
            parsed_dates.append((val, precision_score))
        
        # Choose date with highest precision
        best_date = max(parsed_dates, key=lambda x: x[1])
        
        # Calculate confidence
        max_score = best_date[1]
        confidence = min(1.0, 0.5 + max_score * 0.15)
        
        return FusionResult(
            value=best_date[0],
            confidence=confidence,
            rule_used=self.name,
            metadata={
                "precision_score": max_score,
                "candidates": clean_values
            }
        )

print("✅ Custom conflict resolution functions created")

In [None]:
# Create a comprehensive fusion strategy
movie_strategy = DataFusionStrategy("comprehensive_movie_fusion")

# Configure fusion rules for each attribute
movie_strategy.add_attribute_fuser_from_resolver("title", SmartMovieTitle())
movie_strategy.add_attribute_fuser_from_resolver("director", LongestString())  # Longer names often more complete
movie_strategy.add_attribute_fuser_from_resolver("actors", Union())           # Combine all actors
movie_strategy.add_attribute_fuser_from_resolver("studio", Union())           # Combine all studios
movie_strategy.add_attribute_fuser_from_resolver("date", SmartDateFusion())   # Smart date handling

print("🎯 Fusion Strategy Configuration:")
print(f"Strategy name: {movie_strategy.name}")
print(f"Registered attributes: {list(movie_strategy.get_registered_attributes())}")

# Show the rules for each attribute
for attr in movie_strategy.get_registered_attributes():
    fuser = movie_strategy.get_attribute_fuser(attr)
    print(f"  {attr}: {fuser.resolver.name}")

## 4. Running the Fusion Engine

Now we'll execute the fusion process using the DataFusionEngine with connected components grouping.

In [None]:
# Create and run the fusion engine
fusion_engine = DataFusionEngine(movie_strategy)

print("🚀 Starting Data Fusion Process...")
print("-" * 50)

# Run fusion with all datasets and correspondences
fused_movies = fusion_engine.run(
    datasets=[academy_awards_df, actors_df, golden_globes_df],
    correspondences=all_correspondences
)

print(f"\n✅ Fusion Complete!")
print(f"Input records: {len(academy_awards_df) + len(actors_df) + len(golden_globes_df)}")
print(f"Output records: {len(fused_movies)}")
print(f"Compression ratio: {len(fused_movies) / (len(academy_awards_df) + len(actors_df) + len(golden_globes_df)):.2%}")

In [None]:
# Examine the fusion results
print("🎬 Fusion Results Overview:")
print(f"Fused dataset shape: {fused_movies.shape}")
print(f"Columns: {list(fused_movies.columns)}")

# Show sample fused records
print("\n📋 Sample Fused Records:")
display_cols = ['_id', 'title', 'director', 'date', '_fusion_confidence', '_fusion_sources']
available_cols = [col for col in display_cols if col in fused_movies.columns]
display(fused_movies[available_cols].head(10))

In [None]:
# Analyze fusion confidence and sources
print("📊 Fusion Quality Analysis:")

if '_fusion_confidence' in fused_movies.columns:
    confidence_stats = fused_movies['_fusion_confidence'].describe()
    print("\nConfidence Statistics:")
    print(confidence_stats)
    
    # Show confidence distribution
    print("\nConfidence Distribution:")
    confidence_bins = pd.cut(fused_movies['_fusion_confidence'], 
                           bins=[0, 0.5, 0.7, 0.9, 1.0], 
                           labels=['Low (0-0.5)', 'Medium (0.5-0.7)', 'High (0.7-0.9)', 'Very High (0.9-1.0)'])
    print(confidence_bins.value_counts())

if '_fusion_sources' in fused_movies.columns:
    # Analyze source combinations
    source_counts = fused_movies['_fusion_sources'].apply(len)
    print("\nSource Count Distribution:")
    print(source_counts.value_counts().sort_index())
    
    print(f"\nMulti-source records: {(source_counts > 1).sum()}")
    print(f"Single-source records: {(source_counts == 1).sum()}")

## 5. Detailed Analysis of High-Quality Matches

Let's examine some of the best fusion results to see how our strategy performed.

In [None]:
# Find and display high-confidence multi-source fusions
if '_fusion_confidence' in fused_movies.columns and '_fusion_sources' in fused_movies.columns:
    high_quality_fusions = fused_movies[
        (fused_movies['_fusion_confidence'] > 0.8) & 
        (fused_movies['_fusion_sources'].apply(len) > 1)
    ].copy()
    
    print(f"🌟 High-Quality Multi-Source Fusions ({len(high_quality_fusions)} records):")
    print("=" * 80)
    
    for i, (_, record) in enumerate(high_quality_fusions.head(5).iterrows(), 1):
        print(f"\n{i}. {record.get('title', 'N/A')}")
        print(f"   Director: {record.get('director', 'N/A')}")
        print(f"   Date: {record.get('date', 'N/A')}")
        print(f"   Confidence: {record.get('_fusion_confidence', 0):.3f}")
        print(f"   Sources: {', '.join(record.get('_fusion_sources', []))}")
        
        # Show actors and studios if available
        if pd.notna(record.get('actors')):
            actors = str(record.get('actors'))
            if len(actors) > 100:
                actors = actors[:100] + "..."
            print(f"   Actors: {actors}")
        
        if pd.notna(record.get('studio')):
            print(f"   Studio: {record.get('studio')}")
else:
    print("⚠️  Fusion metadata not available for detailed analysis")

## 6. Creating a Comprehensive Fusion Report

PyDI's reporting framework provides detailed analytics and diagnostics for fusion results.

In [None]:
# Create a comprehensive fusion report
fusion_report = FusionReport(
    fused_df=fused_movies,
    input_datasets=[academy_awards_df, actors_df, golden_globes_df],
    strategy_name=movie_strategy.name,
    correspondences=all_correspondences,
    # evaluation_results will be added later
)

# Display the comprehensive report
fusion_report.print_summary()

In [None]:
# Get detailed quality metrics
quality_metrics = FusionQualityMetrics.calculate_consistency_metrics(fused_movies)
coverage_metrics = FusionQualityMetrics.calculate_coverage_metrics(
    [academy_awards_df, actors_df, golden_globes_df], fused_movies
)

print("📈 Detailed Quality Metrics:")
print("=" * 50)

print("\n🎯 Quality Metrics:")
for key, value in quality_metrics.items():
    if isinstance(value, dict):
        print(f"  {key}:")
        for sub_key, sub_value in value.items():
            print(f"    {sub_key}: {sub_value}")
    else:
        if isinstance(value, float):
            print(f"  {key}: {value:.3f}")
        else:
            print(f"  {key}: {value}")

print("\n📊 Coverage Metrics:")
for key, value in coverage_metrics.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.3f}")
    else:
        print(f"  {key}: {value}")

## 7. Exploring Connected Components Grouping

Let's examine how PyDI's connected components algorithm grouped records for fusion.

In [None]:
# Manually examine the record grouping process
record_groups = build_record_groups_from_correspondences(
    [academy_awards_df, actors_df, golden_globes_df],
    all_correspondences
)

print(f"🔗 Connected Components Analysis:")
print(f"Total groups: {len(record_groups)}")

# Analyze group sizes
group_sizes = [len(group.records) for group in record_groups]
group_size_counts = pd.Series(group_sizes).value_counts().sort_index()

print("\nGroup Size Distribution:")
for size, count in group_size_counts.items():
    print(f"  {size} records: {count} groups")

# Show examples of multi-record groups
multi_record_groups = [g for g in record_groups if len(g.records) > 1]
print(f"\n🎭 Multi-Record Groups: {len(multi_record_groups)}")

for i, group in enumerate(multi_record_groups[:3], 1):
    print(f"\nGroup {i} ({group.group_id}): {len(group.records)} records")
    for record in group.records:
        dataset = group.source_datasets.get(record.get('_id', ''), 'unknown')
        title = record.get('title', 'N/A')
        print(f"  - [{dataset}] {title}")

## 8. Advanced: Custom Evaluation Rules

Let's create some custom evaluation rules to assess fusion quality against domain knowledge.

In [None]:
# Create a gold standard for a few movies (manually curated)
gold_standard_data = {
    '_id': ['manual_1', 'manual_2', 'manual_3'],
    'title': ['7th Heaven', 'Coquette', 'The Broadway Melody'],
    'date': ['1927-01-01', '1929-01-01', '1929-01-01'],  # Known correct dates
    'director': ['Frank Borzage', 'Sam Taylor', 'Harry Beaumont']
}

gold_standard_df = pd.DataFrame(gold_standard_data)

print("🏆 Gold Standard Dataset:")
display(gold_standard_df)

# For this demo, we'll evaluate a subset of our results
# In practice, you'd have a comprehensive gold standard
print("\n📝 Note: This is a simplified evaluation for demonstration.")
print("In practice, you would have a comprehensive gold standard dataset.")

## 9. Provenance Tracking Demonstration

PyDI's provenance tracking system keeps detailed lineage information.

In [None]:
# Demonstrate provenance tracking
provenance_tracker = ProvenanceTracker()

# Register datasets with trust scores
provenance_tracker.register_dataset_source('academy_awards', trust_score=0.9)
provenance_tracker.register_dataset_source('actors', trust_score=0.85)
provenance_tracker.register_dataset_source('golden_globes', trust_score=0.88)

# Track input data
for df in [academy_awards_df, actors_df, golden_globes_df]:
    provenance_tracker.track_input_data(df, df.attrs['dataset_name'])

print("🔍 Provenance Tracking Summary:")
source_stats = provenance_tracker.get_source_statistics()

for source, stats in source_stats.items():
    print(f"\n📊 {source}:")
    print(f"  Records: {stats['record_count']}")
    print(f"  Trust Score: {stats['trust_score']:.2f}")
    print(f"  Avg Confidence: {stats['average_confidence']:.3f}")
    print(f"  Contribution: {stats['contribution_ratio']:.2%}")

## 10. Comparison with Legacy API

PyDI maintains backward compatibility with the original DataFuser API.

In [None]:
# Demonstrate backward compatibility with legacy API
from PyDI.fusion import DataFuser, FusionRule

print("🔄 Legacy API Demonstration:")
print("(Maintaining backward compatibility)")

# Create legacy fusion rules
legacy_rules = {
    'title': FusionRule('longest'),
    'director': FusionRule('longest'),
    'date': FusionRule('most_recent')
}

# Use legacy DataFuser
legacy_fuser = DataFuser()

# For demo, use a small subset of data
small_correspondences = all_correspondences.head(5)

print(f"\nRunning legacy fusion with {len(small_correspondences)} correspondences...")

try:
    legacy_result = legacy_fuser.fuse(
        datasets=[academy_awards_df, actors_df, golden_globes_df],
        correspondences=small_correspondences,
        rules=legacy_rules
    )
    
    print(f"✅ Legacy fusion successful: {len(legacy_result)} records")
    if len(legacy_result) > 0:
        display_cols = ['_id', 'title', 'director', 'date']
        available_cols = [col for col in display_cols if col in legacy_result.columns]
        display(legacy_result[available_cols].head(3))
        
except Exception as e:
    print(f"⚠️  Legacy fusion encountered an issue: {e}")
    print("This may be due to data format differences - the new API is more robust!")

## 11. Performance and Scalability Analysis

Let's analyze the performance characteristics of the fusion process.

In [None]:
import time

# Performance analysis
print("⚡ Performance Analysis:")
print("=" * 40)

# Measure fusion time for different data sizes
def measure_fusion_time(datasets, correspondences, strategy):
    start_time = time.time()
    engine = DataFusionEngine(strategy)
    result = engine.run(datasets, correspondences)
    end_time = time.time()
    return end_time - start_time, len(result)

# Test with full dataset
fusion_time, result_count = measure_fusion_time(
    [academy_awards_df, actors_df, golden_globes_df],
    all_correspondences,
    movie_strategy
)

total_input_records = len(academy_awards_df) + len(actors_df) + len(golden_globes_df)

print(f"📊 Full Dataset Performance:")
print(f"  Input records: {total_input_records:,}")
print(f"  Output records: {result_count:,}")
print(f"  Correspondences: {len(all_correspondences):,}")
print(f"  Fusion time: {fusion_time:.3f} seconds")
print(f"  Records/second: {total_input_records/fusion_time:.0f}")
print(f"  Compression ratio: {result_count/total_input_records:.2%}")

## 12. Export and Reporting

Finally, let's export our results and generate comprehensive reports.

In [None]:
# Export fusion results to different formats
print("💾 Export Options:")
print("=" * 30)

# Show JSON report structure (without actually saving)
report_json = fusion_report.to_json()
print(f"📄 JSON Report Size: {len(report_json):,} characters")

# Show data export formats available
print(f"\n📊 Available Export Formats:")
print(f"  • CSV: {len(fused_movies)} records ready for export")
print(f"  • JSON: Full report with metadata")
print(f"  • HTML: Interactive report with visualizations")
print(f"  • Parquet: Efficient binary format with metadata")

# Sample export commands (commented to avoid file creation)
print(f"\n💡 Sample Export Commands:")
print(f"```python")
print(f"# Export fused data")
print(f"fused_movies.to_csv('fused_movies.csv', index=False)")
print(f"")
print(f"# Export comprehensive report")
print(f"fusion_report.export_detailed_results('output/movie_fusion/')")
print(f"```")

## Summary and Key Takeaways

🎉 **Congratulations!** You've successfully explored PyDI's comprehensive data fusion framework.

### What We Accomplished:

1. **✅ Loaded Winter movie datasets** - Parsed XML data and correspondences
2. **✅ Created sophisticated fusion strategies** - Custom conflict resolution rules
3. **✅ Executed connected components fusion** - Advanced record grouping
4. **✅ Generated quality reports** - Comprehensive analytics and diagnostics
5. **✅ Demonstrated provenance tracking** - Full lineage and trust management
6. **✅ Showed backward compatibility** - Legacy API still supported

### Key Features of PyDI Fusion:

- 🔧 **Modular Architecture**: Pluggable rules and configurable strategies
- 🐼 **Pandas-First**: Native DataFrame operations with metadata support
- 📊 **Rich Analytics**: Confidence scores, quality metrics, and detailed reporting
- 🔍 **Full Provenance**: Track data lineage and source contributions
- ⚡ **High Performance**: Efficient algorithms for large-scale fusion
- 🔄 **Backward Compatible**: Maintains existing API while adding new capabilities

### Next Steps:

- Explore custom conflict resolution functions for your domain
- Set up evaluation pipelines with gold standard data
- Integrate with PyDI's entity matching and schema matching components
- Scale to larger datasets and implement distributed processing

The PyDI data fusion framework provides production-ready capabilities for complex data integration scenarios while maintaining the flexibility and ease-of-use that Python developers expect.

In [None]:
# Final summary statistics
print("🏁 Final Summary:")
print("=" * 50)
print(f"📚 Datasets processed: 3 (Academy Awards, Actors, Golden Globes)")
print(f"📊 Total input records: {total_input_records:,}")
print(f"🎯 Fused output records: {len(fused_movies):,}")
print(f"🔗 Correspondences used: {len(all_correspondences):,}")
print(f"⚙️  Fusion rules applied: {len(movie_strategy.get_registered_attributes())}")
print(f"⏱️  Processing time: {fusion_time:.3f} seconds")
print(f"✨ Framework: PyDI Data Fusion v2.0")
print("\n🎉 Notebook execution complete!")