# Multi-Source Data Integration

## Overview

This notebook demonstrates advanced multi-source data integration using multiple ingestion types, entity resolution, conflict detection, and provenance tracking.

### Learning Objectives

- Ingest data from multiple sources (files, web, databases, streams, feeds)
- Resolve entities across sources using EntityResolver
- Detect conflicts using ConflictDetector
- Track provenance using ProvenanceTracker
- Integrate data into a unified knowledge graph

---

## Workflow: Multi-Source Ingestion → Entity Resolution → Conflict Detection → Provenance Tracking → Unified KG


In [None]:
from semantica.ingest import FileIngestor, WebIngestor, DBIngestor, StreamIngestor, FeedIngestor
from semantica.parse import DocumentParser, StructuredDataParser
from semantica.kg import GraphBuilder, EntityResolver, ConflictDetector, ProvenanceTracker
import tempfile
import os
import json

file_ingestor = FileIngestor()
web_ingestor = WebIngestor()
db_ingestor = DBIngestor()
stream_ingestor = StreamIngestor()
feed_ingestor = FeedIngestor()

temp_dir = tempfile.mkdtemp()

file1 = os.path.join(temp_dir, "source1.txt")
with open(file1, 'w') as f:
    f.write("Apple Inc. is a technology company. Tim Cook is the CEO.")

file_objects = file_ingestor.ingest_file(file1, read_content=True)

print(f"Ingested {len([file_objects]) if file_objects else 0} files")
print(f"Multi-source ingestion initialized")


## Step 2: Entity Resolution

Resolve entities across multiple sources.


In [None]:
entity_resolver = EntityResolver()

entities_from_source1 = [
    {"id": "e1", "name": "Apple Inc.", "type": "Organization", "source": "file1"},
    {"id": "e2", "name": "Tim Cook", "type": "Person", "source": "file1"}
]

entities_from_source2 = [
    {"id": "e3", "name": "Apple Incorporated", "type": "Organization", "source": "web"},
    {"id": "e4", "name": "Timothy Cook", "type": "Person", "source": "web"}
]

all_entities = entities_from_source1 + entities_from_source2

resolved_entities = entity_resolver.resolve(all_entities)

print(f"Original entities: {len(all_entities)}")
print(f"Resolved entities: {len(resolved_entities)}")


## Step 3: Conflict Detection

Detect conflicts between sources.


In [None]:
conflict_detector = ConflictDetector()

conflicts = conflict_detector.detect_value_conflicts(all_entities, "name")

print(f"Detected {len(conflicts)} conflicts")
for conflict in conflicts[:3]:
    print(f"  Conflict: {conflict.entity_id} - {conflict.conflict_type}")


## Step 4: Provenance Tracking

Track data provenance across sources.


In [None]:
provenance_tracker = ProvenanceTracker()

for entity in all_entities:
    provenance_tracker.track_entity(entity.get("id"), entity.get("source"), entity)

relationships = [
    {"source": "e2", "target": "e1", "type": "CEO_of", "source": "file1"}
]

for rel in relationships:
    provenance_tracker.track_relationship(rel.get("source"), rel.get("target"), rel.get("source"), rel)

print(f"Tracked provenance for {len(all_entities)} entities and {len(relationships)} relationships")


## Step 5: Build Unified Knowledge Graph

Build a unified knowledge graph from integrated sources.


In [None]:
builder = GraphBuilder()

unified_kg = builder.build(resolved_entities, relationships)

print(f"Built unified knowledge graph")
print(f"  Entities: {len(unified_kg.get('entities', []))}")
print(f"  Relationships: {len(unified_kg.get('relationships', []))}")
print(f"  Sources integrated: {len(set(e.get('source', '') for e in resolved_entities))}")


## Summary

You've learned advanced multi-source data integration:

- **Multiple Ingestion Types**: FileIngestor, WebIngestor, DBIngestor, StreamIngestor, FeedIngestor
- **EntityResolver**: Resolve entities across sources
- **ConflictDetector**: Detect conflicts between sources
- **ProvenanceTracker**: Track data provenance
- **Unified Knowledge Graph**: Build integrated graph from multiple sources
