# Conflict Detection and Resolution

## Overview

In modern data pipelines, especially those building Knowledge Graphs, data is often ingested from multiple heterogeneous sources (e.g., internal databases, third-party APIs, web scrapes). Discrepancies are inevitable. 

The **Semantica Conflict Resolution Module** (`semantica.conflicts`) provides a robust framework for managing these data inconsistencies. It is designed to ensure that your downstream applications consume only high-quality, reconciled data.

### What counts as a "conflict"?

- A conflict happens when **multiple records for the same entity** disagree on a field.
- Semantica typically assumes each record is a dictionary with:
  - `id` (or `entity_id`): stable identifier for the entity being described
  - one or more attributes (e.g., `name`, `birth_date`, `department`)
  - `source`: where the value came from (db, scrape, api, file)
  - optional `timestamp`: when the value was observed
- The module is source-aware: it can record **which sources contributed which values**, then resolve using strategies like voting or credibility.

### Practical API notes (to avoid common mismatches)

- Use `ConflictDetector.detect_value_conflicts(entities, property_name=...)` when you want to check one field.
- Use `ConflictDetector.detect_conflicts(entities)` when you want a broader scan (value/type/temporal/etc.).
- If you're starting from a Knowledge Graph dictionary (built via `GraphBuilder`), pull entities via `kg.get("entities", [])`.

### Key Capabilities

1.  **Multi-Dimensional Conflict Detection**
    *   **Value Conflicts**: Different values for the same property (e.g., `"Google"` vs `"Google Inc."`).
    *   **Type Conflicts**: Data type mismatches (e.g., string vs integer).
    *   **Temporal Conflicts**: Chronological inconsistencies (e.g., a `start_date` after an `end_date`).

2.  **Provenance & Source Tracking**
    *   **Granular Tracking**: Trace every property value back to its specific source document, page, or API call.
    *   **Credibility Scoring**: Assign trust scores to sources (e.g., `0.95` for internal HR DB vs `0.60` for web scrapes).

3.  **Automated Resolution Strategies**
    *   **Voting**: Majority rules (useful for multiple equal-weight sources).
    *   **Credibility Weighted**: Values from higher-trust sources override others.
    *   **Recency**: The most recent data point wins.
    *   **Expert Review**: Flag complex conflicts for human intervention.

4.  **Investigation & Auditing**
    *   **Investigation Guides**: Auto-generate step-by-step guides for human analysts to resolve sticky conflicts.
    *   **Audit Trails**: Keep a record of how every conflict was resolved for compliance.

## Installation

Ensure Semantica is installed in your environment:

```bash
pip install semantica
```

- If you're running this notebook inside the Semantica repo, prefer an editable install (so changes in code are reflected immediately):
  - `pip install -e .`
- If you're using a hosted notebook environment, `%pip install semantica` is often more reliable than `!pip install ...` because it installs into the active kernel.

In [None]:
!pip install -q semantica 

In [None]:
import json
from datetime import datetime

## Step 1: Simulating Multi-Source Data

To demonstrate the framework, we will simulate a realistic scenario involving employee data.

**The Scenario:**
We have received records for **Employee 001** from three distinct sources:

- **HR Database**: Highly trusted internal source.
- **LinkedIn Scrape**: Less reliable external source.
- **Public Directory**: Outdated public API.

**The record shape (what Semantica expects):**

- Each record is a dictionary describing the same entity (`id`: `emp_001`).
- Each record includes a `source` key so conflicts can be attributed.
- A `timestamp` lets you apply time-based resolution strategies (e.g., most recent wins).

**The Conflicts:**
*   **`birth_date`**: The Public Directory lists a different year.
*   **`department`**: LinkedIn uses a more specific name ("Software Engineering") vs the generic "Engineering" in the HR DB.

In [None]:
# 1. Define source metadata
sources_metadata = {
    "hr_db": {"credibility": 0.95, "type": "internal_database"},
    "linkedin_scrape": {"credibility": 0.60, "type": "web_scrape"},
    "public_dir": {"credibility": 0.40, "type": "public_api"}
}

# 2. Define entity records from these sources
entity_records = [
    {
        "id": "emp_001",
        "name": "John Doe",
        "birth_date": "1980-05-15",
        "department": "Engineering",
        "source": "hr_db",
        "timestamp": "2023-01-01T10:00:00"
    },
    {
        "id": "emp_001",
        "name": "Jonathan Doe",
        "birth_date": "1980-05-15",
        "department": "Software Engineering",
        "source": "linkedin_scrape",
        "timestamp": "2023-06-15T14:30:00"
    },
    {
        "id": "emp_001",
        "name": "John Doe",
        "birth_date": "1982-05-15",  # Conflict: Different year
        "department": "Engineering",
        "source": "public_dir",
        "timestamp": "2022-12-01T09:00:00"
    }
]

print(f"Loaded {len(entity_records)} records for Employee 001")

## Step 2: Registering and Tracking Sources

Before we can effectively resolve conflicts based on trust, we must register our sources with the `SourceTracker`.

The `SourceTracker` acts as a central registry for:

- **Credibility Scores**: How much you trust the source.
- **Metadata**: Helpful context (e.g., source type, system of record vs scrape).

**How to think about credibility scores:**

- Use scores as a *relative ordering* (the exact decimals matter less than the ranking).
- Start simple: `internal_db > vendor_api > web_scrape`.
- Revisit scores later using analytics (e.g., "which sources are frequently wrong?").

We iterate through our simulated sources and register them.

In [None]:
from semantica.conflicts import SourceTracker

source_tracker = SourceTracker()

print("Registering sources...")
for source_id, metadata in sources_metadata.items():
    source_tracker.register_source(
        source_id=source_id,
        source_type=metadata["type"],
        credibility_score=metadata["credibility"]
    )
    print(f"  - Registered '{source_id}' with credibility {metadata['credibility']}")

## Step 3: Detecting Conflicts

We use the `ConflictDetector` to scan our records for discrepancies. 

The detector is flexible and can be configured to check:

- **Specific properties**: Check only critical fields like `birth_date`.
- **Entity-wide scans**: Scan many properties (or all) for an entity.

**What you get back:**

- A list of `Conflict` objects.
- Useful fields youâ€™ll typically inspect:
  - `conflict_type` (e.g., `value_conflict`)
  - `entity_id`, `property_name`
  - `conflicting_values` and `sources`
  - `severity` and `confidence`

Here, we explicitly check `birth_date` and `department`.

In [None]:
from semantica.conflicts import ConflictDetector

# Initialize detector with our populated source tracker
detector = ConflictDetector(source_tracker=source_tracker)

conflicts = []

# 1. Check birth_date
dob_conflicts = detector.detect_value_conflicts(entity_records, "birth_date")
conflicts.extend(dob_conflicts)

# 2. Check department
dept_conflicts = detector.detect_value_conflicts(entity_records, "department")
conflicts.extend(dept_conflicts)

print(f"Detected {len(conflicts)} conflicts:")
for conflict in conflicts:
    print(f"- {conflict.conflict_type.value}: {conflict.property_name} for {conflict.entity_id}")
    print(f"  Values: {conflict.conflicting_values}")
    print(f"  Severity: {conflict.severity}")
    print("--- ")

## Step 4: Analyzing Conflict Patterns

When dealing with large datasets, individual conflicts are less important than systemic patterns. The `ConflictAnalyzer` helps answer questions like:

- "Is one specific source responsible for most conflicts?"
- "Are conflicts concentrated in a specific entity type or property?"
- "What is the distribution of conflict severity and conflict types?"

**How to use this in a pipeline:**

- Run analysis to identify noisy sources.
- Use results to adjust credibility scores (Step 2) or refine ingestion/cleaning rules.
- Track trends over time to catch regressions in upstream systems.

In [None]:
from semantica.conflicts import ConflictAnalyzer

analyzer = ConflictAnalyzer()
analysis = analyzer.analyze_conflicts(conflicts)

print("Conflict Analysis Summary:")
print(f"Total Conflicts: {analysis['total_conflicts']}")
print(f"By Type: {analysis.get('by_type', {}).get('counts')}")
print(f"By Severity: {analysis.get('by_severity', {}).get('counts')}")

## Step 5: Resolving Conflicts

This is the critical step where we decide which value to trust. Semantica offers flexible resolution strategies.

### Strategy A: Voting (Majority Rules)
This strategy selects the value that appears most frequently. It is simple but treats all sources as equal.

- Best when you have many independent sources of similar quality.
- Less suitable if you have a single system-of-record that should always dominate.

### Strategy B: Credibility Weighted
This strategy calculates a weighted score for each value based on the `credibility` of its source. 

**Example:**
*   `hr_db` (0.95) says "1980-05-15"
*   `public_dir` (0.40) says "1982-05-15"

Even if multiple low-quality sources agreed on the wrong date, the high-credibility source would likely win.

**What the resolver returns:**

- A list of resolution results where each item typically includes:
  - whether it was resolved
  - the chosen value (`resolved_value`)
  - a confidence score
  - metadata (like the property name) to support audit trails

Run the next cell to compare voting vs credibility-weighted outcomes.

In [None]:
from semantica.conflicts import ConflictResolver

resolver = ConflictResolver()

# CRITICAL: Link the source tracker to the resolver.
# This allows the resolver to look up the credibility scores we registered in Step 2.
resolver.set_source_tracker(source_tracker)

print("--- Resolution: Voting ---")
voting_results = resolver.resolve_conflicts(conflicts, strategy="voting")
for res in voting_results:
    print(f"Property: {res.metadata.get('property_name'):<15} | Resolved Value: {res.resolved_value}")

print("\n--- Resolution: Credibility Weighted ---")
# Notice how the HR DB's value is preferred due to higher credibility
credibility_results = resolver.resolve_conflicts(conflicts, strategy="credibility_weighted")
for res in credibility_results:
    print(f"Property: {res.metadata.get('property_name'):<15} | Resolved Value: {res.resolved_value} (Confidence: {res.confidence:.2f})")

## Step 6: Generating Investigation Guides

Not all conflicts can be resolved automatically. High-stakes or low-confidence resolutions require human review.

The `InvestigationGuideGenerator` produces a structured "flight plan" for an analyst, detailing:

- **What** is in conflict (entity + field + competing values).
- **Who** is involved (which sources produced which values).
- **How** to verify the correct data (actionable steps an analyst can follow).

**When to generate guides:**

- Low-confidence resolutions.
- Conflicts on critical fields (identity, legal names, compliance attributes).
- Any time you want a human-in-the-loop checkpoint before writing back to the graph.

In [None]:
from semantica.conflicts import InvestigationGuideGenerator

guide_generator = InvestigationGuideGenerator()

# Generate a guide for the first conflict (birth_date)
guide = guide_generator.generate_guide(conflicts[0])

print(f"=== {guide.title} ===")
print(f"Summary: {guide.conflict_summary}\n")

print("Investigation Steps:")
for i, step in enumerate(guide.investigation_steps, 1):
    print(f"{i}. {step.description}")
    print(f"   Action: {step.action}")

print("\nRecommended Actions:")
for action in guide.recommended_actions:
    print(f"[ ] {action}")

## Conclusion

You have successfully built a conflict resolution pipeline using Semantica! 

**Recap of what we achieved:**

- **Simulated** multi-source entity records with realistic disagreements.
- **Registered** sources with credibility scores to create a trust hierarchy.
- **Detected** value conflicts for specific properties.
- **Analyzed** conflicts to understand distribution by type and severity.
- **Resolved** conflicts using voting and credibility-weighted strategies.
- **Generated** an investigation guide to support human review.

**Suggested next steps in a real project:**

- Integrate with your ingestion layer so each extracted value includes a `source` and (ideally) a `timestamp`.
- Expand detection beyond value conflicts using `ConflictDetector.detect_conflicts(...)`.
- Store resolutions and guide outputs to build an audit trail for downstream consumers.