<a href="https://colab.research.google.com/github/sargonxg/TACITUS-Knowledge-Pipeline/blob/main/notebooks/tacitus-knowledge-pipeline-v7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🏛️ TACITUS Knowledge Pipeline v7.0

---



---



**Production-grade pipeline for building *auditable* conflict intelligence graphs (Grounding → Context → Reasoning).**

---

## 🗺️ QUICK START GUIDE

### Phase 1: Setup (Run Once)
| Step | Block | What It Does | Time |
|------|-------|--------------|------|
| 1 | **BLOCK 1** | Install dependencies & configure | ~30–60s |
| 2 | **BLOCK 2** | Define core data schemas (Ontology) | Instant |
| 3 | **BLOCK 2A–2C** | Evidence/provenance + layer helpers + constraints | ~5–15s |
| 4 | **BLOCK 3** | Initialize Vector Engine & OntoRAG | ~5–15s |
| 5 | **BLOCK 4–4A** | Test connections + apply DB constraints/indexes | ~5–15s |

### Phase 2: Build Your Knowledge Base
| Step | Block | What It Does | Time |
|------|-------|--------------|------|
| 6 | **BLOCK 5** | Seed conflict theories (GND graph) | ~60–120s |
| 7 | **BLOCK 6 + 6A** | Upload/process docs + deterministic cleaning | Varies |
| 8 | **BLOCK 6B** | HITL dedup suggestions (Actors) | Varies |
| 9 | **BLOCK 6C–6D** | Evidence-first ingest + QA & run metrics | Varies |
| 10 | **BLOCK 6E** | Build Reasoning Graph (issues, contradictions, leverage scores) | ~10–60s |

### Phase 3: Query & Visualize
| Step | Block | What It Does | Time |
|------|-------|--------------|------|
| 11 | **BLOCK 7** | Semantic search (natural language) | Instant |
| 12 | **BLOCK 8** | Cypher Query Lab | Instant |
| 13 | **BLOCK 9** | Visualization Dashboard | Instant |

### Phase 4: Exports & Integrations
| Step | Block | What It Does | Time |
|------|-------|--------------|------|
| 14 | **BLOCK 10–10A** | Export + bundle audit artifacts | Instant |
| 15 | **BLOCK 11** | Transfer a case from Neo4j → FalkorDB | Varies |


---
# 📦 BLOCK 1: Installation & Configuration
**Run this first. Installs all dependencies and sets up connections.**

In [None]:
# ══════════════════════════════════════════════════════════════════════════════
# BLOCK 1: INSTALLATION & CONFIGURATION
# ══════════════════════════════════════════════════════════════════════════════
# 🎯 PURPOSE: Install all dependencies and configure the environment
# ⏱️ TIME: ~30 seconds
# ══════════════════════════════════════════════════════════════════════════════

print("📦 Installing dependencies...")
!pip install --quiet \
    neo4j \
    google-cloud-aiplatform==1.49.0 \
    "pydantic>=2.0" \
    tenacity \
    rapidfuzz \
    pyvis \
    plotly \
    pandas \
    numpy \
    tqdm \
    rich \
    nest_asyncio \
    faiss-cpu \
    python-dateutil \
    ipywidgets \
    scikit-learn \
    networkx \
    falkordb \
    redis

print("✅ Dependencies installed")

# ─────────────────────────────────────────────────
# IMPORTS
# ─────────────────────────────────────────────────
import os
import re
import json
import hashlib
import time
import uuid
import asyncio
import warnings
from datetime import datetime, timezone
from typing import List, Dict, Any, Optional, Tuple
from dataclasses import dataclass, field
from collections import defaultdict

import pandas as pd
import numpy as np
from dateutil import parser as date_parser

# Google & Vertex AI
from google.colab import auth, files, userdata
import vertexai
from vertexai.generative_models import GenerativeModel, GenerationConfig
from vertexai.language_models import TextEmbeddingModel, TextEmbeddingInput

# Graph & Data
from neo4j import GraphDatabase
from pydantic import BaseModel, Field
from rapidfuzz import fuzz
from tenacity import retry, stop_after_attempt, wait_exponential

# Vector Search
import faiss

# UI & Progress
from rich.console import Console
from rich.panel import Panel
from rich.table import Table
import ipywidgets as widgets
from IPython.display import display, clear_output, HTML

# These libraries are used in later blocks and were part of the erroneous import chain
import sklearn
import networkx as nx

# Async support in Colab
import nest_asyncio
nest_asyncio.apply()

# Suppress warnings
warnings.filterwarnings("ignore", category=UserWarning, module="vertexai")

console = Console()
print("✅ Imports complete")

# ─────────────────────────────────────────────────
# CONFIGURATION
# ─────────────────────────────────────────────────

@dataclass
class Config:
    """Central configuration for TACITUS v6.0"""

    # Google Cloud
    project_id: str = 'tacitus-63ba5'
    location: str = 'us-central1'

    # Neo4j Connection (Aura)
    neo4j_uri: str = 'neo4j+s://84c54653.databases.neo4j.io'
    neo4j_user: str = 'neo4j'
    neo4j_password: str = 'kPgV2l44VSpSyu6hBNI4R5lgKZG-S2jpefHu0k5ivNo'

    # Models
    extraction_model: str = 'gemini-2.5-flash-lite'  # Fast LLM for extraction
    embedding_model: str = 'text-embedding-005'      # Embedding model
    embedding_dimensions: int = 768                   # Vector dimensions (768 is efficient)

    # Processing
    chunk_size: int = 20000      # Characters per chunk
    chunk_overlap: int = 1000    # Overlap between chunks
    er_threshold: float = 85.0   # Entity resolution similarity threshold

    # Defaults
    default_date: str = "2024-01-01"

    # Runtime
    run_id: str = field(default_factory=lambda: f"RUN_{datetime.now().strftime('%Y%m%d_%H%M%S')}")
    dry_run: bool = False

config = Config()
print(f"✅ Configuration loaded (Run ID: {config.run_id})")

# ─────────────────────────────────────────────────
# AUTHENTICATE & CONNECT
# ─────────────────────────────────────────────────

print("\n🔐 Authenticating with Google Cloud...")
auth.authenticate_user()
vertexai.init(project=config.project_id, location=config.location)
print(f"✅ Vertex AI initialized: {config.project_id}")

print("\n🔗 Connecting to Neo4j...")
try:
    driver = GraphDatabase.driver(
        config.neo4j_uri,
        auth=(config.neo4j_user, config.neo4j_password)
    )
    driver.verify_connectivity()
    print("✅ Neo4j connection verified")
except Exception as e:
    print(f"❌ Neo4j connection failed: {e}")
    raise

# Create indexes for performance
print("\n📄 Creating indexes...")
with driver.session() as session:
    indexes = [
        'CREATE INDEX IF NOT EXISTS FOR (a:Actor) ON (a.case_id)',
        'CREATE INDEX IF NOT EXISTS FOR (a:Actor) ON (a.entity_id)',
        'CREATE INDEX IF NOT EXISTS FOR (c:Case) ON (c.case_id)',
        'CREATE INDEX IF NOT EXISTS FOR (e:Event) ON (e.case_id)',
        'CREATE INDEX IF NOT EXISTS FOR (t:Theory) ON (t.entity_id)',
        'CREATE INDEX IF NOT EXISTS FOR (n) ON (n.graph_category)'
    ]
    for idx in indexes:
        try:
            session.run(idx)
        except Exception:
            pass  # Index may already exist
print("✅ Indexes ready")

print("\n" + "="*60)
print("🎉 BLOCK 1 COMPLETE - Environment Ready!")
print("="*60)

📦 Installing dependencies...
✅ Dependencies installed


ImportError: cannot import name 'TagValuesAsyncClient' from partially initialized module 'google.cloud.resourcemanager_v3.services.tag_values' (most likely due to a circular import) (/usr/local/lib/python3.12/dist-packages/google/cloud/resourcemanager_v3/services/tag_values/__init__.py)

---
# 📐 BLOCK 2: Data Ontology (Schemas)
**Defines the structure of all data we extract and store.**

In [None]:
# ══════════════════════════════════════════════════════════════════════════════
# BLOCK 2: DATA ONTOLOGY (PYDANTIC SCHEMAS)
# ══════════════════════════════════════════════════════════════════════════════
# 🎯 PURPOSE: Define the structure of all extracted data
# ⏱️ TIME: Instant
# ══════════════════════════════════════════════════════════════════════════════

# ─────────────────────────────────────────────────────────────────────────────
# CTX (CASE CONTEXT) SCHEMAS - Specific case data
# ─────────────────────────────────────────────────────────────────────────────

class ExtractedActor(BaseModel):
    """A person, organization, or entity involved in the conflict."""
    name: str
    actor_type: str = "unknown"  # person, org, state, group
    role: Optional[str] = None
    description: Optional[str] = None
    is_primary: bool = False
    aliases: List[str] = Field(default_factory=list)
    confidence: float = 0.8

class ActorRelationship(BaseModel):
    """A relationship between two actors."""
    source: str
    target: str
    rel_type: str  # OPPOSES, ALLIES_WITH, INFLUENCES, DEPENDS_ON
    confidence: float = 0.8
    valid_from: Optional[str] = None
    valid_until: Optional[str] = None
    status: str = "active"

class ExtractedEvent(BaseModel):
    """A significant event in the conflict timeline."""
    description: str
    event_type: str = "action"  # action, escalation, negotiation, agreement
    date: Optional[str] = None
    actor_names: List[str] = Field(default_factory=list)
    impact_level: str = "moderate"  # critical, high, moderate, low
    confidence: float = 0.8

class ExtractedClaim(BaseModel):
    """A claim, demand, or statement made by an actor."""
    statement: str
    actor_name: str
    claim_type: str = "assertion"  # demand, threat, accusation, promise
    target_actor: Optional[str] = None
    valid_from: Optional[str] = None
    confidence: float = 0.8

class ExtractedInterest(BaseModel):
    """An interest or goal of an actor."""
    description: str
    actor_name: str
    interest_type: str = "substantive"  # substantive, procedural, psychological
    confidence: float = 0.8

class ExtractedLeverage(BaseModel):
    """Leverage or power that an actor holds."""
    description: str
    actor_name: str
    leverage_type: str = "structural"  # coercive, economic, informational, structural
    target_actors: List[str] = Field(default_factory=list)
    confidence: float = 0.8

class ExtractedConstraint(BaseModel):
    """A constraint limiting an actor's options."""
    description: str
    actor_name: Optional[str] = None
    constraint_type: str = "political"  # political, legal, economic, resource
    confidence: float = 0.8

class ExtractedCommitment(BaseModel):
    """A commitment or agreement between parties."""
    description: str
    parties: List[str] = Field(default_factory=list)
    status: str = "active"  # active, broken, fulfilled
    confidence: float = 0.8

class ExtractedNarrative(BaseModel):
    """A narrative frame promoted by an actor."""
    description: str
    actor_name: str
    frame_type: str = "neutral"  # victim, aggressor, mediator, defender
    confidence: float = 0.8

class FullExtraction(BaseModel):
    """Container for all extracted data from a chunk."""
    actors: List[ExtractedActor] = Field(default_factory=list)
    relationships: List[ActorRelationship] = Field(default_factory=list)
    events: List[ExtractedEvent] = Field(default_factory=list)
    claims: List[ExtractedClaim] = Field(default_factory=list)
    interests: List[ExtractedInterest] = Field(default_factory=list)
    leverage: List[ExtractedLeverage] = Field(default_factory=list)
    constraints: List[ExtractedConstraint] = Field(default_factory=list)
    commitments: List[ExtractedCommitment] = Field(default_factory=list)
    narratives: List[ExtractedNarrative] = Field(default_factory=list)

    def stats(self) -> Dict[str, int]:
        """Return counts for each entity type."""
        return {k: len(getattr(self, k)) for k in self.model_fields.keys()}

    def total_entities(self) -> int:
        """Total number of extracted entities."""
        return sum(self.stats().values())

# ─────────────────────────────────────────────────────────────────────────────
# GND (GROUNDING) SCHEMAS - Abstract theories
# ─────────────────────────────────────────────────────────────────────────────

class GNDConcept(BaseModel):
    """An abstract concept in conflict theory."""
    name: str
    definition: str
    concept_type: str = "phenomenon"
    indicators: List[str] = Field(default_factory=list)

class GNDTheory(BaseModel):
    """A named conflict theory or framework."""
    name: str
    description: str
    proponent: Optional[str] = "Academic"
    core_concepts: List[str] = Field(default_factory=list)

class GNDCausalLink(BaseModel):
    """A causal relationship between concepts."""
    source_concept: str
    target_concept: str
    relation_type: str  # causes, enables, prevents, correlates
    justification: str
    confidence: float = 0.8

class GNDPayload(BaseModel):
    """Container for grounding graph data."""
    concepts: List[GNDConcept] = Field(default_factory=list)
    theories: List[GNDTheory] = Field(default_factory=list)
    causal_links: List[GNDCausalLink] = Field(default_factory=list)

# ─────────────────────────────────────────────────────────────────────────────
# DOCUMENT STRUCTURES
# ─────────────────────────────────────────────────────────────────────────────

@dataclass
class Chunk:
    """A text chunk for processing."""
    chunk_id: str
    content: str
    index: int
    char_start: int
    char_end: int

@dataclass
class CaseDocument:
    """A processed case document."""
    case_id: str
    case_name: str
    filename: str
    full_text: str
    sha256: str
    chunks: List[Chunk]

print("✅ Ontology defined:")
print("   • CTX Schemas: Actor, Event, Claim, Interest, Leverage, Constraint, Commitment, Narrative")
print("   • GND Schemas: Concept, Theory, CausalLink")
print("   • Document: Chunk, CaseDocument")

✅ Ontology defined:
   • CTX Schemas: Actor, Event, Claim, Interest, Leverage, Constraint, Commitment, Narrative
   • GND Schemas: Concept, Theory, CausalLink
   • Document: Chunk, CaseDocument


---
# 🧾 BLOCK 2A: Evidence & Provenance Layer (Span-level auditability)
**Goal:** Make every extracted object *traceable back to the exact text span(s)* that support it.

This extends the pipeline with:
- `SourceDoc` → `SourceChunk` → `TextSpan` nodes
- `EVIDENCE_FOR` edges from spans to Actors/Claims/Events/etc.
- `ExtractionRun` nodes for run-level metrics, configs, and QA signals

> This is the “audit substrate”: you can always answer *“why does the graph believe X?”*


In [None]:
# ══════════════════════════════════════════════════════════════════════════════
# BLOCK 2A: EVIDENCE & PROVENANCE SCHEMAS + HELPERS
# ══════════════════════════════════════════════════════════════════════════════

from typing import Iterable
import math

# ─────────────────────────────────────────────────────────────────────────────
# IDs (stable, deterministic where possible)
# ─────────────────────────────────────────────────────────────────────────────

def mk_id(prefix: str, *parts: str, n: int = 12) -> str:
    payload = "|".join([p or "" for p in parts]).encode("utf-8")
    h = hashlib.sha256(payload).hexdigest()[:n]
    return f"{prefix}{h}"

def doc_id_from_sha(sha256: str) -> str:
    return f"EVD:DOC_{sha256[:12]}"

def span_id(chunk_id: str, start: int, end: int) -> str:
    return mk_id("EVD:SPN_", chunk_id, str(start), str(end), n=16)

def run_id() -> str:
    return f"RUN_{datetime.now(timezone.utc).strftime('%Y%m%dT%H%M%SZ')}_{uuid.uuid4().hex[:8]}"

# ─────────────────────────────────────────────────────────────────────────────
# Span extraction (cheap, deterministic)
# ─────────────────────────────────────────────────────────────────────────────

def _find_all_occurrences(haystack: str, needle: str, max_hits: int = 3) -> List[Tuple[int,int]]:
    """Case-insensitive substring matches; returns [(start,end), ...]."""
    if not needle:
        return []
    hs = haystack.lower()
    nd = needle.lower().strip()
    out = []
    start = 0
    while len(out) < max_hits:
        i = hs.find(nd, start)
        if i == -1:
            break
        out.append((i, i + len(nd)))
        start = i + len(nd)
    return out

def _excerpt(text: str, start: int, end: int, window: int = 140) -> str:
    a = max(0, start - window)
    b = min(len(text), end + window)
    snippet = text[a:b].replace("\n", " ").strip()
    return snippet

def propose_spans_for_chunk(chunk_text: str, actors: List["ExtractedActor"], claims: List["ExtractedClaim"]) -> List[Dict[str, Any]]:
    """
    Produce lightweight candidate spans for a chunk.
    - Actors: match on canonical actor name (and aliases)
    - Claims: attempt exact match on a short prefix (first ~12 words)
    """
    spans = []

    # Actors
    for a in actors:
        needles = [a.name] + list(getattr(a, "aliases", []) or [])
        needles = [n for n in needles if n and len(n) >= 3]
        seen = set()
        for nd in needles:
            for (s,e) in _find_all_occurrences(chunk_text, nd, max_hits=2):
                key = (s,e, a.name)
                if key in seen:
                    continue
                seen.add(key)
                spans.append({
                    "entity_kind": "Actor",
                    "entity_key": a.name,
                    "start_in_chunk": s,
                    "end_in_chunk": e,
                    "method": "string_match",
                    "confidence": 0.85 if nd != a.name else 0.95,
                    "excerpt": _excerpt(chunk_text, s, e)
                })

    # Claims (best-effort)
    for c in claims:
        stmt = (c.statement or "").strip()
        if len(stmt) < 12:
            continue
        short = " ".join(stmt.split()[:12])
        for (s,e) in _find_all_occurrences(chunk_text, short, max_hits=1):
            spans.append({
                "entity_kind": "Claim",
                "entity_key": stmt,
                "start_in_chunk": s,
                "end_in_chunk": e,
                "method": "claim_prefix_match",
                "confidence": 0.75,
                "excerpt": _excerpt(chunk_text, s, e)
            })

    return spans

# ─────────────────────────────────────────────────────────────────────────────
# Neo4j write helpers (Evidence nodes)
# ─────────────────────────────────────────────────────────────────────────────

def write_source_doc_and_chunks(driver, doc: "CaseDocument") -> None:
    """
    Creates:
      (Case)-[:HAS_SOURCE]->(SourceDoc)-[:HAS_CHUNK]->(SourceChunk)
    and stores cleaning + lint metadata if present on doc.
    """
    docid = doc_id_from_sha(doc.sha256)
    meta = {
        "filename": doc.filename,
        "sha256": doc.sha256,
        "case_id": doc.case_id,
        "case_name": doc.case_name,
        "created_at": datetime.now(timezone.utc).isoformat(),
        "graph_category": "EVD",
        "cleaning_audit": json.dumps([h.__dict__ for h in getattr(doc, "cleaning_audit", [])], default=str),
        "lint_flags": json.dumps(getattr(doc, "lint_flags", {}), default=str),
    }

    with driver.session() as s:
        s.run("""
            MERGE (c:Case {case_id: $case_id})
            SET c.case_name = $case_name,
                c.graph_category = COALESCE(c.graph_category,'OPS'),
                c.created_at = COALESCE(c.created_at, $created_at)
            MERGE (d:SourceDoc {doc_id: $doc_id})
            SET d += $meta
            MERGE (c)-[:HAS_SOURCE]->(d)
        """, case_id=doc.case_id, case_name=doc.case_name, created_at=meta["created_at"], doc_id=docid, meta=meta)

        # Chunks
        for ch in doc.chunks:
            s.run("""
                MERGE (k:SourceChunk {chunk_id: $chunk_id})
                SET k.case_id=$case_id,
                    k.doc_id=$doc_id,
                    k.index=$idx,
                    k.char_start=$cs,
                    k.char_end=$ce,
                    k.graph_category='EVD'
                MERGE (d:SourceDoc {doc_id: $doc_id})
                MERGE (d)-[:HAS_CHUNK]->(k)
            """, chunk_id=ch.chunk_id, case_id=doc.case_id, doc_id=docid, idx=ch.index, cs=ch.char_start, ce=ch.char_end)

def write_extraction_run(driver, rid: str, case_id: str, config_snapshot: Dict[str, Any]) -> None:
    with driver.session() as s:
        s.run("""
            MERGE (r:ExtractionRun {run_id: $rid})
            SET r.case_id=$cid,
                r.started_at=$ts,
                r.config=$cfg,
                r.graph_category='OPS'
            MERGE (c:Case {case_id: $cid})
            MERGE (c)-[:HAS_RUN]->(r)
        """, rid=rid, cid=case_id, ts=datetime.now(timezone.utc).isoformat(), cfg=json.dumps(config_snapshot, default=str))

def attach_run_source(driver, rid: str, doc: "CaseDocument") -> None:
    with driver.session() as s:
        s.run("""
            MATCH (r:ExtractionRun {run_id:$rid})
            MATCH (d:SourceDoc {doc_id:$docid})
            MERGE (r)-[:USED_SOURCE]->(d)
        """, rid=rid, docid=doc_id_from_sha(doc.sha256))

def write_spans_and_evidence(driver, doc: "CaseDocument", ext: "FullExtraction", rid: str) -> int:
    """
    Creates TextSpan nodes per chunk and links them to extracted entities.
    We match entities by (case_id + properties) instead of re-deriving entity_id hashes.
    """
    total = 0
    with driver.session() as s:
        for ch in doc.chunks:
            spans = propose_spans_for_chunk(ch.content, ext.actors, ext.claims)
            for sp in spans:
                sid = span_id(ch.chunk_id, sp["start_in_chunk"], sp["end_in_chunk"])
                # doc-level coordinates
                ds = int(ch.char_start + sp["start_in_chunk"])
                de = int(ch.char_start + sp["end_in_chunk"])
                s.run("""
                    MERGE (k:SourceChunk {chunk_id:$chunk_id})
                    MERGE (t:TextSpan {span_id:$sid})
                    SET t.case_id=$cid,
                        t.chunk_id=$chunk_id,
                        t.doc_char_start=$ds,
                        t.doc_char_end=$de,
                        t.excerpt=$ex,
                        t.method=$m,
                        t.confidence=$conf,
                        t.graph_category='EVD'
                    MERGE (k)-[:HAS_SPAN]->(t)
                    WITH t
                    MATCH (r:ExtractionRun {run_id:$rid})
                    MERGE (t)-[:FOUND_IN_RUN]->(r)
                """, chunk_id=ch.chunk_id, sid=sid, cid=doc.case_id, ds=ds, de=de,
                     ex=sp["excerpt"], m=sp["method"], conf=float(sp["confidence"]), rid=rid)

                # Evidence link
                if sp["entity_kind"] == "Actor":
                    s.run("""
                        MATCH (t:TextSpan {span_id:$sid, case_id:$cid})
                        MATCH (a:Actor {name:$name, case_id:$cid})
                        MERGE (t)-[:EVIDENCE_FOR]->(a)
                    """, sid=sid, cid=doc.case_id, name=sp["entity_key"])
                elif sp["entity_kind"] == "Claim":
                    s.run("""
                        MATCH (t:TextSpan {span_id:$sid, case_id:$cid})
                        MATCH (c:Claim {statement:$stmt, case_id:$cid})
                        MERGE (t)-[:EVIDENCE_FOR]->(c)
                    """, sid=sid, cid=doc.case_id, stmt=sp["entity_key"])

                total += 1
    return total

print("✅ Evidence/Provenance helpers ready (SourceDoc/Chunk/Span + ExtractionRun).")


---
# 🧱 BLOCK 2B: Layered Graph Conventions (GND / CTX / RZN / EVD)
**Goal:** Keep the graph interpretable as it grows.

We standardize:
- `graph_category` values
- helper utilities for “layer-aware” Cypher filtering


In [None]:
# ══════════════════════════════════════════════════════════════════════════════
# BLOCK 2B: LAYER CONVENTIONS
# ══════════════════════════════════════════════════════════════════════════════

GRAPH_LAYERS = {
    "GND": "Grounding Graph (theories, concepts, causal links)",
    "CTX": "Case Context Graph (actors, claims, events, leverage...)",
    "RZN": "Reasoning Graph (derived issues, contradictions, diagnostics)",
    "EVD": "Evidence Graph (source docs, chunks, spans, run traces)",
    "OPS": "Operational metadata (runs, QA metrics, pipeline state)"
}

def cypher_layer_filter(layer: str = None) -> str:
    if not layer:
        return "TRUE"
    return f"n.graph_category = '{layer}'"

print("✅ Layer conventions loaded:")
for k,v in GRAPH_LAYERS.items():
    print(f"   • {k}: {v}")


---
# 🧰 BLOCK 2C: Neo4j Constraints & Indexes (for scale + safety)
**Goal:** Prevent accidental duplicates and speed up large-case queries.

This block creates:
- uniqueness constraints (entity IDs, case IDs, run IDs)
- indexes on `case_id`, `name`, and provenance IDs


In [None]:
# ══════════════════════════════════════════════════════════════════════════════
# BLOCK 2C: NEO4J CONSTRAINTS & INDEXES
# ══════════════════════════════════════════════════════════════════════════════

def _ddl(driver, statements: List[str]) -> List[Tuple[str, str]]:
    results = []
    with driver.session() as s:
        for st in statements:
            try:
                s.run(st)
                results.append((st, "ok"))
            except Exception as e:
                results.append((st, f"skip: {type(e).__name__}: {e}"))
    return results

def ensure_neo4j_constraints(driver) -> None:
    ddl = []

    ddl += [
        "CREATE CONSTRAINT case_id_unique IF NOT EXISTS FOR (c:Case) REQUIRE c.case_id IS UNIQUE",
        "CREATE CONSTRAINT run_id_unique IF NOT EXISTS FOR (r:ExtractionRun) REQUIRE r.run_id IS UNIQUE",
        "CREATE CONSTRAINT doc_id_unique IF NOT EXISTS FOR (d:SourceDoc) REQUIRE d.doc_id IS UNIQUE",
        "CREATE CONSTRAINT chunk_id_unique IF NOT EXISTS FOR (k:SourceChunk) REQUIRE k.chunk_id IS UNIQUE",
        "CREATE CONSTRAINT span_id_unique IF NOT EXISTS FOR (t:TextSpan) REQUIRE t.span_id IS UNIQUE",
    ]

    for label in ["Actor","Claim","Event","Interest","Leverage","Constraint","Commitment","Narrative",
                  "Concept","Theory"]:
        ddl.append(f"CREATE CONSTRAINT {label.lower()}_eid_unique IF NOT EXISTS FOR (n:{label}) REQUIRE n.entity_id IS UNIQUE")

    ddl += [
        "CREATE INDEX actor_case_name IF NOT EXISTS FOR (a:Actor) ON (a.case_id, a.name)",
        "CREATE INDEX claim_case_stmt IF NOT EXISTS FOR (c:Claim) ON (c.case_id, c.statement)",
        "CREATE INDEX evd_case_doc IF NOT EXISTS FOR (d:SourceDoc) ON (d.case_id, d.doc_id)",
        "CREATE INDEX run_case IF NOT EXISTS FOR (r:ExtractionRun) ON (r.case_id)",
    ]

    out = _ddl(driver, ddl)
    ok = sum(1 for _,st in out if st=="ok")
    print(f"✅ Neo4j DDL applied: {ok}/{len(out)} statements (others may be skipped depending on Neo4j version).")

print("ℹ️ Constraints/indexes helper ready. Run Block 4A next (after Block 4 creates `driver`).")


---
# 🧠 BLOCK 3: Vector Engine & OntoRAG
**Powers semantic search and intelligent theory selection.**

In [None]:
# ══════════════════════════════════════════════════════════════════════════════
# BLOCK 3: VECTOR ENGINE & ONTORAG
# ══════════════════════════════════════════════════════════════════════════════
# 🎯 PURPOSE: Generate embeddings for semantic search and smart extraction
# ⏱️ TIME: ~5 seconds to initialize
# ══════════════════════════════════════════════════════════════════════════════

class VectorEngine:
    """
    Generates vector embeddings using Vertex AI.
    These embeddings enable semantic (meaning-based) search.
    """

    def __init__(self, model_name: str = "text-embedding-005", dimensions: int = 768):
        """
        Initialize the vector engine.

        Args:
            model_name: Vertex AI embedding model
            dimensions: Output vector dimensions (768 recommended for Neo4j)
        """
        self.model = TextEmbeddingModel.from_pretrained(model_name)
        self.dimensions = dimensions
        self.model_name = model_name
        self._cache = {}  # Cache to avoid redundant API calls
        self._cache_hits = 0
        self._api_calls = 0

    def embed_text(self, text: str, task_type: str = "RETRIEVAL_DOCUMENT") -> List[float]:
        """
        Generate embedding for a single text.

        Args:
            text: Text to embed
            task_type: RETRIEVAL_DOCUMENT or RETRIEVAL_QUERY

        Returns:
            List of floats representing the embedding vector
        """
        if not text or not text.strip():
            return [0.0] * self.dimensions

        # Check cache
        cache_key = f"{text[:200]}_{task_type}"
        if cache_key in self._cache:
            self._cache_hits += 1
            return self._cache[cache_key]

        # Truncate if too long (model limit ~2048 tokens ≈ 8000 chars)
        text = text[:7500] if len(text) > 7500 else text

        try:
            self._api_calls += 1
            inputs = [TextEmbeddingInput(text, task_type)]
            embeddings = self.model.get_embeddings(
                inputs,
                output_dimensionality=self.dimensions
            )
            vector = embeddings[0].values
            self._cache[cache_key] = vector
            return vector
        except Exception as e:
            console.print(f"[yellow]⚠️ Embedding failed: {e}[/yellow]")
            return [0.0] * self.dimensions

    def embed_batch(self, texts: List[str], task_type: str = "RETRIEVAL_DOCUMENT") -> List[List[float]]:
        """
        Generate embeddings for multiple texts (batched for efficiency).

        Args:
            texts: List of texts to embed
            task_type: RETRIEVAL_DOCUMENT or RETRIEVAL_QUERY

        Returns:
            List of embedding vectors
        """
        results = []
        batch_size = 50  # Vertex AI batch limit

        for i in range(0, len(texts), batch_size):
            batch = texts[i:i+batch_size]
            batch_clean = [t[:7500] if t else "empty" for t in batch]

            try:
                self._api_calls += 1
                inputs = [TextEmbeddingInput(t, task_type) for t in batch_clean]
                embeddings = self.model.get_embeddings(
                    inputs,
                    output_dimensionality=self.dimensions
                )
                results.extend([e.values for e in embeddings])
            except Exception as e:
                console.print(f"[yellow]⚠️ Batch embedding failed: {e}[/yellow]")
                results.extend([[0.0] * self.dimensions] * len(batch))

        return results

    def stats(self) -> Dict[str, int]:
        """Return usage statistics."""
        return {
            "api_calls": self._api_calls,
            "cache_hits": self._cache_hits,
            "cache_size": len(self._cache)
        }

# ─────────────────────────────────────────────────────────────────────────────
# ONTORAG SELECTOR
# ─────────────────────────────────────────────────────────────────────────────

class OntoRAGSelector:
    """
    OntoRAG: Ontology-based Retrieval Augmented Generation.

    Selects only the most relevant theories for each text chunk,
    preventing the LLM from getting confused by irrelevant schemas.
    """

    def __init__(self, vector_engine: VectorEngine):
        self.vector_engine = vector_engine
        self.index = None
        self.theories = []  # List of (name, description, entity_id)
        self.is_initialized = False

    def build_index_from_neo4j(self, driver) -> bool:
        """
        Load all GND theories from Neo4j and build FAISS index.

        Returns:
            True if index was built successfully
        """
        console.print("[cyan]Building OntoRAG index from theories...[/cyan]")

        with driver.session() as s:
            result = s.run("""
                MATCH (t:Theory)
                WHERE t.graph_category = 'GND'
                RETURN t.name AS name, t.desc AS description, t.entity_id AS eid
            """)
            self.theories = [(r['name'], r['description'] or '', r['eid']) for r in result]

        if not self.theories:
            console.print("[yellow]⚠️ No theories found in GND graph. Run Block 5 to seed theories.[/yellow]")
            return False

        # Generate embeddings for all theories
        theory_texts = [f"{t[0]}: {t[1]}" for t in self.theories]
        console.print(f"[cyan]   Embedding {len(theory_texts)} theories...[/cyan]")
        embeddings = self.vector_engine.embed_batch(theory_texts)

        # Build FAISS index (Inner Product for cosine similarity)
        vectors = np.array(embeddings).astype('float32')
        faiss.normalize_L2(vectors)  # Normalize for cosine similarity

        self.index = faiss.IndexFlatIP(vectors.shape[1])
        self.index.add(vectors)

        self.is_initialized = True
        console.print(f"[green]✅ OntoRAG index ready: {len(self.theories)} theories indexed[/green]")
        return True

    def select_relevant_theories(self, text: str, top_k: int = 5) -> List[Dict]:
        """
        Find the most relevant theories for a given text.

        Args:
            text: Text to match against theories
            top_k: Number of theories to return

        Returns:
            List of dicts with theory info and relevance scores
        """
        if not self.is_initialized:
            return []

        # Embed the query
        query_vec = np.array([self.vector_engine.embed_text(text[:3000], "RETRIEVAL_QUERY")]).astype('float32')
        faiss.normalize_L2(query_vec)

        # Search
        k = min(top_k, len(self.theories))
        scores, indices = self.index.search(query_vec, k)

        results = []
        for score, idx in zip(scores[0], indices[0]):
            if idx >= 0 and idx < len(self.theories):
                name, desc, eid = self.theories[idx]
                results.append({
                    'name': name,
                    'description': desc,
                    'entity_id': eid,
                    'relevance_score': float(score)
                })

        return results

# ─────────────────────────────────────────────────────────────────────────────
# INITIALIZE
# ─────────────────────────────────────────────────────────────────────────────

print("🧠 Initializing Vector Engine...")
vector_engine = VectorEngine(
    model_name=config.embedding_model,
    dimensions=config.embedding_dimensions
)
print(f"✅ Vector Engine ready ({config.embedding_model}, {config.embedding_dimensions}D)")

onto_rag = OntoRAGSelector(vector_engine)
print("✅ OntoRAG Selector ready (will initialize after seeding theories)")

🧠 Initializing Vector Engine...
✅ Vector Engine ready (text-embedding-005, 768D)
✅ OntoRAG Selector ready (will initialize after seeding theories)


---
# 🧪 BLOCK 4: Connection Tests
**Verify everything is working before proceeding.**

In [None]:
# ══════════════════════════════════════════════════════════════════════════════
# BLOCK 4: CONNECTION & FUNCTIONALITY TESTS
# ══════════════════════════════════════════════════════════════════════════════
# 🎯 PURPOSE: Verify all components are working correctly
# ⏱️ TIME: ~10 seconds
# ══════════════════════════════════════════════════════════════════════════════

def run_diagnostics():
    """Run comprehensive diagnostics on all system components."""

    results = []

    # ─────────────────────────────────────────────────────────────────────────
    # TEST 1: Neo4j Connection
    # ─────────────────────────────────────────────────────────────────────────
    print("\n🔍 Test 1: Neo4j Connection")
    try:
        with driver.session() as s:
            result = s.run("RETURN 1 AS test").single()
            assert result['test'] == 1

            # Get DB stats
            stats = s.run("""
                MATCH (n)
                RETURN count(n) AS nodes,
                       count(DISTINCT labels(n)) AS label_types
            """).single()

        print(f"   ✅ Connected - {stats['nodes']} nodes, {stats['label_types']} label types")
        results.append(("Neo4j", "PASS", f"{stats['nodes']} nodes"))
    except Exception as e:
        print(f"   ❌ Failed: {e}")
        results.append(("Neo4j", "FAIL", str(e)))

    # ─────────────────────────────────────────────────────────────────────────
    # TEST 2: Vertex AI LLM
    # ─────────────────────────────────────────────────────────────────────────
    print("\n🔍 Test 2: Vertex AI LLM")
    try:
        model = GenerativeModel(config.extraction_model)
        response = model.generate_content(
            "Say 'TACITUS READY' and nothing else.",
            generation_config=GenerationConfig(temperature=0.0, max_output_tokens=20)
        )
        output = response.text.strip()
        print(f"   ✅ LLM responded: '{output}'")
        results.append(("LLM", "PASS", config.extraction_model))
    except Exception as e:
        print(f"   ❌ Failed: {e}")
        results.append(("LLM", "FAIL", str(e)))

    # ─────────────────────────────────────────────────────────────────────────
    # TEST 3: Vector Embeddings
    # ─────────────────────────────────────────────────────────────────────────
    print("\n🔍 Test 3: Vector Embeddings")
    try:
        test_text = "This is a test of the TACITUS conflict analysis system."
        embedding = vector_engine.embed_text(test_text)

        assert len(embedding) == config.embedding_dimensions
        assert any(v != 0 for v in embedding)  # Not all zeros

        print(f"   ✅ Generated {len(embedding)}D embedding")
        print(f"   Sample values: [{embedding[0]:.4f}, {embedding[1]:.4f}, ...]")
        results.append(("Embeddings", "PASS", f"{len(embedding)}D"))
    except Exception as e:
        print(f"   ❌ Failed: {e}")
        results.append(("Embeddings", "FAIL", str(e)))

    # ─────────────────────────────────────────────────────────────────────────
    # TEST 4: Neo4j Vector Functions
    # ─────────────────────────────────────────────────────────────────────────
    print("\n🔍 Test 4: Neo4j Vector Functions")
    try:
        with driver.session() as s:
            # Test vector similarity function
            result = s.run("""
                WITH [1.0, 0.0, 0.0] AS v1, [1.0, 0.0, 0.0] AS v2
                RETURN vector.similarity.cosine(v1, v2) AS similarity
            """).single()

            sim = result['similarity']
            assert sim == 1.0  # Identical vectors should have similarity 1.0

        print(f"   ✅ Vector similarity working (test: {sim})")
        results.append(("Neo4j Vectors", "PASS", "cosine similarity"))
    except Exception as e:
        print(f"   ⚠️ Vector functions may not be available: {e}")
        print(f"   This is OK - semantic search will fall back to basic queries")
        results.append(("Neo4j Vectors", "WARN", "Using fallback"))

    # ─────────────────────────────────────────────────────────────────────────
    # TEST 5: FAISS
    # ─────────────────────────────────────────────────────────────────────────
    print("\n🔍 Test 5: FAISS Vector Index")
    try:
        # Create a small test index
        test_vectors = np.random.rand(10, 64).astype('float32')
        faiss.normalize_L2(test_vectors)

        index = faiss.IndexFlatIP(64)
        index.add(test_vectors)

        query = np.random.rand(1, 64).astype('float32')
        faiss.normalize_L2(query)
        scores, indices = index.search(query, 3)

        print(f"   ✅ FAISS working - searched 10 vectors, found top 3")
        results.append(("FAISS", "PASS", "Index working"))
    except Exception as e:
        print(f"   ❌ Failed: {e}")
        results.append(("FAISS", "FAIL", str(e)))

    # ─────────────────────────────────────────────────────────────────────────
    # SUMMARY
    # ─────────────────────────────────────────────────────────────────────────
    print("\n" + "="*60)
    print("📊 DIAGNOSTIC SUMMARY")
    print("="*60)

    df = pd.DataFrame(results, columns=['Component', 'Status', 'Details'])

    # Color coding
    def color_status(val):
        if val == 'PASS':
            return 'background-color: #90EE90'
        elif val == 'FAIL':
            return 'background-color: #FFB6C1'
        else:
            return 'background-color: #FFFFE0'

    display(df.style.applymap(color_status, subset=['Status']).hide(axis='index'))

    # Final verdict
    failures = sum(1 for r in results if r[1] == 'FAIL')
    if failures == 0:
        print("\n🎉 All systems operational! Ready to proceed.")
    else:
        print(f"\n⚠️ {failures} component(s) failed. Check errors above.")

    return results

# Run diagnostics
diagnostic_results = run_diagnostics()


🔍 Test 1: Neo4j Connection
   ✅ Connected - 609 nodes, 15 label types

🔍 Test 2: Vertex AI LLM
   ✅ LLM responded: 'TACITUS READY'

🔍 Test 3: Vector Embeddings
   ✅ Generated 768D embedding
   Sample values: [0.0006, -0.0159, ...]

🔍 Test 4: Neo4j Vector Functions
   ✅ Vector similarity working (test: 1.0)

🔍 Test 5: FAISS Vector Index
   ✅ FAISS working - searched 10 vectors, found top 3

📊 DIAGNOSTIC SUMMARY


  display(df.style.applymap(color_status, subset=['Status']).hide(axis='index'))


Component,Status,Details
Neo4j,PASS,609 nodes
LLM,PASS,gemini-2.5-flash-lite
Embeddings,PASS,768D
Neo4j Vectors,PASS,cosine similarity
FAISS,PASS,Index working



🎉 All systems operational! Ready to proceed.


---
# 🧱 BLOCK 4A: Apply Constraints & Indexes
Run this once after your Neo4j connection works. It’s optional but strongly recommended for real datasets.


In [None]:
# ══════════════════════════════════════════════════════════════════════════════
# BLOCK 4A: APPLY CONSTRAINTS / INDEXES
# ══════════════════════════════════════════════════════════════════════════════

try:
    ensure_neo4j_constraints(driver)
except NameError:
    print("❌ `driver` not found. Run BLOCK 4 first.")
except Exception as e:
    print(f"⚠️ DDL step had issues (often Neo4j version-related). Continuing. Details: {e}")


---
# 🌱 BLOCK 5: Seed Theories & Build OntoRAG Index
**Populates the Grounding Graph (GND) with conflict theories.**

This enables:
- Smarter extraction (LLM knows what theories to look for)
- Theory matching (link cases to relevant frameworks)
- Cross-case analysis (find similar conflicts)

In [None]:
# ══════════════════════════════════════════════════════════════════════════════
# BLOCK 5: THEORY SEEDING & ONTORAG INDEX
# ══════════════════════════════════════════════════════════════════════════════
# 🎯 PURPOSE: Seed the GND graph with conflict theories and build search index
# ⏱️ TIME: ~60 seconds (API calls to generate theories)
# ══════════════════════════════════════════════════════════════════════════════

# ─────────────────────────────────────────────────────────────────────────────
# THEORY GENERATOR
# ─────────────────────────────────────────────────────────────────────────────

SEED_PROMPT = """You are an Expert Professor of Conflict Studies and Political Science.
Generate a structured knowledge graph for this theoretical domain: '{domain}'.

Include:
1. The main theory with clear description
2. 3-5 core concepts the theory uses
3. Key causal relationships between concepts

Return ONLY valid JSON matching this schema:
{{
  "theories": [{{
    "name": "Theory Name",
    "description": "Clear 2-3 sentence description",
    "proponent": "Key theorist(s)",
    "core_concepts": ["concept1", "concept2"]
  }}],
  "concepts": [{{
    "name": "Concept Name",
    "definition": "Clear definition",
    "concept_type": "phenomenon|mechanism|condition",
    "indicators": ["observable indicator 1", "observable indicator 2"]
  }}],
  "causal_links": [{{
    "source_concept": "Concept A",
    "target_concept": "Concept B",
    "relation_type": "causes|enables|prevents|correlates",
    "justification": "Why this relationship exists",
    "confidence": 0.85
  }}]
}}"""

class GNDWriter:
    """Writes grounding data to Neo4j."""

    def __init__(self, drv, vec_engine: VectorEngine = None):
        self.driver = drv
        self.vector_engine = vec_engine

    def _get_embedding(self, text: str) -> Optional[List[float]]:
        if self.vector_engine and text:
            return self.vector_engine.embed_text(text[:500])
        return None

    def write(self, payload: GNDPayload, source_ref: str) -> Dict[str, int]:
        """Write GND data to Neo4j."""
        stats = defaultdict(int)

        with self.driver.session() as s:
            # 1. Concepts
            for c in payload.concepts:
                eid = f"GND:CPT_{hashlib.md5(c.name.lower().strip().encode()).hexdigest()[:12]}"
                embedding = self._get_embedding(f"{c.name}: {c.definition}")

                s.run('''
                    MERGE (n:Concept {entity_id: $eid})
                    SET n.name=$n, n.definition=$d, n.type=$ct,
                        n.graph_category='GND', n.embedding=$emb,
                        n.indicators=$ind
                ''', eid=eid, n=c.name, d=c.definition, ct=c.concept_type,
                     emb=embedding, ind=c.indicators)
                stats['concepts'] += 1

            # 2. Theories
            for t in payload.theories:
                eid = f"GND:THY_{hashlib.md5(t.name.lower().strip().encode()).hexdigest()[:12]}"
                embedding = self._get_embedding(f"{t.name}: {t.description}")

                s.run('''
                    MERGE (th:Theory {entity_id: $eid})
                    SET th.name=$n, th.desc=$d, th.proponent=$p,
                        th.source=$src, th.graph_category='GND',
                        th.embedding=$emb
                ''', eid=eid, n=t.name, d=t.description, p=t.proponent,
                     src=source_ref, emb=embedding)

                # Link to concepts
                for cn in t.core_concepts:
                    s.run('''
                        MATCH (th:Theory {entity_id: $tid})
                        MATCH (c:Concept)
                        WHERE toLower(c.name) = toLower($cn)
                        MERGE (th)-[:EXPLAINS]->(c)
                    ''', tid=eid, cn=cn)
                stats['theories'] += 1

            # 3. Causal Links
            for link in payload.causal_links:
                s.run('''
                    MATCH (a:Concept), (b:Concept)
                    WHERE toLower(a.name) = toLower($src) AND toLower(b.name) = toLower($tgt)
                    MERGE (a)-[r:CAUSAL_LINK {type: $rel}]->(b)
                    SET r.justification = $just, r.confidence = $conf
                ''', src=link.source_concept, tgt=link.target_concept,
                     rel=link.relation_type, just=link.justification,
                     conf=link.confidence)
                stats['causal_links'] += 1

        return dict(stats)

class TheorySeeder:
    """Seeds the GND graph with standard conflict theories."""

    # Standard conflict theories to seed
    STANDARD_THEORIES = [
        "Relative Deprivation Theory (Ted Gurr) - why groups rebel when expectations exceed outcomes",
        "Security Dilemma (Robert Jervis) - how defensive measures cause offensive responses",
        "Resource Curse and Rentier State Theory - how natural resources fuel conflict",
        "Social Movement Theory and Political Process - how movements mobilize and succeed",
        "Elite Bargaining Theory - how power-sharing agreements prevent or end conflict",
        "Ethnic Outbidding Theory - how politicians radicalize ethnic conflicts for votes",
        "Commitment Problem in Civil Wars - why parties cannot credibly commit to peace",
        "Greed vs Grievance Theory (Collier/Hoeffler) - economic motivations for rebellion"
    ]

    def __init__(self, driver, vector_engine: VectorEngine):
        self.driver = driver
        self.vector_engine = vector_engine
        self.model = GenerativeModel(config.extraction_model)
        self.writer = GNDWriter(driver, vector_engine)
        self.cfg = GenerationConfig(temperature=0.1, response_mime_type='application/json')

    @retry(stop=stop_after_attempt(3), wait=wait_exponential(min=2, max=10))
    def _generate_theory(self, domain: str) -> Optional[GNDPayload]:
        """Generate theory structure from LLM."""
        try:
            prompt = SEED_PROMPT.format(domain=domain)
            response = self.model.generate_content(prompt, generation_config=self.cfg)
            data = json.loads(response.text)
            return GNDPayload(**data)
        except Exception as e:
            console.print(f"[yellow]   ⚠️ Generation failed: {e}[/yellow]")
            return None

    def seed_all(self, theories: List[str] = None):
        """Seed all standard theories."""
        theories = theories or self.STANDARD_THEORIES

        console.print(Panel(f"[bold cyan]🌱 Seeding {len(theories)} Conflict Theories[/bold cyan]", border_style="cyan"))

        total_stats = defaultdict(int)

        for i, domain in enumerate(theories, 1):
            console.print(f"\n[{i}/{len(theories)}] {domain[:50]}...")

            payload = self._generate_theory(domain)
            if payload:
                stats = self.writer.write(payload, "LLM_Knowledge_Seed")
                for k, v in stats.items():
                    total_stats[k] += v
                console.print(f"   ✅ Added: {stats}")
            else:
                console.print(f"   ❌ Failed")

            time.sleep(0.5)  # Rate limiting

        console.print(Panel(f"[bold green]✅ Seeding Complete: {dict(total_stats)}[/bold green]", border_style="green"))
        return dict(total_stats)

    def check_existing(self) -> int:
        """Check how many theories already exist."""
        with self.driver.session() as s:
            result = s.run("MATCH (t:Theory {graph_category: 'GND'}) RETURN count(t) AS count").single()
            return result['count']

# ─────────────────────────────────────────────────────────────────────────────
# INTERACTIVE SEEDING
# ─────────────────────────────────────────────────────────────────────────────

seeder = TheorySeeder(driver, vector_engine)
gnd_writer = GNDWriter(driver, vector_engine)

existing_count = seeder.check_existing()
print(f"\n📊 Current GND Graph: {existing_count} theories")

if existing_count == 0:
    print("\n⚠️ No theories found. Click the button below to seed.")
else:
    print(f"\n✅ {existing_count} theories already seeded.")

# Create UI
btn_seed = widgets.Button(description="🌱 Seed Standard Theories", button_style='success', layout={'width': '250px'})
btn_rebuild = widgets.Button(description="🔄 Rebuild OntoRAG Index", button_style='info', layout={'width': '250px'})
out = widgets.Output()

def on_seed(_):
    with out:
        clear_output()
        seeder.seed_all()
        print("\n🔄 Now rebuilding OntoRAG index...")
        onto_rag.build_index_from_neo4j(driver)

def on_rebuild(_):
    with out:
        clear_output()
        onto_rag.build_index_from_neo4j(driver)

btn_seed.on_click(on_seed)
btn_rebuild.on_click(on_rebuild)

display(widgets.HBox([btn_seed, btn_rebuild]))
display(out)

# Auto-build OntoRAG if theories exist
if existing_count > 0:
    print("\n🔄 Building OntoRAG index from existing theories...")
    onto_rag.build_index_from_neo4j(driver)


📊 Current GND Graph: 1 theories

✅ 1 theories already seeded.


HBox(children=(Button(button_style='success', description='🌱 Seed Standard Theories', layout=Layout(width='250…

Output()


🔄 Building OntoRAG index from existing theories...


---
# 📄 BLOCK 6: Document Processing & Extraction Pipeline
**Upload case files, extract entities, and populate the CTX graph.**

This is the main extraction pipeline that:
1. Uploads and chunks your documents
2. Uses OntoRAG to select relevant theories
3. Extracts actors, events, claims, etc.
4. Resolves duplicate entities
5. Writes everything to Neo4j with embeddings

In [None]:
# ══════════════════════════════════════════════════════════════════════════════
# BLOCK 6: DOCUMENT PROCESSING & EXTRACTION PIPELINE
# ══════════════════════════════════════════════════════════════════════════════
# 🎯 PURPOSE: Upload, process, and extract knowledge from case documents
# ⏱️ TIME: Varies by document size (~30s per 50KB)
# ══════════════════════════════════════════════════════════════════════════════

# ─────────────────────────────────────────────────────────────────────────────
# TEXT CHUNKER
# ─────────────────────────────────────────────────────────────────────────────

def chunk_text(text: str, doc_id: str, size: int = 20000, overlap: int = 1000) -> List[Chunk]:
    """Split text into overlapping chunks for processing."""
    text = re.sub(r'\s+', ' ', text).strip()

    if len(text) <= size:
        return [Chunk(f"CHK_{doc_id[:8]}_0", text, 0, 0, len(text))]

    chunks = []
    start = 0
    idx = 0

    while start < len(text):
        end = min(start + size, len(text))
        chunk_content = text[start:end]

        chunks.append(Chunk(
            chunk_id=f"CHK_{doc_id[:8]}_{idx}",
            content=chunk_content,
            index=idx,
            char_start=start,
            char_end=end
        ))

        if end == len(text):
            break

        start = end - overlap
        idx += 1

    return chunks

def process_upload(filename: str, content: bytes, case_name: str) -> CaseDocument:
    """Convert uploaded file to CaseDocument."""
    text = content.decode('utf-8', errors='ignore')
    sha256 = hashlib.sha256(text.encode()).hexdigest()

    clean_name = re.sub(r'[^a-zA-Z0-9]', '_', case_name).upper()[:40]
    case_id = f"CTX_{clean_name}_{datetime.now().strftime('%Y%m%d_%H%M')}"

    chunks = chunk_text(text, sha256, config.chunk_size, config.chunk_overlap)

    return CaseDocument(case_id, case_name, filename, text, sha256, chunks)

# ─────────────────────────────────────────────────────────────────────────────
# LLM EXTRACTION PROMPTS
# ─────────────────────────────────────────────────────────────────────────────

SYSTEM_PROMPT = """You are TACITUS, an expert conflict analyst. Extract structured data accurately.
CRITICAL: Always extract dates/timelines. If no explicit date, infer from context.
Return ONLY valid JSON matching the schema exactly."""

def build_extraction_prompt(text: str, stage: str, relevant_theories: List[Dict] = None) -> str:
    """Build extraction prompt with optional OntoRAG context."""

    theory_context = ""
    if relevant_theories:
        theory_context = "\n\nRELEVANT THEORETICAL FRAMEWORKS (use to guide extraction):\n"
        for t in relevant_theories[:5]:
            theory_context += f"• {t['name']}: {t['description'][:150]}...\n"

    if stage == "actors":
        return f"""{SYSTEM_PROMPT}{theory_context}

Extract ACTORS and RELATIONSHIPS from this conflict text.

JSON Schema:
{{
  "actors": [{{
    "name": "string (full name)",
    "actor_type": "person|org|state|group",
    "role": "string (their role in conflict)",
    "description": "string (brief description)",
    "is_primary": boolean,
    "confidence": float 0-1
  }}],
  "relationships": [{{
    "source": "actor name",
    "target": "actor name",
    "rel_type": "OPPOSES|ALLIES_WITH|INFLUENCES|DEPENDS_ON",
    "valid_from": "YYYY-MM-DD or null",
    "status": "active|historical",
    "confidence": float 0-1
  }}]
}}

TEXT:
{text[:25000]}"""

    else:  # details stage
        return f"""{SYSTEM_PROMPT}{theory_context}

Extract EVENTS, CLAIMS, INTERESTS, LEVERAGE, CONSTRAINTS, COMMITMENTS, and NARRATIVES.

JSON Schema:
{{
  "events": [{{
    "description": "string",
    "event_type": "action|escalation|negotiation|agreement",
    "date": "YYYY-MM-DD or null",
    "actor_names": ["string"],
    "impact_level": "critical|high|moderate|low",
    "confidence": float
  }}],
  "claims": [{{
    "statement": "string (the claim)",
    "actor_name": "who made it",
    "claim_type": "demand|threat|accusation|promise",
    "target_actor": "string or null",
    "valid_from": "YYYY-MM-DD or null",
    "confidence": float
  }}],
  "interests": [{{
    "description": "string",
    "actor_name": "string",
    "interest_type": "security|political|economic|identity",
    "confidence": float
  }}],
  "leverage": [{{
    "description": "string",
    "actor_name": "who holds it",
    "leverage_type": "coercive|economic|informational|structural",
    "target_actors": ["string"],
    "confidence": float
  }}],
  "constraints": [{{
    "description": "string",
    "actor_name": "string or null",
    "constraint_type": "political|legal|economic|resource",
    "confidence": float
  }}],
  "commitments": [{{
    "description": "string",
    "parties": ["string"],
    "status": "active|broken|fulfilled",
    "confidence": float
  }}],
  "narratives": [{{
    "description": "string",
    "actor_name": "who promotes it",
    "frame_type": "victim|aggressor|mediator|defender",
    "confidence": float
  }}]
}}

TEXT:
{text[:25000]}"""

# ─────────────────────────────────────────────────────────────────────────────
# ASYNC EXTRACTOR
# ─────────────────────────────────────────────────────────────────────────────

class AsyncExtractor:
    """Async extraction with OntoRAG support."""

    def __init__(self, model_name: str, onto_rag_selector: OntoRAGSelector = None):
        self.model = GenerativeModel(model_name)
        self.cfg = GenerationConfig(temperature=0.0, response_mime_type='application/json')
        self.onto_rag = onto_rag_selector

    async def _call_llm(self, prompt: str) -> Dict:
        try:
            response = await asyncio.to_thread(
                self.model.generate_content, prompt, generation_config=self.cfg
            )
            return json.loads(response.text)
        except Exception as e:
            console.print(f"[yellow]⚠️ LLM error: {e}[/yellow]")
            return {}

    def _sanitize(self, data_list: list, required_field: str = None) -> list:
        """Clean LLM output to prevent validation errors."""
        if not isinstance(data_list, list):
            return []

        clean = []
        for item in data_list:
            if not isinstance(item, dict):
                continue

            # Fix missing required fields
            if required_field and not item.get(required_field):
                item[required_field] = "Unknown"

            # Fix null dates
            for date_field in ['date', 'valid_from', 'valid_until']:
                if date_field in item and item[date_field] is None:
                    item[date_field] = config.default_date

            # Ensure target_actor for claims
            if 'claim_type' in item and 'target_actor' not in item:
                item['target_actor'] = None

            clean.append(item)

        return clean

    async def extract_chunk(self, content: str) -> FullExtraction:
        """Extract all entities from a chunk."""

        # Get relevant theories via OntoRAG
        relevant_theories = []
        if self.onto_rag and self.onto_rag.is_initialized:
            relevant_theories = self.onto_rag.select_relevant_theories(content[:3000], top_k=3)

        # Build prompts
        prompt_actors = build_extraction_prompt(content, 'actors', relevant_theories)
        prompt_details = build_extraction_prompt(content, 'details', relevant_theories)

        # Parallel extraction
        d1, d2 = await asyncio.gather(
            self._call_llm(prompt_actors),
            self._call_llm(prompt_details)
        )

        ext = FullExtraction()

        # Parse actors
        ext.actors = [ExtractedActor(**a) for a in self._sanitize(d1.get('actors', []), 'name')]
        ext.relationships = [ActorRelationship(**r) for r in self._sanitize(d1.get('relationships', []), 'source')]

        # Parse details
        ext.events = [ExtractedEvent(**e) for e in self._sanitize(d2.get('events', []), 'description')]
        ext.claims = [ExtractedClaim(**c) for c in self._sanitize(d2.get('claims', []), 'actor_name')]
        ext.interests = [ExtractedInterest(**i) for i in self._sanitize(d2.get('interests', []), 'actor_name')]
        ext.leverage = [ExtractedLeverage(**l) for l in self._sanitize(d2.get('leverage', []), 'actor_name')]
        ext.constraints = [ExtractedConstraint(**c) for c in self._sanitize(d2.get('constraints', []), 'description')]
        ext.commitments = [ExtractedCommitment(**cm) for cm in self._sanitize(d2.get('commitments', []), 'description')]
        ext.narratives = [ExtractedNarrative(**n) for n in self._sanitize(d2.get('narratives', []), 'actor_name')]

        return ext

# ─────────────────────────────────────────────────────────────────────────────
# ENTITY RESOLVER
# ─────────────────────────────────────────────────────────────────────────────

class EntityResolver:
    """Resolves duplicate entities using fuzzy matching."""

    def __init__(self, threshold: float = 85.0):
        self.threshold = threshold

    def normalize(self, name: str) -> str:
        name = name.lower().strip()
        for prefix in ['mr. ', 'mrs. ', 'dr. ', 'the ', 'president ']:
            if name.startswith(prefix):
                name = name[len(prefix):]
        return re.sub(r'\s+', ' ', name)

    def find_duplicates(self, actors: List[ExtractedActor]) -> Dict[str, List[str]]:
        """Find groups of duplicate actors."""
        dupes = defaultdict(list)
        processed = set()

        for i, a1 in enumerate(actors):
            if a1.name in processed:
                continue

            group = [a1.name]
            for a2 in actors[i+1:]:
                if a2.name in processed:
                    continue

                score = max(
                    fuzz.ratio(self.normalize(a1.name), self.normalize(a2.name)),
                    fuzz.token_sort_ratio(self.normalize(a1.name), self.normalize(a2.name))
                )

                if score >= self.threshold:
                    group.append(a2.name)
                    processed.add(a2.name)

            if len(group) > 1:
                canonical = max(group, key=len)  # Use longest name as canonical
                dupes[canonical] = [n for n in group if n != canonical]

            processed.add(a1.name)

        return dict(dupes)

    def merge(self, actors: List[ExtractedActor]) -> Tuple[List[ExtractedActor], Dict[str, str]]:
        """Merge duplicate actors and return mapping."""
        dupes = self.find_duplicates(actors)
        mapping = {d: c for c, ds in dupes.items() for d in ds}

        merged = {}
        for a in actors:
            canonical = mapping.get(a.name, a.name)
            if canonical not in merged:
                merged[canonical] = ExtractedActor(
                    name=canonical, actor_type=a.actor_type, role=a.role,
                    description=a.description, is_primary=a.is_primary,
                    aliases=list(set(a.aliases + ([a.name] if a.name != canonical else []))),
                    confidence=a.confidence
                )
            else:
                m = merged[canonical]
                m.is_primary = m.is_primary or a.is_primary
                m.aliases = list(set(m.aliases + [a.name]))
                m.confidence = max(m.confidence, a.confidence)

        return list(merged.values()), mapping

    def apply_mapping(self, ext: FullExtraction, mapping: Dict[str, str]) -> FullExtraction:
        """Apply name mapping to all entities."""
        def m(n): return mapping.get(n, n) if n else n

        for c in ext.claims:
            c.actor_name = m(c.actor_name)
            c.target_actor = m(c.target_actor)
        for i in ext.interests:
            i.actor_name = m(i.actor_name)
        for l in ext.leverage:
            l.actor_name = m(l.actor_name)
            l.target_actors = [m(t) for t in l.target_actors]
        for e in ext.events:
            e.actor_names = [m(a) for a in e.actor_names]
        for c in ext.constraints:
            c.actor_name = m(c.actor_name)
        for cm in ext.commitments:
            cm.parties = [m(p) for p in cm.parties]
        for n in ext.narratives:
            n.actor_name = m(n.actor_name)
        for r in ext.relationships:
            r.source = m(r.source)
            r.target = m(r.target)

        return ext

# ─────────────────────────────────────────────────────────────────────────────
# GRAPH WRITER (with embeddings)
# ─────────────────────────────────────────────────────────────────────────────

# Allowed relationship types (for Cypher safety)
ALLOWED_REL_TYPES = {'OPPOSES', 'ALLIES_WITH', 'INFLUENCES', 'DEPENDS_ON', 'RELATIONSHIP'}

class GraphWriter:
    """Writes extracted data to Neo4j with embeddings."""

    def __init__(self, drv, vec_engine: VectorEngine = None):
        self.driver = drv
        self.vector_engine = vec_engine

    def _embed(self, text: str) -> Optional[List[float]]:
        if self.vector_engine and text:
            return self.vector_engine.embed_text(text[:500])
        return None

    def write_case(self, case_id: str, case_name: str):
        """Create case node."""
        embedding = self._embed(case_name)
        with self.driver.session() as s:
            s.run('''
                MERGE (c:Case {case_id: $cid})
                ON CREATE SET c.created_at = datetime()
                SET c.case_name=$cn, c.graph_category='CTX',
                    c.updated_at=datetime(), c.embedding=$emb
            ''', cid=case_id, cn=case_name, emb=embedding)

    def write_extraction(self, ext: FullExtraction, case_id: str) -> Dict[str, int]:
        """Write all extracted entities to Neo4j."""
        stats = defaultdict(int)

        with self.driver.session() as s:
            # ACTORS
            for a in ext.actors:
                eid = f"CTX:ACT_{hashlib.md5(f'{case_id}:{a.name.lower()}'.encode()).hexdigest()[:12]}"
                emb = self._embed(f"{a.name}: {a.role or ''} {a.description or ''}")
                s.run('''
                    MERGE (a:Actor {entity_id: $eid})
                    SET a.name=$n, a.role=$r, a.type=$t, a.description=$d,
                        a.is_primary=$p, a.case_id=$cid, a.graph_category='CTX',
                        a.embedding=$emb, a.aliases=$aliases
                    WITH a MATCH (c:Case {case_id: $cid}) MERGE (c)-[:CONTAINS]->(a)
                ''', eid=eid, n=a.name, r=a.role, t=a.actor_type, d=a.description,
                     p=a.is_primary, cid=case_id, emb=emb, aliases=a.aliases)
                stats['actors'] += 1

            # EVENTS
            for e in ext.events:
                eid = f"CTX:EVT_{hashlib.md5(f'{case_id}:{e.description[:30]}'.encode()).hexdigest()[:12]}"
                emb = self._embed(e.description)
                s.run('''
                    MERGE (ev:Event {entity_id: $eid})
                    SET ev.description=$d, ev.date=$dt, ev.type=$et, ev.impact=$il,
                        ev.case_id=$cid, ev.graph_category='CTX', ev.embedding=$emb
                ''', eid=eid, d=e.description, dt=e.date, et=e.event_type,
                     il=e.impact_level, cid=case_id, emb=emb)
                for an in e.actor_names:
                    s.run('MATCH (ev:Event {entity_id: $eid}), (a:Actor {name: $an, case_id: $cid}) MERGE (a)-[:INITIATED_EVENT]->(ev)',
                          eid=eid, an=an, cid=case_id)
                stats['events'] += 1

            # RELATIONSHIPS
            for r in ext.relationships:
                rel_type = r.rel_type if r.rel_type in ALLOWED_REL_TYPES else 'RELATIONSHIP'
                s.run(f'''
                    MATCH (a:Actor {{name: $s, case_id: $cid}})
                    MATCH (b:Actor {{name: $t, case_id: $cid}})
                    MERGE (a)-[rel:{rel_type}]->(b)
                    SET rel.valid_from=$vf, rel.status=$st, rel.confidence=$conf
                ''', s=r.source, t=r.target, vf=r.valid_from, st=r.status,
                     conf=r.confidence, cid=case_id)
                stats['relationships'] += 1

            # CLAIMS
            for c in ext.claims:
                eid = f"CTX:CLM_{hashlib.md5(f'{case_id}:{c.statement[:20]}'.encode()).hexdigest()[:12]}"
                emb = self._embed(c.statement)
                s.run('''
                    MERGE (n:Claim {entity_id: $eid})
                    SET n.statement=$st, n.type=$ct, n.target_actor=$ta, n.valid_from=$vf,
                        n.case_id=$cid, n.graph_category='CTX', n.embedding=$emb
                ''', eid=eid, st=c.statement, ct=c.claim_type, ta=c.target_actor,
                     vf=c.valid_from, cid=case_id, emb=emb)
                s.run('MATCH (n:Claim {entity_id: $eid}), (a:Actor {name: $an, case_id: $cid}) MERGE (a)-[:MAKES_CLAIM]->(n)',
                      eid=eid, an=c.actor_name, cid=case_id)
                stats['claims'] += 1

            # INTERESTS
            for i in ext.interests:
                eid = f"CTX:INT_{hashlib.md5(f'{case_id}:{i.description[:20]}'.encode()).hexdigest()[:12]}"
                emb = self._embed(i.description)
                s.run('''
                    MERGE (n:Interest {entity_id: $eid})
                    SET n.description=$d, n.type=$t, n.case_id=$cid, n.graph_category='CTX', n.embedding=$emb
                ''', eid=eid, d=i.description, t=i.interest_type, cid=case_id, emb=emb)
                s.run('MATCH (n:Interest {entity_id: $eid}), (a:Actor {name: $an, case_id: $cid}) MERGE (a)-[:HAS_INTEREST]->(n)',
                      eid=eid, an=i.actor_name, cid=case_id)
                stats['interests'] += 1

            # LEVERAGE
            for l in ext.leverage:
                eid = f"CTX:LEV_{hashlib.md5(f'{case_id}:{l.description[:20]}'.encode()).hexdigest()[:12]}"
                emb = self._embed(l.description)
                s.run('''
                    MERGE (n:Leverage {entity_id: $eid})
                    SET n.description=$d, n.type=$t, n.case_id=$cid, n.graph_category='CTX', n.embedding=$emb
                ''', eid=eid, d=l.description, t=l.leverage_type, cid=case_id, emb=emb)
                s.run('MATCH (n:Leverage {entity_id: $eid}), (a:Actor {name: $an, case_id: $cid}) MERGE (a)-[:HOLDS_LEVERAGE]->(n)',
                      eid=eid, an=l.actor_name, cid=case_id)
                stats['leverage'] += 1

            # CONSTRAINTS
            for c in ext.constraints:
                eid = f"CTX:CON_{hashlib.md5(f'{case_id}:{c.description[:20]}'.encode()).hexdigest()[:12]}"
                emb = self._embed(c.description)
                s.run('''
                    MERGE (n:Constraint {entity_id: $eid})
                    SET n.description=$d, n.type=$t, n.case_id=$cid, n.graph_category='CTX', n.embedding=$emb
                ''', eid=eid, d=c.description, t=c.constraint_type, cid=case_id, emb=emb)
                if c.actor_name:
                    s.run('MATCH (n:Constraint {entity_id: $eid}), (a:Actor {name: $an, case_id: $cid}) MERGE (a)-[:FACES_CONSTRAINT]->(n)',
                          eid=eid, an=c.actor_name, cid=case_id)
                stats['constraints'] += 1

            # COMMITMENTS
            for cm in ext.commitments:
                eid = f"CTX:CMT_{hashlib.md5(f'{case_id}:{cm.description[:20]}'.encode()).hexdigest()[:12]}"
                emb = self._embed(cm.description)
                s.run('''
                    MERGE (n:Commitment {entity_id: $eid})
                    SET n.description=$d, n.status=$st, n.case_id=$cid, n.graph_category='CTX', n.embedding=$emb
                ''', eid=eid, d=cm.description, st=cm.status, cid=case_id, emb=emb)
                for p in cm.parties:
                    s.run('MATCH (n:Commitment {entity_id: $eid}), (a:Actor {name: $p, case_id: $cid}) MERGE (a)-[:MADE_COMMITMENT]->(n)',
                          eid=eid, p=p, cid=case_id)
                stats['commitments'] += 1

            # NARRATIVES
            for n in ext.narratives:
                eid = f"CTX:NAR_{hashlib.md5(f'{case_id}:{n.description[:20]}'.encode()).hexdigest()[:12]}"
                emb = self._embed(n.description)
                s.run('''
                    MERGE (node:Narrative {entity_id: $eid})
                    SET node.description=$d, node.frame=$f, node.case_id=$cid, node.graph_category='CTX', node.embedding=$emb
                ''', eid=eid, d=n.description, f=n.frame_type, cid=case_id, emb=emb)
                s.run('MATCH (node:Narrative {entity_id: $eid}), (a:Actor {name: $an, case_id: $cid}) MERGE (a)-[:PROMOTES_NARRATIVE]->(node)',
                      eid=eid, an=n.actor_name, cid=case_id)
                stats['narratives'] += 1

        return dict(stats)

# ─────────────────────────────────────────────────────────────────────────────
# INITIALIZE COMPONENTS
# ─────────────────────────────────────────────────────────────────────────────

extractor = AsyncExtractor(config.extraction_model, onto_rag)
resolver = EntityResolver(config.er_threshold)
writer = GraphWriter(driver, vector_engine)

print("✅ Extraction pipeline ready:")
print(f"   • Extractor: {config.extraction_model}")
print(f"   • Entity Resolver: {config.er_threshold}% threshold")
print(f"   • Graph Writer: Embeddings {'enabled' if vector_engine else 'disabled'}")
print(f"   • OntoRAG: {'active' if onto_rag.is_initialized else 'inactive (seed theories first)'}")

✅ Extraction pipeline ready:
   • Extractor: gemini-2.5-flash-lite
   • Entity Resolver: 85.0% threshold
   • Graph Writer: Embeddings enabled
   • OntoRAG: active


---
# 🧽 BLOCK 6A: Dataset Hygiene & Text De-noising
**Goal:** Strip boilerplate/noise (e.g., *Project Gutenberg* headers/footers) before chunking + extraction, and keep an audit trail.


In [None]:
# ══════════════════════════════════════════════════════════════════════════════
# BLOCK 6A: DATASET HYGIENE & TEXT DE-NOISING
# ══════════════════════════════════════════════════════════════════════════════

import re
from dataclasses import dataclass
from typing import Tuple, List, Dict

@dataclass
class CleaningHit:
    rule: str
    count: int
    examples: List[str]

# Common boilerplate patterns (add your own!)
CLEANING_RULES = [
    {
        "name": "project_gutenberg_block",
        "pattern": r"(?is)\*\*\*\s*start of (this|the) project gutenberg.*?\*\*\*\s*end of (this|the) project gutenberg.*?\*\*\*",
        "replace_with": ""
    },
    {
        "name": "project_gutenberg_lines",
        "pattern": r"(?im)^\s*(?:this\s+ebook\s+is\s+for\s+the\s+use\s+of\s+anyone\s+anywhere.*|project\s+gutenberg.*|end\s+of\s+(?:this|the)\s+project\s+gutenberg.*)\s*$",
        "replace_with": ""
    },
    {
        "name": "excess_whitespace",
        "pattern": r"\s+",
        "replace_with": " "
    },
]

def clean_text(raw: str, max_examples: int = 3) -> Tuple[str, List[CleaningHit]]:
    """
    Apply deterministic cleaning rules and return an audit trail.
    Keeps this intentionally conservative: it's better to leave some noise than to delete evidence.
    """
    text = raw or ""
    hits: List[CleaningHit] = []

    for rule in CLEANING_RULES:
        before = text
        text, n = re.subn(rule["pattern"], rule["replace_with"], text)
        if n > 0:
            # Capture a few tiny examples from the *before* text to make it debuggable
            ex = []
            try:
                for m in re.finditer(rule["pattern"], before):
                    snippet = before[m.start():min(m.start()+120, len(before))]
                    ex.append(snippet.replace("\n", " ")[:120])
                    if len(ex) >= max_examples:
                        break
            except re.error:
                pass
            hits.append(CleaningHit(rule=rule["name"], count=n, examples=ex))

    return text.strip(), hits

def lint_text(raw: str) -> Dict[str, bool]:
    """Lightweight flags that help you spot obvious dataset artifacts."""
    flags = {
        "contains_project_gutenberg": bool(re.search(r"(?i)project\s+gutenberg", raw or "")),
        "contains_long_url": bool(re.search(r"https?://\S{40,}", raw or "")),
        "contains_repeated_dashes": bool(re.search(r"-{40,}", raw or "")),
    }
    return flags

# Wrap process_upload so you don't need to edit BLOCK 6 code above
_original_process_upload = process_upload

def process_upload(filename: str, content: bytes, case_name: str) -> CaseDocument:
    """Convert uploaded file to CaseDocument with cleaning + linting."""
    text = content.decode('utf-8', errors='ignore')

    flags = lint_text(text)
    cleaned, audit = clean_text(text)

    if audit:
        console.print(f"[cyan]🧽 Cleaned {filename}[/cyan]")
        for h in audit:
            console.print(f"   • {h.rule}: {h.count} hit(s)")
    if any(flags.values()):
        console.print(f"[yellow]🔎 Lint flags for {filename}: {', '.join([k for k,v in flags.items() if v])}[/yellow]")

    # Use cleaned text downstream
    sha256 = hashlib.sha256(cleaned.encode()).hexdigest()

    clean_name = re.sub(r'[^a-zA-Z0-9]', '_', case_name).upper()[:40]
    case_id = f"CTX_{clean_name}_{datetime.now().strftime('%Y%m%d_%H%M')}"

    chunks = chunk_text(cleaned, sha256, config.chunk_size, config.chunk_overlap)

    return CaseDocument(case_id, case_name, filename, cleaned, sha256, chunks)

console.print("[green]✅ Dataset hygiene enabled (process_upload wrapped).[/green]")


In [None]:
# ─────────────────────────────────────────────────────────────────────────────
# BATCH PROCESSING PIPELINE
# ─────────────────────────────────────────────────────────────────────────────

PROCESS_QUEUE = []  # Global queue for documents

async def process_document(doc: CaseDocument) -> Dict[str, int]:
    """Process a single document through the full pipeline."""
    console.print(f"\n[bold cyan]▶️ Processing: {doc.case_name}[/bold cyan] ({len(doc.chunks)} chunks)")

    # Extract from all chunks in parallel
    tasks = [extractor.extract_chunk(chunk.content) for chunk in doc.chunks]
    results = await asyncio.gather(*tasks)

    # Merge all extractions
    merged = FullExtraction()
    for ext in results:
        merged.actors.extend(ext.actors)
        merged.relationships.extend(ext.relationships)
        merged.events.extend(ext.events)
        merged.claims.extend(ext.claims)
        merged.interests.extend(ext.interests)
        merged.leverage.extend(ext.leverage)
        merged.constraints.extend(ext.constraints)
        merged.commitments.extend(ext.commitments)
        merged.narratives.extend(ext.narratives)

    console.print(f"   📊 Raw extraction: {merged.total_entities()} entities")

    # Entity resolution
    if merged.actors:
        merged.actors, mapping = resolver.merge(merged.actors)
        if mapping:
            merged = resolver.apply_mapping(merged, mapping)
            console.print(f"   🔗 Resolved {len(mapping)} duplicate actors")

    # Write to graph
    if not config.dry_run:
        writer.write_case(doc.case_id, doc.case_name)
        stats = writer.write_extraction(merged, doc.case_id)
        return stats

    return merged.stats()

async def run_batch_pipeline():
    """Run the full pipeline on all queued documents."""
    if not PROCESS_QUEUE:
        console.print("[yellow]⚠️ No documents in queue. Upload files first.[/yellow]")
        return

    console.print(Panel(f"[bold cyan]🚀 Processing {len(PROCESS_QUEUE)} Document(s)[/bold cyan]", border_style="cyan"))

    total_start = time.time()
    all_stats = defaultdict(int)

    for doc in PROCESS_QUEUE:
        try:
            start = time.time()
            stats = await process_document(doc)
            duration = time.time() - start

            for k, v in stats.items():
                all_stats[k] += v

            console.print(f"   ✅ Done in {duration:.1f}s: {stats}")

        except Exception as e:
            console.print(f"   ❌ Error: {e}")
            import traceback
            traceback.print_exc()

    total_time = time.time() - total_start

    console.print(Panel(
        f"[bold green]✨ Complete![/bold green]\n" +
        f"Time: {total_time:.1f}s | Totals: {dict(all_stats)}",
        border_style="green"
    ))

    # Show embedding stats
    if vector_engine:
        console.print(f"\n📊 Embedding Stats: {vector_engine.stats()}")

print("\n" + "="*60)
print("📄 UPLOAD YOUR CASE FILES")
print("="*60)
print("Upload .txt files containing conflict case descriptions.")
print("")

# File upload widget
uploader = widgets.FileUpload(accept='.txt', multiple=True, description='Upload .txt')
btn_process = widgets.Button(description='🚀 Process All Files', button_style='success', layout={'width': '200px'})
out_upload = widgets.Output()

def on_upload_change(change):
    global PROCESS_QUEUE
    with out_upload:
        clear_output()
        PROCESS_QUEUE = []

        for filename, file_info in uploader.value.items():
            content = file_info['content']
            size_kb = len(content) / 1024

            # Auto-generate case name from filename
            case_name = os.path.splitext(filename)[0].replace('_', ' ').title()

            doc = process_upload(filename, content, case_name)
            PROCESS_QUEUE.append(doc)

            print(f"📄 {filename} ({size_kb:.1f} KB) → {len(doc.chunks)} chunks")

        print(f"\n✅ {len(PROCESS_QUEUE)} file(s) ready. Click 'Process All Files' to extract.")

def on_process_click(_):
    with out_upload:
        asyncio.get_event_loop().run_until_complete(run_batch_pipeline())

uploader.observe(on_upload_change, names='value')
btn_process.on_click(on_process_click)

display(widgets.VBox([uploader, btn_process, out_upload]))


📄 UPLOAD YOUR CASE FILES
Upload .txt files containing conflict case descriptions.



VBox(children=(FileUpload(value={}, accept='.txt', description='Upload .txt', multiple=True), Button(button_st…

---
# 🧩 BLOCK 6B: Entity Resolution Workbench (Human-in-the-loop Dedup)
**Goal:** Suggest likely duplicate Actors (within a case) using blocking + fuzzy/semantic similarity, so you can review merges before you hard-merge anything.


In [None]:
# ══════════════════════════════════════════════════════════════════════════════
# BLOCK 6B: ENTITY RESOLUTION WORKBENCH (HITL DEDUP)
# ══════════════════════════════════════════════════════════════════════════════

import pandas as pd
import numpy as np
import re
from rapidfuzz import fuzz
from typing import Optional, Dict, List, Tuple

def _norm_name(s: str) -> str:
    s = (s or "").lower().strip()
    s = re.sub(r"[^\w\s]", " ", s)          # drop punctuation
    s = re.sub(r"\b(inc|llc|ltd|corp|co)\b", "", s)  # org suffixes
    s = re.sub(r"\s+", " ", s).strip()
    return s

def _blocking_keys(name: str) -> List[str]:
    n = _norm_name(name)
    toks = n.split()
    if not toks:
        return ["__empty__"]
    first = toks[0][:4]
    last = toks[-1][:4]
    initials = "".join(t[0] for t in toks[:3])
    return list(set([
        f"f:{first}",
        f"l:{last}",
        f"fl:{first}:{last}",
        f"i:{initials}",
    ]))

def _cosine(a: List[float], b: List[float]) -> float:
    a = np.array(a, dtype=float); b = np.array(b, dtype=float)
    na = np.linalg.norm(a); nb = np.linalg.norm(b)
    if na == 0 or nb == 0:
        return 0.0
    return float(np.dot(a, b) / (na * nb))

def fetch_case_actors(case_id: str) -> pd.DataFrame:
    q = """
    MATCH (a:Actor {case_id:$cid})
    RETURN a.name AS name, a.type AS actor_type, a.role AS role,
           a.description AS description, a.aliases AS aliases
    """
    with driver.session() as s:
        rows = [r.data() for r in s.run(q, cid=case_id)]
    df = pd.DataFrame(rows)
    if df.empty:
        return df
    df["name_norm"] = df["name"].apply(_norm_name)
    df["block_keys"] = df["name"].apply(_blocking_keys)
    return df

def suggest_actor_duplicates(
    case_id: str,
    threshold: float = 88.0,
    use_embeddings: bool = True,
    max_pairs_per_block: int = 5000
) -> pd.DataFrame:
    """
    Returns candidate duplicate pairs with a composite score.

    - Blocking: compare only within shared blocking keys (high recall).
    - Scoring: RapidFuzz + optional cosine(name/role embeddings).
    """
    df = fetch_case_actors(case_id)
    if df.empty or len(df) < 2:
        console.print("[yellow]⚠️ Not enough actors to deduplicate.[/yellow]")
        return pd.DataFrame()

    # Build inverted index for blocks → row indices
    block_map: Dict[str, List[int]] = {}
    for idx, keys in enumerate(df["block_keys"]):
        for k in keys:
            block_map.setdefault(k, []).append(idx)

    # Optionally pre-embed (name + role) once
    vecs = None
    if use_embeddings and "vector_engine" in globals():
        texts = (df["name"].fillna("") + ": " + df["role"].fillna("")).tolist()
        vecs = vector_engine.embed_batch(texts, task_type="RETRIEVAL_DOCUMENT")

    seen = set()
    out = []

    for bk, idxs in block_map.items():
        if len(idxs) < 2:
            continue
        # Safety cap to avoid accidental blow-ups
        if len(idxs) * (len(idxs)-1) / 2 > max_pairs_per_block:
            continue

        for i_pos, i in enumerate(idxs):
            for j in idxs[i_pos+1:]:
                key = (min(i, j), max(i, j))
                if key in seen:
                    continue
                seen.add(key)

                n1 = df.at[i, "name"]
                n2 = df.at[j, "name"]

                f1 = fuzz.QRatio(_norm_name(n1), _norm_name(n2))
                f2 = fuzz.token_sort_ratio(_norm_name(n1), _norm_name(n2))
                f3 = fuzz.token_set_ratio(_norm_name(n1), _norm_name(n2))
                fuzz_score = max(f1, f2, f3)

                emb_score = None
                if vecs is not None:
                    emb_score = _cosine(vecs[i], vecs[j]) * 100.0

                # Composite: mostly string, with semantic tie-breaker
                if emb_score is None:
                    score = fuzz_score
                else:
                    score = 0.70 * fuzz_score + 0.30 * emb_score

                if score >= threshold:
                    out.append({
                        "case_id": case_id,
                        "block": bk,
                        "name_1": n1,
                        "name_2": n2,
                        "fuzz_score": round(fuzz_score, 2),
                        "emb_score": round(emb_score, 2) if emb_score is not None else None,
                        "score": round(score, 2),
                        "type_1": df.at[i, "actor_type"],
                        "type_2": df.at[j, "actor_type"],
                    })

    res = pd.DataFrame(out).sort_values(["score", "fuzz_score"], ascending=False)
    console.print(f"[green]✅ Candidate pairs found: {len(res)}[/green]")
    return res

def apply_soft_merges(case_id: str, merges: Dict[str, str]) -> None:
    """
    Safe option: do NOT refactor/merge nodes.
    Instead:
      - create (dup)-[:SAME_AS]->(canonical)
      - append dup name into canonical.aliases
      - tag dup with canonical_name for easy filtering
    """
    if not merges:
        console.print("[yellow]⚠️ No merges supplied.[/yellow]")
        return

    q = """
    MATCH (d:Actor {case_id:$cid, name:$dup})
    MATCH (c:Actor {case_id:$cid, name:$canon})
    MERGE (d)-[r:SAME_AS]->(c)
    SET r.score = $score, d.canonical_name = $canon
    SET c.aliases = apoc.coll.toSet(coalesce(c.aliases, []) + [$dup])
    RETURN d.name AS duplicate, c.name AS canonical
    """
    # APOC is optional; if not available, we fall back without alias set-dedup.
    q_no_apoc = """
    MATCH (d:Actor {case_id:$cid, name:$dup})
    MATCH (c:Actor {case_id:$cid, name:$canon})
    MERGE (d)-[r:SAME_AS]->(c)
    SET r.score = $score, d.canonical_name = $canon
    SET c.aliases = coalesce(c.aliases, []) + [$dup]
    RETURN d.name AS duplicate, c.name AS canonical
    """

    with driver.session() as s:
        for dup, canon in merges.items():
            score = None
            try:
                s.run(q, cid=case_id, dup=dup, canon=canon, score=score)
            except Exception:
                s.run(q_no_apoc, cid=case_id, dup=dup, canon=canon, score=score)

    console.print("[green]✅ Soft merges applied (SAME_AS edges created).[/green]")

# Example:
# case_id = PROCESS_QUEUE[0].case_id  # after upload
# candidates = suggest_actor_duplicates(case_id, threshold=90.0)
# candidates.head(20)

# After manual review:
# MERGES = {"UN": "United Nations", "U.N.": "United Nations"}
# apply_soft_merges(case_id, MERGES)


---
# 🧾 BLOCK 6C: Evidence-first Ingest (Runs + Provenance + CTX write)
**Goal:** A stronger substrate for conflict analysis:

- Writes `SourceDoc/Chunk/Span` evidence nodes alongside CTX entities
- Records an `ExtractionRun` with config snapshot
- Produces span-level `EVIDENCE_FOR` links for Actors and Claims (easy to extend)

Entry point: `upload_and_ingest_v7()`.


In [None]:
# ══════════════════════════════════════════════════════════════════════════════
# BLOCK 6C: EVIDENCE-FIRST INGEST ORCHESTRATOR
# ══════════════════════════════════════════════════════════════════════════════

import nest_asyncio
nest_asyncio.apply()

def _read_uploaded_file(name: str, data: bytes) -> str:
    for enc in ("utf-8","utf-16","latin-1"):
        try:
            return data.decode(enc)
        except Exception:
            continue
    return data.decode("utf-8", errors="ignore")

async def _extract_all_chunks(doc: "CaseDocument", concurrency: int = 6) -> List["FullExtraction"]:
    sem = asyncio.Semaphore(concurrency)
    outs: List["FullExtraction"] = []

    async def _one(ch: "Chunk"):
        async with sem:
            return await extractor.extract_chunk(ch.content)

    tasks = [_one(ch) for ch in doc.chunks]
    for fut in tqdm(asyncio.as_completed(tasks), total=len(tasks), desc="Extracting chunks"):
        outs.append(await fut)

    return outs

def _merge_extractions(exts: List["FullExtraction"]) -> Tuple["FullExtraction", Dict[str,str]]:
    full = FullExtraction()
    for e in exts:
        full.actors += e.actors
        full.relationships += e.relationships
        full.events += e.events
        full.claims += e.claims
        full.interests += e.interests
        full.leverage += e.leverage
        full.constraints += e.constraints
        full.commitments += e.commitments
        full.narratives += e.narratives

    merged_actors, mapping = resolver.merge(full.actors)
    full.actors = merged_actors
    full = resolver.apply_mapping(full, mapping)
    return full, mapping

def ingest_case_v7(case_name: str, filename: str, raw_text: str, concurrency: int = 6) -> Dict[str, Any]:
    doc = process_upload(case_name, filename, raw_text)

    rid = run_id()
    cfg_snapshot = {
        "extraction_model": config.extraction_model,
        "embedding_model": config.embedding_model,
        "chunk_size": config.chunk_size,
        "chunk_overlap": config.chunk_overlap,
        "er_threshold": config.er_threshold,
        "onto_rag_active": bool(getattr(onto_rag, "is_initialized", False)),
    }

    write_source_doc_and_chunks(driver, doc)
    write_extraction_run(driver, rid, doc.case_id, cfg_snapshot)
    attach_run_source(driver, rid, doc)

    exts = asyncio.run(_extract_all_chunks(doc, concurrency=concurrency))
    full_ext, mapping = _merge_extractions(exts)

    stats = writer.write_extraction(doc.case_id, full_ext)
    span_count = write_spans_and_evidence(driver, doc, full_ext, rid)

    return {
        "case_id": doc.case_id,
        "run_id": rid,
        "writer_stats": stats,
        "span_count": span_count,
        "name_mapping": mapping,
    }

def upload_and_ingest_v7(case_name: str = "TACITUS_CASE", concurrency: int = 6) -> List[Dict[str, Any]]:
    uploaded = files.upload()
    results = []
    for fname, data in uploaded.items():
        raw = _read_uploaded_file(fname, data)
        console.print(f"[cyan]📄 Ingesting: {fname}[/cyan]")
        res = ingest_case_v7(case_name=case_name, filename=fname, raw_text=raw, concurrency=concurrency)
        console.print(f"[green]✅ Done: case_id={res['case_id']} run_id={res['run_id']} spans={res['span_count']}[/green]")
        results.append(res)
    return results

print("✅ Evidence-first ingest ready. Next: run `upload_and_ingest_v7('My Case Name')`.")


---
# ✅ BLOCK 6D: QA Gates + Run Metrics (Operational Trace)
**Goal:** Treat extraction like a production pipeline:

- Check basic schema health (suspicious emptiness, missing actors)
- Check graph integrity (orphan nodes)
- Persist QA + counts back onto the `ExtractionRun` node

Run after `upload_and_ingest_v7()` returns `case_id` + `run_id`.


In [None]:
# ══════════════════════════════════════════════════════════════════════════════
# BLOCK 6D: QUALITY GATES + RUN METRICS
# ══════════════════════════════════════════════════════════════════════════════

def qa_graph(case_id: str) -> Dict[str, Any]:
    with driver.session() as s:
        by_label = {}
        for r in s.run("MATCH (n {case_id:$cid}) RETURN labels(n)[0] AS label, count(*) AS n", cid=case_id):
            by_label[r["label"]] = r["n"]

        orphan = s.run("""
            MATCH (n {case_id:$cid})
            WHERE NOT (n)--()
            RETURN count(n) AS c
        """, cid=case_id).single()["c"]

    return {"by_label": by_label, "orphans": int(orphan)}

def persist_run_metrics(run_id: str, payload: Dict[str, Any]) -> None:
    with driver.session() as s:
        s.run("""
            MATCH (r:ExtractionRun {run_id:$rid})
            SET r.finished_at=$ts,
                r.metrics=$m,
                r.qa_issues=$issues
        """, rid=run_id,
             ts=datetime.now(timezone.utc).isoformat(),
             m=json.dumps(payload, default=str),
             issues=json.dumps(payload.get("issues", []), default=str))

def run_qa_and_persist(case_id: str, run_id: str) -> Dict[str, Any]:
    graph_stats = qa_graph(case_id)

    issues = []
    if graph_stats["by_label"].get("Actor", 0) == 0:
        issues.append("No Actor nodes in graph.")
    if graph_stats["by_label"].get("Claim", 0) == 0 and graph_stats["by_label"].get("Event", 0) == 0:
        issues.append("No Claim/Event nodes in graph.")

    payload = {
        "case_id": case_id,
        "run_id": run_id,
        "graph_stats": graph_stats,
        "issues": issues,
    }
    persist_run_metrics(run_id, payload)
    print(f"✅ QA persisted. Orphans={graph_stats['orphans']} Issues={len(issues)}")
    return payload

print("✅ QA helpers ready. Typical use:")
print("   payload = run_qa_and_persist(case_id, run_id)")


---
# 🧠 BLOCK 6E: Build a Reasoning Graph (Issues, Contradictions, Influence)
**Goal:** Move from “extracted facts” → “structured analysis substrate”.

Adds derived nodes/edges:
- `Issue` clusters (from Claim embeddings)
- `POTENTIAL_CONTRADICTION` edges (simple heuristic)
- `influence_score` on Actors (PageRank over Actor↔Actor edges)

Deterministic today; swappable later.


In [None]:
# ══════════════════════════════════════════════════════════════════════════════
# BLOCK 6E: REASONING GRAPH BUILDERS
# ══════════════════════════════════════════════════════════════════════════════

import networkx as nx
from sklearn.cluster import AgglomerativeClustering
from sklearn.feature_extraction.text import TfidfVectorizer

def _cosine_np(A: np.ndarray, B: np.ndarray) -> np.ndarray:
    A = A / (np.linalg.norm(A, axis=1, keepdims=True) + 1e-12)
    B = B / (np.linalg.norm(B, axis=1, keepdims=True) + 1e-12)
    return A @ B.T

def _top_terms(texts: List[str], k: int = 6) -> List[str]:
    if not texts:
        return []
    vec = TfidfVectorizer(stop_words="english", max_features=3000)
    X = vec.fit_transform(texts)
    scores = np.asarray(X.sum(axis=0)).ravel()
    terms = np.array(vec.get_feature_names_out())
    idx = scores.argsort()[::-1][:k]
    return terms[idx].tolist()

def build_issue_graph(case_id: str, distance_threshold: float = 0.35, min_cluster_size: int = 2) -> Dict[str, Any]:
    with driver.session() as s:
        rows = [dict(r) for r in s.run("MATCH (c:Claim {case_id:$cid}) RETURN c.statement AS statement, c.embedding AS emb", cid=case_id)]

    rows = [r for r in rows if r.get("emb") and isinstance(r["emb"], list)]
    if len(rows) < 2:
        print("⚠️ Not enough embedded claims to build issues.")
        return {"issues_created": 0}

    texts = [r["statement"] for r in rows]
    E = np.array([r["emb"] for r in rows], dtype=float)

    clustering = AgglomerativeClustering(
        n_clusters=None,
        distance_threshold=distance_threshold,
        metric="cosine",
        linkage="average"
    )
    labels = clustering.fit_predict(E)

    clusters: Dict[int, List[int]] = defaultdict(list)
    for i, lab in enumerate(labels):
        clusters[int(lab)].append(i)

    created = 0
    with driver.session() as s:
        for lab, idxs in clusters.items():
            if len(idxs) < min_cluster_size:
                continue
            stmts = [texts[i] for i in idxs]
            issue_terms = _top_terms(stmts, k=6)
            issue_label = "; ".join(issue_terms[:4]) if issue_terms else f"issue_{lab}"
            iid = mk_id("RZN:ISS_", case_id, str(lab), issue_label, n=14)

            s.run("""
                MERGE (i:Issue {issue_id:$iid})
                SET i.case_id=$cid,
                    i.issue_label=$lbl,
                    i.top_terms=$terms,
                    i.size=$sz,
                    i.graph_category='RZN',
                    i.created_at=$ts
            """, iid=iid, cid=case_id, lbl=issue_label, terms=issue_terms, sz=len(idxs),
                 ts=datetime.now(timezone.utc).isoformat())

            for stmt in stmts:
                s.run("""
                    MATCH (c:Claim {case_id:$cid, statement:$stmt})
                    MATCH (i:Issue {issue_id:$iid})
                    MERGE (c)-[:ABOUT_ISSUE]->(i)
                """, cid=case_id, stmt=stmt, iid=iid)

            created += 1

    print(f"✅ Issues created: {created}")
    return {"issues_created": created}

_NEG_WORDS = {"not","never","no","none","cannot","can't","won't","without","deny","denies","denied"}

def detect_contradictions(case_id: str, min_similarity: float = 0.72) -> int:
    created = 0
    with driver.session() as s:
        issues = [r["issue_id"] for r in s.run("MATCH (i:Issue {case_id:$cid}) RETURN i.issue_id AS issue_id", cid=case_id)]
        for iid in issues:
            claims = [dict(r) for r in s.run("""
                MATCH (c:Claim {case_id:$cid})-[:ABOUT_ISSUE]->(i:Issue {issue_id:$iid})
                RETURN c.statement AS statement, c.embedding AS emb
            """, cid=case_id, iid=iid)]
            claims = [c for c in claims if c.get("emb") and isinstance(c["emb"], list)]
            if len(claims) < 2:
                continue

            texts = [c["statement"] for c in claims]
            E = np.array([c["emb"] for c in claims], dtype=float)
            S = _cosine_np(E, E)

            for i in range(len(texts)):
                for j in range(i+1, len(texts)):
                    sim = float(S[i,j])
                    if sim < min_similarity:
                        continue
                    ti = set(re.findall(r"[a-z']+", texts[i].lower()))
                    tj = set(re.findall(r"[a-z']+", texts[j].lower()))
                    ni = len(ti & _NEG_WORDS) > 0
                    nj = len(tj & _NEG_WORDS) > 0
                    if ni ^ nj:
                        s.run("""
                            MATCH (a:Claim {case_id:$cid, statement:$s1})
                            MATCH (b:Claim {case_id:$cid, statement:$s2})
                            MERGE (a)-[r:POTENTIAL_CONTRADICTION]->(b)
                            SET r.issue_id=$iid,
                                r.similarity=$sim,
                                r.rationale='negation_mismatch',
                                r.graph_category='RZN'
                        """, cid=case_id, s1=texts[i], s2=texts[j], iid=iid, sim=sim)
                        created += 1
    print(f"✅ Contradiction edges created: {created}")
    return created

def compute_actor_influence(case_id: str) -> Dict[str, float]:
    G = nx.DiGraph()
    with driver.session() as s:
        rows = s.run("""
            MATCH (a:Actor {case_id:$cid})-[r]->(b:Actor {case_id:$cid})
            RETURN a.name AS src, b.name AS dst
        """, cid=case_id).data()

    for r in rows:
        G.add_edge(r["src"], r["dst"])

    if G.number_of_nodes() == 0:
        print("⚠️ No Actor-Actor edges found; influence scoring skipped.")
        return {}

    pr = nx.pagerank(G, alpha=0.85)
    with driver.session() as s:
        for name, score in pr.items():
            s.run("MATCH (a:Actor {case_id:$cid, name:$name}) SET a.influence_score=$sc", cid=case_id, name=name, sc=float(score))

    print(f"✅ Wrote influence_score for {len(pr)} actors.")
    return pr

print("✅ Reasoning builders ready. Typical flow:")
print("   build_issue_graph(case_id)")
print("   detect_contradictions(case_id)")
print("   compute_actor_influence(case_id)")


---
# 🔍 BLOCK 7: Semantic Search
**Search your knowledge graph using natural language.**

Examples:
- "actors with high leverage"
- "escalation events"
- "demands and threats"
- "security concerns"

In [None]:
# ══════════════════════════════════════════════════════════════════════════════
# BLOCK 7: SEMANTIC SEARCH
# ══════════════════════════════════════════════════════════════════════════════
# 🎯 PURPOSE: Search the graph using natural language queries
# ⏱️ TIME: Instant per query
# ══════════════════════════════════════════════════════════════════════════════

class SemanticSearch:
    """Natural language search across the knowledge graph."""

    def __init__(self, drv, vec_engine: VectorEngine):
        self.driver = drv
        self.vector_engine = vec_engine

    def search(self, query: str, node_labels: List[str] = None,
               case_id: str = None, top_k: int = 15) -> pd.DataFrame:
        """
        Semantic search across graph nodes.

        Args:
            query: Natural language query
            node_labels: Optional filter (e.g., ['Actor', 'Event'])
            case_id: Optional filter to specific case
            top_k: Number of results
        """
        # Generate query embedding
        query_vec = self.vector_engine.embed_text(query, "RETRIEVAL_QUERY")

        # Build Cypher
        label_filter = ""
        if node_labels:
            label_parts = [f"'{l}' IN labels(n)" for l in node_labels]
            label_filter = f" AND ({' OR '.join(label_parts)})"

        case_filter = f" AND n.case_id = '{case_id}'" if case_id else ""

        cypher = f"""
            MATCH (n)
            WHERE n.embedding IS NOT NULL {label_filter} {case_filter}
            WITH n, vector.similarity.cosine(n.embedding, $query_vec) AS score
            WHERE score > 0.4
            ORDER BY score DESC
            LIMIT $top_k
            RETURN
                labels(n)[0] AS Type,
                COALESCE(n.name, LEFT(n.description, 80), LEFT(n.statement, 80)) AS Content,
                n.case_id AS Case,
                round(score * 1000) / 1000 AS Score
        """

        try:
            with self.driver.session() as s:
                result = s.run(cypher, query_vec=query_vec, top_k=top_k)
                return pd.DataFrame([dict(r) for r in result])
        except Exception as e:
            console.print(f"[yellow]⚠️ Search error: {e}[/yellow]")
            console.print("[yellow]Tip: Make sure you have nodes with embeddings.[/yellow]")
            return pd.DataFrame()

    def find_similar(self, entity_name: str, entity_type: str = "Actor") -> pd.DataFrame:
        """Find entities similar to a given one."""
        with self.driver.session() as s:
            # Get source embedding
            result = s.run(f"""
                MATCH (n:{entity_type} {{name: $name}})
                WHERE n.embedding IS NOT NULL
                RETURN n.embedding AS emb, n.case_id AS source_case
                LIMIT 1
            """, name=entity_name).single()

            if not result:
                return pd.DataFrame()

            source_emb = result['emb']
            source_case = result['source_case']

            # Find similar
            similar = s.run(f"""
                MATCH (n:{entity_type})
                WHERE n.embedding IS NOT NULL AND n.name <> $name
                WITH n, vector.similarity.cosine(n.embedding, $emb) AS score
                WHERE score > 0.5
                ORDER BY score DESC
                LIMIT 10
                RETURN n.name AS Name, n.case_id AS Case, round(score * 1000) / 1000 AS Similarity
            """, name=entity_name, emb=source_emb)

            return pd.DataFrame([dict(r) for r in similar])

# Initialize
semantic_search = SemanticSearch(driver, vector_engine)

# ─────────────────────────────────────────────────────────────────────────────
# INTERACTIVE UI
# ─────────────────────────────────────────────────────────────────────────────

def get_cases_for_dropdown():
    """Get list of cases for dropdown."""
    with driver.session() as s:
        result = s.run("MATCH (c:Case) RETURN c.case_name AS name, c.case_id AS id ORDER BY c.created_at DESC")
        return [(r['name'], r['id']) for r in result]

# Widgets
w_query = widgets.Text(
    placeholder='e.g., "high leverage actors" or "escalation events"',
    description='Query:',
    layout={'width': '500px'}
)

w_labels = widgets.SelectMultiple(
    options=['Actor', 'Event', 'Claim', 'Interest', 'Leverage', 'Constraint', 'Narrative', 'Commitment', 'Theory', 'Concept'],
    value=[],
    description='Filter:',
    layout={'width': '150px', 'height': '120px'}
)

cases = get_cases_for_dropdown()
w_case = widgets.Dropdown(
    options=[('All Cases', None)] + cases,
    description='Case:',
    layout={'width': '300px'}
)

btn_search = widgets.Button(description='🔍 Search', button_style='primary', layout={'width': '100px'})
out_search = widgets.Output()

def on_search(_):
    with out_search:
        clear_output()
        if not w_query.value:
            print("Enter a search query.")
            return

        print(f"🔍 Searching: '{w_query.value}'...")

        labels = list(w_labels.value) if w_labels.value else None
        df = semantic_search.search(
            query=w_query.value,
            node_labels=labels,
            case_id=w_case.value,
            top_k=20
        )

        if df.empty:
            print("\n⚠️ No results. Try:")
            print("   • Different keywords")
            print("   • Remove filters")
            print("   • Make sure documents are processed with embeddings")
        else:
            print(f"\n✅ Found {len(df)} results:\n")
            display(df.style.hide(axis='index').background_gradient(subset=['Score'], cmap='Greens'))

btn_search.on_click(on_search)

# Also search on Enter key
w_query.on_submit(on_search)

display(widgets.VBox([
    widgets.HTML("<h3>🧠 Semantic Search</h3>"),
    widgets.HBox([w_query, btn_search]),
    widgets.HBox([w_case, w_labels]),
    widgets.HTML("<hr>"),
    out_search
]))

print("\n💡 Tips:")
print("   • Search finds nodes by meaning, not just keywords")
print("   • Select multiple types in Filter to narrow results")
print("   • Press Enter or click Search to query")

VBox(children=(HTML(value='<h3>🧠 Semantic Search</h3>'), HBox(children=(Text(value='', description='Query:', l…


💡 Tips:
   • Search finds nodes by meaning, not just keywords
   • Select multiple types in Filter to narrow results
   • Press Enter or click Search to query


---
# 📊 BLOCK 8: Cypher Query Lab
**Run predefined analytics queries on your conflict graph.**

In [None]:
# ══════════════════════════════════════════════════════════════════════════════
# BLOCK 8: CYPHER QUERY LAB
# ══════════════════════════════════════════════════════════════════════════════

QUERY_LIBRARY = {
    "👥 All Actors": "MATCH (a:Actor {case_id: $cid}) RETURN a.name AS Actor, a.role AS Role, a.type AS Type, a.is_primary AS Primary ORDER BY a.is_primary DESC",
    "⚔️ Conflicts (OPPOSES)": "MATCH (a:Actor {case_id: $cid})-[r:OPPOSES]->(b:Actor) RETURN a.name AS Actor1, 'OPPOSES' AS Rel, b.name AS Actor2",
    "🤝 Alliances": "MATCH (a:Actor {case_id: $cid})-[r:ALLIES_WITH]->(b:Actor) RETURN a.name AS Actor1, 'ALLIES_WITH' AS Rel, b.name AS Actor2",
    "💪 Leverage": "MATCH (a:Actor {case_id: $cid})-[:HOLDS_LEVERAGE]->(l:Leverage) RETURN a.name AS Actor, l.type AS Type, l.description AS Leverage",
    "🎯 Interests": "MATCH (a:Actor {case_id: $cid})-[:HAS_INTEREST]->(i:Interest) RETURN a.name AS Actor, i.type AS Type, i.description AS Interest",
    "📅 Events": "MATCH (e:Event {case_id: $cid}) RETURN e.date AS Date, e.description AS Event, e.impact AS Impact ORDER BY e.date",
    "🗣️ Claims": "MATCH (a:Actor {case_id: $cid})-[:MAKES_CLAIM]->(c:Claim) RETURN a.name AS Actor, c.type AS Type, c.statement AS Claim",
    "📖 Narratives": "MATCH (a:Actor {case_id: $cid})-[:PROMOTES_NARRATIVE]->(n:Narrative) RETURN a.name AS Actor, n.frame AS Frame, n.description AS Narrative",
    "🕸️ Most Connected": "MATCH (a:Actor {case_id: $cid}) OPTIONAL MATCH (a)-[r]-() WITH a, count(r) AS c RETURN a.name AS Actor, c AS Connections ORDER BY c DESC LIMIT 10",
    "📈 Statistics": "MATCH (n {case_id: $cid}) WITH labels(n)[0] AS type, count(*) AS c RETURN type AS Type, c AS Count ORDER BY c DESC"
}

def run_query(query, case_id):
    with driver.session() as s:
        return pd.DataFrame([dict(r) for r in s.run(query, cid=case_id)])

# UI
cases = get_cases_for_dropdown()
if cases:
    w_case_q = widgets.Dropdown(options=cases, description='Case:', layout={'width': '350px'})
    w_query_type = widgets.Dropdown(options=list(QUERY_LIBRARY.keys()), description='Query:')
    out_query = widgets.Output()

    def update_query(_):
        with out_query:
            clear_output()
            df = run_query(QUERY_LIBRARY[w_query_type.value], w_case_q.value)
            if df.empty:
                print("⚠️ No results")
            else:
                print(f"✅ {len(df)} results:")
                display(df.style.hide(axis='index'))

    w_case_q.observe(update_query, names='value')
    w_query_type.observe(update_query, names='value')
    display(widgets.VBox([widgets.HTML('<h3>📊 Query Lab</h3>'), widgets.HBox([w_case_q, w_query_type]), out_query]))
    update_query(None)
else:
    print('⚠️ No cases found. Process documents first.')

VBox(children=(HTML(value='<h3>📊 Query Lab</h3>'), HBox(children=(Dropdown(description='Case:', layout=Layout(…

---
# 🕸️ BLOCK 9: Visualization Dashboard
**Interactive network and timeline visualizations.**

In [None]:
# ══════════════════════════════════════════════════════════════════════════════
# BLOCK 9: VISUALIZATION
# ══════════════════════════════════════════════════════════════════════════════

from pyvis.network import Network
import plotly.graph_objects as go

def visualize_graph(case_id, height='550px'):
    print(f'🕸️ Building graph...')
    net = Network(height=height, width='100%', bgcolor='#1a1a2e', font_color='white', notebook=True, cdn_resources='remote')
    net.barnes_hut(gravity=-2500, spring_length=100)

    added = set()
    with driver.session() as s:
        # Actors
        for a in s.run('MATCH (a:Actor {case_id: $cid}) RETURN a.name AS n, a.is_primary AS p, a.role AS r', cid=case_id):
            net.add_node(a['n'], label=a['n'], color='#e74c3c' if a['p'] else '#c0392b', size=30 if a['p'] else 20, title=a['r'])
            added.add(a['n'])

        # Other nodes
        colors = {'Event': '#9b59b6', 'Claim': '#3498db', 'Interest': '#2ecc71', 'Leverage': '#f39c12'}
        for o in s.run('MATCH (a:Actor {case_id: $cid})-[]->(n) WHERE n.case_id = $cid AND NOT n:Actor RETURN DISTINCT n.entity_id AS id, labels(n)[0] AS t, LEFT(COALESCE(n.description, n.statement), 25) AS l', cid=case_id):
            if o['id'] not in added:
                net.add_node(o['id'], label=o['l'] or o['t'], color=colors.get(o['t'], '#95a5a6'), size=12)
                added.add(o['id'])

        # Edges
        edge_colors = {'OPPOSES': '#e74c3c', 'ALLIES_WITH': '#2ecc71'}
        for r in s.run('MATCH (a {case_id: $cid})-[r]->(b) WHERE b.case_id = $cid OR b.case_id IS NULL RETURN COALESCE(a.name, a.entity_id) AS s, COALESCE(b.name, b.entity_id) AS t, type(r) AS rel', cid=case_id):
            if r['s'] in added and r['t'] in added:
                net.add_edge(r['s'], r['t'], color=edge_colors.get(r['rel'], '#555'), title=r['rel'])

    print(f'✅ {len(added)} nodes')
    display(HTML(net.generate_html()))

def visualize_timeline(case_id):
    df = run_query('MATCH (e:Event {case_id: $cid}) RETURN e.date AS d, e.description AS desc, e.impact AS imp ORDER BY e.date', case_id)
    if df.empty: return print('⚠️ No events')
    colors = {'critical': '#e74c3c', 'high': '#f39c12', 'moderate': '#3498db'}
    fig = go.Figure(go.Scatter(x=df['d'], y=[1]*len(df), mode='markers+text', marker=dict(size=12, color=[colors.get(str(x),'#95a5a6') for x in df['imp']]), text=[str(d)[:18]+'...' for d in df['desc']], textposition='top center', hovertext=df['desc']))
    fig.update_layout(title='Timeline', height=220, template='plotly_dark', showlegend=False, yaxis={'visible': False})
    fig.show()

# UI
cases = get_cases_for_dropdown()
if cases:
    w_case_v = widgets.Dropdown(options=cases, description='Case:', layout={'width': '350px'})
    btn_g = widgets.Button(description='🕸️ Graph', button_style='primary')
    btn_t = widgets.Button(description='📅 Timeline', button_style='info')
    out_v = widgets.Output()
    btn_g.on_click(lambda _: (out_v.clear_output(), visualize_graph(w_case_v.value)) if True else None)
    btn_t.on_click(lambda _: (out_v.clear_output(), visualize_timeline(w_case_v.value)) if True else None)

    def on_graph(_):
        with out_v:
            clear_output()
            visualize_graph(w_case_v.value)

    def on_timeline(_):
        with out_v:
            clear_output()
            visualize_timeline(w_case_v.value)

    btn_g.on_click(on_graph)
    btn_t.on_click(on_timeline)
    display(widgets.VBox([widgets.HTML('<h3>🕸️ Visualization</h3>'), widgets.HBox([w_case_v, btn_g, btn_t]), out_v]))
else:
    print('⚠️ No cases found.')

VBox(children=(HTML(value='<h3>🕸️ Visualization</h3>'), HBox(children=(Dropdown(description='Case:', layout=La…

🕸️ Building graph...
✅ 106 nodes


---
# 📤 BLOCK 10: Export & Database Manager
**Export data and manage your graph database.**

In [None]:
# ══════════════════════════════════════════════════════════════════════════════
# BLOCK 10: EXPORT & DATABASE MANAGER
# ══════════════════════════════════════════════════════════════════════════════

class SPOExporter:
    def __init__(self, drv): self.driver = drv

    def export_triples(self, case_id, fmt='json'):
        with self.driver.session() as s:
            triples = [dict(r) for r in s.run('MATCH (a {case_id: $cid})-[r]->(b) RETURN COALESCE(a.name, a.entity_id) AS subject, type(r) AS predicate, COALESCE(b.name, b.description, b.entity_id) AS object, labels(a)[0] AS subj_type, labels(b)[0] AS obj_type', cid=case_id)]
        if fmt == 'csv':
            import io, csv
            out = io.StringIO()
            if triples:
                w = csv.DictWriter(out, fieldnames=triples[0].keys())
                w.writeheader()
                w.writerows(triples)
            return out.getvalue()
        return json.dumps(triples, indent=2, default=str)

    def download(self, case_id):
        ts = datetime.now().strftime('%Y%m%d_%H%M')
        for fmt, ext in [('json', 'json'), ('csv', 'csv')]:
            fn = f'tacitus_{case_id[:20]}_{ts}.{ext}'
            with open(fn, 'w') as f: f.write(self.export_triples(case_id, fmt))
            files.download(fn)
            print(f'✅ Downloaded: {fn}')

class DBManager:
    def __init__(self, drv): self.driver = drv

    def get_stats(self):
        with self.driver.session() as s:
            cases = pd.DataFrame([dict(r) for r in s.run('MATCH (c:Case) OPTIONAL MATCH (c)-[:CONTAINS]->(n) WITH c, count(n) AS nodes RETURN c.case_name AS Name, c.case_id AS ID, nodes AS Nodes ORDER BY c.created_at DESC')])
            types = pd.DataFrame([dict(r) for r in s.run('MATCH (n) RETURN labels(n)[0] AS Type, count(*) AS Count ORDER BY Count DESC')])
        return cases, types

    def delete_case(self, case_id):
        with self.driver.session() as s:
            s.run('MATCH (n {case_id: $cid}) DETACH DELETE n', cid=case_id)
            s.run('MATCH (c:Case {case_id: $cid}) DELETE c', cid=case_id)

    def wipe_all(self):
        with self.driver.session() as s: s.run('MATCH (n) DETACH DELETE n')

exporter = SPOExporter(driver)
db_mgr = DBManager(driver)

# UI
out_db = widgets.Output()

def refresh():
    with out_db:
        clear_output()
        cases_df, types_df = db_mgr.get_stats()
        print('📁 CASES:')
        display(cases_df.style.hide(axis='index') if not cases_df.empty else print('   (none)'))
        print('\n📊 TYPES:')
        display(types_df.style.hide(axis='index') if not types_df.empty else print('   (none)'))
    w_del.options = get_cases_for_dropdown() or [('None', '')]

btn_ref = widgets.Button(description='🔄 Refresh', button_style='info')
w_del = widgets.Dropdown(options=get_cases_for_dropdown() or [('None', '')], description='Case:')
w_confirm = widgets.Checkbox(description='Confirm', value=False)
btn_del = widgets.Button(description='🗑️ Delete', button_style='danger')
btn_exp = widgets.Button(description='📥 Export', button_style='success')
btn_wipe = widgets.Button(description='💣 WIPE ALL', button_style='danger')

btn_ref.on_click(lambda _: refresh())

def on_del(_):
    with out_db:
        if not w_confirm.value: return print('⚠️ Check Confirm first')
        if not w_del.value: return print('⚠️ Select case')
        db_mgr.delete_case(w_del.value)
        print(f'✅ Deleted {w_del.value}')
        w_confirm.value = False
        refresh()

def on_exp(_):
    with out_db:
        if w_del.value: exporter.download(w_del.value)

def on_wipe(_):
    with out_db:
        if not w_confirm.value: return print('⚠️ Check Confirm first')
        db_mgr.wipe_all()
        print('✅ Database wiped')
        w_confirm.value = False
        refresh()

btn_del.on_click(on_del)
btn_exp.on_click(on_exp)
btn_wipe.on_click(on_wipe)

display(widgets.VBox([
    widgets.HTML('<h3>🗄️ Database Manager</h3>'),
    btn_ref, out_db,
    widgets.HTML('<hr>'),
    widgets.HBox([w_del, btn_exp, btn_del]),
    widgets.HBox([w_confirm, btn_wipe])
]))
refresh()

VBox(children=(HTML(value='<h3>🗄️ Database Manager</h3>'), Button(button_style='info', description='🔄 Refresh'…

---
# 📦 BLOCK 10A: Case Bundle Export (Graph + Evidence + Run Traces)
**Goal:** Export a complete case bundle for review, auditing, sharing, or offline analysis.

Outputs:
- triples.json / triples.csv
- case_bundle.json (nodes + edges + evidence spans + runs)
- zipped bundle you can download from Colab


In [None]:
# ══════════════════════════════════════════════════════════════════════════════
# BLOCK 10A: CASE BUNDLE EXPORT
# ══════════════════════════════════════════════════════════════════════════════

import zipfile

def export_case_bundle(case_id: str, out_dir: str = "tacitus_bundle") -> str:
    os.makedirs(out_dir, exist_ok=True)

    # 1) triples via existing exporter
    exporter = SPOExporter(driver)
    triples_json = exporter.export_triples(case_id, fmt="json")
    triples_csv  = exporter.export_triples(case_id, fmt="csv")
    with open(os.path.join(out_dir, "triples.json"), "w", encoding="utf-8") as f:
        f.write(triples_json)
    with open(os.path.join(out_dir, "triples.csv"), "w", encoding="utf-8") as f:
        f.write(triples_csv)

    # 2) richer bundle (nodes, rels, evidence, runs)
    with driver.session() as s:
        nodes = [dict(r) for r in s.run("MATCH (n {case_id:$cid}) RETURN labels(n) AS labels, n AS node", cid=case_id)]
        rels = [dict(r) for r in s.run("""
            MATCH (a {case_id:$cid})-[r]->(b {case_id:$cid})
            RETURN labels(a) AS a_labels, a AS a, type(r) AS type, r AS r, labels(b) AS b_labels, b AS b
        """, cid=case_id)]
        evidence = [dict(r) for r in s.run("""
            MATCH (d:SourceDoc {case_id:$cid})-[:HAS_CHUNK]->(k:SourceChunk)-[:HAS_SPAN]->(t:TextSpan)
            OPTIONAL MATCH (t)-[:EVIDENCE_FOR]->(x)
            RETURN d.doc_id AS doc_id, k.chunk_id AS chunk_id, t AS span, labels(x) AS ev_labels, x AS ev_target
        """, cid=case_id)]
        runs = [dict(r) for r in s.run("MATCH (r:ExtractionRun {case_id:$cid}) RETURN r", cid=case_id)]

    bundle = {
        "case_id": case_id,
        "exported_at": datetime.now(timezone.utc).isoformat(),
        "nodes": nodes,
        "relationships": rels,
        "evidence": evidence,
        "runs": runs,
    }

    with open(os.path.join(out_dir, "case_bundle.json"), "w", encoding="utf-8") as f:
        json.dump(bundle, f, indent=2, default=str)

    # 3) zip it
    zip_path = f"{out_dir}_{case_id[:18]}.zip"
    with zipfile.ZipFile(zip_path, "w", compression=zipfile.ZIP_DEFLATED) as z:
        for root, _, files_ in os.walk(out_dir):
            for fn in files_:
                p = os.path.join(root, fn)
                z.write(p, arcname=os.path.relpath(p, out_dir))

    print(f"✅ Bundle created: {zip_path}")
    return zip_path

def download_case_bundle(zip_path: str) -> None:
    files.download(zip_path)

print("✅ Bundle export ready:")
print("   zip_path = export_case_bundle(case_id)")
print("   download_case_bundle(zip_path)")


---
# 🔁 BLOCK 11: Transfer Neo4j → FalkorDB (per-case)
**Goal:** Move a case graph into FalkorDB (RedisGraph-compatible), case-scoped and incremental.


In [None]:
# ══════════════════════════════════════════════════════════════════════════════
# BLOCK 11: NEO4J → FALKORDB TRANSFER (CASE-SCOPED)
# ══════════════════════════════════════════════════════════════════════════════

from falkordb import FalkorDB
from redis import ConnectionPool

def connect_falkordb(host: str, port: int = 6379, username: str = None, password: str = None, ssl: bool = False):
    """
    Best-effort connector:
    - Local/self-hosted: host/port
    - Cloud/auth: supply username/password and set ssl=True if required
    """
    try:
        if username or password or ssl:
            pool = ConnectionPool(host=host, port=port, username=username, password=password, ssl=ssl, decode_responses=True)
            return FalkorDB(connection_pool=pool)
        return FalkorDB(host=host, port=port)
    except TypeError:
        # Fallback for older client versions
        return FalkorDB(host=host, port=port)

def _neo4j_fetch_case(driver, case_id: str):
    with driver.session() as s:
        nodes = [dict(r) for r in s.run("""
            MATCH (n {case_id:$cid})
            RETURN id(n) AS neo_id, labels(n) AS labels, properties(n) AS props
        """, cid=case_id)]
        rels = [dict(r) for r in s.run("""
            MATCH (a {case_id:$cid})-[r]->(b {case_id:$cid})
            RETURN id(a) AS src, id(b) AS dst, type(r) AS type, properties(r) AS props
        """, cid=case_id)]
    return nodes, rels

def transfer_case_to_falkordb(case_id: str, falkor_host: str, falkor_port: int, graph_name: str,
                              username: str = None, password: str = None, ssl: bool = False,
                              batch_size: int = 500) -> Dict[str, Any]:
    """
    Transfers nodes + relationships for a single case_id.
    Nodes grouped by primary label; each stored with _neo4j_id for re-linking edges.
    """
    db = connect_falkordb(falkor_host, falkor_port, username=username, password=password, ssl=ssl)
    g = db.select_graph(graph_name)

    nodes, rels = _neo4j_fetch_case(driver, case_id)
    if not nodes:
        print("⚠️ No nodes found for this case_id.")
        return {"nodes": 0, "rels": 0}

    # 1) Create nodes grouped by label
    by_label = defaultdict(list)
    for n in nodes:
        labels = n.get("labels") or ["Node"]
        lbl = labels[0] if labels else "Node"
        props = n.get("props") or {}
        props["_neo4j_id"] = int(n["neo_id"])
        props["_neo_labels"] = labels
        by_label[lbl].append({"props": props})

    node_created = 0
    for lbl, rows in by_label.items():
        for i in range(0, len(rows), batch_size):
            batch = rows[i:i+batch_size]
            q = f"UNWIND $rows AS row CREATE (n:`{lbl}`) SET n = row.props"
            g.query(q, {"rows": batch})
            node_created += len(batch)

    # 2) Create relationships grouped by type
    by_type = defaultdict(list)
    for r in rels:
        props = r.get("props") or {}
        props["_neo_rel_type"] = r["type"]
        by_type[r["type"]].append({"src": int(r["src"]), "dst": int(r["dst"]), "props": props})

    rel_created = 0
    for rt, rows in by_type.items():
        for i in range(0, len(rows), batch_size):
            batch = rows[i:i+batch_size]
            q = f"""
            UNWIND $rows AS row
            MATCH (a) WHERE a._neo4j_id = row.src
            MATCH (b) WHERE b._neo4j_id = row.dst
            CREATE (a)-[r:`{rt}`]->(b)
            SET r = row.props
            """
            g.query(q, {"rows": batch})
            rel_created += len(batch)

    print(f"✅ Falkor transfer complete: nodes={node_created} rels={rel_created} graph={graph_name}")
    return {"nodes": node_created, "rels": rel_created, "graph": graph_name}

print("✅ FalkorDB transfer ready. Example:")
print("   transfer_case_to_falkordb(case_id, falkor_host='...', falkor_port=6379, graph_name='tacitus')")


# ✅ TACITUS v7.0 Ready!

## Quick Reference

| Task | Block |
|------|-------|
| Setup & Test | Blocks 1–4A |
| Seed Theories (GND) | Block 5 |
| Ingest Case Files (CTX) | Block 6 (+6A) |
| Dedup Suggestions (HITL) | Block 6B |
| Evidence & QA | Blocks 6C–6D |
| Reasoning Graph (RZN) | Block 6E |
| Search & Analytics | Blocks 7–9 |
| Export & Bundles | Blocks 10–10A |
| Neo4j → FalkorDB Transfer | Block 11 |

**Design intent:** v7 adds provenance (spans), run-level metrics, optional reasoning-layer construction, and an integration path into FalkorDB.

---
