[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/swarm-ai-safety/swarm/blob/main/examples/illusion_delta_minimal.ipynb)

# Illusion Delta: Replay-Based Incoherence Detection

This notebook demonstrates **illusion delta**, a metric for detecting hidden incoherence in multi-agent systems. The core idea: run the same scenario with different random seeds and observe where agents make *different decisions* at the same decision points. This decision-level disagreement reveals incoherence that single-run metrics miss.

**No API keys required. Runs entirely locally (or in Colab).**

In [None]:
# --- Setup ---
# This cell handles installation automatically.
# In Colab: clones the repo and installs SWARM.
# Locally: assumes you've already run `pip install -e ".[runtime]"`.
import os

if os.getenv("COLAB_RELEASE_TAG"):
    !git clone --depth 1 https://github.com/swarm-ai-safety/swarm.git /content/swarm
    %pip install -q -e "/content/swarm[runtime]"
    os.chdir("/content/swarm")
    print("Installed SWARM from GitHub. Ready to go!")
else:
    print("Local environment detected \u2014 using existing install.")

## The Illusion Delta Concept

A system can *appear* coherent on a single run while harboring latent instability. Illusion delta captures this gap:

- **Perceived coherence**: How consistent the accepted interactions look in a single run (measured via average `p` of accepted interactions).
- **Distributed coherence**: How consistently the system makes the *same decisions* across multiple replays with different seeds.
- **Illusion delta** = perceived coherence - distributed coherence. A positive value means the system looks more coherent than it actually is.

High illusion delta is a red flag: it means deceptive or unstable agents are introducing decision-level noise that single-run metrics cannot detect.

In [None]:
import tempfile
from collections import defaultdict
from pathlib import Path

from swarm.agents.deceptive import DeceptiveAgent
from swarm.agents.honest import HonestAgent
from swarm.core.orchestrator import Orchestrator, OrchestratorConfig
from swarm.logging.event_log import EventLog
from swarm.metrics.incoherence import disagreement_rate, illusion_delta
from swarm.models.events import EventType
import matplotlib.pyplot as plt

In [None]:
def _run_with_log(seed: int) -> tuple[list[dict[str, object]], list[float]]:
    """Run simulation with given seed, extract decisions and accepted p values from event log."""
    with tempfile.TemporaryDirectory() as tmpdir:
        log_path = Path(tmpdir) / f"events_{seed}.jsonl"
        cfg = OrchestratorConfig(
            n_epochs=12,
            steps_per_epoch=8,
            seed=seed,
            log_path=log_path,
            log_events=True,
            observation_noise_probability=0.25,
            observation_noise_std=0.15,
        )
        orchestrator = Orchestrator(config=cfg)
        orchestrator.register_agent(HonestAgent(agent_id="honest_a", name="Alice"))
        orchestrator.register_agent(HonestAgent(agent_id="honest_b", name="Bob"))
        orchestrator.register_agent(
            DeceptiveAgent(agent_id="deceptive_x", name="Mallory")
        )
        orchestrator.run()

        events = list(EventLog(log_path).replay())

    decisions: list[dict[str, str | bool]] = []
    accepted_p_values: list[float] = []

    completed_acceptance: dict[str, bool] = {}
    for event in events:
        if (
            event.event_type == EventType.INTERACTION_COMPLETED
            and event.interaction_id is not None
        ):
            completed_acceptance[event.interaction_id] = bool(
                event.payload.get("accepted", False)
            )

    proposal_index = 0
    for event in events:
        if (
            event.event_type != EventType.INTERACTION_PROPOSED
            or event.interaction_id is None
            or event.initiator_id is None
            or event.counterparty_id is None
        ):
            continue

        decision_id = f"proposal_slot_{proposal_index}"
        proposal_index += 1

        accepted = completed_acceptance.get(event.interaction_id, False)
        interaction_type = str(event.payload.get("interaction_type", "unknown"))
        action_token = (
            f"{event.initiator_id}->{event.counterparty_id}:{interaction_type}:"
            f"{'accept' if accepted else 'reject'}"
        )
        decisions.append({"decision_id": decision_id, "action": action_token})

        if accepted:
            accepted_p_values.append(float(event.payload.get("p", 0.0)))

    return decisions, accepted_p_values

## Multi-Seed Replay

We run the same 3-agent scenario (2 honest + 1 deceptive) across 4 different seeds. Each seed produces a different trajectory through the same decision space. By aligning decision points across runs, we can measure where the system disagrees with itself.

In [None]:
seeds = [7, 8, 9, 10]

all_decisions: list[list[dict[str, object]]] = []
all_accepted_ps: list[list[float]] = []

for seed in seeds:
    decisions, accepted_ps = _run_with_log(seed)
    all_decisions.append(decisions)
    all_accepted_ps.append(accepted_ps)
    print(f"Seed {seed}: {len(decisions)} decisions, {len(accepted_ps)} accepted")

In [None]:
# Compute per-decision disagreement rates across replays
decision_votes: defaultdict[str, list[str]] = defaultdict(list)
replay_counts: defaultdict[str, int] = defaultdict(int)

for decisions in all_decisions:
    seen_ids: set[str] = set()
    for row in decisions:
        decision_id = str(row["decision_id"])
        decision_votes[decision_id].append(str(row["action"]))
        if decision_id not in seen_ids:
            replay_counts[decision_id] += 1
            seen_ids.add(decision_id)

disagreement_rates = [
    disagreement_rate(votes)
    for decision_id, votes in decision_votes.items()
    if replay_counts[decision_id] >= 2 and len(votes) >= 2
]

p_values = all_accepted_ps[0]
gap = illusion_delta(p_values=p_values, disagreement_rates=disagreement_rates)

In [None]:
# Summary statistics
print("seed  accepted_interactions")
for seed, accepted_ps in zip(seeds, all_accepted_ps):
    print(f"{seed}     {len(accepted_ps)}")

print(f"\nAccepted samples (reference seed): {len(p_values)}")
print(f"Decision points compared:          {len(disagreement_rates)}")
print(f"Perceived coherence:               {gap.perceived_coherence:.4f}")
print(f"Distributed coherence:             {gap.distributed_coherence:.4f}")
print(f"Illusion delta:                    {gap.illusion_delta:.4f}")

In [None]:
fig = plt.figure(figsize=(7, 4))
ax = fig.add_subplot(1, 1, 1)
ax.hist(disagreement_rates, bins=10, color="#c0392b", alpha=0.85)
ax.set_title(f"Decision disagreement (illusion \u0394={gap.illusion_delta:.3f})")
ax.set_xlabel("decision-level disagreement rate")
ax.set_ylabel("count")
ax.grid(alpha=0.3)
fig.tight_layout()
plt.show()

## Interpretation

**Reading the results:**

- **Illusion delta > 0**: The system appears more coherent than it actually is. The deceptive agent (Mallory) introduces decision-level noise that a single run cannot detect.
- **Disagreement histogram**: Peaks near 0 mean most decisions are stable across seeds. A long right tail indicates fragile decision points where small perturbations flip outcomes.
- **Perceived vs distributed coherence**: A large gap suggests the system's apparent safety is brittle.

**Next steps:**

- Increase the number of seeds for tighter estimates of distributed coherence.
- Try different agent compositions (more deceptive agents, adaptive adversaries) to see how illusion delta scales.
- Run the full CLI version with more epochs: `python examples/illusion_delta_minimal.py`
- Explore governance mechanisms that reduce illusion delta: `python examples/parameter_sweep.py`

**Repository:** [github.com/swarm-ai-safety/swarm](https://github.com/swarm-ai-safety/swarm)