[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/swarm-ai-safety/swarm/blob/main/examples/run_redteam.ipynb)

# Red-Team Evaluation: 8 Attack Vectors

This notebook tests **governance resilience** against adversarial attacks using SWARM's red-team evaluation framework. We run 8 attack scenarios (reputation farming, collusion rings, threshold dancing, and more) against 4 governance configurations of increasing strictness and compare how each configuration holds up.

**No API keys required. Runs entirely locally (or in Colab).**

**Difficulty:** Intermediate

## What you'll learn

| Step | What happens | Key concept |
|------|-------------|-------------|
| 1 | Define 4 governance configs | Defense layering: circuit breakers, audits, staking, collusion detection |
| 2 | Run red-team evaluation against each | Automated adversarial testing |
| 3 | Compare evasion rates and damage | Quantifying governance effectiveness |
| 4 | Visualize which attacks succeed where | Identifying coverage gaps |

In [None]:
# --- Setup ---
# This cell handles installation automatically.
# In Colab: clones the repo and installs SWARM.
# Locally: assumes you've already run `pip install -e ".[runtime]"`.
import os

if os.getenv("COLAB_RELEASE_TAG"):
    !git clone --depth 1 https://github.com/swarm-ai-safety/swarm.git /content/swarm
    %pip install -q -e "/content/swarm[runtime]"
    os.chdir("/content/swarm")
    print("Installed SWARM from GitHub. Ready to go!")
else:
    print("Local environment detected -- using existing install.")

## Governance Configurations

We test 4 governance configurations, each adding more defense layers:

| Config | Circuit Breaker | Collusion Detection | Audit | Staking | Tax Rate |
|--------|:-:|:-:|:-:|:-:|:-:|
| **no_defenses** | - | - | - | - | 0% |
| **baseline** | Yes | - | - | - | 0% |
| **moderate** | Yes | - | Yes | Yes | 5% |
| **strict** | Yes | Yes | Yes | Yes | 10% |

**Red-team evaluation** systematically probes each config with a library of adversarial attack scenarios. Each attack exploits a different governance weakness (e.g., reputation farming builds trust before exploiting, collusion rings coordinate to evade pairwise detection, threshold dancing stays just below circuit-breaker thresholds).

The key metrics are:
- **Attacks succeeded**: How many of the attack vectors broke through
- **Evasion rate**: Fraction of adversarial actions that went undetected
- **Total damage**: Cumulative harm caused across all attacks

In [None]:
from swarm.redteam.evaluator import RedTeamEvaluator
import pandas as pd
import matplotlib.pyplot as plt

GOVERNANCE_CONFIGS = {
    "no_defenses": {
        "circuit_breaker_enabled": False,
        "collusion_detection_enabled": False,
        "audit_enabled": False,
        "staking_enabled": False,
        "transaction_tax_rate": 0.0,
    },
    "baseline": {
        "circuit_breaker_enabled": True,
        "collusion_detection_enabled": False,
        "audit_enabled": False,
        "staking_enabled": False,
        "transaction_tax_rate": 0.0,
    },
    "moderate": {
        "circuit_breaker_enabled": True,
        "collusion_detection_enabled": False,
        "audit_enabled": True,
        "staking_enabled": True,
        "transaction_tax_rate": 0.05,
    },
    "strict": {
        "circuit_breaker_enabled": True,
        "collusion_detection_enabled": True,
        "audit_enabled": True,
        "staking_enabled": True,
        "transaction_tax_rate": 0.10,
    },
}

In [None]:
# Run quick evaluation across all governance configs
all_results = {}
for config_name, gov_config in GOVERNANCE_CONFIGS.items():
    print(f"\nTesting: {config_name}")
    evaluator = RedTeamEvaluator(governance_config=gov_config)
    result = evaluator.quick_evaluate()
    all_results[config_name] = result
    print(f"  Attacks: {result['attacks_tested']}, Successful: {result['attacks_successful']}")
    print(f"  Evasion rate: {result['avg_evasion_rate']:.2%}, Damage: {result['total_damage']:.1f}")

In [None]:
# Build comparison DataFrame
rows = []
for config_name, result in all_results.items():
    tested = result["attacks_tested"]
    succeeded = result["attacks_successful"]
    rows.append({
        "config": config_name,
        "attacks_tested": tested,
        "attacks_succeeded": succeeded,
        "attacks_prevented": tested - succeeded,
        "evasion_rate": result["avg_evasion_rate"],
        "total_damage": result["total_damage"],
    })

df = pd.DataFrame(rows)
df = df.set_index("config")
print(df.to_string())

## Understanding the Metrics

- **Evasion rate**: The fraction of adversarial actions that went undetected by governance mechanisms. Lower is better -- a strict config should detect most adversarial behavior.
- **Total damage**: Cumulative harm across all attack scenarios. This captures how much damage slips through even when attacks are partially detected.
- **Attacks prevented**: Number of attack vectors that were fully neutralized (attack did not succeed). A "succeeded" attack means the adversary achieved its objective despite governance.

The gap between `no_defenses` and `strict` shows the value of layered governance. The gap between `baseline` and `moderate` reveals which specific mechanisms (audits, staking) matter most.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
fig.suptitle("Red-Team Results: Governance Config Comparison", fontsize=14)

configs = df.index.tolist()
x = range(len(configs))
bar_colors = ["#d62728", "#ff7f0e", "#1f77b4", "#2ca02c"]

# --- Attacks succeeded vs prevented ---
ax = axes[0]
ax.bar(x, df["attacks_succeeded"], label="Succeeded", color="#d62728", alpha=0.85)
ax.bar(x, df["attacks_prevented"], bottom=df["attacks_succeeded"],
       label="Prevented", color="#2ca02c", alpha=0.85)
ax.set_xticks(list(x))
ax.set_xticklabels(configs, rotation=30, ha="right")
ax.set_ylabel("Number of attacks")
ax.set_title("Attack Outcomes")
ax.legend(loc="upper right")

# --- Total damage ---
ax = axes[1]
ax.bar(x, df["total_damage"], color=bar_colors, alpha=0.85)
ax.set_xticks(list(x))
ax.set_xticklabels(configs, rotation=30, ha="right")
ax.set_ylabel("Total damage")
ax.set_title("Cumulative Damage")

# --- Evasion rate ---
ax = axes[2]
ax.bar(x, df["evasion_rate"], color=bar_colors, alpha=0.85)
ax.set_xticks(list(x))
ax.set_xticklabels(configs, rotation=30, ha="right")
ax.set_ylabel("Evasion rate")
ax.set_title("Avg Evasion Rate")
ax.set_ylim(0, 1.0)

plt.tight_layout()
plt.show()

# --- Per-attack heatmap ---
# Build a matrix: rows = attack names, columns = configs
attack_names = [r["scenario"]["name"] for r in all_results[configs[0]]["results"]]
heatmap_data = []
for config_name in configs:
    col = []
    for r in all_results[config_name]["results"]:
        col.append(r["damage_caused"])
    heatmap_data.append(col)

heatmap_df = pd.DataFrame(heatmap_data, index=configs, columns=attack_names).T

fig, ax = plt.subplots(figsize=(10, 4))
im = ax.imshow(heatmap_df.values, aspect="auto", cmap="YlOrRd")
ax.set_xticks(range(len(configs)))
ax.set_xticklabels(configs, rotation=30, ha="right")
ax.set_yticks(range(len(attack_names)))
ax.set_yticklabels(attack_names)
ax.set_title("Per-Attack Damage by Governance Config")
fig.colorbar(im, ax=ax, label="Damage")

# Annotate cells
for i in range(len(attack_names)):
    for j in range(len(configs)):
        val = heatmap_df.values[i, j]
        ax.text(j, i, f"{val:.0f}", ha="center", va="center",
                color="white" if val > heatmap_df.values.max() * 0.6 else "black",
                fontsize=9)

plt.tight_layout()
plt.show()

## Key Findings

**Expected patterns:**

1. **No defenses** allows all attacks through with maximum damage -- this is the worst-case baseline.
2. **Circuit breakers alone** (baseline) provide a first line of defense but miss coordinated and slow-burn attacks.
3. **Moderate governance** (audits + staking) catches more attacks because staking creates economic penalties and audits provide retrospective detection.
4. **Strict governance** (+ collusion detection + higher tax) should prevent the most attacks, but at higher operational cost.

**What to investigate next:**

- Run `--mode full` for deeper evaluation with 20 epochs per attack: `python examples/run_redteam.py --mode full`
- Check which specific attacks still succeed under strict governance -- these represent the hardest unsolved problems.
- Use `examples/parameter_sweep.py` to find the minimum governance needed to prevent each attack type.
- Examine the per-attack heatmap above to identify which attack vectors are most robust to governance changes.

**Repository:** [github.com/swarm-ai-safety/swarm](https://github.com/swarm-ai-safety/swarm)