Skip to content

stylusnexus/agent-armor

Repository files navigation

Agent Armor

npm version License: MIT

agentarmor.dev | npm | GitHub

Open-source security framework for AI agents. Detects and defends against AI Agent Traps — adversarial content designed to manipulate, deceive, or exploit autonomous AI agents.

Built on the taxonomy from AI Agent Traps (Franklin et al., Google DeepMind, 2025).

Why This Matters

AI agents ingest content they didn't generate: web pages, RAG chunks, tool outputs, database results. Any of that content can contain instructions designed to hijack the agent's behavior, and the agent can't tell the difference between data and directives.

This isn't theoretical. It's happening now:

  • Slack AI data exfiltration (2024): Poisoned messages in Slack channels caused the AI assistant to extract and leak data from private channels through tool calls.
  • EchoLeak / Microsoft 365 Copilot: A zero-click attack where a single email with hidden instructions made Copilot exfiltrate data from OneDrive, SharePoint, and Teams, routed through trusted Microsoft URLs so it looked like internal links.
  • Devin AI pentest (2025): A $500 security test found the coding agent completely defenseless against prompt injection. It exposed ports, leaked access tokens, and installed command-and-control malware.
  • SSH key exfiltration via GPT-4o (Jan 2026): A single poisoned email coerced GPT-4o into executing Python that exfiltrated SSH keys in 80% of trials.

Google DeepMind's AI Agent Traps taxonomy (Franklin et al., 2025) catalogs 14 attack types across 6 categories. OpenAI has stated that AI browsers "may always be vulnerable" to prompt injection.

Agent Armor scans content at every stage of the agent lifecycle: before ingestion, after retrieval, and before the agent's output reaches the user. Input validation and output validation in one pipeline.

Quick Start

Regex-only (synchronous)

Zero dependencies, sub-millisecond scans:

import { AgentArmor } from '@stylusnexus/agentarmor';

const armor = AgentArmor.regexOnly();

const result = armor.scanSync(htmlString);

if (!result.clean) {
  console.warn('Threats detected:', result.threats);
  // Use result.sanitized for cleaned content
}

With ML classifier (asynchronous)

For deeper detection using an ONNX-based classifier:

import { AgentArmor } from '@stylusnexus/agentarmor';

const armor = await AgentArmor.create({
  ml: { enabled: true },
});

const result = await armor.scan(htmlString);

if (!result.clean) {
  console.warn('Threats detected:', result.threats);
}

Install

Core package (regex-based detection, zero dependencies):

npm install @stylusnexus/agentarmor

Optional ML classifier for deeper detection:

npm install @stylusnexus/agentarmor-ml

Testing & Validation

Eval Suite

71 curated samples (49 adversarial, 22 benign) covering all 10 shipped detector types across 4 attack categories:

Strictness Detection Rate False Positive Rate
Permissive 83.7% 0.0%
Balanced 98.0% 0.0%
Strict 98.0% 0.0%

Adversarial samples are sourced from or inspired by: WASP benchmark (Evtimov et al.), HackAPrompt (Schulhoff et al., 2023), Greshake et al. (2023), real-world incident reports, and the DeepMind paper examples. Benign samples include security blog posts, legitimate HTML with display:none elements, CI/CD documentation, AI safety textbook excerpts, and normal emails.

Run it yourself: npx tsx scripts/eval/run-eval.ts

Real-World Attack Validation

A separate validation suite (examples/real-world-validation.ts) tests against 24 inlined samples drawn directly from published security research:

Source Attack Type Samples
Unit 42 (Palo Alto Networks), 2025 Hidden CSS/HTML injection 6
Greshake et al., 2023 Indirect prompt injection 4
JailbreakBench / HackAPrompt Jailbreak patterns 5
Embrace The Red (J. Rehberger) Data exfiltration via agents 4
Benign false-positive controls Security docs, normal HTML/email 5

Result: 100% detection, 0% false positives at balanced strictness. All samples are inlined for offline reproducibility, no network required.

Run it: npx tsx examples/real-world-validation.ts

Attack Categories Covered

Category Target Status What It Detects
Content Injection Perception Shipped Hidden HTML/CSS instructions, metadata injection, dynamic cloaking artifacts, syntactic masking
Behavioural Control Action Shipped Embedded jailbreak sequences, data exfiltration patterns, sub-agent spawning traps
Cognitive State Memory Shipped RAG knowledge poisoning, latent memory poisoning, contextual learning manipulation
Semantic Manipulation Reasoning Shipped Biased framing/priming, oversight evasion, persona hyperstition
Systemic Multi-Agent Planned Congestion traps, interdependence cascades, tacit collusion, compositional fragments, sybil attacks
Human-in-the-Loop Overseer Planned Approval fatigue induction, social engineering via compromised agent

Configuration

const armor = await AgentArmor.create({
  // Enable/disable specific detectors
  contentInjection: {
    hiddenHTML: true,        // CSS display:none, off-screen positioning
    metadataInjection: true, // aria-label, HTML comments with instructions
    dynamicCloaking: true,   // Bot detection scripts
    syntacticMasking: true,  // Markdown/LaTeX payload hiding
  },
  behaviouralControl: {
    jailbreakPatterns: true, // Known jailbreak sequence detection
    exfiltrationURLs: true,  // Data exfiltration patterns
    privilegeEscalation: true, // Sub-agent spawning triggers
  },
  // Strictness affects confidence thresholds
  // 'permissive' = fewer alerts, higher confidence required
  // 'balanced'   = recommended default
  // 'strict'     = more alerts, catches subtle attacks
  strictness: 'balanced',

  // ML classifier (requires @stylusnexus/agentarmor-ml)
  ml: {
    enabled: true,
    // Behavior when ML model is unavailable:
    // 'throw' (default) | 'warn-and-skip' | 'silent-skip'
    onUnavailable: 'warn-and-skip',
  },
});

For sync-only usage without ML, use AgentArmor.regexOnly() which accepts the same options minus ml:

const armor = AgentArmor.regexOnly({
  strictness: 'strict',
  contentInjection: { hiddenHTML: true, metadataInjection: true },
});

Scan Results

Every scan returns a ScanResult with full threat details:

interface ScanResult {
  clean: boolean;          // true if no threats found
  threats: Threat[];       // sorted by severity, then confidence
  sanitized: string;       // content with threats neutralized
  durationMs: number;      // scan time in milliseconds
  stats: {
    detectorsRun: number;
    threatsFound: number;
    highestSeverity: 'low' | 'medium' | 'high' | 'critical' | null;
  };
}

interface Threat {
  category: TrapCategory;  // e.g. 'content-injection'
  type: TrapType;          // e.g. 'hidden-html'
  severity: Severity;      // 'low' | 'medium' | 'high' | 'critical'
  confidence: number;      // 0-1
  description: string;     // human-readable explanation
  evidence: string;        // the offending content (truncated)
  location?: { offset: number; length: number };
  detectorId: string;
  source: 'pattern' | 'ml' | 'custom';  // how the threat was detected
}

The source field indicates which detection method found the threat:

  • 'pattern' — matched by a regex pattern from the built-in pattern database
  • 'ml' — flagged by the ML classifier
  • 'custom' — found by a user-provided custom detector

ML Classifier

The optional @stylusnexus/agentarmor-ml package adds an ONNX-based classifier that catches threats regex patterns might miss. It downloads the model on first use and caches it locally.

const armor = await AgentArmor.create({
  ml: {
    enabled: true,
    // Optional: point to a local model directory
    modelDir: './models/agentarmor',
    // Optional: configure download behavior
    download: {
      timeoutMs: 120_000,
      retries: 2,
      onProgress: (received, total) => {
        console.log(`Downloading model: ${Math.round(received / total * 100)}%`);
      },
    },
    // Optional: gracefully degrade if model is unavailable
    onUnavailable: 'warn-and-skip',
  },
});

When ML is enabled, calling await armor.scan(content) runs both regex and ML detectors. The ML classifier's threats have source: 'ml' in the result, making it easy to distinguish them from pattern-based detections.

If the ML package is not installed or the model is unavailable, behavior depends on the onUnavailable setting: 'throw' (default), 'warn-and-skip', or 'silent-skip'.

Architecture

Agent Armor operates as a middleware pipeline with three interception points:

External Content --> [Pre-Ingestion Scanner] --> Agent Context
                                                      |
                    [Post-Retrieval Scanner] <-- RAG/Memory Store
                                                      |
                    [Pre-Execution Scanner]  --> Agent Output --> User

Each interception point has both sync and async methods:

Stage Sync Async
Pre-ingestion scanSync(content) await scan(content)
Post-retrieval scanRAGChunksSync(chunks) await scanRAGChunks(chunks)
Pre-execution scanOutputSync(output) await scanOutput(output)

Detectors

Content Injection (Shipped)

  • HiddenHTMLDetector — Finds instructions hidden via CSS (display:none, visibility:hidden, off-screen positioning)
  • MetadataInjectionDetector — Scans HTML comments, aria-label, alt attributes, meta tags for injected instructions
  • DynamicCloakingDetector — Detects JavaScript patterns that serve different content to agents vs humans
  • SyntacticMaskingDetector — Identifies payloads hidden in Markdown link text, LaTeX commands, zero-width characters, or bidi overrides

Behavioural Control (Shipped)

  • JailbreakPatternDetector — Pattern-matches against known jailbreak templates (DAN, role-play bypasses, educational framing exploits, developer mode claims)
  • ExfiltrationDetector — Flags instructions that attempt to locate, encode, and transmit context data to external endpoints
  • SubAgentSpawningDetector — Detects instructions that try to instantiate new agents, escalate tool permissions, or inject pipeline steps

Performance

The core regex detectors have zero dependencies and run with sub-millisecond latency. When the ML classifier is enabled, scan times increase but remain practical for real-time use:

  • Regex only: <1ms for small content, ~2-5ms for 10KB, ~10-20ms for 100KB
  • With ML: ~50-200ms depending on content length and hardware

Use scanSync() for latency-critical paths and await scan() when ML detection is needed.

Custom Detectors

Extend Agent Armor with your own detectors:

import { AgentArmor, type Detector } from '@stylusnexus/agentarmor';

const myDetector: Detector = {
  id: 'my-custom-detector',
  name: 'My Custom Detector',
  category: 'content-injection',
  scan: (content, options) => {
    // Your sync detection logic
    return {
      threats: [
        // Each threat must include the `source` field
        {
          category: 'content-injection',
          type: 'hidden-html',
          severity: 'high',
          confidence: 0.95,
          description: 'Found suspicious pattern',
          evidence: content.slice(0, 100),
          detectorId: 'my-custom-detector',
          source: 'custom',
        },
      ],
    };
  },
  // Optional: async detection (used by `scan()`, `scanRAGChunks()`, `scanOutput()`)
  scanAsync: async (content, options) => {
    // Your async detection logic (e.g. call an external API)
    return { threats: [] };
  },
  sanitize: (content, threats) => content,
};

const armor = AgentArmor.regexOnly({
  customDetectors: [myDetector],
});

Updatable Pattern Database

Patterns are data-driven, not hardcoded. Update without upgrading the package:

// Fetch latest patterns from your pattern server
const latestPatterns = await AgentArmor.fetchLatestPatterns('https://your-server.com/patterns.json');
armor.loadPatterns(latestPatterns);

// Or load custom patterns directly
armor.loadPatterns(myCustomPatterns);

// Check current pattern version
console.log(armor.patternVersion); // '0.2.0'

Framework Agnostic

Agent Armor works with any LLM agent framework:

  • LangChain / LangGraph — Use as a preprocessing step in your chain
  • Claude Code / Anthropic SDK — Drop into tool result processing
  • OpenAI Agents SDK — Wrap tool outputs before context assembly
  • AutoGen / CrewAI — Add as an inter-agent message filter
  • Custom agents — Call directly in your pipeline

Examples

The examples/ directory has ready-to-run integration examples:

Example Audience What it shows
customer-facing-agent.ts SMB / Startup Protect a support chatbot: scan knowledge base, customer messages, and agent output
audit-logging.ts Enterprise Policy enforcement + structured audit log for compliance (SOC2, ISO 27001)
tool-output-guard.ts Developer Guard every tool call in a custom agent loop (web, DB, file, API)
rag-pipeline.ts Developer Filter poisoned RAG chunks before LLM context assembly
express-middleware.ts Developer Express middleware that scans and sanitizes requests
web-content-scanner.ts Developer Scan raw HTML from web fetches in strict mode
ml-classifier.ts Developer Async pipeline with ML classifier enabled
custom-detector.ts Developer Implement and register a custom Detector
real-world-validation.ts Security Validate against real-world attack samples from published research

Run any example:

npx tsx examples/rag-pipeline.ts

Roadmap & Research Opportunities

Agent Armor covers 4 of the 6 attack categories in the DeepMind taxonomy. Here's what's shipped, what's next, and where the open questions are.

Shipped

  • Content Injection (4 detectors) and Behavioural Control (3 detectors) since v0.1.0
  • Cognitive State (3 detectors) and Semantic Manipulation (3 detectors) since v0.2.0
  • ML classifier (DeBERTa-v3-small, ONNX) as optional companion package
  • Pattern database v0.4.0 with 71 pattern entries

In Progress

  • Expanded eval dataset. 71 samples is a start, not a finish. Integrating larger public datasets (deepset/prompt-injections at 662 samples, Giskard-AI) to stress-test detection and false positive rates at scale.
  • Honeypot/canary system. Behavioral baseline approach for detecting novel attacks that bypass pattern matching. Measures response distribution drift rather than relying on known signatures.
  • Pattern update API. Continuous pattern improvements delivered without requiring an npm upgrade.

Not Yet Covered (P2)

These are the remaining 2 categories from the taxonomy. They're harder problems with less established detection approaches:

  • Systemic attacks (multi-agent): congestion traps, interdependence cascades, tacit collusion, compositional fragment attacks, sybil attacks. These target multi-agent architectures and require detection approaches that reason about agent-to-agent interactions, not just content.
  • Human-in-the-loop attacks: approval fatigue induction, social engineering via compromised agent. These exploit the human overseer rather than the agent itself. Detection likely requires behavioral analysis over time rather than content scanning.

Open Research Questions

If you're a researcher or practitioner thinking about these problems, we'd value your perspective:

  • Hierarchical vs. flat classification. The ML classifier uses multi-label flat classification (14 outputs). Would a hierarchical approach (category first, then type) better reflect the taxonomy structure and improve accuracy on underrepresented categories?
  • Semantic manipulation detection. Biased framing and persona hyperstition are inherently subtle. Regex catches the obvious cases, but sophisticated semantic attacks may require embedding-level analysis or chain-of-thought reasoning about intent. What's the right detection architecture here?
  • Cross-session attack detection. Latent memory poisoning and contextual learning traps accumulate over multiple interactions. The current scan-per-input approach can't detect gradual drift. What does a stateful detection layer look like?
  • Adversarial robustness of the detector itself. If an attacker knows the pattern database, they can craft bypasses. How do we make the detection layer robust to adversarial evasion without creating an arms race?

Staying Updated

  • Pattern database is versioned and updatable independently of the npm package via AgentArmor.fetchLatestPatterns()
  • ML model is retrained periodically with new attack samples and pushed to HuggingFace
  • GitHub releases track all changes with a CHANGELOG
  • Security issues can be reported via SECURITY.md

FAQ

Who is this for?

Agent Armor protects agents you build and control. If you're writing agent code using Claude API, Azure OpenAI, LangChain, CrewAI, AutoGen, or any framework where you own the data pipeline, this is for you.

You are... Agent Armor helps you...
A developer building an AI agent that calls tools, browses the web, or uses RAG Scan every piece of external content before it enters your agent's context
A startup/SMB with a customer-facing AI chatbot or support agent Protect your knowledge base from poisoning and your agent's output from manipulation
An enterprise team building custom AI tooling on top of LLM APIs Add audit logging, policy enforcement, and compliance evidence to your agent pipeline

Can this protect our Microsoft 365 Copilot / Claude.ai / ChatGPT deployment?

Not directly. Those are closed pipelines where the vendor controls the scanning. Agent Armor can't insert itself between Copilot and the content it reads from SharePoint or Teams. If you're using a hosted AI product as-is, the vendor is responsible for security on their end.

Where it does fit: if your team is building custom agents using the Claude API, Azure OpenAI, or other LLM APIs, you control the pipeline, and Agent Armor is the scanning layer for it.

Isn't this just prompt injection detection?

Prompt injection is one attack type out of the 10 we cover (detected by 13 detectors). Prompt injection targets chatbots within a single conversation. Agent traps target autonomous agents with tool access, persistent memory, and sub-agent spawning. Different attack surface, different blast radius.

The full taxonomy includes content injection, behavioral control, cognitive state manipulation (RAG/memory poisoning), and semantic manipulation (biased framing, persona shifts). These are distinct attack categories with different detection approaches.

Can a determined attacker bypass this?

Yes. A sophisticated adversary with knowledge of the pattern database can craft content that evades regex detection. The ML classifier raises the bar significantly, but no detection system is foolproof.

Agent Armor is defense-in-depth. It raises the cost of attack and catches the broad majority of real-world attacks. Think of it as input validation for your agent pipeline, grounded in a real taxonomy rather than guesswork.

What about false positives?

False positives are the hardest problem in this space. Naive regex on security-adjacent vocabulary (phrases like "ignore previous instructions," "system prompt," "act as") generates enormous noise on legitimate developer content, documentation, and security research.

The solution is a two-pass detection pipeline: structural pattern match first, then an instruction signal context check. Patterns that would cause noise have a requireInstructions flag that prevents them from firing without that second signal. On our eval suite of 71 samples (including security blog posts, AI safety textbooks, and CI/CD documentation as benign controls), the false positive rate is 0%.

How much latency does this add?

The regex-based core runs in sub-millisecond time for typical content. Under 5ms for 10KB, under 20ms for 100KB. Zero runtime dependencies.

With the ML classifier enabled, expect 50-200ms depending on content length and hardware. Use scanSync() for latency-critical paths and await scan() when you want ML detection.

Is my data sent anywhere?

No. Everything runs locally. The regex detectors are pure computation with no network calls. The ML classifier runs an ONNX model on your machine. No content leaves your infrastructure. The only network call is the one-time model download (~140MB) on first use, which can be skipped by bundling the model in your deployment.

Contributing

We welcome contributions! See CONTRIBUTING.md for setup and workflow details.

Areas where contributions are especially valuable:

  • New adversarial samples for the evaluation suite (scripts/eval/samples.ts)
  • New detection patterns for the pattern database (src/patterns/default-patterns.ts)
  • Custom detectors for novel attack vectors
  • Research on the open questions above

Research Foundation

This project implements defenses based on the systematic framework proposed in:

Franklin, M., Tomasev, N., Jacobs, J., Leibo, J.Z., & Osindero, S. (2025). AI Agent Traps. Google DeepMind. papers.ssrn.com/sol3/papers.cfm?abstract_id=6372438

License

MIT

About

Open-source agent security framework. Detects and defends against AI Agent Traps — content injection, embedded jailbreaks, RAG poisoning, data exfiltration, and more. Based on the DeepMind Agent Traps taxonomy.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors