agentarmor.dev | npm | GitHub
Open-source security framework for AI agents. Detects and defends against AI Agent Traps — adversarial content designed to manipulate, deceive, or exploit autonomous AI agents.
Built on the taxonomy from AI Agent Traps (Franklin et al., Google DeepMind, 2025).
AI agents ingest content they didn't generate: web pages, RAG chunks, tool outputs, database results. Any of that content can contain instructions designed to hijack the agent's behavior, and the agent can't tell the difference between data and directives.
This isn't theoretical. It's happening now:
- Slack AI data exfiltration (2024): Poisoned messages in Slack channels caused the AI assistant to extract and leak data from private channels through tool calls.
- EchoLeak / Microsoft 365 Copilot: A zero-click attack where a single email with hidden instructions made Copilot exfiltrate data from OneDrive, SharePoint, and Teams, routed through trusted Microsoft URLs so it looked like internal links.
- Devin AI pentest (2025): A $500 security test found the coding agent completely defenseless against prompt injection. It exposed ports, leaked access tokens, and installed command-and-control malware.
- SSH key exfiltration via GPT-4o (Jan 2026): A single poisoned email coerced GPT-4o into executing Python that exfiltrated SSH keys in 80% of trials.
Google DeepMind's AI Agent Traps taxonomy (Franklin et al., 2025) catalogs 14 attack types across 6 categories. OpenAI has stated that AI browsers "may always be vulnerable" to prompt injection.
Agent Armor scans content at every stage of the agent lifecycle: before ingestion, after retrieval, and before the agent's output reaches the user. Input validation and output validation in one pipeline.
Zero dependencies, sub-millisecond scans:
import { AgentArmor } from '@stylusnexus/agentarmor';
const armor = AgentArmor.regexOnly();
const result = armor.scanSync(htmlString);
if (!result.clean) {
console.warn('Threats detected:', result.threats);
// Use result.sanitized for cleaned content
}For deeper detection using an ONNX-based classifier:
import { AgentArmor } from '@stylusnexus/agentarmor';
const armor = await AgentArmor.create({
ml: { enabled: true },
});
const result = await armor.scan(htmlString);
if (!result.clean) {
console.warn('Threats detected:', result.threats);
}Core package (regex-based detection, zero dependencies):
npm install @stylusnexus/agentarmorOptional ML classifier for deeper detection:
npm install @stylusnexus/agentarmor-ml71 curated samples (49 adversarial, 22 benign) covering all 10 shipped detector types across 4 attack categories:
| Strictness | Detection Rate | False Positive Rate |
|---|---|---|
| Permissive | 83.7% | 0.0% |
| Balanced | 98.0% | 0.0% |
| Strict | 98.0% | 0.0% |
Adversarial samples are sourced from or inspired by: WASP benchmark (Evtimov et al.), HackAPrompt (Schulhoff et al., 2023), Greshake et al. (2023), real-world incident reports, and the DeepMind paper examples. Benign samples include security blog posts, legitimate HTML with display:none elements, CI/CD documentation, AI safety textbook excerpts, and normal emails.
Run it yourself: npx tsx scripts/eval/run-eval.ts
A separate validation suite (examples/real-world-validation.ts) tests against 24 inlined samples drawn directly from published security research:
| Source | Attack Type | Samples |
|---|---|---|
| Unit 42 (Palo Alto Networks), 2025 | Hidden CSS/HTML injection | 6 |
| Greshake et al., 2023 | Indirect prompt injection | 4 |
| JailbreakBench / HackAPrompt | Jailbreak patterns | 5 |
| Embrace The Red (J. Rehberger) | Data exfiltration via agents | 4 |
| Benign false-positive controls | Security docs, normal HTML/email | 5 |
Result: 100% detection, 0% false positives at balanced strictness. All samples are inlined for offline reproducibility, no network required.
Run it: npx tsx examples/real-world-validation.ts
| Category | Target | Status | What It Detects |
|---|---|---|---|
| Content Injection | Perception | Shipped | Hidden HTML/CSS instructions, metadata injection, dynamic cloaking artifacts, syntactic masking |
| Behavioural Control | Action | Shipped | Embedded jailbreak sequences, data exfiltration patterns, sub-agent spawning traps |
| Cognitive State | Memory | Shipped | RAG knowledge poisoning, latent memory poisoning, contextual learning manipulation |
| Semantic Manipulation | Reasoning | Shipped | Biased framing/priming, oversight evasion, persona hyperstition |
| Systemic | Multi-Agent | Planned | Congestion traps, interdependence cascades, tacit collusion, compositional fragments, sybil attacks |
| Human-in-the-Loop | Overseer | Planned | Approval fatigue induction, social engineering via compromised agent |
const armor = await AgentArmor.create({
// Enable/disable specific detectors
contentInjection: {
hiddenHTML: true, // CSS display:none, off-screen positioning
metadataInjection: true, // aria-label, HTML comments with instructions
dynamicCloaking: true, // Bot detection scripts
syntacticMasking: true, // Markdown/LaTeX payload hiding
},
behaviouralControl: {
jailbreakPatterns: true, // Known jailbreak sequence detection
exfiltrationURLs: true, // Data exfiltration patterns
privilegeEscalation: true, // Sub-agent spawning triggers
},
// Strictness affects confidence thresholds
// 'permissive' = fewer alerts, higher confidence required
// 'balanced' = recommended default
// 'strict' = more alerts, catches subtle attacks
strictness: 'balanced',
// ML classifier (requires @stylusnexus/agentarmor-ml)
ml: {
enabled: true,
// Behavior when ML model is unavailable:
// 'throw' (default) | 'warn-and-skip' | 'silent-skip'
onUnavailable: 'warn-and-skip',
},
});For sync-only usage without ML, use AgentArmor.regexOnly() which accepts the same options minus ml:
const armor = AgentArmor.regexOnly({
strictness: 'strict',
contentInjection: { hiddenHTML: true, metadataInjection: true },
});Every scan returns a ScanResult with full threat details:
interface ScanResult {
clean: boolean; // true if no threats found
threats: Threat[]; // sorted by severity, then confidence
sanitized: string; // content with threats neutralized
durationMs: number; // scan time in milliseconds
stats: {
detectorsRun: number;
threatsFound: number;
highestSeverity: 'low' | 'medium' | 'high' | 'critical' | null;
};
}
interface Threat {
category: TrapCategory; // e.g. 'content-injection'
type: TrapType; // e.g. 'hidden-html'
severity: Severity; // 'low' | 'medium' | 'high' | 'critical'
confidence: number; // 0-1
description: string; // human-readable explanation
evidence: string; // the offending content (truncated)
location?: { offset: number; length: number };
detectorId: string;
source: 'pattern' | 'ml' | 'custom'; // how the threat was detected
}The source field indicates which detection method found the threat:
'pattern'— matched by a regex pattern from the built-in pattern database'ml'— flagged by the ML classifier'custom'— found by a user-provided custom detector
The optional @stylusnexus/agentarmor-ml package adds an ONNX-based classifier that catches threats regex patterns might miss. It downloads the model on first use and caches it locally.
const armor = await AgentArmor.create({
ml: {
enabled: true,
// Optional: point to a local model directory
modelDir: './models/agentarmor',
// Optional: configure download behavior
download: {
timeoutMs: 120_000,
retries: 2,
onProgress: (received, total) => {
console.log(`Downloading model: ${Math.round(received / total * 100)}%`);
},
},
// Optional: gracefully degrade if model is unavailable
onUnavailable: 'warn-and-skip',
},
});When ML is enabled, calling await armor.scan(content) runs both regex and ML detectors. The ML classifier's threats have source: 'ml' in the result, making it easy to distinguish them from pattern-based detections.
If the ML package is not installed or the model is unavailable, behavior depends on the onUnavailable setting: 'throw' (default), 'warn-and-skip', or 'silent-skip'.
Agent Armor operates as a middleware pipeline with three interception points:
External Content --> [Pre-Ingestion Scanner] --> Agent Context
|
[Post-Retrieval Scanner] <-- RAG/Memory Store
|
[Pre-Execution Scanner] --> Agent Output --> User
Each interception point has both sync and async methods:
| Stage | Sync | Async |
|---|---|---|
| Pre-ingestion | scanSync(content) |
await scan(content) |
| Post-retrieval | scanRAGChunksSync(chunks) |
await scanRAGChunks(chunks) |
| Pre-execution | scanOutputSync(output) |
await scanOutput(output) |
- HiddenHTMLDetector — Finds instructions hidden via CSS (
display:none,visibility:hidden, off-screen positioning) - MetadataInjectionDetector — Scans HTML comments,
aria-label,altattributes,metatags for injected instructions - DynamicCloakingDetector — Detects JavaScript patterns that serve different content to agents vs humans
- SyntacticMaskingDetector — Identifies payloads hidden in Markdown link text, LaTeX commands, zero-width characters, or bidi overrides
- JailbreakPatternDetector — Pattern-matches against known jailbreak templates (DAN, role-play bypasses, educational framing exploits, developer mode claims)
- ExfiltrationDetector — Flags instructions that attempt to locate, encode, and transmit context data to external endpoints
- SubAgentSpawningDetector — Detects instructions that try to instantiate new agents, escalate tool permissions, or inject pipeline steps
The core regex detectors have zero dependencies and run with sub-millisecond latency. When the ML classifier is enabled, scan times increase but remain practical for real-time use:
- Regex only: <1ms for small content, ~2-5ms for 10KB, ~10-20ms for 100KB
- With ML: ~50-200ms depending on content length and hardware
Use scanSync() for latency-critical paths and await scan() when ML detection is needed.
Extend Agent Armor with your own detectors:
import { AgentArmor, type Detector } from '@stylusnexus/agentarmor';
const myDetector: Detector = {
id: 'my-custom-detector',
name: 'My Custom Detector',
category: 'content-injection',
scan: (content, options) => {
// Your sync detection logic
return {
threats: [
// Each threat must include the `source` field
{
category: 'content-injection',
type: 'hidden-html',
severity: 'high',
confidence: 0.95,
description: 'Found suspicious pattern',
evidence: content.slice(0, 100),
detectorId: 'my-custom-detector',
source: 'custom',
},
],
};
},
// Optional: async detection (used by `scan()`, `scanRAGChunks()`, `scanOutput()`)
scanAsync: async (content, options) => {
// Your async detection logic (e.g. call an external API)
return { threats: [] };
},
sanitize: (content, threats) => content,
};
const armor = AgentArmor.regexOnly({
customDetectors: [myDetector],
});Patterns are data-driven, not hardcoded. Update without upgrading the package:
// Fetch latest patterns from your pattern server
const latestPatterns = await AgentArmor.fetchLatestPatterns('https://your-server.com/patterns.json');
armor.loadPatterns(latestPatterns);
// Or load custom patterns directly
armor.loadPatterns(myCustomPatterns);
// Check current pattern version
console.log(armor.patternVersion); // '0.2.0'Agent Armor works with any LLM agent framework:
- LangChain / LangGraph — Use as a preprocessing step in your chain
- Claude Code / Anthropic SDK — Drop into tool result processing
- OpenAI Agents SDK — Wrap tool outputs before context assembly
- AutoGen / CrewAI — Add as an inter-agent message filter
- Custom agents — Call directly in your pipeline
The examples/ directory has ready-to-run integration examples:
| Example | Audience | What it shows |
|---|---|---|
customer-facing-agent.ts |
SMB / Startup | Protect a support chatbot: scan knowledge base, customer messages, and agent output |
audit-logging.ts |
Enterprise | Policy enforcement + structured audit log for compliance (SOC2, ISO 27001) |
tool-output-guard.ts |
Developer | Guard every tool call in a custom agent loop (web, DB, file, API) |
rag-pipeline.ts |
Developer | Filter poisoned RAG chunks before LLM context assembly |
express-middleware.ts |
Developer | Express middleware that scans and sanitizes requests |
web-content-scanner.ts |
Developer | Scan raw HTML from web fetches in strict mode |
ml-classifier.ts |
Developer | Async pipeline with ML classifier enabled |
custom-detector.ts |
Developer | Implement and register a custom Detector |
real-world-validation.ts |
Security | Validate against real-world attack samples from published research |
Run any example:
npx tsx examples/rag-pipeline.tsAgent Armor covers 4 of the 6 attack categories in the DeepMind taxonomy. Here's what's shipped, what's next, and where the open questions are.
- Content Injection (4 detectors) and Behavioural Control (3 detectors) since v0.1.0
- Cognitive State (3 detectors) and Semantic Manipulation (3 detectors) since v0.2.0
- ML classifier (DeBERTa-v3-small, ONNX) as optional companion package
- Pattern database v0.4.0 with 71 pattern entries
- Expanded eval dataset. 71 samples is a start, not a finish. Integrating larger public datasets (deepset/prompt-injections at 662 samples, Giskard-AI) to stress-test detection and false positive rates at scale.
- Honeypot/canary system. Behavioral baseline approach for detecting novel attacks that bypass pattern matching. Measures response distribution drift rather than relying on known signatures.
- Pattern update API. Continuous pattern improvements delivered without requiring an npm upgrade.
These are the remaining 2 categories from the taxonomy. They're harder problems with less established detection approaches:
- Systemic attacks (multi-agent): congestion traps, interdependence cascades, tacit collusion, compositional fragment attacks, sybil attacks. These target multi-agent architectures and require detection approaches that reason about agent-to-agent interactions, not just content.
- Human-in-the-loop attacks: approval fatigue induction, social engineering via compromised agent. These exploit the human overseer rather than the agent itself. Detection likely requires behavioral analysis over time rather than content scanning.
If you're a researcher or practitioner thinking about these problems, we'd value your perspective:
- Hierarchical vs. flat classification. The ML classifier uses multi-label flat classification (14 outputs). Would a hierarchical approach (category first, then type) better reflect the taxonomy structure and improve accuracy on underrepresented categories?
- Semantic manipulation detection. Biased framing and persona hyperstition are inherently subtle. Regex catches the obvious cases, but sophisticated semantic attacks may require embedding-level analysis or chain-of-thought reasoning about intent. What's the right detection architecture here?
- Cross-session attack detection. Latent memory poisoning and contextual learning traps accumulate over multiple interactions. The current scan-per-input approach can't detect gradual drift. What does a stateful detection layer look like?
- Adversarial robustness of the detector itself. If an attacker knows the pattern database, they can craft bypasses. How do we make the detection layer robust to adversarial evasion without creating an arms race?
- Pattern database is versioned and updatable independently of the npm package via
AgentArmor.fetchLatestPatterns() - ML model is retrained periodically with new attack samples and pushed to HuggingFace
- GitHub releases track all changes with a CHANGELOG
- Security issues can be reported via SECURITY.md
Agent Armor protects agents you build and control. If you're writing agent code using Claude API, Azure OpenAI, LangChain, CrewAI, AutoGen, or any framework where you own the data pipeline, this is for you.
| You are... | Agent Armor helps you... |
|---|---|
| A developer building an AI agent that calls tools, browses the web, or uses RAG | Scan every piece of external content before it enters your agent's context |
| A startup/SMB with a customer-facing AI chatbot or support agent | Protect your knowledge base from poisoning and your agent's output from manipulation |
| An enterprise team building custom AI tooling on top of LLM APIs | Add audit logging, policy enforcement, and compliance evidence to your agent pipeline |
Not directly. Those are closed pipelines where the vendor controls the scanning. Agent Armor can't insert itself between Copilot and the content it reads from SharePoint or Teams. If you're using a hosted AI product as-is, the vendor is responsible for security on their end.
Where it does fit: if your team is building custom agents using the Claude API, Azure OpenAI, or other LLM APIs, you control the pipeline, and Agent Armor is the scanning layer for it.
Prompt injection is one attack type out of the 10 we cover (detected by 13 detectors). Prompt injection targets chatbots within a single conversation. Agent traps target autonomous agents with tool access, persistent memory, and sub-agent spawning. Different attack surface, different blast radius.
The full taxonomy includes content injection, behavioral control, cognitive state manipulation (RAG/memory poisoning), and semantic manipulation (biased framing, persona shifts). These are distinct attack categories with different detection approaches.
Yes. A sophisticated adversary with knowledge of the pattern database can craft content that evades regex detection. The ML classifier raises the bar significantly, but no detection system is foolproof.
Agent Armor is defense-in-depth. It raises the cost of attack and catches the broad majority of real-world attacks. Think of it as input validation for your agent pipeline, grounded in a real taxonomy rather than guesswork.
False positives are the hardest problem in this space. Naive regex on security-adjacent vocabulary (phrases like "ignore previous instructions," "system prompt," "act as") generates enormous noise on legitimate developer content, documentation, and security research.
The solution is a two-pass detection pipeline: structural pattern match first, then an instruction signal context check. Patterns that would cause noise have a requireInstructions flag that prevents them from firing without that second signal. On our eval suite of 71 samples (including security blog posts, AI safety textbooks, and CI/CD documentation as benign controls), the false positive rate is 0%.
The regex-based core runs in sub-millisecond time for typical content. Under 5ms for 10KB, under 20ms for 100KB. Zero runtime dependencies.
With the ML classifier enabled, expect 50-200ms depending on content length and hardware. Use scanSync() for latency-critical paths and await scan() when you want ML detection.
No. Everything runs locally. The regex detectors are pure computation with no network calls. The ML classifier runs an ONNX model on your machine. No content leaves your infrastructure. The only network call is the one-time model download (~140MB) on first use, which can be skipped by bundling the model in your deployment.
We welcome contributions! See CONTRIBUTING.md for setup and workflow details.
Areas where contributions are especially valuable:
- New adversarial samples for the evaluation suite (
scripts/eval/samples.ts) - New detection patterns for the pattern database (
src/patterns/default-patterns.ts) - Custom detectors for novel attack vectors
- Research on the open questions above
This project implements defenses based on the systematic framework proposed in:
Franklin, M., Tomasev, N., Jacobs, J., Leibo, J.Z., & Osindero, S. (2025). AI Agent Traps. Google DeepMind. papers.ssrn.com/sol3/papers.cfm?abstract_id=6372438
MIT