AgentWatcher

AgentWatcher is a detection-based defense against indirect prompt injection in LLM agents. It first runs causal context attribution over untrusted context to find the most influential contexts, then applies a monitor LLM that classifies those contexts under explicit, customizable rules. Compared with fully black-box detectors, this pipeline is easier to interpret: attribution shows where the model focused, and the monitor LLM’s judgment is based on rule-grounded reasoning. An example for monitor LLM’s output is shown below:

This rule-based detection can leverage the reasoning ability of the monitor LLM to achieve a better trade-off between utility and robustness. AgentWatcher achieves state-of-art performance on LLM agent benchmarks such as AgentDyn:

This repository is intended for reproducing the main experiments from the paper.

🔨 Requirements

Python 3.10+ (tested with conda environments).
CUDA GPU(s) for the backend LLM (vLLM can be used optionally).
Hugging Face access for datasets and models.

pip install -r requirements.txt
huggingface-cli login   # if needed for gated models

🤗 Monitor LLM on Hugging Face

The trained monitor is a LoRA adapter (PEFT). By default, main.py automatically loads Hub checkpoint SecureLLMSys/AgentWatcher-Qwen3-4B-Instruct-2507 on SecureLLMSys (override with --monitor_llm).

🔬 Running experiments

All launchers live under scripts/. They assume you run from the AgentWatcher repo root (or cd there first). Please first set the OpenAI key with export OPENAI_API_KEY=<YOUR_KEY>. Edit the Python launchers (all_datasets, all_defenses, gpus, name, models, etc.) to match your sweep before running.

Long-context benchmark (`main.py`)

Batch jobs (local GPUs or Slurm):

python scripts/run_long_context.py

This drives main.py with vLLM (--use_vllm) and writes logs under logs/main_logs/<name>/.... With no local GPU it submits Slurm via scripts/main.sh.

One-off run without the launcher:

python main.py \
  --dataset lcc_long \
  --attack combined \
  --defense agentwatcher \
  --backend_llm Qwen/Qwen3-4B-Instruct-2507 \
  --monitor_llm SecureLLMSys/AgentWatcher-Qwen3-4B-Instruct-2507 \
  --use_vllm \
  --name my_run

Optional attribution overrides on main.py: --w_s, --w_l, --w_r, --K, --attribution_model.

AgentDojo

To run AgentDojo/AgentDyn, please first set up the enviornment following PIArena (https://github.com/sleeepeer/PIArena):

cd agents/agentdojo && pip install -e . && cd ../..

Then run:

python scripts/run_agentdojo.py

This calls main_agentdojo.py and writes logs under logs/agentdojo_logs/.... With no local GPU it uses scripts/main_agentdojo.sh for Slurm. For a single manual run you can call python main_agentdojo.py --model ... --defense agentwatcher --monitor_llm ... from the repo root; set PIARENA_DEFENSE / PIARENA_MONITOR_LLM if your wrapper expects them. Defense agentwatcher uses PIMonitorLLMDefenseAdapter.

AgentDyn

./scripts/run_agentdyn.sh

Optional: ./scripts/run_agentdyn.sh --model gpt-4o-2024-08-06 for running agentwatcher with gpt-4o as the backbone LLM. The script activates the agentdojo conda env, and runs from agents/agentdyn/src. We recommend using tmux to prevent the run from being interrupted.

InjecAgent

python scripts/run_injecagent.py

This launches main_injecagent.py and logs under logs/main_logs/.... Slurm entrypoint: scripts/main_injecagent.sh.

Direct invocation without the launcher:

python main_injecagent.py \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --defense agentwatcher \
  --monitor_llm SecureLLMSys/AgentWatcher-Qwen3-4B-Instruct-2507 \
  --name my_injecagent_run

Acknowledgement

This project incorporates code from PIArena, AgentDojo, AgentDyn, InjecAgent, and AT2.

Citation

If you use this code, please cite the AgentWatcher paper (when available).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AgentWatcher

🔨 Requirements

🤗 Monitor LLM on Hugging Face

🔬 Running experiments

Long-context benchmark (`main.py`)

AgentDojo

AgentDyn

InjecAgent

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
agents		agents
assets		assets
datasets		datasets
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
main.py		main.py
main_agentdojo.py		main_agentdojo.py
main_injecagent.py		main_injecagent.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

AgentWatcher

🔨 Requirements

🤗 Monitor LLM on Hugging Face

🔬 Running experiments

Long-context benchmark (main.py)

AgentDojo

AgentDyn

InjecAgent

Acknowledgement

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 0

Languages

Long-context benchmark (`main.py`)

Packages

Contributors