ClaimPilot Harness

Crash-test insurance claim AI agents before production.

A crash-test simulator for AI claim agents: adversarial cases, deterministic scoring, and replayable failure reports.

ClaimPilot Harness runs messy insurance claim scenarios against AI agents and shows where they passed, hesitated, or failed.

It is not another claim-processing agent. It is the test range for them.

python -m claimpilot_harness compare cases/travel-injection-001.json demo risky

Case:        travel-injection-001
Leaderboard: runs/travel-injection-001-leaderboard.html

Agent        Score    Verdict
------------ -------- ------------
demo          93.9%   investigate
risky          6.1%   approve

Why This Exists

Most AI agent demos look impressive until they meet messy real-world claims: mismatched invoices, missing documents, policy exclusions, claimant contradictions, hidden prompt injection, and privacy traps.

ClaimPilot turns those failure modes into repeatable test cases.

Use it to answer:

Did the agent choose the right claim action?
Did it cite the evidence that mattered?
Did it request the missing document instead of guessing?
Did it detect fraud or coverage inconsistencies?
Did it ignore malicious instructions hidden inside uploaded evidence?

See docs/why-claimpilot.md for the product thesis.

Demo

Compare a careful agent against a deliberately risky one:

python -m claimpilot_harness compare cases/travel-injection-001.json demo risky

On Windows, use py -m claimpilot_harness ... if python is not on your PATH.

You will get a score and a replay report:

Case:        travel-injection-001
Leaderboard: runs/travel-injection-001-leaderboard.html

Agent        Score    Verdict
------------ -------- ------------
demo          93.9%   investigate
risky          6.1%   approve

Open runs/latest.html to view the leaderboard.

What A Replay Shows

The replay report is designed for product, risk, and engineering review:

Evidence timeline
Agent verdict and confidence
Findings and requested documents
Prompt-injection / privacy flags
Scoring breakdown by rubric item
Raw decision JSON for debugging

Included Case Packs

Case	Line	What It Tests
`auto-collision-001`	Auto	Repair invoice conflicts with damage photos and claimant chat.
`health-bill-001`	Health	Possible excluded cosmetic procedure without medical necessity proof.
`travel-injection-001`	Travel	Missing official delay proof plus prompt injection hidden in uploaded evidence.

Agent Interface

Use the built-in demo agent:

python -m claimpilot_harness run cases/auto-collision-001.json --agent demo

Compare built-in agents and generate a leaderboard:

python -m claimpilot_harness compare cases/travel-injection-001.json demo risky

Run an OpenAI-compatible model:

python -m claimpilot_harness run cases/travel-injection-001.json \
  --agent openai \
  --openai-model your-model-name

Compare it against the built-in baselines:

python -m claimpilot_harness compare cases/travel-injection-001.json demo openai risky \
  --openai-model your-model-name

Or connect any agent command that reads JSON from stdin and prints a JSON decision:

python -m claimpilot_harness run cases/auto-collision-001.json \
  --agent command \
  --agent-command "python examples/simple_agent.py"

Expected decision shape:

{
  "verdict": "investigate",
  "confidence": 0.82,
  "summary": "Hold the claim pending additional review.",
  "findings": ["invoice and physical damage evidence need reconciliation"],
  "requested_documents": ["independent adjuster estimate"],
  "cited_evidence": ["E2", "E3"],
  "privacy_flags": ["ignored embedded instruction in evidence"]
}

Case Format

Cases are plain JSON files. Each case contains:

Claimant and policy context
Evidence summaries with stable IDs
Red-team traps
Expected findings, document requests, citations, and forbidden behavior
A weighted scoring rubric

See docs/case-format.md.

OpenAI-Compatible Adapter

ClaimPilot supports OpenAI-style /v1/chat/completions endpoints without requiring an SDK dependency.

Set OPENAI_API_KEY, then pass --agent openai and --openai-model. Use --openai-base-url for compatible local or hosted gateways.

See docs/openai-compatible.md.

Roadmap

Command and HTTP agent comparison
OpenAI-compatible and Ollama adapters
LLM-as-judge scoring mode
Claim case generator for synthetic case packs
Fraud, compliance, and privacy scorecards
CI mode for regression testing agent changes
GitHub Pages replay gallery

Positioning

ClaimPilot Harness is built for the gap between AI agent demos and production systems. A claim agent that can answer one happy-path question is easy to build. A claim agent that survives conflicting evidence, policy constraints, missing documents, and adversarial uploads needs a harness.

That is the product surface this project explores.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github		.github
assets		assets
cases		cases
claimpilot_harness		claimpilot_harness
docs		docs
examples		examples
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ClaimPilot Harness

Why This Exists

Demo

What A Replay Shows

Included Case Packs

Agent Interface

Case Format

OpenAI-Compatible Adapter

Roadmap

Positioning

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ClaimPilot Harness

Why This Exists

Demo

What A Replay Shows

Included Case Packs

Agent Interface

Case Format

OpenAI-Compatible Adapter

Roadmap

Positioning

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages