Shadow Evaluation + Outcome Audit for AI Decisions

Every decision becomes:

what your system chose (π_E)
what a risk-aware policy would have chosen (π_S)
where they diverge
what actually happened later (outcome)

This is a decision accountability layer that runs alongside any model, system, or evaluation pipeline.

It does not replace your model. It makes its decisions measurable.

What this is

A lightweight system that turns AI or human decisions into a closed feedback loop:

input decision
   ↓
π_E (existing system output)
   ↓
π_S (shadow risk policy: CVaR + irreversibility model)
   ↓
divergence detection
   ↓
real-world outcome logging

Over time, you get:

where your system is too risky
where it is too conservative
where evaluation scores fail to predict real cost
where "correct-looking" decisions fail in reality

Why it exists

Most systems only do one thing:

optimize a score

This system asks a different question:

what happens if that decision is wrong?

It exposes:

hidden tail risk
irreversibility blind spots
cost underestimation
evaluation-function overconfidence

Core components

1. π_S Risk Engine (frozen)

CVaR-based downside estimation
irreversibility weighting
cost-aware thresholding
deterministic decision boundary

2. Shadow Mode

Runs alongside any system:

compares π_E vs π_S
logs divergence
does not interfere with execution

3. Outcome Capture

When reality happens:

logs actual cost
compares predicted vs realized impact
preserves full decision context (immutable)

4. Audit Trail

Every case becomes:

decision → shadow evaluation → outcome → calibration error

No aggregation required to be useful.

Minimal example

from sdk import RiskAuditClient

client = RiskAuditClient(
    base_url="http://localhost:8000",
    api_key="your-key"
)

# your system's decision
result = client.evaluate(
    case_id="deploy-model-v3",
    context="production model upgrade",
    eval_scores={"v2": 0.81, "v3": 0.84}
)

print(result)

Later:

client.log_outcome(
    case_id="deploy-model-v3",
    realized="failure",
    cost_actual=2500000
)

Now you can measure:

what was chosen vs what should have been chosen under risk

What you get

Decision traceability
Risk boundary visibility
Real-world calibration error
Divergence signals (π_E vs π_S)
Fault probes for failure classes

Deployment

Docker

docker compose up --build

API

POST /evaluate
POST /outcome
GET /audit

SDK

Single-file client (sdk.py) — no dependencies.

What this is NOT

not a scoring model
not an eval benchmark
not a fine-tuning tool
not a replacement for your system

What this actually is

A measurement layer for decision systems that finally connects prediction to consequence.

If you only understand one thing:

This system does not try to be correct.

It tries to make incorrectness visible before it becomes expensive.

Quick reference

Authentication

# Enable API keys (comma-separated)
EVAL_CONTROL_API_KEYS="key1,key2" docker compose up --build

# curl with auth
curl -X POST http://localhost:8000/evaluate \
  -H "Content-Type: application/json" \
  -H "X-API-Key: key1" \
  -d '{"case_id":"X","eval_scores":{"a":0.8,"b":0.9},"pi_E":"b"}'

Empty/unset EVAL_CONTROL_API_KEYS = auth disabled (local dev).

Configuration

Variable	Default	Description
`EVAL_CONTROL_API_KEYS`	`""`	Comma-separated API keys. Empty = no auth.
`EVAL_CONTROL_LOG_DIR`	`.`	Directory for shadow_log.jsonl and outcomes.jsonl.
`EVAL_CONTROL_PORT`	`8000`	Server port.

Full pipeline demo

python demo.py

CLI

python shadow_mode.py                    # Interactive REPL
python shadow_mode.py --dry-run           # Replay 20 regression cases
python shadow_mode.py --file cases.jsonl  # Batch mode
python outcome_capture.py log <id> --realized <success|failure> --notes "..."
python outcome_capture.py show

Local install (no Docker)

git clone https://github.com/wpydesign/eval-control.git
cd eval-control
pip install fastapi uvicorn
uvicorn api:app --reload --port 8000

License

MIT

Commercial Use

If you are using eval-control in a production or commercial environment, I'd appreciate you reaching out.

Contact: wpydesign@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.zscripts		.zscripts
data		data
db		db
download		download
eval-control-repo		eval-control-repo
examples/websocket		examples/websocket
logs		logs
mini-services		mini-services
node_modules		node_modules
prisma		prisma
public		public
safegate-zendesk		safegate-zendesk
scripts		scripts
skills		skills
src		src
upload		upload
--timeout		--timeout
.dockerignore		.dockerignore
.env		.env
.gitignore		.gitignore
Caddyfile		Caddyfile
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
api.py		api.py
boundary_probe_cases.jsonl		boundary_probe_cases.jsonl
bun.lock		bun.lock
calibrate.py		calibrate.py
calibrated_config.json		calibrated_config.json
calibrated_config_v2.json		calibrated_config_v2.json
calibrated_config_v2_calibrated.jsonl		calibrated_config_v2_calibrated.jsonl
calibrated_config_v3.json		calibrated_config_v3.json
calibrated_config_v3_calibrated.jsonl		calibrated_config_v3_calibrated.jsonl
calibration_dataset.jsonl		calibration_dataset.jsonl
calibration_dataset_300.jsonl		calibration_dataset_300.jsonl
calibration_drift.jsonl		calibration_drift.jsonl
calibration_run_log.jsonl		calibration_run_log.jsonl
components.json		components.json
core.py		core.py
dataset_shadow_200.json		dataset_shadow_200.json
demo.py		demo.py
docker-compose.yml		docker-compose.yml
drift_history.jsonl		drift_history.jsonl
drift_sim.py		drift_sim.py
drift_sim_results.jsonl		drift_sim_results.jsonl
eslint.config.mjs		eslint.config.mjs
eval_300.py		eval_300.py
eval_300_drift.jsonl		eval_300_drift.jsonl
eval_300_log.jsonl		eval_300_log.jsonl
eval_batch.py		eval_batch.py
eval_fast.py		eval_fast.py
eval_prompts.py		eval_prompts.py
gen_prompts.py		gen_prompts.py
gen_shadow_200.py		gen_shadow_200.py
linkedin_ian.json		linkedin_ian.json
next.config.ts		next.config.ts
outcome_capture.py		outcome_capture.py
package-lock.json		package-lock.json
package.json		package.json
package.json.bak		package.json.bak
postcss.config.mjs		postcss.config.mjs
prompts_300.json		prompts_300.json
prompts_300_balanced.json		prompts_300_balanced.json
pyproject.toml		pyproject.toml
real_cases_phase1.jsonl		real_cases_phase1.jsonl
reddit_api_search1.json		reddit_api_search1.json
reddit_api_search2.json		reddit_api_search2.json
reddit_api_search3.json		reddit_api_search3.json
reddit_api_search4.json		reddit_api_search4.json
reddit_api_search5.json		reddit_api_search5.json
reddit_dev_docs.json		reddit_dev_docs.json
reddit_devvit_search.json		reddit_devvit_search.json
reddit_targets1.json		reddit_targets1.json
reddit_targets2.json		reddit_targets2.json
reddit_targets3.json		reddit_targets3.json
reddit_targets4.json		reddit_targets4.json
reddit_thread1.json		reddit_thread1.json
reddit_thread1_api.json		reddit_thread1_api.json
reddit_thread1_old.json		reddit_thread1_old.json
reddit_thread2.json		reddit_thread2.json
reddit_thread2_api.json		reddit_thread2_api.json
reddit_thread2_old.json		reddit_thread2_old.json
reddit_thread_fix1.json		reddit_thread_fix1.json
regression_dataset.py		regression_dataset.py
release_gate.py		release_gate.py
run_300.py		run_300.py
run_300_v2.py		run_300_v2.py
run_batches.sh		run_batches.sh
run_chunks.py		run_chunks.py
run_deepseek.sh		run_deepseek.sh
run_forever.sh		run_forever.sh
run_live_batch.py		run_live_batch.py
run_loop.py		run_loop.py
run_shadow_chunk.py		run_shadow_chunk.py
run_shadow_range.py		run_shadow_range.py
screenshot_analysis.json		screenshot_analysis.json
sdk.py		sdk.py
search_deploy2.json		search_deploy2.json
search_hn.json		search_hn.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Shadow Evaluation + Outcome Audit for AI Decisions

What this is

Why it exists

Core components

1. π_S Risk Engine (frozen)

2. Shadow Mode

3. Outcome Capture

4. Audit Trail

Minimal example

What you get

Deployment

Docker

API

SDK

What this is NOT

What this actually is

If you only understand one thing:

Quick reference

Authentication

Configuration

Full pipeline demo

CLI

Local install (no Docker)

License

Commercial Use

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Shadow Evaluation + Outcome Audit for AI Decisions

What this is

Why it exists

Core components

1. π_S Risk Engine (frozen)

2. Shadow Mode

3. Outcome Capture

4. Audit Trail

Minimal example

What you get

Deployment

Docker

API

SDK

What this is NOT

What this actually is

If you only understand one thing:

Quick reference

Authentication

Configuration

Full pipeline demo

CLI

Local install (no Docker)

License

Commercial Use

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages