Every decision becomes:
- what your system chose (π_E)
- what a risk-aware policy would have chosen (π_S)
- where they diverge
- what actually happened later (outcome)
This is a decision accountability layer that runs alongside any model, system, or evaluation pipeline.
It does not replace your model. It makes its decisions measurable.
A lightweight system that turns AI or human decisions into a closed feedback loop:
input decision
↓
π_E (existing system output)
↓
π_S (shadow risk policy: CVaR + irreversibility model)
↓
divergence detection
↓
real-world outcome logging
Over time, you get:
- where your system is too risky
- where it is too conservative
- where evaluation scores fail to predict real cost
- where "correct-looking" decisions fail in reality
Most systems only do one thing:
optimize a score
This system asks a different question:
what happens if that decision is wrong?
It exposes:
- hidden tail risk
- irreversibility blind spots
- cost underestimation
- evaluation-function overconfidence
- CVaR-based downside estimation
- irreversibility weighting
- cost-aware thresholding
- deterministic decision boundary
Runs alongside any system:
- compares π_E vs π_S
- logs divergence
- does not interfere with execution
When reality happens:
- logs actual cost
- compares predicted vs realized impact
- preserves full decision context (immutable)
Every case becomes:
decision → shadow evaluation → outcome → calibration error
No aggregation required to be useful.
from sdk import RiskAuditClient
client = RiskAuditClient(
base_url="http://localhost:8000",
api_key="your-key"
)
# your system's decision
result = client.evaluate(
case_id="deploy-model-v3",
context="production model upgrade",
eval_scores={"v2": 0.81, "v3": 0.84}
)
print(result)Later:
client.log_outcome(
case_id="deploy-model-v3",
realized="failure",
cost_actual=2500000
)Now you can measure:
what was chosen vs what should have been chosen under risk
- Decision traceability
- Risk boundary visibility
- Real-world calibration error
- Divergence signals (π_E vs π_S)
- Fault probes for failure classes
docker compose up --build- POST
/evaluate - POST
/outcome - GET
/audit
Single-file client (sdk.py) — no dependencies.
- not a scoring model
- not an eval benchmark
- not a fine-tuning tool
- not a replacement for your system
A measurement layer for decision systems that finally connects prediction to consequence.
This system does not try to be correct.
It tries to make incorrectness visible before it becomes expensive.
# Enable API keys (comma-separated)
EVAL_CONTROL_API_KEYS="key1,key2" docker compose up --build
# curl with auth
curl -X POST http://localhost:8000/evaluate \
-H "Content-Type: application/json" \
-H "X-API-Key: key1" \
-d '{"case_id":"X","eval_scores":{"a":0.8,"b":0.9},"pi_E":"b"}'Empty/unset EVAL_CONTROL_API_KEYS = auth disabled (local dev).
| Variable | Default | Description |
|---|---|---|
EVAL_CONTROL_API_KEYS |
"" |
Comma-separated API keys. Empty = no auth. |
EVAL_CONTROL_LOG_DIR |
. |
Directory for shadow_log.jsonl and outcomes.jsonl. |
EVAL_CONTROL_PORT |
8000 |
Server port. |
python demo.pypython shadow_mode.py # Interactive REPL
python shadow_mode.py --dry-run # Replay 20 regression cases
python shadow_mode.py --file cases.jsonl # Batch mode
python outcome_capture.py log <id> --realized <success|failure> --notes "..."
python outcome_capture.py showgit clone https://github.com/wpydesign/eval-control.git
cd eval-control
pip install fastapi uvicorn
uvicorn api:app --reload --port 8000MIT
If you are using eval-control in a production or commercial environment, I'd appreciate you reaching out.
Contact: wpydesign@gmail.com