Add per-action health metrics to executor by cmttt · Pull Request #191 · roostorg/osprey

cmttt · 2026-03-24T20:54:46Z

Summary

Adds osprey.action_health metric emitted once per action after execution completes
Adds osprey.action_error_count histogram for actions that had errors (10% sampled)
Tags are all bounded: action (~100 values), had_errors (bool), had_unexpected_errors (bool), had_effects (bool)
No behavioral changes — purely additive observability

Why

We need to quantify what % of Osprey actions execute to completion without failures. Currently there is no per-action health signal — failures are only visible at the UDF level, and "spammy" exceptions are suppressed from existing metrics entirely.

What it measures

had_errors: True/False — whether ANY node errors occurred (including ExpectedUdfException)
had_unexpected_errors: True/False — whether non-expected errors occurred (excludes ExpectedUdfException)
had_effects: True/False — whether the action produced any effects (verdicts, labels, etc.)

Key queries

// % of actions with any failure
sum:osprey.action_health{had_errors:true} by {action}.as_count()
/ sum:osprey.action_health{*} by {action}.as_count() * 100

// % of actions with zero effects (potential enforcement failure)
sum:osprey.action_health{had_effects:false} by {action}.as_count()
/ sum:osprey.action_health{*} by {action}.as_count() * 100

Test plan

Verify syntax validity (done — ast.parse passes)
Run full test suite via Docker (./run-tests.sh)
Deploy to staging and verify metrics appear in Datadog
Monitor cardinality for 24h — estimated ~800 series

Emit `osprey.action_health` (increment) after each action execution completes, tracking whether the action had errors, unexpected errors, and whether it produced any effects. Also emit `osprey.action_error_count` (histogram, 10% sampled) for actions with errors to capture error count distribution. Tags are all bounded: action (~100), had_errors (bool), had_unexpected_errors (bool), had_effects (bool).

One emit per action — no need to sample.

cmttt requested review from a team, EXBreder, ayubun, haileyok and vinaysrao1 as code owners March 24, 2026 20:54

Remove unnecessary sample_rate on action_error_count histogram

b7c9136

One emit per action — no need to sample.

EXBreder approved these changes Mar 24, 2026

View reviewed changes

cmttt merged commit bd2fe92 into main Mar 24, 2026
11 checks passed

cmttt mentioned this pull request Mar 25, 2026

Use _is_spammy_exception for action_health metric filtering #193

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per-action health metrics to executor#191

Add per-action health metrics to executor#191
cmttt merged 2 commits intomainfrom
ls/action-health-metrics

cmttt commented Mar 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cmttt commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

What it measures

Key queries

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cmttt commented Mar 24, 2026 •

edited

Loading