Skip to content

Add per-action health metrics to executor#191

Merged
cmttt merged 2 commits intomainfrom
ls/action-health-metrics
Mar 24, 2026
Merged

Add per-action health metrics to executor#191
cmttt merged 2 commits intomainfrom
ls/action-health-metrics

Conversation

@cmttt
Copy link
Collaborator

@cmttt cmttt commented Mar 24, 2026

Summary

  • Adds osprey.action_health metric emitted once per action after execution completes
  • Adds osprey.action_error_count histogram for actions that had errors (10% sampled)
  • Tags are all bounded: action (~100 values), had_errors (bool), had_unexpected_errors (bool), had_effects (bool)
  • No behavioral changes — purely additive observability

Why

We need to quantify what % of Osprey actions execute to completion without failures. Currently there is no per-action health signal — failures are only visible at the UDF level, and "spammy" exceptions are suppressed from existing metrics entirely.

What it measures

  • had_errors: True/False — whether ANY node errors occurred (including ExpectedUdfException)
  • had_unexpected_errors: True/False — whether non-expected errors occurred (excludes ExpectedUdfException)
  • had_effects: True/False — whether the action produced any effects (verdicts, labels, etc.)

Key queries

// % of actions with any failure
sum:osprey.action_health{had_errors:true} by {action}.as_count()
/ sum:osprey.action_health{*} by {action}.as_count() * 100

// % of actions with zero effects (potential enforcement failure)
sum:osprey.action_health{had_effects:false} by {action}.as_count()
/ sum:osprey.action_health{*} by {action}.as_count() * 100

Test plan

  • Verify syntax validity (done — ast.parse passes)
  • Run full test suite via Docker (./run-tests.sh)
  • Deploy to staging and verify metrics appear in Datadog
  • Monitor cardinality for 24h — estimated ~800 series

Emit `osprey.action_health` (increment) after each action execution
completes, tracking whether the action had errors, unexpected errors,
and whether it produced any effects. Also emit
`osprey.action_error_count` (histogram, 10% sampled) for actions
with errors to capture error count distribution.

Tags are all bounded: action (~100), had_errors (bool),
had_unexpected_errors (bool), had_effects (bool).
@cmttt cmttt requested review from a team, EXBreder, ayubun, haileyok and vinaysrao1 as code owners March 24, 2026 20:54
One emit per action — no need to sample.
@cmttt cmttt merged commit bd2fe92 into main Mar 24, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants