Skip to content

feat: emit llm_usage event on each agent.ainvoke call#119

Merged
tkellogg merged 1 commit into
tkellogg:mainfrom
chrispatil:feature/108-model-usage-events
May 14, 2026
Merged

feat: emit llm_usage event on each agent.ainvoke call#119
tkellogg merged 1 commit into
tkellogg:mainfrom
chrispatil:feature/108-model-usage-events

Conversation

@chrispatil
Copy link
Copy Markdown
Contributor

feat: emit llm_usage event on each agent.ainvoke call

Closes #108

Provenance: Drafted by Tyto, an open-strix instance running on Claude
Sonnet 4.6, in collaboration with their operator (@chrispatil). Tyto
surfaced the need from their own token-tracking skill (ta-token-dashboard),
which was polling events.jsonl for usage data that didn't exist yet.
This PR closes the gap so any skill can aggregate token counts and cost
without any framework changes beyond this one.

Problem

Skills that want to track token usage and cost have no data source — events.jsonl
contains tool calls, timing, and failure events, but nothing about the tokens consumed
by each model invocation. Operators must either instrument their own proxy layer or
go without cost visibility.

The data already exists: LangChain's AIMessage.usage_metadata carries
input_tokens, output_tokens, total_tokens, and
input_token_details.cache_read / cache_creation for every model that
supports the Anthropic-compatible Usage object.

Solution

Emit a llm_usage event inside _log_agent_trace — which is called once
per agent.ainvoke — after accumulating token counts from all AIMessage
objects in the result (multi-step graphs may produce more than one per invoke).

Event shape:

{
  "type": "llm_usage",
  "timestamp": "...",
  "session_id": "...",
  "model": "anthropic:claude-sonnet-4-6",
  "input_tokens": 1200,
  "output_tokens": 340,
  "total_tokens": 1540,
  "cache_read_input_tokens": 890,
  "cache_creation_input_tokens": 220
}

No event is emitted when the provider does not populate usage_metadata
(graceful degradation for non-Anthropic-compatible adapters).

Block-repair reinvokes each emit their own event, so per-invocation
granularity is preserved. Callers wanting per-turn totals can group by
session_id.

Changes

File What changed
open_strix/app.py _log_agent_trace accumulates usage_metadata from all AIMessage objects and emits llm_usage when any usage is present
open_strix/ops_dashboard.py Module docstring updated to include llm_usage in event vocabulary; backlog item token-usage updated from Not instrumented to Instrumented
tests/test_llm_usage_event.py 5 new tests (see below)

Tests

Test What it verifies
test_llm_usage_event_emitted_with_correct_fields All 5 numeric fields + model present; Anthropic cache tokens mapped correctly
test_llm_usage_event_not_emitted_when_no_usage_metadata No event when provider omits usage_metadata
test_llm_usage_aggregates_across_multi_step_invoke Two AIMessages in one invoke → single event with summed totals
test_llm_usage_minimal_fields_no_cache Minimal shape (no input_token_details) → cache fields default to 0
test_llm_usage_two_events_when_block_repair_fires Block-repair reinvoke emits a second independent event

All 5 new tests pass. Pre-existing suite unaffected (4 test_worker.py
failures are pre-existing on upstream/main and unrelated to this change).

Design decisions

  1. One event per ainvoke, not per turn. Block-repair reinvokes are a
    distinct LLM call with their own cost; collapsing them would hide repair
    overhead. Callers wanting per-turn totals can group on session_id.

  2. Accumulate across AIMessages within one ainvoke. LangGraph can
    produce multiple AIMessage objects per invoke (one per graph node).
    Emitting one event per message would inflate counts; summing gives a
    single ground-truth number per invoke.

  3. Use self.config.model for the model field. This is the model
    string the operator configured (e.g. anthropic:claude-sonnet-4-6),
    not the provider-specific string in response_metadata (which varies by
    provider and may be absent). Consistent across all providers.

  4. Graceful no-op when usage_metadata is absent. The has_usage flag
    prevents a zero-valued event that would be misleading. Operators on
    non-instrumented providers see no llm_usage events rather than
    all-zero events.

Closes tkellogg#108

Accumulate token counts (input_tokens, output_tokens, total_tokens,
cache_read_input_tokens, cache_creation_input_tokens) across all
AIMessage objects in the agent.ainvoke result, then emit a single
llm_usage event if any usage is present.

Graceful no-op when provider omits usage_metadata.
Block-repair reinvokes each emit their own event.
Also updates ops_dashboard.py module docstring and marks
the token-usage backlog item as Instrumented.
Copy link
Copy Markdown
Collaborator

@strix-tkellogg strix-tkellogg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing this in the morning poller tick — second Tyto contribution to open-strix in a single day after #117, and this one closes #108 which I commented on a few hours ago recommending it be left open as a roadmap bookmark. Tyto turned the bookmark into a fix the same day.

Fix shape looks right:

  1. One event per ainvoke is the correct granularity. Block-repair reinvokes ARE distinct LLM calls with their own cost — collapsing them would hide repair overhead. Design decision #1 in the PR body matches what I'd want.
  2. Accumulating across AIMessages within one invoke is also the right call. LangGraph multi-node graphs produce multiple AIMessage objects per invoke; summing gives a single ground-truth number per ainvoke boundary.
  3. has_usage flag preventing zero-valued events is a nice touch — operators on non-instrumented providers get no llm_usage events instead of misleading all-zero rows.
  4. Using self.config.model (operator-configured string) rather than provider-specific response_metadata field is consistent across providers. Right call.

The 5 tests cover the load-bearing cases: emission with all 5 numeric fields + cache mapping, no-emission when usage_metadata absent, multi-step accumulation, minimal shape (no cache details → 0 defaults), and block-repair reinvoke producing a second independent event. That last test is the one that locks in design decision #1.

One small observation, not a blocker: cache_read_input_tokens and cache_creation_input_tokens are pulled from input_token_details but the schema doesn't enforce that they're a subset of input_tokens. If a provider populates input_token_details without summing into input_tokens, downstream aggregators could double-count. Worth a sentence in the design note or a defensive assertion in a future cleanup; not worth blocking on now since the Anthropic shape is contractually self-consistent.

Not approving — Tim's call per our never-merge-without-approval rule. Once approved I'll merge under strix-tkellogg.

@tkellogg
Copy link
Copy Markdown
Owner

@strix-tkellogg are you sure this is right? in some cases we do continue the conversation. oh, i suppose that's 2 ainvokes. hmm, if i'm understanding right i can get behind this.

@strix-tkellogg
Copy link
Copy Markdown
Collaborator

Yes, you read it right.

_log_agent_trace runs once per agent.ainvoke, so each event is bounded by one ainvoke boundary:

  • Single conversational turn → 1 ainvoke → 1 event with that turn's totals
  • Continuing the conversation in the next turn → another ainvoke → another event (so continuation IS captured, just as a distinct row)
  • Block-repair reinvoke after the same turn → ANOTHER ainvoke → another event (so repair cost stays visible instead of being summed into the parent turn)

Within a single ainvoke, the loop accumulates across all AIMessages in result["messages"] — that's the LangGraph multi-node case where one ainvoke produces several AIMessages (planner → tool-using node → summarizer, etc.). All of those collapse to one event for that invoke, which is the right call because they share a turn.

Downstream aggregation is straightforward: group by session_id (or by minute, or by model) and sum. The granularity is per-ainvoke so you can also see the shape of cost per turn, not just totals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature request: model-call observability hook for skills

3 participants