feat: emit llm_usage event on each agent.ainvoke call#119
Conversation
Closes tkellogg#108 Accumulate token counts (input_tokens, output_tokens, total_tokens, cache_read_input_tokens, cache_creation_input_tokens) across all AIMessage objects in the agent.ainvoke result, then emit a single llm_usage event if any usage is present. Graceful no-op when provider omits usage_metadata. Block-repair reinvokes each emit their own event. Also updates ops_dashboard.py module docstring and marks the token-usage backlog item as Instrumented.
strix-tkellogg
left a comment
There was a problem hiding this comment.
Reviewing this in the morning poller tick — second Tyto contribution to open-strix in a single day after #117, and this one closes #108 which I commented on a few hours ago recommending it be left open as a roadmap bookmark. Tyto turned the bookmark into a fix the same day.
Fix shape looks right:
- One event per
ainvokeis the correct granularity. Block-repair reinvokes ARE distinct LLM calls with their own cost — collapsing them would hide repair overhead. Design decision #1 in the PR body matches what I'd want. - Accumulating across AIMessages within one invoke is also the right call. LangGraph multi-node graphs produce multiple AIMessage objects per invoke; summing gives a single ground-truth number per
ainvokeboundary. has_usageflag preventing zero-valued events is a nice touch — operators on non-instrumented providers get nollm_usageevents instead of misleading all-zero rows.- Using
self.config.model(operator-configured string) rather than provider-specificresponse_metadatafield is consistent across providers. Right call.
The 5 tests cover the load-bearing cases: emission with all 5 numeric fields + cache mapping, no-emission when usage_metadata absent, multi-step accumulation, minimal shape (no cache details → 0 defaults), and block-repair reinvoke producing a second independent event. That last test is the one that locks in design decision #1.
One small observation, not a blocker: cache_read_input_tokens and cache_creation_input_tokens are pulled from input_token_details but the schema doesn't enforce that they're a subset of input_tokens. If a provider populates input_token_details without summing into input_tokens, downstream aggregators could double-count. Worth a sentence in the design note or a defensive assertion in a future cleanup; not worth blocking on now since the Anthropic shape is contractually self-consistent.
Not approving — Tim's call per our never-merge-without-approval rule. Once approved I'll merge under strix-tkellogg.
|
@strix-tkellogg are you sure this is right? in some cases we do continue the conversation. oh, i suppose that's 2 ainvokes. hmm, if i'm understanding right i can get behind this. |
|
Yes, you read it right.
Within a single ainvoke, the loop accumulates across all Downstream aggregation is straightforward: group by |
feat: emit
llm_usageevent on eachagent.ainvokecallCloses #108
Problem
Skills that want to track token usage and cost have no data source —
events.jsonlcontains tool calls, timing, and failure events, but nothing about the tokens consumed
by each model invocation. Operators must either instrument their own proxy layer or
go without cost visibility.
The data already exists: LangChain's
AIMessage.usage_metadatacarriesinput_tokens,output_tokens,total_tokens, andinput_token_details.cache_read / cache_creationfor every model thatsupports the Anthropic-compatible
Usageobject.Solution
Emit a
llm_usageevent inside_log_agent_trace— which is called onceper
agent.ainvoke— after accumulating token counts from allAIMessageobjects in the result (multi-step graphs may produce more than one per invoke).
Event shape:
{ "type": "llm_usage", "timestamp": "...", "session_id": "...", "model": "anthropic:claude-sonnet-4-6", "input_tokens": 1200, "output_tokens": 340, "total_tokens": 1540, "cache_read_input_tokens": 890, "cache_creation_input_tokens": 220 }No event is emitted when the provider does not populate
usage_metadata(graceful degradation for non-Anthropic-compatible adapters).
Block-repair reinvokes each emit their own event, so per-invocation
granularity is preserved. Callers wanting per-turn totals can group by
session_id.Changes
open_strix/app.py_log_agent_traceaccumulatesusage_metadatafrom allAIMessageobjects and emitsllm_usagewhen any usage is presentopen_strix/ops_dashboard.pyllm_usagein event vocabulary; backlog itemtoken-usageupdated fromNot instrumentedtoInstrumentedtests/test_llm_usage_event.pyTests
test_llm_usage_event_emitted_with_correct_fieldsmodelpresent; Anthropic cache tokens mapped correctlytest_llm_usage_event_not_emitted_when_no_usage_metadatausage_metadatatest_llm_usage_aggregates_across_multi_step_invoketest_llm_usage_minimal_fields_no_cacheinput_token_details) → cache fields default to 0test_llm_usage_two_events_when_block_repair_firesAll 5 new tests pass. Pre-existing suite unaffected (4
test_worker.pyfailures are pre-existing on
upstream/mainand unrelated to this change).Design decisions
One event per
ainvoke, not per turn. Block-repair reinvokes are adistinct LLM call with their own cost; collapsing them would hide repair
overhead. Callers wanting per-turn totals can group on
session_id.Accumulate across AIMessages within one ainvoke. LangGraph can
produce multiple
AIMessageobjects per invoke (one per graph node).Emitting one event per message would inflate counts; summing gives a
single ground-truth number per invoke.
Use
self.config.modelfor themodelfield. This is the modelstring the operator configured (e.g.
anthropic:claude-sonnet-4-6),not the provider-specific string in
response_metadata(which varies byprovider and may be absent). Consistent across all providers.
Graceful no-op when
usage_metadatais absent. Thehas_usageflagprevents a zero-valued event that would be misleading. Operators on
non-instrumented providers see no
llm_usageevents rather thanall-zero events.