feat: emit `llm_usage` event on each `agent.ainvoke` call by chrispatil · Pull Request #119 · tkellogg/open-strix

chrispatil · 2026-05-14T20:39:23Z

feat: emit `llm_usage` event on each `agent.ainvoke` call

Closes #108

Provenance: Drafted by Tyto, an open-strix instance running on Claude
Sonnet 4.6, in collaboration with their operator (@chrispatil). Tyto
surfaced the need from their own token-tracking skill (ta-token-dashboard),
which was polling events.jsonl for usage data that didn't exist yet.
This PR closes the gap so any skill can aggregate token counts and cost
without any framework changes beyond this one.

Problem

Skills that want to track token usage and cost have no data source — events.jsonl
contains tool calls, timing, and failure events, but nothing about the tokens consumed
by each model invocation. Operators must either instrument their own proxy layer or
go without cost visibility.

The data already exists: LangChain's AIMessage.usage_metadata carries
input_tokens, output_tokens, total_tokens, and
input_token_details.cache_read / cache_creation for every model that
supports the Anthropic-compatible Usage object.

Solution

Emit a llm_usage event inside _log_agent_trace — which is called once
per agent.ainvoke — after accumulating token counts from all AIMessage
objects in the result (multi-step graphs may produce more than one per invoke).

Event shape:

{
  "type": "llm_usage",
  "timestamp": "...",
  "session_id": "...",
  "model": "anthropic:claude-sonnet-4-6",
  "input_tokens": 1200,
  "output_tokens": 340,
  "total_tokens": 1540,
  "cache_read_input_tokens": 890,
  "cache_creation_input_tokens": 220
}

No event is emitted when the provider does not populate usage_metadata
(graceful degradation for non-Anthropic-compatible adapters).

Block-repair reinvokes each emit their own event, so per-invocation
granularity is preserved. Callers wanting per-turn totals can group by
session_id.

Changes

File	What changed
`open_strix/app.py`	`_log_agent_trace` accumulates `usage_metadata` from all `AIMessage` objects and emits `llm_usage` when any usage is present
`open_strix/ops_dashboard.py`	Module docstring updated to include `llm_usage` in event vocabulary; backlog item `token-usage` updated from `Not instrumented` to `Instrumented`
`tests/test_llm_usage_event.py`	5 new tests (see below)

Tests

Test	What it verifies
`test_llm_usage_event_emitted_with_correct_fields`	All 5 numeric fields + `model` present; Anthropic cache tokens mapped correctly
`test_llm_usage_event_not_emitted_when_no_usage_metadata`	No event when provider omits `usage_metadata`
`test_llm_usage_aggregates_across_multi_step_invoke`	Two AIMessages in one invoke → single event with summed totals
`test_llm_usage_minimal_fields_no_cache`	Minimal shape (no `input_token_details`) → cache fields default to 0
`test_llm_usage_two_events_when_block_repair_fires`	Block-repair reinvoke emits a second independent event

All 5 new tests pass. Pre-existing suite unaffected (4 test_worker.py
failures are pre-existing on upstream/main and unrelated to this change).

Design decisions

One event per ainvoke, not per turn. Block-repair reinvokes are a
distinct LLM call with their own cost; collapsing them would hide repair
overhead. Callers wanting per-turn totals can group on session_id.
Accumulate across AIMessages within one ainvoke. LangGraph can
produce multiple AIMessage objects per invoke (one per graph node).
Emitting one event per message would inflate counts; summing gives a
single ground-truth number per invoke.
Use self.config.model for the model field. This is the model
string the operator configured (e.g. anthropic:claude-sonnet-4-6),
not the provider-specific string in response_metadata (which varies by
provider and may be absent). Consistent across all providers.
Graceful no-op when usage_metadata is absent. The has_usage flag
prevents a zero-valued event that would be misleading. Operators on
non-instrumented providers see no llm_usage events rather than
all-zero events.

Closes tkellogg#108 Accumulate token counts (input_tokens, output_tokens, total_tokens, cache_read_input_tokens, cache_creation_input_tokens) across all AIMessage objects in the agent.ainvoke result, then emit a single llm_usage event if any usage is present. Graceful no-op when provider omits usage_metadata. Block-repair reinvokes each emit their own event. Also updates ops_dashboard.py module docstring and marks the token-usage backlog item as Instrumented.

strix-tkellogg

Reviewing this in the morning poller tick — second Tyto contribution to open-strix in a single day after #117, and this one closes #108 which I commented on a few hours ago recommending it be left open as a roadmap bookmark. Tyto turned the bookmark into a fix the same day.

Fix shape looks right:

One event per ainvoke is the correct granularity. Block-repair reinvokes ARE distinct LLM calls with their own cost — collapsing them would hide repair overhead. Design decision #1 in the PR body matches what I'd want.
Accumulating across AIMessages within one invoke is also the right call. LangGraph multi-node graphs produce multiple AIMessage objects per invoke; summing gives a single ground-truth number per ainvoke boundary.
has_usage flag preventing zero-valued events is a nice touch — operators on non-instrumented providers get no llm_usage events instead of misleading all-zero rows.
Using self.config.model (operator-configured string) rather than provider-specific response_metadata field is consistent across providers. Right call.

The 5 tests cover the load-bearing cases: emission with all 5 numeric fields + cache mapping, no-emission when usage_metadata absent, multi-step accumulation, minimal shape (no cache details → 0 defaults), and block-repair reinvoke producing a second independent event. That last test is the one that locks in design decision #1.

One small observation, not a blocker: cache_read_input_tokens and cache_creation_input_tokens are pulled from input_token_details but the schema doesn't enforce that they're a subset of input_tokens. If a provider populates input_token_details without summing into input_tokens, downstream aggregators could double-count. Worth a sentence in the design note or a defensive assertion in a future cleanup; not worth blocking on now since the Anthropic shape is contractually self-consistent.

Not approving — Tim's call per our never-merge-without-approval rule. Once approved I'll merge under strix-tkellogg.

tkellogg · 2026-05-14T21:29:03Z

@strix-tkellogg are you sure this is right? in some cases we do continue the conversation. oh, i suppose that's 2 ainvokes. hmm, if i'm understanding right i can get behind this.

strix-tkellogg · 2026-05-14T21:31:12Z

Yes, you read it right.

_log_agent_trace runs once per agent.ainvoke, so each event is bounded by one ainvoke boundary:

Single conversational turn → 1 ainvoke → 1 event with that turn's totals
Continuing the conversation in the next turn → another ainvoke → another event (so continuation IS captured, just as a distinct row)
Block-repair reinvoke after the same turn → ANOTHER ainvoke → another event (so repair cost stays visible instead of being summed into the parent turn)

Within a single ainvoke, the loop accumulates across all AIMessages in result["messages"] — that's the LangGraph multi-node case where one ainvoke produces several AIMessages (planner → tool-using node → summarizer, etc.). All of those collapse to one event for that invoke, which is the right call because they share a turn.

Downstream aggregation is straightforward: group by session_id (or by minute, or by model) and sum. The granularity is per-ainvoke so you can also see the shape of cost per turn, not just totals.

strix-tkellogg reviewed May 14, 2026

View reviewed changes

tkellogg merged commit 9e40c8e into tkellogg:main May 14, 2026

strix-tkellogg mentioned this pull request May 14, 2026

Allow scheduler jobs to specify a model (or named subagent) for per-context model routing #120

Closed

chrispatil mentioned this pull request May 14, 2026

feat: add optional model field to SchedulerJob for per-job model routing #121

Merged

strix-tkellogg mentioned this pull request May 16, 2026

Release v0.1.44 #122

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: emit `llm_usage` event on each `agent.ainvoke` call#119

feat: emit `llm_usage` event on each `agent.ainvoke` call#119
tkellogg merged 1 commit into
tkellogg:mainfrom
chrispatil:feature/108-model-usage-events

chrispatil commented May 14, 2026

Uh oh!

strix-tkellogg left a comment

Uh oh!

tkellogg commented May 14, 2026

Uh oh!

strix-tkellogg commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chrispatil commented May 14, 2026

feat: emit llm_usage event on each agent.ainvoke call

Problem

Solution

Changes

Tests

Design decisions

Uh oh!

strix-tkellogg left a comment

Choose a reason for hiding this comment

Uh oh!

tkellogg commented May 14, 2026

Uh oh!

strix-tkellogg commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: emit `llm_usage` event on each `agent.ainvoke` call