fix(anthropic): always set streaming token usage from API data (#3949) by carsonjc04 · Pull Request #3976 · traceloop/openllmetry

carsonjc04 · 2026-04-11T23:03:19Z

The streaming _handle_completion() and _complete_instrumentation() methods gated all token-usage recording behind
if Config.enrich_token_usage. Since that flag defaults to False, streaming spans never received gen_ai.usage.input_tokens — even when the API provided real usage data in the SSE stream events.

Downstream tools (e.g. Langfuse) would then fall back to tokenizing the raw input content, which includes base64 image data, producing massively inflated token counts (e.g. 1,633 instead of 343).

Restructure the logic so that:

API-provided usage is always read and recorded on the span
The enrich_token_usage flag only gates the local estimation fallback (count_prompt_tokens_from_request) for when usage is absent from the response

This aligns the streaming path with the non-streaming create() path, which already sets token attributes unconditionally.

Closes #3949

I have added tests that cover my changes.
If adding a new instrumentation or changing an existing one, I've added screenshots from some observability platform showing the change.
PR name follows conventional commits format: feat(instrumentation): ... or fix(instrumentation): ....
(If applicable) I have updated the documentation accordingly.

Summary by CodeRabbit

Bug Fixes
- Streaming token usage now prefers API-provided values; local estimation is used only as a fallback when enrichment is enabled. Completion counts are normalized, instrumentation completion is reliably marked once for both sync and async streams, and token-usage warning logs now include exception text.
Tests
- Added tests validating token-usage attributes and span completion behavior for streaming (sync and async) when API usage is present or absent.

CLAassistant · 2026-04-11T23:03:35Z

All committers have signed the CLA.

coderabbitai · 2026-04-11T23:03:41Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Centralized token-usage resolution for Anthropic streaming: added _resolve_stream_token_usage to unify prompt/completion token extraction for sync and async streams, preferring API usage when present and optionally estimating tokens when enrichment is enabled; call sites now set span attributes only when prompt tokens are resolved and use simplified warning logging.

Changes

Cohort / File(s)	Summary
Streaming token-resolution logic `packages/opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/streaming.py`	Added `_resolve_stream_token_usage` shared helper used by `AnthropicStream._handle_completion` and `AnthropicAsyncStream._complete_instrumentation`. Helper returns API `usage` (`input_tokens`/`output_tokens`) when present; otherwise returns `(None, None)` unless `Config.enrich_token_usage` is enabled, in which case it estimates `prompt_tokens` and `completion_tokens` (using `instance.count_tokens` when available). Call sites now call `_set_token_usage` only when `prompt_tokens` is not `None`; `completion_tokens` normalized with `or 0`. Warning logging on failures simplified.
Tests for streaming span token attributes `packages/opentelemetry-instrumentation-anthropic/tests/test_semconv_span_attrs.py`	Added regression tests and a helper `_make_anthropic_async_stream()` to validate that streaming spans record token attributes from API-provided `usage` even when `Config.enrich_token_usage` is False, and that when API `usage` is empty token attributes are not set while instrumentation still completes and span ends once; tests cover both sync and async streaming paths.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hop through streams and count each peep,
I trust the API's numbers when they're deep,
If counts are shy and enrichment gives room,
I stitch prompt and reply from events that loom,
Span ends, I twitch — another tidy sweep!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: always setting streaming token usage from API data instead of only when enrich_token_usage is enabled.
Linked Issues check	✅ Passed	The PR successfully addresses issue `#3949` by refactoring token usage logic to always record API-provided usage for streaming, aligning behavior with non-streaming create() calls.
Out of Scope Changes check	✅ Passed	All changes are directly related to fixing streaming token usage from API data and include appropriate test coverage for the fix.
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

packages/opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/streaming.py (1)

246-279: Extract the stream token-resolution branch into one helper.

This logic now exists twice, and it has already started to drift (logger.warning(..., e) vs logger.warning(..., str(e))). Pulling it into a shared helper will make the sync and async paths stay aligned the next time this behavior changes.

Proposed refactor sketch

+def _resolve_stream_token_usage(complete_response, instance, kwargs):
+    usage = complete_response.get("usage")
+    if usage:
+        return (
+            usage.get("input_tokens", 0) or 0,
+            usage.get("output_tokens", 0) or 0,
+        )
+
+    if not Config.enrich_token_usage:
+        return None, None
+
+    prompt_tokens = count_prompt_tokens_from_request(instance, kwargs)
+    completion_content = "".join(
+        event.get("text", "")
+        for event in complete_response.get("events", [])
+        if event.get("text")
+    )
+    completion_tokens = None
+    if complete_response.get("model") and hasattr(instance, "count_tokens"):
+        completion_tokens = instance.count_tokens(completion_content)
+
+    return prompt_tokens, completion_tokens
+
-        try:
-            usage = self._complete_response.get("usage")
-            prompt_tokens = None
-            completion_tokens = None
-            ...
+        try:
+            prompt_tokens, completion_tokens = _resolve_stream_token_usage(
+                self._complete_response, self._instance, self._kwargs
+            )
             if prompt_tokens is not None:
                 _set_token_usage(
                     self._span,
                     self._complete_response,
                     prompt_tokens,
                     completion_tokens or 0,

Also applies to: 408-441

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@packages/opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/streaming.py`
around lines 246 - 279, The token-resolution logic duplicated in the streaming
path should be extracted into a single helper (e.g., a method like
_resolve_stream_token_usage or a private module function) that accepts self (or
the minimal pieces: self._complete_response, self._instance, self._kwargs) and
returns prompt_tokens and completion_tokens (or None/0) so both sync and async
locations can call it; move the current branch that checks
self._complete_response.get("usage"), falls back to Config.enrich_token_usage
with count_prompt_tokens_from_request and self._instance.count_tokens, and the
final _set_token_usage call into the helper, and replace the duplicated blocks
(the shown block and the one at lines 408–441) with calls to this helper,
ensuring exception handling is unified (use logger.warning(..., str(e)) or
include the error consistently) and references to symbols like
Config.enrich_token_usage, count_prompt_tokens_from_request, _set_token_usage,
self._complete_response, and self._instance remain the same.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@packages/opentelemetry-instrumentation-anthropic/tests/test_semconv_span_attrs.py`:
- Around line 1546-1607: Add async counterparts of the two sync tests to cover
AnthropicAsyncStream: duplicate
test_streaming_sets_token_usage_from_api_without_enrich_flag and
test_streaming_skips_token_usage_without_api_data_and_enrich_disabled but use
the async helper (e.g., _make_anthropic_async_stream or the async stream class
AnthropicAsyncStream), patch Config.enrich_token_usage the same way, await the
async completion handler (await stream._handle_completion() or call the
coroutine appropriately), and assert the same span attribute expectations
(GenAIAttributes.GEN_AI_USAGE_INPUT_TOKENS / GEN_AI_USAGE_OUTPUT_TOKENS presence
or absence), stream._instrumentation_completed True, and that span.end was
called. Ensure test names reference Async (or AsyncStream) and mirror the
setup/usage blocks from the sync tests (span, span.is_recording, span.end =
MagicMock) so both sync and async paths are covered.

---

Nitpick comments:
In
`@packages/opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/streaming.py`:
- Around line 246-279: The token-resolution logic duplicated in the streaming
path should be extracted into a single helper (e.g., a method like
_resolve_stream_token_usage or a private module function) that accepts self (or
the minimal pieces: self._complete_response, self._instance, self._kwargs) and
returns prompt_tokens and completion_tokens (or None/0) so both sync and async
locations can call it; move the current branch that checks
self._complete_response.get("usage"), falls back to Config.enrich_token_usage
with count_prompt_tokens_from_request and self._instance.count_tokens, and the
final _set_token_usage call into the helper, and replace the duplicated blocks
(the shown block and the one at lines 408–441) with calls to this helper,
ensuring exception handling is unified (use logger.warning(..., str(e)) or
include the error consistently) and references to symbols like
Config.enrich_token_usage, count_prompt_tokens_from_request, _set_token_usage,
self._complete_response, and self._instance remain the same.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3e1e1713-4299-411f-a360-0b9833856b0a

📥 Commits

Reviewing files that changed from the base of the PR and between 786d49f and cf19fd4.

📒 Files selected for processing (2)

packages/opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/streaming.py
packages/opentelemetry-instrumentation-anthropic/tests/test_semconv_span_attrs.py

…loop#3949) The streaming _handle_completion() and _complete_instrumentation() methods gated all token-usage recording behind `if Config.enrich_token_usage`. Since that flag defaults to False, streaming spans never received gen_ai.usage.input_tokens — even when the API provided real usage data in the SSE stream events. Downstream tools (e.g. Langfuse) would then fall back to tokenizing the raw input content, which includes base64 image data, producing massively inflated token counts (e.g. 1,633 instead of 343). Restructure the logic so that: - API-provided usage is always read and recorded on the span - The enrich_token_usage flag only gates the local estimation fallback (count_prompt_tokens_from_request) for when usage is absent from the response This aligns the streaming path with the non-streaming create() path, which already sets token attributes unconditionally. Closes traceloop#3949 Made-with: Cursor

coderabbitai

♻️ Duplicate comments (1)

packages/opentelemetry-instrumentation-anthropic/tests/test_semconv_span_attrs.py (1)

1546-1607: ⚠️ Potential issue | 🟡 Minor

Add the same regression coverage for AnthropicAsyncStream.

These tests only exercise AnthropicStream, but this PR changed the mirrored async branch in AnthropicAsyncStream._complete_instrumentation too. Please add async equivalents for both cases so the two paths cannot drift again.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@packages/opentelemetry-instrumentation-anthropic/tests/test_semconv_span_attrs.py`
around lines 1546 - 1607, Add two async test functions mirroring
test_streaming_sets_token_usage_from_api_without_enrich_flag and
test_streaming_skips_token_usage_without_api_data_and_enrich_disabled but
targeting AnthropicAsyncStream and its async completion method: create an
AnthropicAsyncStream via _make_anthropic_stream(span) (or the async equivalent),
set stream._complete_response the same way, patch
opentelemetry.instrumentation.anthropic.streaming.Config to set
enrich_token_usage=False, then await stream._complete_instrumentation() (or the
async handler used by AnthropicAsyncStream) and assert the same
attributes/behavior (GEN_AI_USAGE_INPUT_TOKENS and GEN_AI_USAGE_OUTPUT_TOKENS
present for the API usage case, absent for the no-usage case,
stream._instrumentation_completed True, and span.end called once). Mark the
tests with pytest.mark.asyncio so they run as async tests and use the same
unique symbols: AnthropicAsyncStream and _complete_instrumentation.

🧹 Nitpick comments (1)

packages/opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/streaming.py (1)
246-279: Extract the token-usage resolution into one helper.

These sync and async branches are now effectively copy-pasted, and they already diverge slightly in the warning call. Pulling the API-usage/fallback logic into a shared helper would make future token-accounting fixes much less likely to land in only one path.

Also applies to: 408-441
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@packages/opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/streaming.py`
around lines 246 - 279, The token-usage calculation and fallback logic (reading
self._complete_response["usage"], falling back to Config.enrich_token_usage with
count_prompt_tokens_from_request and self._instance.count_tokens, then calling
_set_token_usage) is duplicated between the sync and async branches; extract
this into a single helper method (e.g., _resolve_and_set_token_usage or similar)
that accepts self, metric_attributes and uses self._complete_response,
self._instance, _set_token_usage, self._span, self._token_histogram and
self._choice_counter to perform the same logic and logging, then replace both
inline blocks (the current block and the matching block around lines 408-441)
with calls to that helper so both paths share identical behavior and warning
handling.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In
`@packages/opentelemetry-instrumentation-anthropic/tests/test_semconv_span_attrs.py`:
- Around line 1546-1607: Add two async test functions mirroring
test_streaming_sets_token_usage_from_api_without_enrich_flag and
test_streaming_skips_token_usage_without_api_data_and_enrich_disabled but
targeting AnthropicAsyncStream and its async completion method: create an
AnthropicAsyncStream via _make_anthropic_stream(span) (or the async equivalent),
set stream._complete_response the same way, patch
opentelemetry.instrumentation.anthropic.streaming.Config to set
enrich_token_usage=False, then await stream._complete_instrumentation() (or the
async handler used by AnthropicAsyncStream) and assert the same
attributes/behavior (GEN_AI_USAGE_INPUT_TOKENS and GEN_AI_USAGE_OUTPUT_TOKENS
present for the API usage case, absent for the no-usage case,
stream._instrumentation_completed True, and span.end called once). Mark the
tests with pytest.mark.asyncio so they run as async tests and use the same
unique symbols: AnthropicAsyncStream and _complete_instrumentation.

---

Nitpick comments:
In
`@packages/opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/streaming.py`:
- Around line 246-279: The token-usage calculation and fallback logic (reading
self._complete_response["usage"], falling back to Config.enrich_token_usage with
count_prompt_tokens_from_request and self._instance.count_tokens, then calling
_set_token_usage) is duplicated between the sync and async branches; extract
this into a single helper method (e.g., _resolve_and_set_token_usage or similar)
that accepts self, metric_attributes and uses self._complete_response,
self._instance, _set_token_usage, self._span, self._token_histogram and
self._choice_counter to perform the same logic and logging, then replace both
inline blocks (the current block and the matching block around lines 408-441)
with calls to that helper so both paths share identical behavior and warning
handling.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 09e7c3e6-31dc-47aa-9a8e-74a2d35107d1

📥 Commits

Reviewing files that changed from the base of the PR and between cf19fd4 and c999e5f.

📒 Files selected for processing (2)

packages/opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/streaming.py
packages/opentelemetry-instrumentation-anthropic/tests/test_semconv_span_attrs.py

coderabbitai

🧹 Nitpick comments (1)

packages/opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/streaming.py (1)

275-291: Consider extracting shared token-finalization block to avoid sync/async drift.

Line 275-Line 291 and Line 419-Line 435 duplicate the same resolve/set/log flow. A tiny shared helper would reduce maintenance risk.

♻️ Suggested refactor

+def _apply_stream_token_usage(
+    span,
+    complete_response,
+    instance,
+    kwargs,
+    metric_attributes,
+    token_histogram,
+    choice_counter,
+):
+    try:
+        prompt_tokens, completion_tokens = _resolve_stream_token_usage(
+            complete_response, instance, kwargs
+        )
+        if prompt_tokens is not None:
+            _set_token_usage(
+                span,
+                complete_response,
+                prompt_tokens,
+                completion_tokens or 0,
+                metric_attributes,
+                token_histogram,
+                choice_counter,
+            )
+    except Exception as e:
+        logger.warning("Failed to set token usage, error: %s", str(e))

Then call _apply_stream_token_usage(...) from both completion methods.

Also applies to: 419-435

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@packages/opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/streaming.py`
around lines 275 - 291, Extract the duplicated resolve/set/log logic into a
single helper (e.g. _apply_stream_token_usage) that takes the common operands
(self._complete_response, self._instance, self._kwargs, self._span,
metric_attributes, self._token_histogram, self._choice_counter), performs the
_resolve_stream_token_usage call, calls _set_token_usage when prompt_tokens is
not None, and wraps everything in the existing try/except that logs failures;
then replace the duplicated blocks in both completion paths with a call to this
new helper to avoid sync/async drift and reduce maintenance risk.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@packages/opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/streaming.py`:
- Around line 275-291: Extract the duplicated resolve/set/log logic into a
single helper (e.g. _apply_stream_token_usage) that takes the common operands
(self._complete_response, self._instance, self._kwargs, self._span,
metric_attributes, self._token_histogram, self._choice_counter), performs the
_resolve_stream_token_usage call, calls _set_token_usage when prompt_tokens is
not None, and wraps everything in the existing try/except that logs failures;
then replace the duplicated blocks in both completion paths with a call to this
new helper to avoid sync/async drift and reduce maintenance risk.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c02d7315-7c52-4ce0-8876-f6ebb9491cce

📥 Commits

Reviewing files that changed from the base of the PR and between c999e5f and 037af04.

📒 Files selected for processing (2)

packages/opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/streaming.py
packages/opentelemetry-instrumentation-anthropic/tests/test_semconv_span_attrs.py

🚧 Files skipped from review as they are similar to previous changes (1)

packages/opentelemetry-instrumentation-anthropic/tests/test_semconv_span_attrs.py

max-deygin-traceloop

Minor dock string nitpick, otherwise LGTM

…/instrumentation/anthropic/streaming.py Thank you max-deygin-traceloop for the inconsistency callout. Co-authored-by: max-deygin-traceloop <max@traceloop.com>

max-deygin-traceloop · 2026-04-14T09:31:53Z

LGTM @carsonjc04 if you don't mind, please sign the CLA and we can merge it

max-deygin-traceloop

LGTM!

coderabbitai bot reviewed Apr 11, 2026

View reviewed changes

Comment thread packages/opentelemetry-instrumentation-anthropic/tests/test_semconv_span_attrs.py

carsonjc04 force-pushed the fix/streaming-token-usage-3949 branch from cf19fd4 to c999e5f Compare April 11, 2026 23:13

coderabbitai bot reviewed Apr 11, 2026

View reviewed changes

carsonjc04 force-pushed the fix/streaming-token-usage-3949 branch from c999e5f to 037af04 Compare April 11, 2026 23:18

coderabbitai bot reviewed Apr 11, 2026

View reviewed changes

max-deygin-traceloop reviewed Apr 12, 2026

View reviewed changes

Comment thread ...opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/streaming.py Outdated

max-deygin-traceloop self-requested a review April 12, 2026 12:22

max-deygin-traceloop requested changes Apr 12, 2026

View reviewed changes

Comment thread ...opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/streaming.py Outdated

carsonjc04 and others added 2 commits April 13, 2026 14:36

Update packages/opentelemetry-instrumentation-anthropic/opentelemetry…

2c84ec3

…/instrumentation/anthropic/streaming.py Thank you max-deygin-traceloop for the inconsistency callout. Co-authored-by: max-deygin-traceloop <max@traceloop.com>

Merge branch 'main' into fix/streaming-token-usage-3949

a4cce6f

max-deygin-traceloop self-requested a review April 14, 2026 09:32

max-deygin-traceloop approved these changes Apr 14, 2026

View reviewed changes

carsonjc04 and others added 2 commits April 15, 2026 14:39

Merge branch 'main' into fix/streaming-token-usage-3949

b298e72

Merge branch 'main' into fix/streaming-token-usage-3949

320f90e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(anthropic): always set streaming token usage from API data (#3949)#3976

fix(anthropic): always set streaming token usage from API data (#3949)#3976
carsonjc04 wants to merge 5 commits intotraceloop:mainfrom
carsonjc04:fix/streaming-token-usage-3949

carsonjc04 commented Apr 11, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

CLAassistant commented Apr 11, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 11, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

max-deygin-traceloop left a comment

Uh oh!

Uh oh!

max-deygin-traceloop commented Apr 14, 2026

Uh oh!

max-deygin-traceloop left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

carsonjc04 commented Apr 11, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

CLAassistant commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

max-deygin-traceloop left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

max-deygin-traceloop commented Apr 14, 2026

Uh oh!

max-deygin-traceloop left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

carsonjc04 commented Apr 11, 2026 •

edited by coderabbitai bot

Loading

CLAassistant commented Apr 11, 2026 •

edited

Loading

coderabbitai bot commented Apr 11, 2026 •

edited

Loading