Skip to content

fix(anthropic): streaming base64 images inflate input token count#3975

Open
karthikbolla wants to merge 3 commits intotraceloop:mainfrom
karthikbolla:fix/streaming-base64-image-token-inflation
Open

fix(anthropic): streaming base64 images inflate input token count#3975
karthikbolla wants to merge 3 commits intotraceloop:mainfrom
karthikbolla:fix/streaming-base64-image-token-inflation

Conversation

@karthikbolla
Copy link
Copy Markdown

@karthikbolla karthikbolla commented Apr 11, 2026

Summary

Fixes a bug where using client.messages.stream() with base64 images causes
the OTel span's gen_ai.usage.input_tokens attribute to be significantly
inflated compared to what the Anthropic API actually reports.

Root Cause

In streaming.py, _set_token_usage() computed input tokens as:

input_tokens = prompt_tokens + cache_read_tokens + cache_creation_tokens

where prompt_tokens was already usage["input_tokens"] from the API — the
total billed input count that already includes image tokens and all cached
token sub-components. Adding the cache sub-fields again caused double-counting.

The same arithmetic flaw also exists in the sync and async _set_token_usage()
functions in __init__.py, which this PR also fixes for consistency.

Fix Approach

Use usage["input_tokens"] directly as input_tokens. The cache_read_input_tokens
and cache_creation_input_tokens fields are still recorded as their own dedicated
span attributes (for cache observability), but are no longer added into the
GEN_AI_USAGE_INPUT_TOKENS total.

Before / After Behaviour

Method API usage.input_tokens OTel span before fix OTel span after fix
create() 343 343 ✅ 343 ✅
stream() 343 1,633 ❌ 343 ✅

Testing Done

  • Added a regression test test_anthropic_streaming_base64_image_token_count_legacy
    that sends a 500×500 base64 PNG image via stream=True and asserts that the OTel
    span's input_tokens matches the API-reported value (343), not the inflated value (1,633).
  • Verified all existing tests pass with no regressions.

Files Changed

  • packages/opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/streaming.py
  • packages/opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/__init__.py
  • packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py (new test)
  • packages/opentelemetry-instrumentation-anthropic/tests/cassettes/test_messages/test_anthropic_streaming_base64_image_token_count_legacy.yaml (new cassette)

Fixes #3949

Summary by CodeRabbit

  • Bug Fixes

    • Input token metrics no longer include cached input tokens; cache read/creation tokens are recorded separately and missing values default to zero. Span attributes for input/total tokens are now derived from prompt and completion tokens; token histograms remain based on input tokens.
  • Tests

    • Added a legacy streaming test and accompanying cassette that send a base64 image plus prompt to validate token counting and instrumentation.

The streaming _set_token_usage() function in streaming.py was computing
input_tokens as:

    input_tokens = prompt_tokens + cache_read_tokens + cache_creation_tokens

where prompt_tokens = usage["input_tokens"] from the Anthropic API's
message_start SSE event. However, the API's input_tokens field already
represents the total billed input token count, which includes image
tokens and cached tokens. Adding the cache sub-fields on top caused
double-counting that inflated the span's GEN_AI_USAGE_INPUT_TOKENS
attribute.

For a 500x500 image request, this produced 1,633 in the OTel span
(343 text tokens + ~1,290 image tokens counted twice) instead of the
correct 343 that the API reports.

Fixed by using usage["input_tokens"] directly as input_tokens without
adding cache sub-fields. The cache_read_input_tokens and
cache_creation_input_tokens are still recorded as dedicated span
attributes for cache-hit observability.

Applied the same fix consistently to _set_token_usage() and
_aset_token_usage() in __init__.py.

Fixes traceloop#3949
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


karthik bolla seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 11, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7f021fe5-e33c-4f5c-8e50-68711d895af2

📥 Commits

Reviewing files that changed from the base of the PR and between e37bbb4 and 0cb7a8d.

📒 Files selected for processing (1)
  • packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py

📝 Walkthrough

Walkthrough

Input token computation for Anthropic streaming was changed: spans now use prompt_tokens as input_tokens (excluding cached input tokens); cache read/creation counts are extracted separately (default 0 when missing). A VCR cassette and a legacy streaming test were added.

Changes

Cohort / File(s) Summary
Token calculation fixes
packages/opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/__init__.py, packages/opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/streaming.py
input_tokens now set from prompt_tokens only (sync + async). cache_read_tokens / cache_creation_tokens are read conditionally (default 0) and recorded as separate span attributes; histograms use updated input_tokens.
Streaming test cassette & test
packages/opentelemetry-instrumentation-anthropic/tests/cassettes/test_messages/test_anthropic_streaming_base64_image_token_count_legacy.yaml, packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py
Added VCR cassette for a streaming interaction with a base64 PNG and a legacy streaming test test_anthropic_streaming_base64_image_token_count_legacy asserting GEN_AI_USAGE_INPUT_TOKENS equals API usage (343).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I nibbled bytes of base64 cheer,
Prompt tokens now the ones we hear,
Cache counts sit politely near,
Spans report true numbers clear,
Hooray — telemetry hops sincere!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main fix: correcting inflated input token counts when streaming base64 images, which is the primary issue addressed across multiple files.
Linked Issues check ✅ Passed The PR correctly addresses issue #3949 by removing double-counting of cache and image tokens from input_tokens calculations in both streaming and sync/async paths, ensuring OTel spans match API-reported usage.
Out of Scope Changes check ✅ Passed All changes directly target the token inflation bug: modifications to init.py and streaming.py fix the core issue, the test and cassette validate the fix, and formatting is incidental.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py (1)

2834-2836: Add a span-shape assertion before indexing spans[0].

Line 2835 assumes at least one finished span. A direct shape assertion makes this test less brittle and failures clearer.

Proposed test hardening
     spans = span_exporter.get_finished_spans()
+    assert [span.name for span in spans] == ["anthropic.chat"]
     anthropic_span = spans[0]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py`
around lines 2834 - 2836, Add an explicit shape assertion before indexing
spans[0] to avoid an IndexError and make failures clearer: after calling
span_exporter.get_finished_spans() assert that spans is non-empty (e.g. assert
spans or assert len(spans) >= 1) before assigning anthropic_span = spans[0];
reference the variables span_exporter, get_finished_spans, spans, and
anthropic_span when locating the insertion point.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py`:
- Line 2842: The file ends without a trailing newline causing Ruff W292; fix by
adding a single newline character at the end of
packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py (i.e.,
after the final closing ")" shown in the diff) so the file terminates with a
newline.

---

Nitpick comments:
In `@packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py`:
- Around line 2834-2836: Add an explicit shape assertion before indexing
spans[0] to avoid an IndexError and make failures clearer: after calling
span_exporter.get_finished_spans() assert that spans is non-empty (e.g. assert
spans or assert len(spans) >= 1) before assigning anthropic_span = spans[0];
reference the variables span_exporter, get_finished_spans, spans, and
anthropic_span when locating the insertion point.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8a192af8-2666-42d1-87b2-bca451dc9b25

📥 Commits

Reviewing files that changed from the base of the PR and between 786d49f and 1eb6ece.

📒 Files selected for processing (4)
  • packages/opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/__init__.py
  • packages/opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/streaming.py
  • packages/opentelemetry-instrumentation-anthropic/tests/cassettes/test_messages/test_anthropic_streaming_base64_image_token_count_legacy.yaml
  • packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py

Comment thread packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py (1)

2843-2843: ⚠️ Potential issue | 🟡 Minor

Add trailing newline at EOF (Ruff W292).

Line 2843 still leaves the file without a final newline.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py` at
line 2843, The file
packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py is
missing a trailing newline at EOF; fix this by editing test_messages.py and
ensure the file ends with a single newline character (add a final blank line at
the end of the file so the last line terminates properly), then re-run linting
to confirm Ruff W292 is resolved.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py`:
- Around line 2825-2832: The test currently invokes
anthropic_client.messages.create(..., stream=True) but the regression affects
the messages.stream() path; update or add a test that calls
anthropic_client.messages.stream(model="claude-haiku-4-5-20251001",
max_tokens=100, messages=messages) (or equivalent named parameters) and iterate
over the returned iterator (for _ in response: pass) to exercise the direct
messages.stream() API; ensure the new test uses the same inputs as the existing
base64-image test so the exact API path is covered.

---

Duplicate comments:
In `@packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py`:
- Line 2843: The file
packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py is
missing a trailing newline at EOF; fix this by editing test_messages.py and
ensure the file ends with a single newline character (add a final blank line at
the end of the file so the last line terminates properly), then re-run linting
to confirm Ruff W292 is resolved.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e9c85d10-6308-47da-b4f1-0f26cc1d2d5c

📥 Commits

Reviewing files that changed from the base of the PR and between 1eb6ece and e37bbb4.

📒 Files selected for processing (1)
  • packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py

Comment thread packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py Outdated
@max-deygin-traceloop
Copy link
Copy Markdown
Contributor

Can't reproduce the describe bug using Anthropic API.

I tested this end-to-end on main, using your added test. Sent a real
500×500 PNG image through client.messages.stream() with
instrumentation enabled.

Results:

Attribute Value
gen_ai.usage.input_tokens 343
gen_ai.usage.output_tokens 7
gen_ai.usage.total_tokens 350

The main branch computes 343 + 0 (cache_create) + 0 (cache_read) = 343 — no inflation.

  1. The test uses a 1×1 pixel PNG
    (iVBORw0KGgo...kJggg==) however the cassette hardcodes input_tokens: 343, which is the token count for a 500×500 image. When I
    re-record the cassette with --record-mode=all, the real API
    returns input_tokens: 20 for that 1×1 image, and the test fails
    with assert 20 == 343.

  2. The input_tokens field excludes cache tokens. Per the
    [Anthropic docs](https://docs.anthropic.com/en/docs/build-with-clau
    de/prompt-caching), input_tokens represents only the non-cached
    portion. We verified this with prompt caching enabled — a cached
    11k-token system prompt returns input_tokens: 7, cache_read_input_tokens: 11041. The current addition
    (input_tokens + cache_read + cache_creation) is necessary to
    compute the correct total.

  3. The fix would break OTel semconv compliance. The [GenAI
    semantic conventions](https://opentelemetry.io/docs/specs/semconv/g
    en-ai/gen-ai-spans/) state that gen_ai.usage.input_tokens "SHOULD
    include all types of input tokens, including cached tokens."
    Removing the cache token addition would undercount input tokens
    when prompt caching is active (reporting 7 instead of 11,048 in our
    test).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Streaming with base64 images inflates input token count

3 participants