fix(anthropic): streaming base64 images inflate input token count by karthikbolla · Pull Request #3975 · traceloop/openllmetry

karthikbolla · 2026-04-11T22:50:52Z

Summary

Fixes a bug where using client.messages.stream() with base64 images causes
the OTel span's gen_ai.usage.input_tokens attribute to be significantly
inflated compared to what the Anthropic API actually reports.

Root Cause

In streaming.py, _set_token_usage() computed input tokens as:

input_tokens = prompt_tokens + cache_read_tokens + cache_creation_tokens

where prompt_tokens was already usage["input_tokens"] from the API — the
total billed input count that already includes image tokens and all cached
token sub-components. Adding the cache sub-fields again caused double-counting.

The same arithmetic flaw also exists in the sync and async _set_token_usage()
functions in __init__.py, which this PR also fixes for consistency.

Fix Approach

Use usage["input_tokens"] directly as input_tokens. The cache_read_input_tokens
and cache_creation_input_tokens fields are still recorded as their own dedicated
span attributes (for cache observability), but are no longer added into the
GEN_AI_USAGE_INPUT_TOKENS total.

Before / After Behaviour

Method	API `usage.input_tokens`	OTel span before fix	OTel span after fix
`create()`	343	343 ✅	343 ✅
`stream()`	343	1,633 ❌	343 ✅

Testing Done

Added a regression test test_anthropic_streaming_base64_image_token_count_legacy
that sends a 500×500 base64 PNG image via stream=True and asserts that the OTel
span's input_tokens matches the API-reported value (343), not the inflated value (1,633).
Verified all existing tests pass with no regressions.

Files Changed

packages/opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/streaming.py
packages/opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/__init__.py
packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py (new test)
packages/opentelemetry-instrumentation-anthropic/tests/cassettes/test_messages/test_anthropic_streaming_base64_image_token_count_legacy.yaml (new cassette)

Fixes #3949

Summary by CodeRabbit

Bug Fixes
- Input token metrics no longer include cached input tokens; cache read/creation tokens are recorded separately and missing values default to zero. Span attributes for input/total tokens are now derived from prompt and completion tokens; token histograms remain based on input tokens.
Tests
- Added a legacy streaming test and accompanying cassette that send a base64 image plus prompt to validate token counting and instrumentation.

The streaming _set_token_usage() function in streaming.py was computing input_tokens as: input_tokens = prompt_tokens + cache_read_tokens + cache_creation_tokens where prompt_tokens = usage["input_tokens"] from the Anthropic API's message_start SSE event. However, the API's input_tokens field already represents the total billed input token count, which includes image tokens and cached tokens. Adding the cache sub-fields on top caused double-counting that inflated the span's GEN_AI_USAGE_INPUT_TOKENS attribute. For a 500x500 image request, this produced 1,633 in the OTel span (343 text tokens + ~1,290 image tokens counted twice) instead of the correct 343 that the API reports. Fixed by using usage["input_tokens"] directly as input_tokens without adding cache sub-fields. The cache_read_input_tokens and cache_creation_input_tokens are still recorded as dedicated span attributes for cache-hit observability. Applied the same fix consistently to _set_token_usage() and _aset_token_usage() in __init__.py. Fixes traceloop#3949

CLAassistant · 2026-04-11T22:50:58Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

karthik bolla seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

coderabbitai · 2026-04-11T22:51:07Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7f021fe5-e33c-4f5c-8e50-68711d895af2

📥 Commits

Reviewing files that changed from the base of the PR and between e37bbb4 and 0cb7a8d.

📒 Files selected for processing (1)

packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py

🚧 Files skipped from review as they are similar to previous changes (1)

packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py

📝 Walkthrough

Walkthrough

Input token computation for Anthropic streaming was changed: spans now use prompt_tokens as input_tokens (excluding cached input tokens); cache read/creation counts are extracted separately (default 0 when missing). A VCR cassette and a legacy streaming test were added.

Changes

Cohort / File(s)	Summary
Token calculation fixes `packages/opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/__init__.py`, `packages/opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/streaming.py`	`input_tokens` now set from `prompt_tokens` only (sync + async). `cache_read_tokens` / `cache_creation_tokens` are read conditionally (default 0) and recorded as separate span attributes; histograms use updated `input_tokens`.
Streaming test cassette & test `packages/opentelemetry-instrumentation-anthropic/tests/cassettes/test_messages/test_anthropic_streaming_base64_image_token_count_legacy.yaml`, `packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py`	Added VCR cassette for a streaming interaction with a base64 PNG and a legacy streaming test `test_anthropic_streaming_base64_image_token_count_legacy` asserting `GEN_AI_USAGE_INPUT_TOKENS` equals API usage (343).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I nibbled bytes of base64 cheer,
Prompt tokens now the ones we hear,
Cache counts sit politely near,
Spans report true numbers clear,
Hooray — telemetry hops sincere!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main fix: correcting inflated input token counts when streaming base64 images, which is the primary issue addressed across multiple files.
Linked Issues check	✅ Passed	The PR correctly addresses issue `#3949` by removing double-counting of cache and image tokens from input_tokens calculations in both streaming and sync/async paths, ensuring OTel spans match API-reported usage.
Out of Scope Changes check	✅ Passed	All changes directly target the token inflation bug: modifications to init.py and streaming.py fix the core issue, the test and cassette validate the fix, and formatting is incidental.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py (1)

2834-2836: Add a span-shape assertion before indexing spans[0].

Line 2835 assumes at least one finished span. A direct shape assertion makes this test less brittle and failures clearer.

Proposed test hardening

     spans = span_exporter.get_finished_spans()
+    assert [span.name for span in spans] == ["anthropic.chat"]
     anthropic_span = spans[0]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py`
around lines 2834 - 2836, Add an explicit shape assertion before indexing
spans[0] to avoid an IndexError and make failures clearer: after calling
span_exporter.get_finished_spans() assert that spans is non-empty (e.g. assert
spans or assert len(spans) >= 1) before assigning anthropic_span = spans[0];
reference the variables span_exporter, get_finished_spans, spans, and
anthropic_span when locating the insertion point.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py`:
- Line 2842: The file ends without a trailing newline causing Ruff W292; fix by
adding a single newline character at the end of
packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py (i.e.,
after the final closing ")" shown in the diff) so the file terminates with a
newline.

---

Nitpick comments:
In `@packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py`:
- Around line 2834-2836: Add an explicit shape assertion before indexing
spans[0] to avoid an IndexError and make failures clearer: after calling
span_exporter.get_finished_spans() assert that spans is non-empty (e.g. assert
spans or assert len(spans) >= 1) before assigning anthropic_span = spans[0];
reference the variables span_exporter, get_finished_spans, spans, and
anthropic_span when locating the insertion point.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8a192af8-2666-42d1-87b2-bca451dc9b25

📥 Commits

Reviewing files that changed from the base of the PR and between 786d49f and 1eb6ece.

📒 Files selected for processing (4)

packages/opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/__init__.py
packages/opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/streaming.py
packages/opentelemetry-instrumentation-anthropic/tests/cassettes/test_messages/test_anthropic_streaming_base64_image_token_count_legacy.yaml
packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py (1)
2843-2843: ⚠️ Potential issue | 🟡 Minor

Add trailing newline at EOF (Ruff W292).

Line 2843 still leaves the file without a final newline.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py` at
line 2843, The file
packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py is
missing a trailing newline at EOF; fix this by editing test_messages.py and
ensure the file ends with a single newline character (add a final blank line at
the end of the file so the last line terminates properly), then re-run linting
to confirm Ruff W292 is resolved.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py`:
- Around line 2825-2832: The test currently invokes
anthropic_client.messages.create(..., stream=True) but the regression affects
the messages.stream() path; update or add a test that calls
anthropic_client.messages.stream(model="claude-haiku-4-5-20251001",
max_tokens=100, messages=messages) (or equivalent named parameters) and iterate
over the returned iterator (for _ in response: pass) to exercise the direct
messages.stream() API; ensure the new test uses the same inputs as the existing
base64-image test so the exact API path is covered.

---

Duplicate comments:
In `@packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py`:
- Line 2843: The file
packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py is
missing a trailing newline at EOF; fix this by editing test_messages.py and
ensure the file ends with a single newline character (add a final blank line at
the end of the file so the last line terminates properly), then re-run linting
to confirm Ruff W292 is resolved.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e9c85d10-6308-47da-b4f1-0f26cc1d2d5c

📥 Commits

Reviewing files that changed from the base of the PR and between 1eb6ece and e37bbb4.

📒 Files selected for processing (1)

packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py

….stream

max-deygin-traceloop · 2026-04-12T11:09:20Z

Can't reproduce the describe bug using Anthropic API.

I tested this end-to-end on main, using your added test. Sent a real
500×500 PNG image through client.messages.stream() with
instrumentation enabled.

Results:

Attribute	Value
`gen_ai.usage.input_tokens`	343
`gen_ai.usage.output_tokens`	7
`gen_ai.usage.total_tokens`	350

The main branch computes 343 + 0 (cache_create) + 0 (cache_read) = 343 — no inflation.

The test uses a 1×1 pixel PNG
(iVBORw0KGgo...kJggg==) however the cassette hardcodes input_tokens: 343, which is the token count for a 500×500 image. When I
re-record the cassette with --record-mode=all, the real API
returns input_tokens: 20 for that 1×1 image, and the test fails
with assert 20 == 343.
The input_tokens field excludes cache tokens. Per the
[Anthropic docs](https://docs.anthropic.com/en/docs/build-with-clau
de/prompt-caching), input_tokens represents only the non-cached
portion. We verified this with prompt caching enabled — a cached
11k-token system prompt returns input_tokens: 7, cache_read_input_tokens: 11041. The current addition
(input_tokens + cache_read + cache_creation) is necessary to
compute the correct total.
The fix would break OTel semconv compliance. The [GenAI
semantic conventions](https://opentelemetry.io/docs/specs/semconv/g
en-ai/gen-ai-spans/) state that gen_ai.usage.input_tokens "SHOULD
include all types of input tokens, including cached tokens."
Removing the cache token addition would undercount input tokens
when prompt caching is active (reporting 7 instead of 11,048 in our
test).

coderabbitai bot reviewed Apr 11, 2026

View reviewed changes

Comment thread packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py Outdated

Added newline

e37bbb4

coderabbitai bot reviewed Apr 11, 2026

View reviewed changes

Comment thread packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py Outdated

changed anthropic_client.messages.create to anthropic_client.messages…

0cb7a8d

….stream

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(anthropic): streaming base64 images inflate input token count#3975

fix(anthropic): streaming base64 images inflate input token count#3975
karthikbolla wants to merge 3 commits intotraceloop:mainfrom
karthikbolla:fix/streaming-base64-image-token-inflation

karthikbolla commented Apr 11, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

CLAassistant commented Apr 11, 2026

Uh oh!

coderabbitai bot commented Apr 11, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

max-deygin-traceloop commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

karthikbolla commented Apr 11, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Fix Approach

Before / After Behaviour

Testing Done

Files Changed

Summary by CodeRabbit

Uh oh!

CLAassistant commented Apr 11, 2026

Uh oh!

coderabbitai bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

max-deygin-traceloop commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

karthikbolla commented Apr 11, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 11, 2026 •

edited

Loading