GH-4331 improve token estimation for binary data #4332

Hyeri1ee · 2025-09-07T23:25:42Z

Summary

This PR improves the JTokkitTokenCountEstimator by converting binary data to a Base64 string for more accurate token estimation. This change aligns the logic with how most multimodal LLMs actually process this data type, leading to better cost predictions.

Changes

Enhancement: The estimate() method now handles byte[] data by converting it to a Base64 string before token counting. This directly addresses the previous inaccurate behavior of adding the raw byte array length, as noted by the original code comment // This is likely incorrect.
Validation: Existing tests in TokenCountBatchingStrategyTests have been verified to ensure no regressions.

Key Features

Accurate Estimation: Provides a more realistic token count for binary data.
LLM Alignment: The new approach mirrors the Base64 encoding used by most multimodal models.
Backward Compatibility: This is an internal change and does not affect the public API.

Closes: #4331

Signed-off-by: Hyeri1ee <haerizian10@gmail.com>

ilayaperumalg · 2025-09-08T08:13:11Z

@Hyeri1ee Thanks for the PR improving the way token usage is handled for the binary data.

Hyeri1ee · 2025-09-08T08:16:32Z

@Hyeri1ee Thanks for the PR improving the way token usage is handled for the binary data.

glad it helped! : )

ilayaperumalg · 2025-09-08T15:26:20Z

Rebased and merged as 677730f

ilayaperumalg · 2025-09-08T15:28:38Z

Cherry-picked and pushed 1.0.x as c60d942

Hyeri1ee added 2 commits September 8, 2025 08:29

fix(tokenizer): use Base64 encoding for binary data token estimation

3d14a0d

Signed-off-by: Hyeri1ee <haerizian10@gmail.com>

chore(tokenizer): apply checkStyleon JTokkitTokenCountEstimator.java

89c3e60

Signed-off-by: Hyeri1ee <haerizian10@gmail.com>

Hyeri1ee force-pushed the GH-4331 branch from 45b882a to 89c3e60 Compare September 7, 2025 23:29

Hyeri1ee changed the title ~~feat(tokenizer): improve token estimation for binary data~~ GH-4331 improve token estimation for binary data Sep 7, 2025

ilayaperumalg added the improvement label Sep 8, 2025

markpollack assigned ilayaperumalg Sep 8, 2025

markpollack added this to the 1.1.0.M1 milestone Sep 8, 2025

ilayaperumalg closed this Sep 8, 2025

ilayaperumalg added the for: backport-to-1.0.x label Sep 8, 2025

ilayaperumalg mentioned this pull request Sep 15, 2025

JTokkitTokenCountEstimator : Add Base64 support for more accurate binary data token estimation #4331

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-4331 improve token estimation for binary data #4332

GH-4331 improve token estimation for binary data #4332

Uh oh!

Hyeri1ee commented Sep 7, 2025 •

edited

Loading

Uh oh!

ilayaperumalg commented Sep 8, 2025

Uh oh!

Hyeri1ee commented Sep 8, 2025

Uh oh!

ilayaperumalg commented Sep 8, 2025

Uh oh!

ilayaperumalg commented Sep 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

GH-4331 improve token estimation for binary data #4332

GH-4331 improve token estimation for binary data #4332

Uh oh!

Conversation

Hyeri1ee commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Key Features

Uh oh!

ilayaperumalg commented Sep 8, 2025

Uh oh!

Hyeri1ee commented Sep 8, 2025

Uh oh!

ilayaperumalg commented Sep 8, 2025

Uh oh!

ilayaperumalg commented Sep 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Hyeri1ee commented Sep 7, 2025 •

edited

Loading