Skip to content

Conversation

@Hyeri1ee
Copy link
Contributor

@Hyeri1ee Hyeri1ee commented Sep 7, 2025

Summary

This PR improves the JTokkitTokenCountEstimator by converting binary data to a Base64 string for more accurate token estimation. This change aligns the logic with how most multimodal LLMs actually process this data type, leading to better cost predictions.

Changes

  • Enhancement: The estimate() method now handles byte[] data by converting it to a Base64 string before token counting. This directly addresses the previous inaccurate behavior of adding the raw byte array length, as noted by the original code comment // This is likely incorrect.

  • Validation: Existing tests in TokenCountBatchingStrategyTests have been verified to ensure no regressions.

Key Features

  • Accurate Estimation: Provides a more realistic token count for binary data.

  • LLM Alignment: The new approach mirrors the Base64 encoding used by most multimodal models.

  • Backward Compatibility: This is an internal change and does not affect the public API.

Closes: #4331

Signed-off-by: Hyeri1ee <haerizian10@gmail.com>
Signed-off-by: Hyeri1ee <haerizian10@gmail.com>
@Hyeri1ee Hyeri1ee changed the title feat(tokenizer): improve token estimation for binary data GH-4331 improve token estimation for binary data Sep 7, 2025
@ilayaperumalg
Copy link
Member

@Hyeri1ee Thanks for the PR improving the way token usage is handled for the binary data.

@Hyeri1ee
Copy link
Contributor Author

Hyeri1ee commented Sep 8, 2025

@Hyeri1ee Thanks for the PR improving the way token usage is handled for the binary data.

glad it helped! : )

@markpollack markpollack added this to the 1.1.0.M1 milestone Sep 8, 2025
@ilayaperumalg
Copy link
Member

Rebased and merged as 677730f

@ilayaperumalg
Copy link
Member

Cherry-picked and pushed 1.0.x as c60d942

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

JTokkitTokenCountEstimator : Add Base64 support for more accurate binary data token estimation

3 participants