GH-4331 improve token estimation for binary data #4332
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR improves the JTokkitTokenCountEstimator by converting binary data to a Base64 string for more accurate token estimation. This change aligns the logic with how most multimodal LLMs actually process this data type, leading to better cost predictions.
Changes
Enhancement: The estimate() method now handles byte[] data by converting it to a Base64 string before token counting. This directly addresses the previous inaccurate behavior of adding the raw byte array length, as noted by the original code comment
// This is likely incorrect.Validation: Existing tests in TokenCountBatchingStrategyTests have been verified to ensure no regressions.
Key Features
Accurate Estimation: Provides a more realistic token count for binary data.
LLM Alignment: The new approach mirrors the Base64 encoding used by most multimodal models.
Backward Compatibility: This is an internal change and does not affect the public API.
Closes: #4331