fix(ccusage): improve deduplication to keep records with most tokens#825
fix(ccusage): improve deduplication to keep records with most tokens#825jaried wants to merge 1 commit intoryoppippi:mainfrom
Conversation
When multiple records share the same messageId (common with API proxies that don't provide requestId), the previous logic would keep the first record encountered, which often had 0 or minimal tokens. This change keeps the record with the highest total token count instead. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
📝 WalkthroughWalkthroughRefactored deduplication logic in the data loader to replace per-entry hash tracking with a token-based best-record mechanism. Entries are now deduplicated by storing the highest-token variant for each unique hash, with special handling for entries that cannot be hashed. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@apps/ccusage/src/data-loader.ts`:
- Around line 515-520: getTotalTokens currently sums only
message.usage.input_tokens and output_tokens; update it to also add any cache
token counts (e.g., message.usage.cache_tokens or the exact cache fields used in
the UsageData shape) so dedup keeps records with cache usage, and adjust the
unit test around Line 4320 to include entries with differing cache token values
and assert the function returns the combined total including cache tokens.
Locate and modify the getTotalTokens function and the corresponding test to
reference the exact cache token property names present in UsageData.
| /** | ||
| * Calculate total tokens for a usage data entry | ||
| */ | ||
| export function getTotalTokens(data: UsageData): number { | ||
| return data.message.usage.input_tokens + data.message.usage.output_tokens; | ||
| } |
There was a problem hiding this comment.
Include cache tokens in the dedup comparison.
Right now getTotalTokens only sums input/output. If duplicates differ only in cache tokens, the “best” record may still drop cache usage (and cost). This undermines the goal of keeping the most complete record. Please also update the getTotalTokens test around Line 4320 to cover cache fields.
🔧 Proposed fix
export function getTotalTokens(data: UsageData): number {
- return data.message.usage.input_tokens + data.message.usage.output_tokens;
+ const usage = data.message.usage;
+ return (
+ usage.input_tokens +
+ usage.output_tokens +
+ (usage.cache_creation_input_tokens ?? 0) +
+ (usage.cache_read_input_tokens ?? 0)
+ );
}🤖 Prompt for AI Agents
In `@apps/ccusage/src/data-loader.ts` around lines 515 - 520, getTotalTokens
currently sums only message.usage.input_tokens and output_tokens; update it to
also add any cache token counts (e.g., message.usage.cache_tokens or the exact
cache fields used in the UsageData shape) so dedup keeps records with cache
usage, and adjust the unit test around Line 4320 to include entries with
differing cache token values and assert the function returns the combined total
including cache tokens. Locate and modify the getTotalTokens function and the
corresponding test to reference the exact cache token property names present in
UsageData.
Background
This PR is an improved version of #775, which was closed due to an incomplete fix.
PR #775 Problem: Some API proxies don't include
requestIdin usage data. Previously, entries withoutrequestIdreturnednullfromcreateUniqueHash, leading to duplicate counting.PR #775 Solution: Fall back to using only
messageIdwhenrequestIdis missing.PR #775 Issue: While #775 fixed the deduplication key problem, it didn't address which record to keep. The old logic kept the first record encountered, which often had 0 or minimal tokens (streaming intermediate states), causing token data loss.
This PR's Solution
This PR includes the fix from #775 and adds a new strategy: keep the record with the most tokens.
Changes
createUniqueHash: ReturnsmessageIdwhenrequestIdis missing (from fix: deduplication fallback to message ID when request ID is missing #775)getTotalTokens: Calculatesinput_tokens + output_tokensfor comparisonshouldReplaceExisting: Compares token counts to decide which record to keepSet<string>toMap<string, Record>to track best recordsisDuplicateEntry,markAsProcessedBefore vs After
messageId:requestId(null if missing requestId)messageId:requestIdormessageIdFiles Changed
apps/ccusage/src/data-loader.tsTesting