Skip to content

fix(ccusage): improve deduplication to keep records with most tokens#825

Closed
jaried wants to merge 1 commit intoryoppippi:mainfrom
jaried:fix/deduplication-keep-max-tokens
Closed

fix(ccusage): improve deduplication to keep records with most tokens#825
jaried wants to merge 1 commit intoryoppippi:mainfrom
jaried:fix/deduplication-keep-max-tokens

Conversation

@jaried
Copy link
Copy Markdown

@jaried jaried commented Jan 27, 2026

Background

This PR is an improved version of #775, which was closed due to an incomplete fix.

PR #775 Problem: Some API proxies don't include requestId in usage data. Previously, entries without requestId returned null from createUniqueHash, leading to duplicate counting.

PR #775 Solution: Fall back to using only messageId when requestId is missing.

PR #775 Issue: While #775 fixed the deduplication key problem, it didn't address which record to keep. The old logic kept the first record encountered, which often had 0 or minimal tokens (streaming intermediate states), causing token data loss.

This PR's Solution

This PR includes the fix from #775 and adds a new strategy: keep the record with the most tokens.

Changes

  1. Modified createUniqueHash: Returns messageId when requestId is missing (from fix: deduplication fallback to message ID when request ID is missing #775)
  2. Added getTotalTokens: Calculates input_tokens + output_tokens for comparison
  3. Added shouldReplaceExisting: Compares token counts to decide which record to keep
  4. Changed deduplication structure: From Set<string> to Map<string, Record> to track best records
  5. Removed unused functions: isDuplicateEntry, markAsProcessed

Before vs After

Metric Before After
Deduplication key messageId:requestId (null if missing requestId) messageId:requestId or messageId
Strategy Keep first record Keep record with most tokens
Records without requestId Not deduplicated (all kept) Deduplicated, keeping highest token record

Files Changed

  • apps/ccusage/src/data-loader.ts

Testing

  • Updated existing test case expectations
  • All tests pass

When multiple records share the same messageId (common with API proxies
that don't provide requestId), the previous logic would keep the first
record encountered, which often had 0 or minimal tokens.

This change keeps the record with the highest total token count instead.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jan 27, 2026

📝 Walkthrough

Walkthrough

Refactored deduplication logic in the data loader to replace per-entry hash tracking with a token-based best-record mechanism. Entries are now deduplicated by storing the highest-token variant for each unique hash, with special handling for entries that cannot be hashed.

Changes

Cohort / File(s) Summary
Token-based deduplication refactor
apps/ccusage/src/data-loader.ts
Replaced chronological dedup with best-record mechanism: introduced bestRecords map to track highest-token entries per hash, and noHashRecords for unhashable entries. Added helper functions getTotalTokens() and shouldReplaceExisting(). Updated createUniqueHash() to fall back to messageId-only when requestId is missing. Updated dedup paths in daily, session, and session-block loading to use new approach. Updated test expectations for token-based replacement semantics.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • ryoppippi

Poem

🐰 Tokens tale so grand, a best-record plan!
No more simple dates, now we count the weight,
Each entry's worth is judged by tokens fair,
The strongest one survives with utmost care! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely describes the main change: improving deduplication to keep records with the most tokens, which directly matches the primary objective and file-level changes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@apps/ccusage/src/data-loader.ts`:
- Around line 515-520: getTotalTokens currently sums only
message.usage.input_tokens and output_tokens; update it to also add any cache
token counts (e.g., message.usage.cache_tokens or the exact cache fields used in
the UsageData shape) so dedup keeps records with cache usage, and adjust the
unit test around Line 4320 to include entries with differing cache token values
and assert the function returns the combined total including cache tokens.
Locate and modify the getTotalTokens function and the corresponding test to
reference the exact cache token property names present in UsageData.

Comment on lines +515 to +520
/**
* Calculate total tokens for a usage data entry
*/
export function getTotalTokens(data: UsageData): number {
return data.message.usage.input_tokens + data.message.usage.output_tokens;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Include cache tokens in the dedup comparison.

Right now getTotalTokens only sums input/output. If duplicates differ only in cache tokens, the “best” record may still drop cache usage (and cost). This undermines the goal of keeping the most complete record. Please also update the getTotalTokens test around Line 4320 to cover cache fields.

🔧 Proposed fix
 export function getTotalTokens(data: UsageData): number {
-	return data.message.usage.input_tokens + data.message.usage.output_tokens;
+	const usage = data.message.usage;
+	return (
+		usage.input_tokens +
+		usage.output_tokens +
+		(usage.cache_creation_input_tokens ?? 0) +
+		(usage.cache_read_input_tokens ?? 0)
+	);
 }
🤖 Prompt for AI Agents
In `@apps/ccusage/src/data-loader.ts` around lines 515 - 520, getTotalTokens
currently sums only message.usage.input_tokens and output_tokens; update it to
also add any cache token counts (e.g., message.usage.cache_tokens or the exact
cache fields used in the UsageData shape) so dedup keeps records with cache
usage, and adjust the unit test around Line 4320 to include entries with
differing cache token values and assert the function returns the combined total
including cache tokens. Locate and modify the getTotalTokens function and the
corresponding test to reference the exact cache token property names present in
UsageData.

@jaried jaried closed this Jan 27, 2026
@jaried jaried deleted the fix/deduplication-keep-max-tokens branch February 1, 2026 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant