fix(ccusage): improve deduplication to keep records with most tokens by jaried · Pull Request #825 · ryoppippi/ccusage

jaried · 2026-01-27T04:47:33Z

Background

This PR is an improved version of #775, which was closed due to an incomplete fix.

PR #775 Problem: Some API proxies don't include requestId in usage data. Previously, entries without requestId returned null from createUniqueHash, leading to duplicate counting.

PR #775 Solution: Fall back to using only messageId when requestId is missing.

PR #775 Issue: While #775 fixed the deduplication key problem, it didn't address which record to keep. The old logic kept the first record encountered, which often had 0 or minimal tokens (streaming intermediate states), causing token data loss.

This PR's Solution

This PR includes the fix from #775 and adds a new strategy: keep the record with the most tokens.

Changes

Modified createUniqueHash: Returns messageId when requestId is missing (from fix: deduplication fallback to message ID when request ID is missing #775)
Added getTotalTokens: Calculates input_tokens + output_tokens for comparison
Added shouldReplaceExisting: Compares token counts to decide which record to keep
Changed deduplication structure: From Set<string> to Map<string, Record> to track best records
Removed unused functions: isDuplicateEntry, markAsProcessed

Before vs After

Metric	Before	After
Deduplication key	`messageId:requestId` (null if missing requestId)	`messageId:requestId` or `messageId`
Strategy	Keep first record	Keep record with most tokens
Records without requestId	Not deduplicated (all kept)	Deduplicated, keeping highest token record

Files Changed

apps/ccusage/src/data-loader.ts

Testing

Updated existing test case expectations
All tests pass

When multiple records share the same messageId (common with API proxies that don't provide requestId), the previous logic would keep the first record encountered, which often had 0 or minimal tokens. This change keeps the record with the highest total token count instead. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

coderabbitai · 2026-01-27T04:47:51Z

📝 Walkthrough

Walkthrough

Refactored deduplication logic in the data loader to replace per-entry hash tracking with a token-based best-record mechanism. Entries are now deduplicated by storing the highest-token variant for each unique hash, with special handling for entries that cannot be hashed.

Changes

Cohort / File(s)	Summary
Token-based deduplication refactor `apps/ccusage/src/data-loader.ts`	Replaced chronological dedup with best-record mechanism: introduced `bestRecords` map to track highest-token entries per hash, and `noHashRecords` for unhashable entries. Added helper functions `getTotalTokens()` and `shouldReplaceExisting()`. Updated `createUniqueHash()` to fall back to messageId-only when requestId is missing. Updated dedup paths in daily, session, and session-block loading to use new approach. Updated test expectations for token-based replacement semantics.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

fix: implement chronological deduplication for branched conversations #58: Changes deduplication strategy in the same data-loader module, including updates to createUniqueHash and dedup paths for daily/session loading.
fix(blocks-json): Fix total token count calculation to include cache #310: Introduces getTotalTokens() helper that consolidates token-sum computation, which this PR now leverages in the new dedup mechanism.

Suggested reviewers

ryoppippi

Poem

🐰 Tokens tale so grand, a best-record plan!
No more simple dates, now we count the weight,
Each entry's worth is judged by tokens fair,
The strongest one survives with utmost care! ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely describes the main change: improving deduplication to keep records with the most tokens, which directly matches the primary objective and file-level changes.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@apps/ccusage/src/data-loader.ts`:
- Around line 515-520: getTotalTokens currently sums only
message.usage.input_tokens and output_tokens; update it to also add any cache
token counts (e.g., message.usage.cache_tokens or the exact cache fields used in
the UsageData shape) so dedup keeps records with cache usage, and adjust the
unit test around Line 4320 to include entries with differing cache token values
and assert the function returns the combined total including cache tokens.
Locate and modify the getTotalTokens function and the corresponding test to
reference the exact cache token property names present in UsageData.

coderabbitai · 2026-01-27T04:53:10Z

+/**
+ * Calculate total tokens for a usage data entry
+ */
+export function getTotalTokens(data: UsageData): number {
+	return data.message.usage.input_tokens + data.message.usage.output_tokens;
+}


⚠️ Potential issue | 🟠 Major

Include cache tokens in the dedup comparison.

Right now getTotalTokens only sums input/output. If duplicates differ only in cache tokens, the “best” record may still drop cache usage (and cost). This undermines the goal of keeping the most complete record. Please also update the getTotalTokens test around Line 4320 to cover cache fields.

🔧 Proposed fix

export function getTotalTokens(data: UsageData): number { - return data.message.usage.input_tokens + data.message.usage.output_tokens; + const usage = data.message.usage; + return ( + usage.input_tokens + + usage.output_tokens + + (usage.cache_creation_input_tokens ?? 0) + + (usage.cache_read_input_tokens ?? 0) + ); }

🤖 Prompt for AI Agents

In `@apps/ccusage/src/data-loader.ts` around lines 515 - 520, getTotalTokens currently sums only message.usage.input_tokens and output_tokens; update it to also add any cache token counts (e.g., message.usage.cache_tokens or the exact cache fields used in the UsageData shape) so dedup keeps records with cache usage, and adjust the unit test around Line 4320 to include entries with differing cache token values and assert the function returns the combined total including cache tokens. Locate and modify the getTotalTokens function and the corresponding test to reference the exact cache token property names present in UsageData.

coderabbitai Bot reviewed Jan 27, 2026

View reviewed changes

jaried closed this Jan 27, 2026

jaried deleted the fix/deduplication-keep-max-tokens branch February 1, 2026 14:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(ccusage): improve deduplication to keep records with most tokens#825

fix(ccusage): improve deduplication to keep records with most tokens#825
jaried wants to merge 1 commit intoryoppippi:mainfrom
jaried:fix/deduplication-keep-max-tokens

jaried commented Jan 27, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jan 27, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jaried commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

This PR's Solution

Changes

Before vs After

Files Changed

Testing

Uh oh!

coderabbitai Bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jaried commented Jan 27, 2026 •

edited

Loading

coderabbitai Bot commented Jan 27, 2026 •

edited

Loading