Conversation
There was a problem hiding this comment.
Hello @ryoppippi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello! Gemini here, providing a summary of this pull request. This PR focuses on refactoring the data loading and aggregation logic within src/data-loader.ts. The primary goal, as indicated by the title, is to leverage the Object.groupBy method where applicable to simplify and potentially improve the readability of the code that groups data entries before aggregating them.
Highlights
- Refactor Data Aggregation: The core change involves replacing the manual iteration and aggregation using
Mapobjects with a pattern that first collects all relevant data entries into an array, then usesObject.groupByto group them by a specific key (date, session key, or month), and finally usesreduceon each group to calculate the aggregated totals. - Apply Object.groupBy to Daily Data: The
loadDailyUsageDatafunction has been updated to group individual usage entries by date usingObject.groupBybefore summing up tokens and cost for each day. - Apply Object.groupBy to Session Data: The
loadSessionDatafunction now usesObject.groupByto group usage entries by a session key (projectPath/sessionId). The aggregation logic within thereducestep also handles finding the latest activity timestamp and collecting unique versions for each session. - Apply Object.groupBy to Monthly Data: The
loadMonthlyUsageDatafunction, which previously aggregated daily data fromloadDailyUsageDatainto months using aMap, now usesObject.groupByto group the daily usage objects by month.
Changelog
- src/data-loader.ts
- Refactored
loadDailyUsageDatato useObject.groupByfor date-based aggregation. - Refactored
loadSessionDatato useObject.groupByfor session-based aggregation, including updates to howlastActivityandversionsare determined. - Refactored
loadMonthlyUsageDatato useObject.groupByfor month-based aggregation of daily data. - Adjusted filtering logic in
loadDailyUsageDataandloadSessionDatato occur after the main aggregation step.
- Refactored
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
commit: |
There was a problem hiding this comment.
Pull Request Overview
This PR refactors the daily, session, and monthly data loaders to use Object.groupBy for grouping, replacing manual Map usage and reducing boilerplate aggregation logic.
- Replaced manual
Mapbuilds withObject.groupByinloadDailyUsageData,loadSessionData, andloadMonthlyUsageData - Aggregations now use
Array.prototype.reduceon grouped entries - Date‐range filtering and sorting retain existing semantics but operate on grouped results
Comments suppressed due to low confidence (1)
src/data-loader.ts:150
- [nitpick] The variable name
allEntriesis generic and reused across multiple functions. Consider renaming it todailyEntriesto improve clarity and maintainability.
const allEntries: { data: UsageData; date: string; cost: number }[] = [];
| .map(([date, entries]) => { | ||
| if (entries == null) { | ||
| return undefined; | ||
| } | ||
|
|
||
| if (options?.since || options?.until) { | ||
| results = results.filter((data) => { | ||
| const dateStr = data.date.replace(/-/g, ""); // Convert to YYYYMMDD | ||
| if (options.since && dateStr < options.since) return false; | ||
| if (options.until && dateStr > options.until) return false; | ||
| return entries.reduce( |
There was a problem hiding this comment.
The null check for entries is redundant because Object.groupBy always returns an array for each key. You can remove this branch to simplify the code.
| .map(([date, entries]) => { | |
| if (entries == null) { | |
| return undefined; | |
| } | |
| if (options?.since || options?.until) { | |
| results = results.filter((data) => { | |
| const dateStr = data.date.replace(/-/g, ""); // Convert to YYYYMMDD | |
| if (options.since && dateStr < options.since) return false; | |
| if (options.until && dateStr > options.until) return false; | |
| return entries.reduce( | |
| .map(([date, entries]) => | |
| entries.reduce( |
| .filter((item) => item != null) | ||
| .filter((item) => { | ||
| // Filter by date range if specified |
There was a problem hiding this comment.
[nitpick] Chaining multiple .filter calls leads to extra array traversals. Consider merging the non-null check and date-range filter into a single pass to reduce iterations.
| .filter((item) => item != null) | |
| .filter((item) => { | |
| // Filter by date range if specified | |
| .filter((item) => { | |
| // Ensure item is not null and filter by date range if specified | |
| if (item == null) { | |
| return false; | |
| } |
There was a problem hiding this comment.
Code Review
This pull request successfully refactors loadDailyUsageData, loadSessionData, and loadMonthlyUsageData to use Object.groupBy for aggregating data. This change makes the grouping intent very clear and leverages a modern JavaScript feature.
The implementation correctly handles the Partial<Record<K, T[]>> return type of Object.groupBy by checking for null or undefined entry arrays and filtering them out, which is good practice.
One notable improvement in loadSessionData is the addition of default values for sessionId and projectPath when parsing file paths, enhancing robustness.
The primary area for consideration is the potential impact on performance, as the new approach generally involves more passes over the data arrays compared to the previous Map-based in-place aggregation. This is discussed in more detail in the specific comments.
Summary of Findings
- Potential Performance Impact in
loadDailyUsageData: The refactor to useObject.groupByinloadDailyUsageDataresults in multiple iterations over the collected entries (grouping, then mapping/reducing, then filtering). This could be less performant than the previous single-pass Map-based aggregation for large datasets. - Potential Performance Impact in
loadSessionData: InloadSessionData, after grouping by session, each session's data is iterated multiple times (to find the latest entry, to collect versions, and to sum totals). This contrasts with the previous incremental update approach and might affect performance for sessions with many entries. - Potential Performance Impact in
loadMonthlyUsageData: TheloadMonthlyUsageDatafunction now performsObject.groupByand subsequent aggregation on the results ofloadDailyUsageData. This means thedailyDataarray is iterated multiple times after it has already been computed, potentially adding overhead.
Merge Readiness
The pull request successfully implements the goal of using Object.groupBy. However, there are medium-severity concerns regarding potential performance degradation in all three refactored functions due to increased data iterations. It would be beneficial to discuss these performance aspects, and potentially benchmark the changes against the previous implementation if large datasets are common, before merging.
As an AI assistant, I am not authorized to approve pull requests. Please ensure this code is reviewed and approved by the appropriate team members.
| // Group by date using Object.groupBy | ||
| const groupedByDate = Object.groupBy(allEntries, (entry) => entry.date); | ||
|
|
||
| // Aggregate each group | ||
| const results = Object.entries(groupedByDate) | ||
| .map(([date, entries]) => { | ||
| if (entries == null) { | ||
| return undefined; | ||
| } | ||
|
|
||
| if (options?.since || options?.until) { | ||
| results = results.filter((data) => { | ||
| const dateStr = data.date.replace(/-/g, ""); // Convert to YYYYMMDD | ||
| if (options.since && dateStr < options.since) return false; | ||
| if (options.until && dateStr > options.until) return false; | ||
| return entries.reduce( | ||
| (acc, entry) => ({ | ||
| date, | ||
| inputTokens: | ||
| acc.inputTokens + (entry.data.message.usage.input_tokens ?? 0), | ||
| outputTokens: | ||
| acc.outputTokens + (entry.data.message.usage.output_tokens ?? 0), | ||
| cacheCreationTokens: | ||
| acc.cacheCreationTokens + | ||
| (entry.data.message.usage.cache_creation_input_tokens ?? 0), | ||
| cacheReadTokens: | ||
| acc.cacheReadTokens + | ||
| (entry.data.message.usage.cache_read_input_tokens ?? 0), | ||
| totalCost: acc.totalCost + entry.cost, | ||
| }), | ||
| { | ||
| date, | ||
| inputTokens: 0, | ||
| outputTokens: 0, | ||
| cacheCreationTokens: 0, | ||
| cacheReadTokens: 0, | ||
| totalCost: 0, | ||
| }, | ||
| ); | ||
| }) | ||
| .filter((item) => item != null) | ||
| .filter((item) => { | ||
| // Filter by date range if specified | ||
| if (options?.since || options?.until) { | ||
| const dateStr = item.date.replace(/-/g, ""); // Convert to YYYYMMDD | ||
| if (options.since && dateStr < options.since) return false; | ||
| if (options.until && dateStr > options.until) return false; | ||
| } | ||
| return true; | ||
| }); |
There was a problem hiding this comment.
The new approach using Object.groupBy followed by Object.entries().map().reduce() is clear in its intent. However, it introduces multiple iterations over the allEntries data:
Object.groupByiterates once.Object.entries().map()iterates over the groups, andreduce()iterates over entries within each group.- The subsequent
.filter()for date range iterates again.
The previous Map-based approach aggregated data in a more incremental fashion, typically involving fewer full passes over the dataset.
Have you considered the potential performance implications for very large log files or a high number of entries? It might be worth benchmarking this against the older implementation if performance is critical for large datasets.
The check if (entries == null) (line 184) and the filter((item) => item != null) (line 213) are correctly implemented to handle the Partial<Record<K, T[]>> return type of Object.groupBy where group arrays can be undefined.
| // Group by session using Object.groupBy | ||
| const groupedBySessions = Object.groupBy( | ||
| allEntries, | ||
| (entry) => entry.sessionKey, | ||
| ); | ||
|
|
||
| // Aggregate each session group | ||
| const results = Object.entries(groupedBySessions) | ||
| .map(([_, entries]) => { | ||
| if (entries == null) { | ||
| return undefined; | ||
| } | ||
|
|
||
| // Find the latest timestamp for lastActivity | ||
| const latestEntry = entries.reduce((latest, current) => | ||
| current.timestamp > latest.timestamp ? current : latest, | ||
| ); | ||
|
|
||
| if (options?.since || options?.until) { | ||
| results = results.filter((session) => { | ||
| const dateStr = session.lastActivity.replace(/-/g, ""); // Convert to YYYYMMDD | ||
| if (options.since && dateStr < options.since) return false; | ||
| if (options.until && dateStr > options.until) return false; | ||
| // Collect all unique versions | ||
| const versionSet = new Set<string>(); | ||
| for (const entry of entries) { | ||
| if (entry.data.version) { | ||
| versionSet.add(entry.data.version); | ||
| } | ||
| } | ||
|
|
||
| // Aggregate totals | ||
| const aggregated = entries.reduce( | ||
| (acc, entry) => ({ | ||
| sessionId: latestEntry.sessionId, | ||
| projectPath: latestEntry.projectPath, | ||
| inputTokens: | ||
| acc.inputTokens + (entry.data.message.usage.input_tokens ?? 0), | ||
| outputTokens: | ||
| acc.outputTokens + (entry.data.message.usage.output_tokens ?? 0), | ||
| cacheCreationTokens: | ||
| acc.cacheCreationTokens + | ||
| (entry.data.message.usage.cache_creation_input_tokens ?? 0), | ||
| cacheReadTokens: | ||
| acc.cacheReadTokens + | ||
| (entry.data.message.usage.cache_read_input_tokens ?? 0), | ||
| totalCost: acc.totalCost + entry.cost, | ||
| lastActivity: formatDate(latestEntry.timestamp), | ||
| versions: Array.from(versionSet).sort(), | ||
| }), | ||
| { | ||
| sessionId: latestEntry.sessionId, | ||
| projectPath: latestEntry.projectPath, | ||
| inputTokens: 0, | ||
| outputTokens: 0, | ||
| cacheCreationTokens: 0, | ||
| cacheReadTokens: 0, | ||
| totalCost: 0, | ||
| lastActivity: formatDate(latestEntry.timestamp), | ||
| versions: Array.from(versionSet).sort(), | ||
| }, | ||
| ); | ||
|
|
||
| return aggregated; | ||
| }) | ||
| .filter((item) => item != null) | ||
| .filter((item) => { | ||
| // Filter by date range if specified | ||
| if (options?.since || options?.until) { | ||
| const dateStr = item.lastActivity.replace(/-/g, ""); // Convert to YYYYMMDD | ||
| if (options.since && dateStr < options.since) return false; | ||
| if (options.until && dateStr > options.until) return false; | ||
| } | ||
| return true; | ||
| }); |
There was a problem hiding this comment.
Similar to loadDailyUsageData, this refactoring for loadSessionData uses Object.groupBy effectively. The addition of default values for sessionId (line 266) and projectPath (line 268) is a nice improvement for robustness.
Regarding performance and complexity within the aggregation step (lines 310-361):
- Finding
latestEntryinvolves a pass overentriesfor the current session group. - Collecting
versionSetinvolves another pass overentries. - The final
reduceto aggregate totals is a third pass overentries.
This multi-pass approach within each group's processing, after the initial Object.groupBy, could be less efficient than the previous Map-based method that updated session data incrementally in one pass over the raw log lines.
Could this lead to noticeable slowdowns for sessions with a very large number of individual log entries? The increased number of iterations per session group might be a concern for performance-sensitive scenarios.
| // Group daily data by month using Object.groupBy | ||
| const groupedByMonth = Object.groupBy(dailyData, (data) => | ||
| data.date.substring(0, 7), | ||
| ); | ||
|
|
||
| for (const data of dailyData) { | ||
| // Extract YYYY-MM from YYYY-MM-DD | ||
| const month = data.date.substring(0, 7); | ||
|
|
||
| const existing = monthlyMap.get(month) || { | ||
| month, | ||
| inputTokens: 0, | ||
| outputTokens: 0, | ||
| cacheCreationTokens: 0, | ||
| cacheReadTokens: 0, | ||
| totalCost: 0, | ||
| }; | ||
|
|
||
| existing.inputTokens += data.inputTokens; | ||
| existing.outputTokens += data.outputTokens; | ||
| existing.cacheCreationTokens += data.cacheCreationTokens; | ||
| existing.cacheReadTokens += data.cacheReadTokens; | ||
| existing.totalCost += data.totalCost; | ||
| // Aggregate each month group | ||
| const monthlyArray = Object.entries(groupedByMonth) | ||
| .map(([month, dailyEntries]) => { | ||
| if (dailyEntries == null) { | ||
| return undefined; | ||
| } | ||
|
|
||
| monthlyMap.set(month, existing); | ||
| } | ||
| return dailyEntries.reduce( | ||
| (acc, data) => ({ | ||
| month, | ||
| inputTokens: acc.inputTokens + data.inputTokens, | ||
| outputTokens: acc.outputTokens + data.outputTokens, | ||
| cacheCreationTokens: | ||
| acc.cacheCreationTokens + data.cacheCreationTokens, | ||
| cacheReadTokens: acc.cacheReadTokens + data.cacheReadTokens, | ||
| totalCost: acc.totalCost + data.totalCost, | ||
| }), | ||
| { | ||
| month, | ||
| inputTokens: 0, | ||
| outputTokens: 0, | ||
| cacheCreationTokens: 0, | ||
| cacheReadTokens: 0, | ||
| totalCost: 0, | ||
| }, | ||
| ); | ||
| }) | ||
| .filter((item) => item != null); |
There was a problem hiding this comment.
In loadMonthlyUsageData, the dailyData (which is already processed and potentially filtered by loadDailyUsageData) is now processed again:
Object.groupBy(dailyData, ...)iterates overdailyData.Object.entries().map().reduce()iterates over the monthly groups and their constituent daily entries.
The previous implementation iterated dailyData once to populate monthlyMap.
This change might introduce additional overhead, especially since dailyData itself is the result of prior processing. Was this potential performance trade-off considered? For instance, if loadDailyUsageData returns a large number of daily records, these subsequent iterations could add up.
feat: use object.groupBy as much as possible
No description provided.