🚑(importer) fix memory leak with large mbox file import#516
Conversation
📝 WalkthroughWalkthroughA single-pass MBOX importer was replaced with a two-pass approach: Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client Code
participant Scanner as scan_mbox_messages()
participant File as MBOX File
participant Streamer as stream_mbox_messages()
Client->>Scanner: scan_mbox_messages(file)
Scanner->>File: readline() loop (single pass)
File-->>Scanner: lines -> detect "From " starts
Scanner-->>Client: (message_positions, file_end)
Client->>Streamer: stream_mbox_messages(file, positions, file_end)
Streamer->>File: seek(start_position)
File-->>Streamer: positioned at start
Streamer->>File: read(bytes_until_end)
File-->>Streamer: message bytes
Streamer-->>Client: yield message bytes (oldest → newest)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@src/backend/core/services/importer/tasks.py`:
- Around line 221-240: The docstring and comments claim messages are yielded
"oldest first" but the code iterates from the end to the start (for i in
range(len(message_positions) - 1, -1, -1)) and tests expect newest-first
ordering; update the docstring and the inline comment to state that messages are
yielded newest-first (most recent first) to match the actual behavior of the
generator that uses message_positions and file_end, or alternatively reverse the
iteration to range(0, len(message_positions)) if you intend oldest-first—ensure
the text mentions the exact loop/variables (message_positions, file_end, and the
for i ... loop) so the change is consistent.
- Around line 232-236: The current check "if not message_positions or not
file_end:" treats 0 as missing and logs a warning for valid empty MBOX files;
change it to explicitly test for None: first, if message_positions is None or
file_end is None then call logger.warning(...) and return; otherwise if
message_positions is empty (e.g. == [] or simply "if not message_positions:")
return quietly without logging. Update the condition(s) around
message_positions, file_end and the existing logger.warning call to use "is
None" checks and a separate silent return for empty message_positions.
Refactored the MBOX file processing to first scan for message positions without loading the entire file into memory, improving efficiency. The second pass now streams messages using pre-computed positions, ensuring memory usage is minimized while maintaining correct message order for threading.
869f5f3 to
702a657
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/backend/core/tests/tasks/test_task_importer.py`:
- Around line 694-697: The inline comment above the assertions incorrectly
states "oldest first" — update it to reflect the actual test assertions which
expect newest-first ordering: change the comment that references threading
ordering to say "newest first (latest message first for threading)" so it
matches the assertions that check messages[0] == "Test Message 3", messages[1]
== "Test Message 2", messages[2] == "Test Message 1".
Purpose
We were parsing the mbox and store each message in a list... so for large mbox it leads to memory overflow.
Refactored the MBOX file processing to first scan for message positions without loading the entire file into memory, improving efficiency. The second pass now streams messages using pre-computed positions, ensuring memory usage is minimized while maintaining correct message order for threading.
Summary by CodeRabbit
Refactor
Tests
✏️ Tip: You can customize this high-level summary in your review settings.