Skip to content

🚑(importer) fix memory leak with large mbox file import#516

Merged
jbpenrath merged 1 commit into
mainfrom
fix/mbox-memory-leak
Jan 29, 2026
Merged

🚑(importer) fix memory leak with large mbox file import#516
jbpenrath merged 1 commit into
mainfrom
fix/mbox-memory-leak

Conversation

@jbpenrath
Copy link
Copy Markdown
Contributor

@jbpenrath jbpenrath commented Jan 29, 2026

Purpose

We were parsing the mbox and store each message in a list... so for large mbox it leads to memory overflow.

Refactored the MBOX file processing to first scan for message positions without loading the entire file into memory, improving efficiency. The second pass now streams messages using pre-computed positions, ensuring memory usage is minimized while maintaining correct message order for threading.

Summary by CodeRabbit

  • Refactor

    • Optimized message import processing with improved memory efficiency and performance through a two-pass scanning and streaming approach
    • Enhanced input validation for message processing
  • Tests

    • Added and updated test coverage for the revised message streaming methodology, including validation of correct message ordering and per-message processing

✏️ Tip: You can customize this high-level summary in your review settings.

@jbpenrath jbpenrath self-assigned this Jan 29, 2026
@jbpenrath jbpenrath requested a review from sylvinus January 29, 2026 14:24
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jan 29, 2026

📝 Walkthrough

Walkthrough

A single-pass MBOX importer was replaced with a two-pass approach: scan_mbox_messages() precomputes message start offsets and file end, and stream_mbox_messages() seeks and yields message bytes from those offsets (oldest→newest). count_mbox_messages() was removed; tests updated for the new flow.

Changes

Cohort / File(s) Summary
Core MBOX Processing
src/backend/core/services/importer/tasks.py
Adds scan_mbox_messages(file) -> (message_positions, file_end) and changes stream_mbox_messages(file, message_positions, file_end) to read via seeks and yield message bytes in chronological order. Removes count_mbox_messages(). Adds input validation, updates docstrings and progress reporting.
Test Suite
src/backend/core/tests/tasks/test_task_importer.py
Updates tests to call scan_mbox_messages() then stream_mbox_messages(...). Adds test_task_stream_mbox_messages_not_fully_loaded_in_memory with a SpyFile wrapper to assert scanning occurs without seeking and streaming does per-message seeks/reads and correct ordering.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client Code
    participant Scanner as scan_mbox_messages()
    participant File as MBOX File
    participant Streamer as stream_mbox_messages()

    Client->>Scanner: scan_mbox_messages(file)
    Scanner->>File: readline() loop (single pass)
    File-->>Scanner: lines -> detect "From " starts
    Scanner-->>Client: (message_positions, file_end)

    Client->>Streamer: stream_mbox_messages(file, positions, file_end)
    Streamer->>File: seek(start_position)
    File-->>Streamer: positioned at start
    Streamer->>File: read(bytes_until_end)
    File-->>Streamer: message bytes
    Streamer-->>Client: yield message bytes (oldest → newest)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I scanned the mbox with careful paws,
Marked each start without a pause,
Then hopped back, seeking every line,
Yielding bytes in order, neat and fine,
Two passes danced — a tidy cause. 🥕📬

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 61.54% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: fixing a memory leak in mbox file import by refactoring to avoid loading entire files into memory.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@src/backend/core/services/importer/tasks.py`:
- Around line 221-240: The docstring and comments claim messages are yielded
"oldest first" but the code iterates from the end to the start (for i in
range(len(message_positions) - 1, -1, -1)) and tests expect newest-first
ordering; update the docstring and the inline comment to state that messages are
yielded newest-first (most recent first) to match the actual behavior of the
generator that uses message_positions and file_end, or alternatively reverse the
iteration to range(0, len(message_positions)) if you intend oldest-first—ensure
the text mentions the exact loop/variables (message_positions, file_end, and the
for i ... loop) so the change is consistent.
- Around line 232-236: The current check "if not message_positions or not
file_end:" treats 0 as missing and logs a warning for valid empty MBOX files;
change it to explicitly test for None: first, if message_positions is None or
file_end is None then call logger.warning(...) and return; otherwise if
message_positions is empty (e.g. == [] or simply "if not message_positions:")
return quietly without logging. Update the condition(s) around
message_positions, file_end and the existing logger.warning call to use "is
None" checks and a separate silent return for empty message_positions.

Comment thread src/backend/core/services/importer/tasks.py
Comment thread src/backend/core/services/importer/tasks.py Outdated
Refactored the MBOX file processing to first scan for message positions
without loading the entire file into memory, improving efficiency.
The second pass now streams messages using pre-computed positions,
ensuring memory usage is minimized while maintaining correct
message order for threading.
@jbpenrath jbpenrath force-pushed the fix/mbox-memory-leak branch from 869f5f3 to 702a657 Compare January 29, 2026 14:40
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/backend/core/tests/tasks/test_task_importer.py`:
- Around line 694-697: The inline comment above the assertions incorrectly
states "oldest first" — update it to reflect the actual test assertions which
expect newest-first ordering: change the comment that references threading
ordering to say "newest first (latest message first for threading)" so it
matches the assertions that check messages[0] == "Test Message 3", messages[1]
== "Test Message 2", messages[2] == "Test Message 1".

Comment thread src/backend/core/tests/tasks/test_task_importer.py
@jbpenrath jbpenrath merged commit b830755 into main Jan 29, 2026
13 checks passed
@jbpenrath jbpenrath deleted the fix/mbox-memory-leak branch January 29, 2026 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant