🚑(importer) fix memory leak with large mbox file import by jbpenrath · Pull Request #516 · suitenumerique/messages

jbpenrath · 2026-01-29T14:24:20Z

Purpose

We were parsing the mbox and store each message in a list... so for large mbox it leads to memory overflow.

Refactored the MBOX file processing to first scan for message positions without loading the entire file into memory, improving efficiency. The second pass now streams messages using pre-computed positions, ensuring memory usage is minimized while maintaining correct message order for threading.

Summary by CodeRabbit

Refactor
- Optimized message import processing with improved memory efficiency and performance through a two-pass scanning and streaming approach
- Enhanced input validation for message processing
Tests
- Added and updated test coverage for the revised message streaming methodology, including validation of correct message ordering and per-message processing

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-29T14:24:42Z

📝 Walkthrough

Walkthrough

A single-pass MBOX importer was replaced with a two-pass approach: scan_mbox_messages() precomputes message start offsets and file end, and stream_mbox_messages() seeks and yields message bytes from those offsets (oldest→newest). count_mbox_messages() was removed; tests updated for the new flow.

Changes

Cohort / File(s)	Summary
Core MBOX Processing `src/backend/core/services/importer/tasks.py`	Adds `scan_mbox_messages(file) -> (message_positions, file_end)` and changes `stream_mbox_messages(file, message_positions, file_end)` to read via seeks and yield message bytes in chronological order. Removes `count_mbox_messages()`. Adds input validation, updates docstrings and progress reporting.
Test Suite `src/backend/core/tests/tasks/test_task_importer.py`	Updates tests to call `scan_mbox_messages()` then `stream_mbox_messages(...)`. Adds `test_task_stream_mbox_messages_not_fully_loaded_in_memory` with a `SpyFile` wrapper to assert scanning occurs without seeking and streaming does per-message seeks/reads and correct ordering.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client Code
    participant Scanner as scan_mbox_messages()
    participant File as MBOX File
    participant Streamer as stream_mbox_messages()

    Client->>Scanner: scan_mbox_messages(file)
    Scanner->>File: readline() loop (single pass)
    File-->>Scanner: lines -> detect "From " starts
    Scanner-->>Client: (message_positions, file_end)

    Client->>Streamer: stream_mbox_messages(file, positions, file_end)
    Streamer->>File: seek(start_position)
    File-->>Streamer: positioned at start
    Streamer->>File: read(bytes_until_end)
    File-->>Streamer: message bytes
    Streamer-->>Client: yield message bytes (oldest → newest)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I scanned the mbox with careful paws,
Marked each start without a pause,
Then hopped back, seeking every line,
Yielding bytes in order, neat and fine,
Two passes danced — a tidy cause. 🥕📬

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 61.54% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: fixing a memory leak in mbox file import by refactoring to avoid loading entire files into memory.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@src/backend/core/services/importer/tasks.py`:
- Around line 221-240: The docstring and comments claim messages are yielded
"oldest first" but the code iterates from the end to the start (for i in
range(len(message_positions) - 1, -1, -1)) and tests expect newest-first
ordering; update the docstring and the inline comment to state that messages are
yielded newest-first (most recent first) to match the actual behavior of the
generator that uses message_positions and file_end, or alternatively reverse the
iteration to range(0, len(message_positions)) if you intend oldest-first—ensure
the text mentions the exact loop/variables (message_positions, file_end, and the
for i ... loop) so the change is consistent.
- Around line 232-236: The current check "if not message_positions or not
file_end:" treats 0 as missing and logs a warning for valid empty MBOX files;
change it to explicitly test for None: first, if message_positions is None or
file_end is None then call logger.warning(...) and return; otherwise if
message_positions is empty (e.g. == [] or simply "if not message_positions:")
return quietly without logging. Update the condition(s) around
message_positions, file_end and the existing logger.warning call to use "is
None" checks and a separate silent return for empty message_positions.

Refactored the MBOX file processing to first scan for message positions without loading the entire file into memory, improving efficiency. The second pass now streams messages using pre-computed positions, ensuring memory usage is minimized while maintaining correct message order for threading.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/backend/core/tests/tasks/test_task_importer.py`:
- Around line 694-697: The inline comment above the assertions incorrectly
states "oldest first" — update it to reflect the actual test assertions which
expect newest-first ordering: change the comment that references threading
ordering to say "newest first (latest message first for threading)" so it
matches the assertions that check messages[0] == "Test Message 3", messages[1]
== "Test Message 2", messages[2] == "Test Message 1".

jbpenrath self-assigned this Jan 29, 2026

jbpenrath requested a review from sylvinus January 29, 2026 14:24

coderabbitai Bot reviewed Jan 29, 2026

View reviewed changes

Comment thread src/backend/core/services/importer/tasks.py

Comment thread src/backend/core/services/importer/tasks.py Outdated

jbpenrath force-pushed the fix/mbox-memory-leak branch from 869f5f3 to 702a657 Compare January 29, 2026 14:40

coderabbitai Bot reviewed Jan 29, 2026

View reviewed changes

Comment thread src/backend/core/tests/tasks/test_task_importer.py

jbpenrath merged commit b830755 into main Jan 29, 2026
13 checks passed

jbpenrath deleted the fix/mbox-memory-leak branch January 29, 2026 15:03

coderabbitai Bot mentioned this pull request Feb 4, 2026

✨(importer) enhance MBOX file processing with S3 streaming #517

Closed

coderabbitai Bot mentioned this pull request Feb 17, 2026

✨(import) add support for PST imports & stream data for mbox #544

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚑(importer) fix memory leak with large mbox file import#516

🚑(importer) fix memory leak with large mbox file import#516
jbpenrath merged 1 commit into
mainfrom
fix/mbox-memory-leak

jbpenrath commented Jan 29, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jan 29, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jbpenrath commented Jan 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jbpenrath commented Jan 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jan 29, 2026 •

edited

Loading