Skip to content

Add Facebook Messenger DYI import#281

Open
jesserobbins wants to merge 15 commits intowesm:mainfrom
jesserobbins:jesse/fbmessenger
Open

Add Facebook Messenger DYI import#281
jesserobbins wants to merge 15 commits intowesm:mainfrom
jesserobbins:jesse/fbmessenger

Conversation

@jesserobbins
Copy link
Copy Markdown
Contributor

Summary

  • Adds msgvault import-messenger command for Facebook "Download Your Information" exports
  • Supports JSON, HTML, and E2EE flat-file formats with auto-detection per thread
  • Handles both old messages/ and newer your_activity_across_facebook/messages/ DYI layouts
  • Decodes Facebook's Latin-1-over-UTF-8 mojibake in JSON exports transparently
  • Ingests attachments (photos, videos, stickers, audio) into content-addressed storage
  • Preserves reactions as relational rows and in message body for FTS
  • Resumable checkpoints every 50 threads
  • Removes hard-coded message_type = 'email' filter from query engines so all archived message types participate in search and aggregation

~4,700 lines across 44 files. Core importer is ~2,100 lines with ~1,600 lines of tests.

Closes #280. Related: #136, #192, #278.

🤖 Generated with Claude Code

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 21, 2026

roborev: Combined Review (791264b)

Summary verdict: one Medium issue remains; no High or Critical findings were reported.

Medium

  • internal/fbmessenger/slug.go:64 - DecodeMojibake can corrupt already-valid Latin-1-range Unicode strings. For example, valid UTF-8 text like café may be converted into invalid UTF-8 bytes instead of being preserved, affecting message bodies, participant names, titles, and E2EE exports.
    • Fix: only apply the Latin-1-to-UTF-8 repair when the converted result is valid UTF-8 and appears to actually fix mojibake; otherwise return the original string.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@jesserobbins
Copy link
Copy Markdown
Contributor Author

I don't love the roborev loops but here we are.

Comment thread CLAUDE.md
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 21, 2026

roborev: Combined Review (630a87f)

Medium issue found: Facebook Messenger JSON attachment resolution misses the your_facebook_activity/messages export layout.

Medium

  • internal/fbmessenger/json_parser.go around resolveAttachmentURI
    • JSON attachment resolution does not handle the your_facebook_activity/messages layout that Discover explicitly supports.
    • For exports in that layout, URIs like messages/inbox/.../photos/x.jpg are only tried under absRoot/your_activity_across_facebook/... and absRoot/..., so existing attachments under absRoot/your_facebook_activity/messages/... are recorded as missing and not copied.
    • Fix: add filepath.Join(absRoot, "your_facebook_activity", cleaned) to the candidate paths and cover it with a fixture/test for the alternate layout plus media.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@jesserobbins jesserobbins marked this pull request as ready for review April 21, 2026 20:04
jesserobbins and others added 4 commits April 21, 2026 13:31
Import Facebook Messenger data from DYI exports (JSON and HTML formats),
E2EE flat-file exports, and multiple directory layouts including
your_activity_across_facebook, your_facebook_activity, and legacy
messages/ roots. Includes CLI command, MIME parsing, attachment
resolution, Discover/Parse pipeline, and query engine support for
messenger message types.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jesserobbins
Copy link
Copy Markdown
Contributor Author

jesserobbins commented Apr 21, 2026

Addressed @trel feedback & collapsed reviews into features and fixes.

Also tested against my facebook archives.

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 21, 2026

roborev: Combined Review (14a00f6)

I’ll consolidate the reviews into a PR-ready markdown comment, keeping only medium-or-higher findings and merging duplicates.
Medium-risk issues found in the Messenger import path; security review found no additional concerns.

Medium

  • internal/fbmessenger/importer.go:139
    Gracefully interrupted imports are marked failed, but resume only reads GetActiveSync, so the checkpoint saved before FailSync is ignored on the next run despite the CLI saying to rerun to continue.
    Fix: Resume from the latest checkpointed Messenger sync run, including failed interrupted runs, or avoid marking user-canceled imports failed in a way that hides their checkpoint.

  • internal/fbmessenger/html_parser.go:331
    Remaining HTML images are assigned to the first message without an attachment, regardless of where the <img> appeared, so photos in later messages can be stored on the wrong message.
    Fix: Track image positions while parsing message blocks, or parse per-message DOM containers so attachments stay associated with their containing message.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

…rrectly

Interrupted imports were marked failed but GetActiveSync only looked for
running syncs, so the saved checkpoint was invisible on resume. Added
GetLatestCheckpointedSync to find the most recent sync with a checkpoint
regardless of status, used as fallback when no active sync exists.

HTML images were assigned to the first empty-body or attachment-less
message regardless of DOM position. Refactored collectHTMLLines to track
each image's line index, so parseHTMLLines assigns images only to the
message block where they actually appear in the document.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 21, 2026

roborev: Combined Review (f446440)

Findings: one medium issue needs fixing before merge.

Medium

  • internal/fbmessenger/importer.go:118 - ImportDYI lowercases opts.Format but does not validate it. An invalid CLI value such as --format jsno falls through the format switch, imports no normal JSON/HTML threads, and still returns success with zero messages.
    • Fix: Validate format against auto, json, html, and both before discovery/import, and return an error for unknown values.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

jesserobbins and others added 2 commits April 21, 2026 15:16
Previously an invalid format like --format jsno silently imported zero
messages and returned success. Now validated against the known set
(auto, json, html, both) before discovery begins.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolve conflict in internal/store/sync.go — keep both
GetLatestCheckpointedSync (ours) and HasAnyActiveSync (upstream).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 21, 2026

roborev: Combined Review (b39f436)

Medium risk: one issue found that can cause Facebook Messenger thread/message ID collisions.

Medium

  • internal/fbmessenger/importer.go:498
    source_message_id is based only on the thread directory name and per-thread message index. If a Facebook export contains the same thread directory basename in multiple sections, or across multiple discovered layout roots, those messages can collide under the same source and conversation ID, causing one thread’s messages to overwrite or dedupe the other.

    Fix: Include a stable section/layout discriminator in the conversation ID and source_message_id, or otherwise ensure discovered thread identities are globally unique within the source.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

jesserobbins and others added 2 commits April 21, 2026 16:36
source_conversation_id and source_message_id used only the thread
directory basename, so threads with the same name in different sections
(e.g. inbox/ vs archived_threads/) could collide within the same source.
Now prefixed with the section (e.g. "inbox/alice_ABC123__0").

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Build and all tests pass.
Changes:
- Apply `dialect.Rebind()` to parameterized queries in `RemoveSourceSerialized` (`SELECT COUNT(*)` and `DELETE FROM sources`) to prevent placeholder breakage when PostgreSQL dialect lands
- Document load-bearing defer ordering in `ExecuteContext` (panic handler must run before log-file close because `os.Exit` skips remaining defers)
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 22, 2026

roborev: Combined Review (de0afdc)

PR adds Facebook Messenger import support, but one resume-checkpoint bug can cause missed messages on repeat imports.

Medium

  • Location: internal/store/sync.go:240
    • Problem: GetLatestCheckpointedSync resumes from any sync run with cursor_before, including completed runs. Because successful imports save per-thread checkpoints, a later import of the same export root can treat the previous completed run as resumable and skip already-known threads, missing new messages added to existing threads.
    • Fix: Only resume from interrupted/running or failed sync runs, or clear cursor_before on successful completion. Add a regression test that imports a root, adds a message to an existing thread, then re-imports and verifies the new message is imported.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

GetLatestCheckpointedSync was matching any sync run with cursor_before,
including completed runs. This caused re-imports to skip threads that
were already checkpointed, missing new messages added to existing threads.

Filter to only running/failed status so completed imports always re-scan.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jesserobbins
Copy link
Copy Markdown
Contributor Author

jesserobbins commented Apr 23, 2026

Interesting scenario. Hard to design a test for. I don't think people are going to merge in these very often. Will see what a fix looks like.

@jesserobbins
Copy link
Copy Markdown
Contributor Author

I will add this is already useful for me.

jesserobbins and others added 3 commits April 23, 2026 10:19
…L scan

Six roborev fixes bundled by code area:

- writeThreadToStore: the synthesized-sender branch (message whose
  sender is not in the thread's participants header) now also calls
  EnsureConversationParticipant, mirroring the loop that handles
  participants enumerated from the header. Previously sender_id was
  populated on the message but the participant was never linked to the
  conversation, skewing participant analytics. Regression test
  TestImportDYI_SynthesizedSenderLinkedToConversation.

- Introduce errLimitReached and return it when opts.Limit trips
  mid-thread. The outer loop in ImportDYI detects the sentinel, breaks
  without advancing the per-thread checkpoint, and doesn't log as
  "thread failed". A later non-limited run re-scans the partial thread
  and picks up the remaining messages via source_message_id dedup.

- ReplaceMessageRecipients("from", …) is now skipped when senderID is
  invalid, so a re-import where the current fixture can't resolve a
  sender doesn't clobber a previously-recorded "from" row. The error
  return is now logged via logger.Warn instead of discarded.

- parseHTMLLines: when a sender-candidate line has no timestamp within
  the look-ahead window and another sender is hit first, resume scanning
  at that candidate (i = nextSender) rather than advancing one line at
  a time through the failed window.

- Wire up ImportSummary.MessagesSkipped in the UpsertMessage error
  branch so per-message failures are counted for the CLI's "%d skipped"
  line instead of trivially staying at zero.

- Wire up ImportSummary.HardErrors on the outer-loop catastrophic
  per-thread failure branch (the "thread failed" log). Ctx cancel,
  errLimitReached, and per-message upsert errors remain Errors-only.

Addresses roborev jobs 206 and 211.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
discoverE2EEFlat previously treated any unknown top-level JSON file as
an E2EE thread, gated only by a hardcoded metadata-filename allowlist.
Any new top-level JSON Facebook adds in a future DYI revision would be
handed to ParseE2EEJSONFile and either log as ErrCorruptJSON or
silently register as a zero-message thread.

Filter by shape via a streaming json.Decoder probe (probeE2EEShape)
that reads only enough tokens to see "participants" and "messages" at
the top level, classified via a three-state e2eeShape enum:

- Thread  (object with both keys)      → included in discovery
- NotThread (non-object, or object with
  neither key)                         → silently skipped
- Unknown (I/O or mid-parse error, or
  object with exactly one of the two
  keys)                                → passed through to the parser

Keeping non-thread shapes out of the indexed list matters because the
per-thread checkpoint resumes by index: if a metadata file joined the
list one run and dropped the next (e.g. after a Facebook DYI schema
change or an allowlist update) the saved index would point past the
next real thread.

ParseE2EEJSONFile gets a matching three-way classification: a new
ErrNotE2EEThread sentinel (silent skip) for non-objects and
object-with-neither-key, the existing ErrCorruptJSON (log + count) for
object-with-exactly-one-key, and full decode for well-formed thread
shapes. A partial/malformed thread export now surfaces via the
ThreadsSkipped path instead of vanishing. runE2EE treats
ErrNotE2EEThread as a silent skip. Files are read and decoded once.

Tests: TestParseE2EEJSONFile_NotAThread,
TestParseE2EEJSONFile_PartialObjectCorrupt, and
TestDiscover_E2EEFlatRejectsNonThreadJSON.

Addresses roborev jobs 211, 215, 222, 224, 225.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
writeThreadToStore's ON CONFLICT UPDATE path rewrote sender_id,
is_from_me, and message_recipients("from") on every re-import using the
current-run senderName. When a re-import of an already-imported message
couldn't resolve a sender (senderName stripped, participant renamed, or
the synthesized-sender path failed) it clobbered previously-good sender
metadata with NULL / empty values.

Before UpsertMessage, when the current run's senderID is invalid we now
look up the prior row and rehydrate enough context that every
downstream write preserves it:

- sender_id is read directly.
- is_from_me is read directly so a self-authored message doesn't
  flip to inbound and add the account owner to "to" recipients.
- FromMeCount is bumped so the CLI's --me-mismatch warning doesn't
  fire on runs where rehydration was the source of is_from_me.
- Sender display_name and email are rehydrated via LEFT JOIN on
  participants. The display_name SELECT COALESCEs onto
  message_recipients.display_name because the seeded --me participant
  is created with an empty display_name, and the prior "from" row is
  the authoritative label for self-authored messages.
- Local senderName / nameToID / nameToEmail are repopulated so the
  subsequent ReplaceMessageRecipients("from", …) and UpsertFTS write
  preserved values instead of empty strings.
- EnsureConversationParticipant is re-called to repair DBs affected
  by the pre-fix synthesized-sender bug (sender_id populated but
  conversation_participants row missing).

Regression tests:
- TestImportDYI_SenderIDPreservedOnReimport covers sender_id,
  is_from_me, from display_name, and the account-owner-not-in-to
  assertion for a self-authored message, plus FromMeCount.
- TestImportDYI_ReimportRepairsConversationParticipant simulates a
  pre-fix DB (sender_id present, conversation_participants row missing)
  and asserts the row is restored on re-import.

Addresses roborev jobs 215, 217, 219, 221, 224.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 23, 2026

roborev: Combined Review (b8f4ded)

Verdict: Changes introduce two compile-breaking regressions and one medium-severity parsing bug; not ready to merge.

High

  • cmd/msgvault/cmd/import_messenger.go (around lines 50, 106-108, 141): The new command calls MustBeLocal(...) and rebuildCacheAfterWrite(...), but those helpers are not defined in the patch. This is a compile-time failure. Add the helpers in this branch or replace them with existing local-mode/cache rebuild logic already used elsewhere.

  • internal/fbmessenger/importer.go (around lines 455-720), internal/store/sources.go (around lines 137, 152): The importer references store APIs/fields that do not exist in the current store layer, including EnsureConversationWithType, EnsureConversationParticipant, UpsertMessageRawWithFormat, LinkMessageLabel, UpsertReaction, RecomputeConversationStats, and s.dialect. This is a second compile-time break. Either implement the missing store APIs/field in the same PR or switch the importer to existing store methods.

Medium

  • internal/fbmessenger/html_parser.go (around lines 314-338): parseHTMLLines treats any body line that exactly matches a participant name as the start of a new sender block before confirming a timestamp. A legitimate message whose content is exactly a participant name can therefore be misparsed, causing the real message to be skipped and an empty message to be attributed instead. The parser should avoid using participant-name matches as an early delimiter and continue until a timestamp is found, or parse message containers directly from the DOM.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@jesserobbins
Copy link
Copy Markdown
Contributor Author

The two High findings in the combined review are false positives — every cited symbol exists in the tree. go build ./... and go test ./... both pass on b8f4ded.

MustBeLocal / rebuildCacheAfterWrite:

  • cmd/msgvault/cmd/store_resolver.go:81func MustBeLocal(cmdName string) error
  • cmd/msgvault/cmd/build_cache.go:796func rebuildCacheAfterWrite(dbPath string)

Store APIs the importer supposedly references that don't exist:

  • internal/store/messages.go:1065EnsureConversationWithType
  • internal/store/messages.go:1190EnsureConversationParticipant
  • internal/store/messages.go:1205UpsertMessageRawWithFormat
  • internal/store/messages.go:756LinkMessageLabel
  • internal/store/messages.go:1197UpsertReaction
  • internal/store/messages.go:1032RecomputeConversationStats

s.dialect:

  • internal/store/store.go:32Store has a dialect Dialect field set in both New and Open constructors

All of these also exist on upstream/main (confirmed via git show upstream/main:<path>), so this isn't a case of a diff being reviewed against a stale base. The reviewer is hallucinating API absence — possibly by scoping to recent commits and seeing call sites without their definitions.

The Medium parseHTMLLines finding about participant-name-as-body-line is a known tradeoff that also came up on iter 1 of the local refine cycle (roborev job 206). The fix I landed there advances i to the next sender candidate when no timestamp is found, rather than the proposed tighter-window approach — the suggested window-tighten doesn't actually solve the "body line equals participant name" case (a real timestamp often still lies within 64 lines), and moving to full per-message DOM container parsing is a bigger rewrite than this PR scopes. Worth tracking but leaving in this PR.

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 23, 2026

roborev: Combined Review (6a28ac0)

Overall verdict: Changes are not ready to merge due to one build-breaking issue and several medium-severity correctness issues.

High

  • Build break from missing store APIs
    Location: internal/fbmessenger/importer.go (writeThreadToStore / end-of-import stats update)
    Problem: The importer calls store methods that are not implemented in the current repo/change set: EnsureConversationWithType, EnsureConversationParticipant, UpsertMessageRawWithFormat, LinkMessageLabel, UpsertReaction, RecomputeConversationStats. As written, this change does not build.
    Fix: Add the missing store APIs in the same change, or rewrite the importer to use existing store methods and supported data paths.

Medium

  • First-thread checkpoints are not resumable
    Location: internal/fbmessenger/importer.go (resume restore block around if prior.ThreadIndex > 0)
    Problem: Resume is only enabled when ThreadIndex > 0. A checkpoint saved while still processing the first thread has ThreadIndex == 0, so it is ignored and the rerun restarts that thread from scratch.
    Fix: Treat any valid checkpoint as resumable, including ThreadIndex == 0, and add a test covering interruption during the first thread.

  • Failed threads can be skipped on resume
    Location: internal/fbmessenger/importer.go (main thread loop after importThread(...))
    Problem: The checkpoint advances to threadIdx+1 even after a hard per-thread error. If a thread partially writes and then fails, a resumed run will skip that thread instead of retrying it, leaving the import incomplete.
    Fix: Only advance the checkpoint after a thread finishes successfully or is safely skipped; on hard failures, keep the checkpoint on the current thread or abort.

  • Cross-dialect SQL binding bug
    Location: internal/store/sync.go in GetLatestCheckpointedSync
    Problem: The query uses a positional ? placeholder but is not wrapped in s.dialect.Rebind(...), unlike other store queries. This can break on dialects such as PostgreSQL that require $1-style placeholders.
    Fix: Wrap the query string in s.dialect.Rebind(...).


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

…ilure

Two resume-path corrections:

- A fbmessengerCheckpoint saved mid-way through the first thread has
  ThreadIndex == 0, which the prior resume guard (`if prior.ThreadIndex > 0`)
  treated as "no progress yet" and ignored. Functionally this didn't lose
  data (source_message_id dedup made retry safe and the outer loop already
  starts at threadIdx=0) but summary.WasResumed stayed false, the "resuming"
  log didn't fire, and cumulative counters from the prior run didn't carry
  forward — so the summary undercounted and the CLI didn't tell the user a
  resume was happening. Gate removed; any well-formed checkpoint with a
  matching RootDir is now resumable. Added
  TestImportDYI_ResumeFromFirstThreadCheckpoint.

- The outer thread loop previously advanced the per-thread checkpoint to
  threadIdx+1 even after a hard error on that thread, so a resumed run
  would skip it. A transient failure (DB lock, I/O blip) should retry on
  the next run; source_message_id dedup keeps retry safe for any messages
  already written. On hard error we now log, flag HardErrors=true, and
  `continue` without advancing the checkpoint. A persistent error will
  surface via HardErrors=true across runs and the user can rerun with
  --no-resume if they want to force past it.

Addresses roborev combined review (6a28ac0) Medium findings wesm#2 and wesm#3.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@jesserobbins
Copy link
Copy Markdown
Contributor Author

Triaged the combined review; fixes for the two real Mediums pushed as 1816893.

High (missing store APIs) — false positive, same as the previous pass. Every symbol cited is defined and builds clean locally and in CI (test + nix-build both SUCCESS). The roborev status check on this PR is also SUCCESS, contradicting the comment body. Call-site/definition references:

  • MustBeLocalcmd/msgvault/cmd/store_resolver.go:81
  • rebuildCacheAfterWritecmd/msgvault/cmd/build_cache.go:796
  • EnsureConversationWithTypeinternal/store/messages.go:1065
  • EnsureConversationParticipantinternal/store/messages.go:1190
  • UpsertMessageRawWithFormatinternal/store/messages.go:1205
  • LinkMessageLabelinternal/store/messages.go:756
  • UpsertReactioninternal/store/messages.go:1197
  • RecomputeConversationStatsinternal/store/messages.go:1032
  • s.dialectinternal/store/store.go:32 (populated in New and Open constructors)

Medium: first-thread checkpoints not resumable — fixed. The prior guard if prior.ThreadIndex > 0 ignored mid-first-thread checkpoints. Functionally no data loss (source_message_id dedup covers retry, outer loop already starts at threadIdx=0), but WasResumed stayed false, the "resuming" log didn't fire, and cumulative counters didn't carry forward. Guard removed; any well-formed checkpoint with matching RootDir is resumable. Added TestImportDYI_ResumeFromFirstThreadCheckpoint.

Medium: failed threads skipped on resume — fixed. Outer thread loop no longer advances the per-thread checkpoint on hard error. It logs, sets HardErrors=true, and continues without advancing — so a transient failure retries on the next run (source_message_id dedup keeps it idempotent), and a persistent failure surfaces cross-run via HardErrors=true. The user can rerun with --no-resume to force past a stuck thread if needed.

Medium: GetLatestCheckpointedSync cross-dialect binding — false positive. s.db is a *loggedDB wrapper (internal/store/db_logger.go:78) whose QueryRow/Query/Exec methods all call d.rebind(query) before delegating. The store's New constructor installs dialect.Rebind as that rebinder (store.go:104). So every ? placeholder in sync.go, messages.go, etc. is rewritten to $1-style at runtime on Postgres. The explicit s.dialect.Rebind(...) calls in sources.go:137,152 are redundant-but-harmless; the rest of the store relies on the wrapper, and that's intentional. GetLatestCheckpointedSync works fine on Postgres.

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 23, 2026

roborev: Combined Review (1816893)

Verdict: Changes are not ready to merge due to one compile-blocking issue and several correctness/regression risks.

High

  • internal/fbmessenger/importer.go (writeThreadToStore)
    The importer calls store methods that are not present in this diff: EnsureConversationWithType, EnsureConversationParticipant, UpsertMessageRawWithFormat, and UpsertReaction. As written, this will not compile.
    Fix: Add these methods in internal/store or switch the importer to existing store APIs.

Medium

  • internal/fbmessenger/importer.go (checkpoint save after successful thread)
    After a hard thread error, the loop continues, and a later successful thread can advance the checkpoint past the failed thread. If the run stops after that, resume will start after the checkpoint and permanently skip the failed thread.
    Fix: Only checkpoint the highest contiguous successfully processed thread, or abort on hard thread errors.

  • cmd/msgvault/cmd/build_cache_messenger_test.go:74
    The test queries message_type from message Parquet files, but buildCache does not export m.message_type in the messages Parquet schema. This will fail with a missing-column error, and cached analytics cannot distinguish Messenger rows by type.
    Fix: Export m.message_type in the messages Parquet output and add the corresponding DuckDB Parquet CTE cast if it will be queried.

  • internal/fbmessenger/discover.go:222
    dec.Decode(&skip) with skip as json.RawMessage buffers the entire skipped subtree in memory. If a very large "messages" array appears before "participants", this lightweight streaming probe can still cause a major memory spike.
    Fix: Skip the subtree with a token-based depth counter using dec.Token() instead of decoding into json.RawMessage.

  • internal/query/duckdb.go (lines 647, 850) and internal/query/sqlite.go (lines 188, 258)
    The hardcoded message_type = 'email' filter was removed. If no replacement filtering was introduced, existing email-scoped query contexts may now return all message types, changing behavior and potentially breaking UI expectations.
    Fix: Confirm whether unified inbox behavior is intended; otherwise add an explicit message-type filter in AggregateOptions / MessageFilter.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@jesserobbins
Copy link
Copy Markdown
Contributor Author

Took a look at this against the PR head (1816893) and the commit roborev actually reviewed (6a28ac0). Two findings are stale, two are incorrect analysis.

High — missing store APIs: Incorrect at 6a28ac0. All six methods existed in internal/store/messages.go at that commit (LinkMessageLabel, RecomputeConversationStats, EnsureConversationWithType, EnsureConversationParticipant, UpsertReaction, UpsertMessageRawWithFormat). go build ./... passes on both commits.

Medium — Rebind in GetLatestCheckpointedSync: Incorrect at 6a28ac0. loggedDB auto-rebinds every query — see internal/store/db_logger.go:100,119 (query = d.rebind(query)), wired up at internal/store/store.go:104,146 via newLoggedDB(db, dialect.Rebind). Bare ? is the convention throughout the store; the suggested fix would double-rebind.

Medium — first-thread checkpoint not resumable: Was real at 6a28ac0, fixed in 1816893 (startThreadIdx = prior.ThreadIndex unconditionally).

Medium — failed thread skipped on resume: Was real at 6a28ac0, fixed in 1816893 (hard error path continues without advancing the checkpoint; source_message_id dedup keeps retry safe).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Facebook Messenger DYI import

2 participants