Conversation
roborev: Combined Review (
|
|
I don't love the roborev loops but here we are. |
roborev: Combined Review (
|
Import Facebook Messenger data from DYI exports (JSON and HTML formats), E2EE flat-file exports, and multiple directory layouts including your_activity_across_facebook, your_facebook_activity, and legacy messages/ roots. Includes CLI command, MIME parsing, attachment resolution, Discover/Parse pipeline, and query engine support for messenger message types. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
630a87f to
14a00f6
Compare
|
Addressed @trel feedback & collapsed reviews into features and fixes. Also tested against my facebook archives. |
roborev: Combined Review (
|
…rrectly Interrupted imports were marked failed but GetActiveSync only looked for running syncs, so the saved checkpoint was invisible on resume. Added GetLatestCheckpointedSync to find the most recent sync with a checkpoint regardless of status, used as fallback when no active sync exists. HTML images were assigned to the first empty-body or attachment-less message regardless of DOM position. Refactored collectHTMLLines to track each image's line index, so parseHTMLLines assigns images only to the message block where they actually appear in the document. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
roborev: Combined Review (
|
Previously an invalid format like --format jsno silently imported zero messages and returned success. Now validated against the known set (auto, json, html, both) before discovery begins. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolve conflict in internal/store/sync.go — keep both GetLatestCheckpointedSync (ours) and HasAnyActiveSync (upstream). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
roborev: Combined Review (
|
source_conversation_id and source_message_id used only the thread directory basename, so threads with the same name in different sections (e.g. inbox/ vs archived_threads/) could collide within the same source. Now prefixed with the section (e.g. "inbox/alice_ABC123__0"). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Build and all tests pass. Changes: - Apply `dialect.Rebind()` to parameterized queries in `RemoveSourceSerialized` (`SELECT COUNT(*)` and `DELETE FROM sources`) to prevent placeholder breakage when PostgreSQL dialect lands - Document load-bearing defer ordering in `ExecuteContext` (panic handler must run before log-file close because `os.Exit` skips remaining defers)
roborev: Combined Review (
|
GetLatestCheckpointedSync was matching any sync run with cursor_before, including completed runs. This caused re-imports to skip threads that were already checkpointed, missing new messages added to existing threads. Filter to only running/failed status so completed imports always re-scan. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Interesting scenario. Hard to design a test for. I don't think people are going to merge in these very often. Will see what a fix looks like. |
|
I will add this is already useful for me. |
…L scan
Six roborev fixes bundled by code area:
- writeThreadToStore: the synthesized-sender branch (message whose
sender is not in the thread's participants header) now also calls
EnsureConversationParticipant, mirroring the loop that handles
participants enumerated from the header. Previously sender_id was
populated on the message but the participant was never linked to the
conversation, skewing participant analytics. Regression test
TestImportDYI_SynthesizedSenderLinkedToConversation.
- Introduce errLimitReached and return it when opts.Limit trips
mid-thread. The outer loop in ImportDYI detects the sentinel, breaks
without advancing the per-thread checkpoint, and doesn't log as
"thread failed". A later non-limited run re-scans the partial thread
and picks up the remaining messages via source_message_id dedup.
- ReplaceMessageRecipients("from", …) is now skipped when senderID is
invalid, so a re-import where the current fixture can't resolve a
sender doesn't clobber a previously-recorded "from" row. The error
return is now logged via logger.Warn instead of discarded.
- parseHTMLLines: when a sender-candidate line has no timestamp within
the look-ahead window and another sender is hit first, resume scanning
at that candidate (i = nextSender) rather than advancing one line at
a time through the failed window.
- Wire up ImportSummary.MessagesSkipped in the UpsertMessage error
branch so per-message failures are counted for the CLI's "%d skipped"
line instead of trivially staying at zero.
- Wire up ImportSummary.HardErrors on the outer-loop catastrophic
per-thread failure branch (the "thread failed" log). Ctx cancel,
errLimitReached, and per-message upsert errors remain Errors-only.
Addresses roborev jobs 206 and 211.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
discoverE2EEFlat previously treated any unknown top-level JSON file as an E2EE thread, gated only by a hardcoded metadata-filename allowlist. Any new top-level JSON Facebook adds in a future DYI revision would be handed to ParseE2EEJSONFile and either log as ErrCorruptJSON or silently register as a zero-message thread. Filter by shape via a streaming json.Decoder probe (probeE2EEShape) that reads only enough tokens to see "participants" and "messages" at the top level, classified via a three-state e2eeShape enum: - Thread (object with both keys) → included in discovery - NotThread (non-object, or object with neither key) → silently skipped - Unknown (I/O or mid-parse error, or object with exactly one of the two keys) → passed through to the parser Keeping non-thread shapes out of the indexed list matters because the per-thread checkpoint resumes by index: if a metadata file joined the list one run and dropped the next (e.g. after a Facebook DYI schema change or an allowlist update) the saved index would point past the next real thread. ParseE2EEJSONFile gets a matching three-way classification: a new ErrNotE2EEThread sentinel (silent skip) for non-objects and object-with-neither-key, the existing ErrCorruptJSON (log + count) for object-with-exactly-one-key, and full decode for well-formed thread shapes. A partial/malformed thread export now surfaces via the ThreadsSkipped path instead of vanishing. runE2EE treats ErrNotE2EEThread as a silent skip. Files are read and decoded once. Tests: TestParseE2EEJSONFile_NotAThread, TestParseE2EEJSONFile_PartialObjectCorrupt, and TestDiscover_E2EEFlatRejectsNonThreadJSON. Addresses roborev jobs 211, 215, 222, 224, 225. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
writeThreadToStore's ON CONFLICT UPDATE path rewrote sender_id,
is_from_me, and message_recipients("from") on every re-import using the
current-run senderName. When a re-import of an already-imported message
couldn't resolve a sender (senderName stripped, participant renamed, or
the synthesized-sender path failed) it clobbered previously-good sender
metadata with NULL / empty values.
Before UpsertMessage, when the current run's senderID is invalid we now
look up the prior row and rehydrate enough context that every
downstream write preserves it:
- sender_id is read directly.
- is_from_me is read directly so a self-authored message doesn't
flip to inbound and add the account owner to "to" recipients.
- FromMeCount is bumped so the CLI's --me-mismatch warning doesn't
fire on runs where rehydration was the source of is_from_me.
- Sender display_name and email are rehydrated via LEFT JOIN on
participants. The display_name SELECT COALESCEs onto
message_recipients.display_name because the seeded --me participant
is created with an empty display_name, and the prior "from" row is
the authoritative label for self-authored messages.
- Local senderName / nameToID / nameToEmail are repopulated so the
subsequent ReplaceMessageRecipients("from", …) and UpsertFTS write
preserved values instead of empty strings.
- EnsureConversationParticipant is re-called to repair DBs affected
by the pre-fix synthesized-sender bug (sender_id populated but
conversation_participants row missing).
Regression tests:
- TestImportDYI_SenderIDPreservedOnReimport covers sender_id,
is_from_me, from display_name, and the account-owner-not-in-to
assertion for a self-authored message, plus FromMeCount.
- TestImportDYI_ReimportRepairsConversationParticipant simulates a
pre-fix DB (sender_id present, conversation_participants row missing)
and asserts the row is restored on re-import.
Addresses roborev jobs 215, 217, 219, 221, 224.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
roborev: Combined Review (
|
|
The two High findings in the combined review are false positives — every cited symbol exists in the tree.
Store APIs the importer supposedly references that don't exist:
All of these also exist on The Medium |
roborev: Combined Review (
|
…ilure Two resume-path corrections: - A fbmessengerCheckpoint saved mid-way through the first thread has ThreadIndex == 0, which the prior resume guard (`if prior.ThreadIndex > 0`) treated as "no progress yet" and ignored. Functionally this didn't lose data (source_message_id dedup made retry safe and the outer loop already starts at threadIdx=0) but summary.WasResumed stayed false, the "resuming" log didn't fire, and cumulative counters from the prior run didn't carry forward — so the summary undercounted and the CLI didn't tell the user a resume was happening. Gate removed; any well-formed checkpoint with a matching RootDir is now resumable. Added TestImportDYI_ResumeFromFirstThreadCheckpoint. - The outer thread loop previously advanced the per-thread checkpoint to threadIdx+1 even after a hard error on that thread, so a resumed run would skip it. A transient failure (DB lock, I/O blip) should retry on the next run; source_message_id dedup keeps retry safe for any messages already written. On hard error we now log, flag HardErrors=true, and `continue` without advancing the checkpoint. A persistent error will surface via HardErrors=true across runs and the user can rerun with --no-resume if they want to force past it. Addresses roborev combined review (6a28ac0) Medium findings wesm#2 and wesm#3. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Triaged the combined review; fixes for the two real Mediums pushed as High (missing store APIs) — false positive, same as the previous pass. Every symbol cited is defined and builds clean locally and in CI (
Medium: first-thread checkpoints not resumable — fixed. The prior guard Medium: failed threads skipped on resume — fixed. Outer thread loop no longer advances the per-thread checkpoint on hard error. It logs, sets Medium: |
roborev: Combined Review (
|
|
Took a look at this against the PR head ( High — missing store APIs: Incorrect at Medium — Rebind in Medium — first-thread checkpoint not resumable: Was real at Medium — failed thread skipped on resume: Was real at |
Summary
msgvault import-messengercommand for Facebook "Download Your Information" exportsmessages/and neweryour_activity_across_facebook/messages/DYI layoutsmessage_type = 'email'filter from query engines so all archived message types participate in search and aggregation~4,700 lines across 44 files. Core importer is ~2,100 lines with ~1,600 lines of tests.
Closes #280. Related: #136, #192, #278.
🤖 Generated with Claude Code