Skip to content

Feature: Facebook Messenger DYI import #280

@jesserobbins

Description

@jesserobbins

msgvault already ingests WhatsApp, iMessage, Google Voice, and SMS alongside email. The unified schema, the text-message import pipeline, the FTS indexing, the content-addressed attachment store, the Parquet analytics cache — all of it works for non-email message types today. Adding another chat source is now a well-worn path.

Facebook Messenger is the obvious next one and it is the most requested one #192 after whatsapp.

Facebook gives you a "Download Your Information" export, which is a zip file full of JSON or HTML files organized by thread. The data is all there: timestamps, participants, reactions, photos, videos, call logs. It is yours.

I have been testing this, here's the current branch => jesserobbins:jesse/fbmessenger

What this adds

msgvault import-messenger ingests a Facebook DYI export directory and stores every conversation in the same schema as email, WhatsApp, and iMessage.

msgvault import-messenger ~/Downloads/facebook-your_activity --me jesse
msgvault import-messenger --format json ~/facebook-dyi --me jesse
msgvault import-messenger --limit 100 ~/facebook-dyi --me jesse

After import, Messenger conversations show up everywhere: TUI, MCP server, HTTP API, search, analytics. Stage old threads for deletion. Build a collection that spans email and Messenger and deduplicate across both.

DYI export formats

Facebook ships now three archive formats, and which have changed over the years.

JSON is preferable. Millisecond timestamps, structured participants, typed message categories (Generic, Share, Call, Unsubscribe), reaction metadata. The catch: Facebook encodes all strings as Latin-1 bytes stuffed into a JSON UTF-8 document. Every non-ASCII character is mojibake. The parser must decode this transparently.

HTML is what most people have. Less metadata, no reaction structure, timestamps that vary by locale and DYI version. The parser handles four known timestamp layouts and falls back to best-effort extraction. When a thread has both JSON and HTML, JSON wins.

E2EE encrypted threads (new) use a flat-file format that differs from both. One message per line, colon-delimited, no JSON structure. A separate parser handles these.

All three formats are auto-detected per thread. A single export can contain a mix.

Key behaviors

  • Discovery: walks the export directory for both the old messages/ layout and the newer your_activity_across_facebook/messages/ structure. Handles inbox, archived, filtered, and message-request sections.
  • Identity: --me sets is_from_me on outbound messages. Participant addresses are synthesized as <slug>@facebook.messenger.
  • Attachments: photos, videos, stickers, audio, and files are ingested into content-addressed storage.
  • Reactions: stored as relational rows and appended to message body for FTS searchability.
  • Resumable: checkpoints every 50 threads. If interrupted, picks up where it left off.
  • Mojibake decoding: Latin-1-over-UTF-8 encoding in Facebook's JSON exports is decoded transparently.

Companion change

The query layer currently hard-codes message_type = 'email' filters in aggregate views, which hides all non-email messages from the TUI. This branch removes those filters so Messenger (and WhatsApp, iMessage, etc.) results participate in search and aggregation. If you archived it, you should be able to find it.

Scope

~4,400 lines across 40 files. Core importer is ~2,100 lines of implementation and ~1,600 lines of tests covering JSON parsing, HTML parsing, E2EE parsing, mojibake decoding, multi-file thread assembly, attachment resolution, FTS indexing, and Parquet cache integration.

Follows the same discover → parse → ingest pipeline as WhatsApp, iMessage, and Google Voice importers.

Related

Branch ready for PR: jesse/fbmessenger (6 commits on current main, tests pass).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions