Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write MBOX conversion code for testing #11

Closed
bufordrat opened this issue Jan 26, 2022 · 6 comments · Fixed by #14
Closed

Write MBOX conversion code for testing #11

bufordrat opened this issue Jan 26, 2022 · 6 comments · Fixed by #14
Assignees

Comments

@bufordrat
Copy link
Contributor

bufordrat commented Jan 26, 2022

Write MBOX conversion code for testing

This code can be approached as something we'll only use for casual testing while in development, rather than as part of our production code. (Though it may turn out to be easy to turn into something robust enough for production; we'll see.)

What we want for this issue is a function that will take a list of individual emails (each in the form of a string) and make it into an MBOX. This will provide a convenient way to take an email we generated, and see whether it will pass whatever we've decided is our standard of validation. (At present, our standard of validation is whether the file will open in Apple Mail, since Apple Mail is the only mail user agent (MUA) normal computer users would have heard of that can read MBOX.)

Something in the ballpark of this type signature ought to work:

val to_mbox : ?escape:bool -> string list -> string

to_mbox can go in a new module called lib/mbox.ml. As good a place as any for the time being. For more info on the optional escape parameter, please see below.

Background on MBOX

Much like the specification of email itself, the MBOX format is pretty nuts. The thing to note is that an MBOX is just a flat list. The emails appear in that list, and the delimiter the format uses is a From string that looks like this:

From foo@bar Fri Jan 21 11:48:27 2022

Most MUAs will accept any line starting with From (capital F, one space after the word) as an MBOX delimiter, but for maximum compatibility, an email address and date afterward are recommended. It doesn't matter what they are because they get ignored when the MBOX is parsed into a list of emails. Why? The From line is just a delimiter that's considered to be a part of the mailbox; it isn't part of any email.

This is in contrast to lines beginning with From: (that's 'from' with a colon): those are actual email headers, which means any line starting with From: you encounter while flipping through a file will be part of an email.

For more background on this absurd weirdness, along with different conventions for escaping the MBOX delimiter, please see:

https://en.wikipedia.org/wiki/Mbox

Desired Behavior

At the MBOX level

to_mbox should take a list of emails (i.e. email strings) and do the following:

  • intersperse them with a From line resembling the example above---in fact, it can literally just be the exact example above every time, unless you want to get fancy and insert the current date/time
  • if there is no \r\n (CRLF) at the end of a given email, add two
  • if there is a CRLF at the end of a given email, add one

You can test that the result works by trying to import it into Apple Mail.

Character escaping

Because the delimiter for the MBOX format is the From line, this leads to all the all the usual annoyances re: quoting and character escaping. For example, imagine that the following Classic Britney Lyrics were part of the body of an email:

And you didn't hear
All my joy through my tears
All my hopes through my fears
Did you know, still, I miss you somehow?
From the bottom of my broken heart
There's just a thing or two I'd like you to know
You were my first love, you were my true love
From the first kisses to the very last rose
From the bottom of my broken heart
Even though time may find me somebody new
You were my real love, I never knew love
'Til there was you
From the bottom of my broken heart
Baby, I said, please stay (stay)
Give our love a chance for one more day, oh
We could've worked things out (taking time is what it's all about)
Taking time is what love's all about (oh)

An MBOX parser would obviously not want to parse the above string into four separate emails. The traditional workaround, discussed in the Wikipedia article linked to above, is to replace all From -s with >From -s in the input string. That leads to further problems, because >From could also theoretically be intended to be part of an email body.

A more robust way to handle this situation is to MIME-encode every email body using quoted-printable (rather than base64) encoding when parsing an MBOX:

https://en.wikipedia.org/wiki/Quoted-printable

This has the advantage of allowing you to sleep better at night re: parse errors, but the disadvantage of turning every single email in the input MBOX into a MIME-encoded email. For the purpose of being able to view things in an MUA that makes no difference, but for archival purposes, we generally want to err on the side of keeping as much of the original information in the input MBOX as we can intact. (Like, maybe Indiana Jones of the future is looking at Professor Smartypants' email backup and is interested in how many of their emails were MIME-encoded.) The jargon for this among archivists is 'orginal order':

https://en.wikipedia.org/wiki/Original_order

Keith and I discussed these trade-offs at some length and settled on the following solution for now. All the input MBOX-es we are planning to handle were either:

  • created by libpst
  • created by GMail
  • the actual format the person's MUA was using

That means that we can safely assume 'the input has correctly-escaped From lines' as an invariant, which in turn means that our handling of unescaped From lines can be more minimal than it would be for the kind of recalcitrant input we are fully expecting to have to deal with. So for the purposes of to_mbox, I think we can get away with it having an optional parameter of type bool, call it escape. The behavior would be as follows:

  • if escape is true, have to_mbox replace all occurrences of "From " with ">From " and all occurrences of ">From " with ">>From "
  • if escape is false (which it will be by default), have to_mbox throw an exception when it encounters From in any of the strings in the input list

You can define the relevant exception along these lines:

# exception MBOXParseError of string;;
exception MBOXParseError of string
# raise @@ MBOXParseError "whatever info you want in here";;
Exception: MBOXParseError "whatever info you want in here".

Getting Started

Here is some example MBOX-parsing code from the precursor to Prelude, a standard library called Kw. You can use it as a basis for our MBOX parsing code---in fact, probably with few to no changes.

(** {1 Mbox parser ({i Xavier Leroy})}

  Snarfed from: <{{:http://cristal.inria.fr/~xleroy/software.html#spamoracle}http://cristal.inria.fr/~xleroy/software.html#spamoracle}>

  Hacked by KW 2010-05-13 <{{:http://www.lib.uchicago.edu/keith/}http://www.lib.uchicago.edu/keith/}>
    - added map and fold functionals

  @author Xavier Leroy, projet Cristal, INRIA Rocquencourt
 *)
(***********************************************************************)
(*                                                                     *)
(*                 SpamOracle -- a Bayesian spam filter                *)
(*                                                                     *)
(*            Xavier Leroy, projet Cristal, INRIA Rocquencourt         *)
(*                                                                     *)
(*  Copyright 2002 Institut National de Recherche en Informatique et   *)
(*  en Automatique.  This file is distributed under the terms of the   *)
(*  GNU Public License version 2, http://www.gnu.org/licenses/gpl.txt  *)
(*                                                                     *)
(***********************************************************************)

(* $Id: mbox.ml,v 1.4 2002/08/26 09:35:25 xleroy Exp $ *)

(* Reading of a mailbox file and splitting into individual messages *)

type t =
  { ic: in_channel;
    zipped: bool;
    mutable start: string;
    buf: Buffer.t }

let open_mbox_file filename =
  if Filename.check_suffix filename ".gz" then
    { ic = Unix.open_process_in ("gunzip -c " ^filename);
      zipped = true;
      start = "";
      buf = Buffer.create 50000 }
  else
    { ic = open_in filename;
      zipped = false;
      start = "";
      buf = Buffer.create 50000 }

let open_mbox_channel ic =
    { ic = ic;
      zipped = false;
      start = "";
      buf = Buffer.create 50000 }

let read_msg t =
  Buffer.clear t.buf;
  Buffer.add_string t.buf t.start;
  let rec read () =
    let line = input_line t.ic in
    if String.length line >= 5
    && String.sub line 0 5 = "From "
    && Buffer.length t.buf > 0 then begin
      t.start <- (line ^ "\n");
      Buffer.contents t.buf
    end else begin
      Buffer.add_string t.buf line;
      Buffer.add_char t.buf '\n';
      read ()
    end in
  try
    read()
  with End_of_file ->
    if Buffer.length t.buf > 0 then begin
      t.start <- "";
      Buffer.contents t.buf
    end else
      raise End_of_file

let close_mbox t =
  if t.zipped
  then ignore(Unix.close_process_in t.ic)
  else close_in t.ic

let mbox_file_iter filename fn =
  let ic = open_mbox_file filename in
  try
    while true do fn(read_msg ic) done
  with End_of_file ->
    close_mbox ic

(** [mbox_file_fold fn inchan acc]: fold the function [fn] over the messages in the mbox file open on [inchan] with [acc] as initial accumulator. *)
let mbox_file_fold fn inchan acc =		(* KW *)
  let ic = open_mbox_channel inchan in
  let rec loop acc =
    match try Some (read_msg ic) with End_of_file -> None with
      | Some msg -> loop (fn acc msg)
      | None     -> acc
  in
    loop acc

(** [mbox_file_map fn filename]: map the function [fn] over the messages in the mbox file [filename]. *)
let mbox_file_map fn filename =		(* KW *)
  let ic = open_in filename in
    try
      let result = List.rev (mbox_file_fold (fun acc msg -> fn msg::acc) ic []) in
	close_in ic;
	result
    with exn -> close_in ic; raise exn

let mbox_channel_iter inchan fn =
  let ic = open_mbox_channel inchan in
  try
    while true do fn(read_msg ic) done
  with End_of_file ->
    close_mbox ic

let read_single_msg inchan =
  let res = Buffer.create 10000 in
  let buf = Bytes.create 1024 in
  let rec read () =
    let n = input inchan buf 0 (Bytes.length buf) in
    if n > 0 then begin
      Buffer.add_subbytes res buf 0 n;
      read ()
    end in
  read ();
  Buffer.contents res
@waclena
Copy link

waclena commented Jan 26, 2022 via email

@bufordrat
Copy link
Contributor Author

See revised issue above, containing a new section on what to do about character escaping.

@nmmull
Copy link
Collaborator

nmmull commented Jan 30, 2022

Out of curiosity, if we're doing From-munging already, why not use mboxrd (From -> >From and >From -> >>From) instead since it's reversible?

@bufordrat
Copy link
Contributor Author

Good question! What do you think, @waclena?

@waclena
Copy link

waclena commented Jan 31, 2022 via email

@bufordrat
Copy link
Contributor Author

Excellent. Will update the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants