-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write MBOX conversion code for testing #11
Comments
Matt Teichman ***@***.***> writes:
For more background on this absurd weirdness, along with different
conventions for escaping the MBOX delimiter, please see:
https://en.wikipedia.org/wiki/Mbox
Desired Behavior
to_mbox should take a list of emails (i.e. email strings) and do the
following:
* intersperse them with a From line resembling the example above---in
fact, it can literally just be the exact example above every time,
unless you want to get fancy and insert the current date/time
* if there is no CRLF at the end of a given email, add two
* if there is a CRLF at the end of a given email, add one
And don't forget to quote any From_ lines in each message in the list,
as discussed in the Wikipedia article!
|
See revised issue above, containing a new section on what to do about character escaping. |
Out of curiosity, if we're doing From-munging already, why not use mboxrd ( |
Good question! What do you think, @waclena? |
Matt Teichman ***@***.***> writes:
Good question! What do you think, @waclena?
Yes, surely. It's clearly better the better mung!
|
Excellent. Will update the issue. |
Merged
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Write MBOX conversion code for testing
This code can be approached as something we'll only use for casual testing while in development, rather than as part of our production code. (Though it may turn out to be easy to turn into something robust enough for production; we'll see.)
What we want for this issue is a function that will take a list of individual emails (each in the form of a string) and make it into an MBOX. This will provide a convenient way to take an email we generated, and see whether it will pass whatever we've decided is our standard of validation. (At present, our standard of validation is whether the file will open in Apple Mail, since Apple Mail is the only mail user agent (MUA) normal computer users would have heard of that can read MBOX.)
Something in the ballpark of this type signature ought to work:
to_mbox
can go in a new module calledlib/mbox.ml
. As good a place as any for the time being. For more info on the optionalescape
parameter, please see below.Background on MBOX
Much like the specification of email itself, the MBOX format is pretty nuts. The thing to note is that an MBOX is just a flat list. The emails appear in that list, and the delimiter the format uses is a
From
string that looks like this:Most MUAs will accept any line starting with
From
(capital F, one space after the word) as an MBOX delimiter, but for maximum compatibility, an email address and date afterward are recommended. It doesn't matter what they are because they get ignored when the MBOX is parsed into a list of emails. Why? TheFrom
line is just a delimiter that's considered to be a part of the mailbox; it isn't part of any email.This is in contrast to lines beginning with
From:
(that's 'from' with a colon): those are actual email headers, which means any line starting withFrom:
you encounter while flipping through a file will be part of an email.For more background on this absurd weirdness, along with different conventions for escaping the MBOX delimiter, please see:
https://en.wikipedia.org/wiki/Mbox
Desired Behavior
At the MBOX level
to_mbox
should take a list of emails (i.e. email strings) and do the following:From
line resembling the example above---in fact, it can literally just be the exact example above every time, unless you want to get fancy and insert the current date/time\r\n
(CRLF) at the end of a given email, add twoYou can test that the result works by trying to import it into Apple Mail.
Character escaping
Because the delimiter for the MBOX format is the
From
line, this leads to all the all the usual annoyances re: quoting and character escaping. For example, imagine that the following Classic Britney Lyrics were part of the body of an email:An MBOX parser would obviously not want to parse the above string into four separate emails. The traditional workaround, discussed in the Wikipedia article linked to above, is to replace all
From
-s with>From
-s in the input string. That leads to further problems, because>From
could also theoretically be intended to be part of an email body.A more robust way to handle this situation is to MIME-encode every email body using quoted-printable (rather than base64) encoding when parsing an MBOX:
https://en.wikipedia.org/wiki/Quoted-printable
This has the advantage of allowing you to sleep better at night re: parse errors, but the disadvantage of turning every single email in the input MBOX into a MIME-encoded email. For the purpose of being able to view things in an MUA that makes no difference, but for archival purposes, we generally want to err on the side of keeping as much of the original information in the input MBOX as we can intact. (Like, maybe Indiana Jones of the future is looking at Professor Smartypants' email backup and is interested in how many of their emails were MIME-encoded.) The jargon for this among archivists is 'orginal order':
https://en.wikipedia.org/wiki/Original_order
Keith and I discussed these trade-offs at some length and settled on the following solution for now. All the input MBOX-es we are planning to handle were either:
libpst
That means that we can safely assume 'the input has correctly-escaped
From
lines' as an invariant, which in turn means that our handling of unescapedFrom
lines can be more minimal than it would be for the kind of recalcitrant input we are fully expecting to have to deal with. So for the purposes ofto_mbox
, I think we can get away with it having an optional parameter of typebool
, call itescape
. The behavior would be as follows:escape
istrue
, haveto_mbox
replace all occurrences of "From
" with ">From
" and all occurrences of ">From
" with ">>From
"escape
isfalse
(which it will be by default), haveto_mbox
throw an exception when it encountersFrom
in any of the strings in the input listYou can define the relevant exception along these lines:
Getting Started
Here is some example MBOX-parsing code from the precursor to
Prelude
, a standard library calledKw
. You can use it as a basis for our MBOX parsing code---in fact, probably with few to no changes.The text was updated successfully, but these errors were encountered: