You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Spec out configuration file for Attachment Converter
This is kind of a big task and may end up having to be split up into multiple issues. Nonetheless, we'll begin with a number of tasks in a single issue and split it up where applicable.
determine the format of the config file
determine how best to parse the config file and what datatype the config file will be parsed into
determine how the information from that parsed config file will be incorporated into the logic of the app
Let's take these in turn.
Config File Language
Some preliminary design brainstorming has led me to conclude, at least for now, that GNU Refer is a good choice for a config file format for Attachment Converter. (We can always change it later if need be.) Here are some reasons why.
the syntax is dead simple, and therefore it's easy to parse
the syntax is easy enough that a non-technical person could edit it without too much difficulty
the syntax is simple in a way that reduces the risk of the user accidentally making the config file unusuable because of an accidental typo
our use case doesn't require anything with a tree structure, which is doable but requires a little bit of sublte love/care in GNU Refer
Prelude, the OCaml standard library we are using for this project, has a Refer parser; so we can parse config files in this format without incurring a dependency on Yet Another third-party library
(rule of thumb: we like to avoid third-party dependencies except where absolutely necessary)
Getting your hands on an actual formal spec for refer is kind of annoying; you have to look at the GNU refer manpage. Nonetheless, a) we already have a parser for it and b) a quick example should illustrate how the syntax works. Suppose we would like to go through an email collection and perform two conversions: we want to convert all Word doc files to plaintext, and we want to convert all Word docx files to PDF-A-1b. The config file for performing those two conversions could look like this:
(Those conversion utilities are fictional, for illustrative purposes. The real conversion utilities we'll be using will be complex invocations of a command line app with lots of options.)
So: a refer record is a key-value type of dealio: the percent sign followed by any string followed by a space gives you the field name. In between the space and the line break-followed-by-a-percent is the value. A refer database is simply a list of refer records, each one separated by two line breaks. That means we can have one record for each conversion we would like the app to perform.
We will eventually have to finesse the syntax for the shell_command value so that it can handle the distinction between:
a utility that takes a filepath as an input and prints to stdout
a utility that takes stdin as an input and prints to stdout
a utility that takes both an input and an output filepath
etc.
However, we will wait until a later issue to add that bit of fanciness. (The rough plan will be to handle it using printf-like escape syntax.) For the purposes of getting up and running with something simple, assume for now that all command line utilities for performing conversions take a filepath as input and output to stdout.
Parsing the Config File
What should we parse the datatype into? First things first: let's create a new file for the config information at lib/config.ml and put the following code there. This requires the following change to lib/dune:
The Formats.variety datatype is a sum type whose purpose is to enumerate the different ways an email part could change/not change:
by leaving the MIME type header alone and just changing the data in the attachment, for when we don't need to update the MIME type in light of the change to the data
by changing the data in the attachment and updating the MIME type to reflect how the data were changed
leaving it alone
We should be able to parse a refer record of the type given above into a Formats.variety in the following way:
if the source MIME type and the target MIME type are the same (for example, converting PDF to PDF-A), it's a DataOnly whose dtransform is supplied by the path to the command line utility, passed into convert from Re-implement owen-practicum using ocamlnet #1
if they are different (for example, converting DOCX to plaintext), it's a DataAndHeader whose dtransform is supplied by the filepath and whose htransform is a function mapping the source MIME type string to the target MIME type string indicated in the refer record
we can either make everything else a NoChange based on an exhaustive list of all MIME types, or remove NoChange from the datatype (for now I think I like the idea of leaving it in)
We can work out the details of the parsing error messages (for badly formatted config files) etc. while implementing the config file parser. I expect we can use Prelude's parser to do the real parsing, then write a little wrapper code to do cleanup on the result of that plus whatever domain-specific error messaging we might want.
One final note on the idea behind the Formats.t dictionary datatype. The thought here is that Formats.t will be a lookup table with strings representing MIME types (such as application/pdf, text/plain, and so forth) as keys and Formats.variety-s as values. Then, when we are recursing through an email parsetree and come across a part of the email that is an attachment, what we'll do is examine the MIME header in the part we're looking at, look it up in the Formats.t dictionary, and the Formats.variety value the dictionary gives us back will tell use what to do with it: do nothing, convert just the data in the attachment, or convert both the data and the header in the attachment. See the next section for more info on how amap and acopy will have to be revised to make use of a Formats.t input in this way.
Design implications
The above design requires some revisions to our earlier spec from issues #1 and #6.
We'll keep the name amap for now, though since amap is at this point nowhere near functorial, we will eventually want to ditch the name. But the type should be updated to something like this:
valamap : Formats.t -> parsetree -> (parsetree, Formats.error) result
Where error is a datatype we will probably have to revise heavily, but which can start off on these lines, inside the Formats module:
valamap : Formats.t -> parsetree -> (parsetree, Formats.error) result
When amap [acopy] encounters a new part of a mail, it will try to find the Content-Type header of that part in the input Formats.t dictionary. If it can't, then it won't do anything to that part. Otherwise, it looks that MIME header up in the input Formats.t dictionary to determine what kind of conversion to perform on that part of the mail, then converts accordingly.
One last thing. We will probably save this for a later issue, but it might be nice to make the first input to amap and acopy an optional parameter that defaults to some config we're planning to test with a lot.
The text was updated successfully, but these errors were encountered:
Spec out configuration file for Attachment Converter
This is kind of a big task and may end up having to be split up into multiple issues. Nonetheless, we'll begin with a number of tasks in a single issue and split it up where applicable.
Let's take these in turn.
Config File Language
Some preliminary design brainstorming has led me to conclude, at least for now, that GNU Refer is a good choice for a config file format for Attachment Converter. (We can always change it later if need be.) Here are some reasons why.
Prelude
, the OCaml standard library we are using for this project, has a Refer parser; so we can parse config files in this format without incurring a dependency on Yet Another third-party libraryGetting your hands on an actual formal spec for
refer
is kind of annoying; you have to look at the GNU refer manpage. Nonetheless, a) we already have a parser for it and b) a quick example should illustrate how the syntax works. Suppose we would like to go through an email collection and perform two conversions: we want to convert all Worddoc
files to plaintext, and we want to convert all Worddocx
files to PDF-A-1b. The config file for performing those two conversions could look like this:(Those conversion utilities are fictional, for illustrative purposes. The real conversion utilities we'll be using will be complex invocations of a command line app with lots of options.)
So: a
refer
record is a key-value type of dealio: the percent sign followed by any string followed by a space gives you the field name. In between the space and the line break-followed-by-a-percent is the value. Arefer
database is simply a list of refer records, each one separated by two line breaks. That means we can have one record for each conversion we would like the app to perform.We will eventually have to finesse the syntax for the
shell_command
value so that it can handle the distinction between:However, we will wait until a later issue to add that bit of fanciness. (The rough plan will be to handle it using printf-like escape syntax.) For the purposes of getting up and running with something simple, assume for now that all command line utilities for performing conversions take a filepath as input and output to stdout.
Parsing the Config File
What should we parse the datatype into? First things first: let's create a new file for the config information at
lib/config.ml
and put the following code there. This requires the following change tolib/dune
:I.e. what we had before, but with a
modules
S-expression whose tail includeslib
andconfig
.A good first stab at laying out the datatype within
config.ml
would be something close to this:The
Formats.variety
datatype is a sum type whose purpose is to enumerate the different ways an email part could change/not change:We should be able to parse a refer record of the type given above into a
Formats.variety
in the following way:DataOnly
whosedtransform
is supplied by the path to the command line utility, passed intoconvert
from Re-implement owen-practicum using ocamlnet #1DataAndHeader
whosedtransform
is supplied by the filepath and whosehtransform
is a function mapping the source MIME type string to the target MIME type string indicated in the refer recordNoChange
based on an exhaustive list of all MIME types, or removeNoChange
from the datatype (for now I think I like the idea of leaving it in)We can work out the details of the parsing error messages (for badly formatted config files) etc. while implementing the config file parser. I expect we can use
Prelude
's parser to do the real parsing, then write a little wrapper code to do cleanup on the result of that plus whatever domain-specific error messaging we might want.To get started with the parser, check out
Prelude
's documentation:https://www2.lib.uchicago.edu/keith/software/prelude/Prelude.Refer.html
One final note on the idea behind the
Formats.t
dictionary datatype. The thought here is thatFormats.t
will be a lookup table with strings representing MIME types (such asapplication/pdf
,text/plain
, and so forth) as keys andFormats.variety
-s as values. Then, when we are recursing through an email parsetree and come across a part of the email that is an attachment, what we'll do is examine the MIME header in the part we're looking at, look it up in theFormats.t
dictionary, and theFormats.variety
value the dictionary gives us back will tell use what to do with it: do nothing, convert just the data in the attachment, or convert both the data and the header in the attachment. See the next section for more info on howamap
andacopy
will have to be revised to make use of aFormats.t
input in this way.Design implications
The above design requires some revisions to our earlier spec from issues #1 and #6.
We'll keep the name
amap
for now, though sinceamap
is at this point nowhere near functorial, we will eventually want to ditch the name. But the type should be updated to something like this:Where
error
is a datatype we will probably have to revise heavily, but which can start off on these lines, inside theFormats
module:Similar changes to
acopy
are in order:When
amap
[acopy
] encounters a new part of a mail, it will try to find theContent-Type
header of that part in the inputFormats.t
dictionary. If it can't, then it won't do anything to that part. Otherwise, it looks that MIME header up in the inputFormats.t
dictionary to determine what kind of conversion to perform on that part of the mail, then converts accordingly.One last thing. We will probably save this for a later issue, but it might be nice to make the first input to
amap
andacopy
an optional parameter that defaults to some config we're planning to test with a lot.The text was updated successfully, but these errors were encountered: