Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec out configuration file for Attachment Converter #9

Closed
bufordrat opened this issue Nov 18, 2021 · 0 comments · Fixed by #19
Closed

Spec out configuration file for Attachment Converter #9

bufordrat opened this issue Nov 18, 2021 · 0 comments · Fixed by #19
Assignees

Comments

@bufordrat
Copy link
Contributor

bufordrat commented Nov 18, 2021

Spec out configuration file for Attachment Converter

This is kind of a big task and may end up having to be split up into multiple issues. Nonetheless, we'll begin with a number of tasks in a single issue and split it up where applicable.

  • determine the format of the config file
  • determine how best to parse the config file and what datatype the config file will be parsed into
  • determine how the information from that parsed config file will be incorporated into the logic of the app

Let's take these in turn.

Config File Language

Some preliminary design brainstorming has led me to conclude, at least for now, that GNU Refer is a good choice for a config file format for Attachment Converter. (We can always change it later if need be.) Here are some reasons why.

  • the syntax is dead simple, and therefore it's easy to parse
  • the syntax is easy enough that a non-technical person could edit it without too much difficulty
  • the syntax is simple in a way that reduces the risk of the user accidentally making the config file unusuable because of an accidental typo
  • our use case doesn't require anything with a tree structure, which is doable but requires a little bit of sublte love/care in GNU Refer
  • Prelude, the OCaml standard library we are using for this project, has a Refer parser; so we can parse config files in this format without incurring a dependency on Yet Another third-party library
  • (rule of thumb: we like to avoid third-party dependencies except where absolutely necessary)

Getting your hands on an actual formal spec for refer is kind of annoying; you have to look at the GNU refer manpage. Nonetheless, a) we already have a parser for it and b) a quick example should illustrate how the syntax works. Suppose we would like to go through an email collection and perform two conversions: we want to convert all Word doc files to plaintext, and we want to convert all Word docx files to PDF-A-1b. The config file for performing those two conversions could look like this:

%source_type application/msword
%target_type text/plain
%shell_command /bin/doc2txt

%source_type application/vnd.openxmlformats-officedocument.wordprocessingml
%target_type application/pdf
%shell_command /bin/docx2pdf

(Those conversion utilities are fictional, for illustrative purposes. The real conversion utilities we'll be using will be complex invocations of a command line app with lots of options.)

So: a refer record is a key-value type of dealio: the percent sign followed by any string followed by a space gives you the field name. In between the space and the line break-followed-by-a-percent is the value. A refer database is simply a list of refer records, each one separated by two line breaks. That means we can have one record for each conversion we would like the app to perform.

We will eventually have to finesse the syntax for the shell_command value so that it can handle the distinction between:

  • a utility that takes a filepath as an input and prints to stdout
  • a utility that takes stdin as an input and prints to stdout
  • a utility that takes both an input and an output filepath
  • etc.

However, we will wait until a later issue to add that bit of fanciness. (The rough plan will be to handle it using printf-like escape syntax.) For the purposes of getting up and running with something simple, assume for now that all command line utilities for performing conversions take a filepath as input and output to stdout.

Parsing the Config File

What should we parse the datatype into? First things first: let's create a new file for the config information at lib/config.ml and put the following code there. This requires the following change to lib/dune:

(library
 (name lib)
 (libraries prelude versioj mrmime threads netstring unix)
 (modules lib config)
 (inline_tests (backend qtest.lib)))

I.e. what we had before, but with a modules S-expression whose tail includes lib and config.

A good first stab at laying out the datatype within config.ml would be something close to this:

module Formats = struct

  type htransform = string -> string
  type dtransform = string -> string
  
  type variety =
    | DataOnly of dtransform
    | DataAndHeader of (htransform * dtransform)
    | NoChange

  module Dict = Map.Make (String)
    
  type t = variety Dict.t
  
end

The Formats.variety datatype is a sum type whose purpose is to enumerate the different ways an email part could change/not change:

  • by leaving the MIME type header alone and just changing the data in the attachment, for when we don't need to update the MIME type in light of the change to the data
  • by changing the data in the attachment and updating the MIME type to reflect how the data were changed
  • leaving it alone

We should be able to parse a refer record of the type given above into a Formats.variety in the following way:

  • if the source MIME type and the target MIME type are the same (for example, converting PDF to PDF-A), it's a DataOnly whose dtransform is supplied by the path to the command line utility, passed into convert from Re-implement owen-practicum using ocamlnet #1
  • if they are different (for example, converting DOCX to plaintext), it's a DataAndHeader whose dtransform is supplied by the filepath and whose htransform is a function mapping the source MIME type string to the target MIME type string indicated in the refer record
  • we can either make everything else a NoChange based on an exhaustive list of all MIME types, or remove NoChange from the datatype (for now I think I like the idea of leaving it in)

We can work out the details of the parsing error messages (for badly formatted config files) etc. while implementing the config file parser. I expect we can use Prelude's parser to do the real parsing, then write a little wrapper code to do cleanup on the result of that plus whatever domain-specific error messaging we might want.

To get started with the parser, check out Prelude's documentation:
https://www2.lib.uchicago.edu/keith/software/prelude/Prelude.Refer.html

One final note on the idea behind the Formats.t dictionary datatype. The thought here is that Formats.t will be a lookup table with strings representing MIME types (such as application/pdf, text/plain, and so forth) as keys and Formats.variety-s as values. Then, when we are recursing through an email parsetree and come across a part of the email that is an attachment, what we'll do is examine the MIME header in the part we're looking at, look it up in the Formats.t dictionary, and the Formats.variety value the dictionary gives us back will tell use what to do with it: do nothing, convert just the data in the attachment, or convert both the data and the header in the attachment. See the next section for more info on how amap and acopy will have to be revised to make use of a Formats.t input in this way.

Design implications

The above design requires some revisions to our earlier spec from issues #1 and #6.

We'll keep the name amap for now, though since amap is at this point nowhere near functorial, we will eventually want to ditch the name. But the type should be updated to something like this:

val amap : Formats.t -> parsetree -> (parsetree, Formats.error) result

Where error is a datatype we will probably have to revise heavily, but which can start off on these lines, inside the Formats module:

module Formats = struct

  type htransform = string -> string
  type dtransform = string -> string
  
  type variety =
    | DataOnly of dtransform
    | DataAndHeader of (htransform * dtransform)
    | NoChange

  module Dict = Map.Make (String)
    
  type t = variety Dict.t

  module Error = struct
    type t =
      | ReferParse of string
      | Unix of (string * Unix.error)
      | MimeParse of string
      | CharacterEncoding of string
  end
  type error = Error.t

end

Similar changes to acopy are in order:

val amap : Formats.t -> parsetree -> (parsetree, Formats.error) result

When amap [acopy] encounters a new part of a mail, it will try to find the Content-Type header of that part in the input Formats.t dictionary. If it can't, then it won't do anything to that part. Otherwise, it looks that MIME header up in the input Formats.t dictionary to determine what kind of conversion to perform on that part of the mail, then converts accordingly.

One last thing. We will probably save this for a later issue, but it might be nice to make the first input to amap and acopy an optional parameter that defaults to some config we're planning to test with a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants