Building a signature file with ROY

Richard Lehane edited this page Feb 27, 2018 · 65 revisions

This guide describes how you can use the roy tool to build custom signature files.

Note: When sf runs, it defaults to a standard signature file (usually defaults.sig). You can choose a custom signature file with the -sig flag, i.e. sf -sig custom.sig FILE.

Example

asciicast

Getting started

Install

roy is bundled with the windows releases. Simply copy the executable into a location in your PATH.

If you are on Ubuntu or OS/X, roy is installed with the homebrew and Ubuntu packages.

If you are on a different OS, you can compile roy with golang installed. Just use go install github.com/richardlehane/siegfried/cmd/roy.

Setup

Home directory

In order to build a signature file, roy needs to the know the location of source signature files (e.g. DROID, DROID container, PRONOM reports, and Tika and freedesktop.org MIME-info files). It also needs to know where the signature file should be loaded from / saved to. The sf tool and roy both share a home directory where all this information is normally located. You can find your home directory by invoking either of those tools with the -help flag. You can set a custom home directory for both tools with the -home flag.

Quick setup: Copy and extract the contents of the latest data.zip file on the releases page page into your home directory and you can skip the rest of this section and start modifying signature files.

DROID and container signature files

If you want to build your own PRONOM signature file, you'll need copies of recent DROID signature and container files in your home directory. You can find these signature files on the TNA's website: http://www.nationalarchives.gov.uk/aboutapps/pronom/droid-signature-files.htm. By default, the latest files in your home directory will be used. You can apply the -droid or -container flags to choose specific versions.

PRONOM reports

In normal usage, the DROID file is only used as a list of all current puids. The XML Report files from the TNA's PRONOM website are preferred as the primary sources for building the signature. In order to build a signature in this way you need copies of all these reports. The roy harvest command will download these for you into a reports directory within your home directory. This runs fairly quickly but if you are on a slow connection you may find it times out before completion. You can use the timeout flag (roy harvest -timeout 5m). You can select a different reports directory with the -reports flag.

You can, if you prefer, build a PRONOM signature file without reports and using only the DROID file. To do this, use the flag -noreports.

Apache Tika and freedesktop.org MIME-info signature files

If you want to build a MIME-info signature file, you'll need a copy of the latest signature files from Apache Tika or freedesktop.org.

Apache Tika MIME-info files (tika-mimetypes.xml) are available for download with the Tika source code: http://tika.apache.org/download.html.

The freedesktop.org MIME-info files are available at: https://freedesktop.org/wiki/Software/shared-mime-info/.

Some other software projects maintain their own, custom MIME-info signature files, for example UFRaw (UFRaw's MIME-info files can be found within the project's source code e.g. v0.22). You can use any valid MIME-info file as a source when building signatures.

Library of Congress FDDs

To build a Library of Congress FDD signature file, download the latest fddXML.zip from the Library of Congress website at: http://www.loc.gov/preservation/digital/formats/fdd/fdd_xml_info.shtml

Format sets (optional step)

Format sets are a convenience mechanism to support signature customisation with roy. The -limit and -exclude flags take comma-separated lists of format IDs to limit a signature to a selection of formats or to exclude a selection. The sets feature makes these flags a bit more functional by allowing commands like:

roy build -limit @pdf

or

roy build -exclude @pdfa,@pdfx,@pdfe

Sets can also be used with the -extend and -extendc flags for adding lists of format extensions to your signature. E.g. roy build -extend @exponential-decay,@archivematica.

The sets feature works like a macro: it looks at any json files in a sets directory (within your siegfried "home" directory) for definitions of format sets. Any formats with the '@' prefix are expanded to the contents of those sets. Here is an example of a sets file for pdf. This expansion is recursive: you can include sets within larger sets. You can also refer to sets across set files (so could create a separate 'office.json' file that has references to the pdf sets).

To use this feature, you'll need to create that sets directory and add your own format sets there. You can also copy sets files from the siegfried repository (contributions welcome).

Setup... in six steps

So, to recap, if you want to build your own signature file you need to:

  1. identify where your home directory is located (or select a custom one with the -home flag)
  2. copy DROID and container files from the TNA's website into that home directory
  3. invoke the roy harvest command to download PRONOM reports
  4. download the Apache Tika and freedesktop.org MIME-info files
  5. download the Library of Congress FDD signatures
  6. (optional) create a sets directory and create/copy format sets there.

Build

Once you've done all that, simply invoking roy build is enough to create a new signature file. This will build a default.sig file identical to the signature file distributed by the siegfried update service (sf -update). The default signature file contains a single identifier based on the latest release of the PRONOM database.

A MIME-info signature file

The roy build command assumes that you are creating a PRONOM signature file by default. To build a MIME-info signature file instead, use the -mi flag with the name of the MIME-info signature file:

e.g. roy build -mi tika-mimetypes.xml

As a convenience, you can just use "tika" instead of "tika-mimetypes.xml" and "freedesktop" instead of "freedesktop.org.xml". The -mi flag also works with the roy add command (which is described further below):

e.g. roy add -mi freedesktop

A Library of Congress FDD signature file

To build a FDD signature file do:

roy build -loc or roy add -loc

Where FDD signatures reference PRONOM IDs, PRONOM signatures are imported into the LOC identifier. You can override this behaviour so that only LOC magic is used with the -nopronom flag i.e. roy build -loc -nopronom

Customisable

roy has a number of options for further customising your signature files.

Here are the flags you can apply:

roy build -bof 16000 (set a maximum beginning of file offset for byte sequence matching)

roy build -eof 8000 (set a maximum end of file offset for byte sequence matching)

roy build -noeof (trim end of file segments from byte signatures)

roy build -nobyte (build an identifier without byte signatures)

roy build -nocontainer (build an identifier without container signatures)

roy build -notext (build an identifier without a text matcher)

roy build -noname (build an identifier without a filename matcher)

roy build -nomime (build an identifier without a MIME matcher)

roy build -noxml (build an identifier without an XML matcher)

roy build -noreports (build an identifier using the DROID file alone and not PRONOM XML reports)

roy build -limit fmt/1,fmt/2,fmt/3 (limit the identifier to certain formats)

roy build -exclude fmt/1,fmt/2,fmt/3 (exclude formats from the identifier)

roy build -extend custom-fmt1.xml,custom-fmt2.xml (add custom signatures in DROID format e.g. using this utility. Custom signature should be placed in a custom directory within your home directory)

roy build -multi single (build an identifier that is guaranteed to return a single result. In the event of a tie, UNKNOWN is returned with a descriptive warning)

roy build -multi conclusive (the default mode, applies weights and returns only the strongest result(s))

roy build -multi positive (in this mode, all strong results are returned. This means that a result that is based on an internal signature such as a byte, container, RIFF or xml match. Weights and priorities are still applied in order to return early from matching wherever possible - i.e. this mode does not affect speed.)

roy build -multi comprehensive (identical to positive except that weights and priorities are ignored during matching - like exhaustive, this mode will slow things down)

roy build -multi exhaustive (build an identifier that ignores format weights and returns all results - this will slow things down but can be useful for debugging e.g. alongside sf -debug FILE)

roy build -extend custom-fmt1.xml -extendc -custom-container-fmt1.xml (add custom signatures in DROID container format. The DROID container format doesn't include format details such as name and mimetype so these need to be provided in a matching normal DROID extension file. Read this post for more information. Custom signature should be placed in a custom directory within your home directory)

roy build -droid DroidSignatureFile_V10.xml -noreports (specify a particular DROID file, the noreports flag is useful if you don't have matching PRONOM reports for older versions)

roy build -container container-signature-2010.xml (specify a container signature file)

roy build -mi tika-mimetypes.xml (build a MIMEInfo identifier with the supplied MIMEInfo signature file. You can use "tika" or "freedesktop" as aliases for "tika-mimetypes.xml" and "freedesktop.org.xml" respectively.)

Naming your signature file and your identifier

All of the commands above will work but they will override your default default.sig file. Since many of these constraints will alter the way that files are identified, it is best practice to use a different signature name and a different identifier name.

For example:

roy build -name speedy -bof 131072 speedy.sig

The last part speedy.sig is the signature name and the -name flag names the identifier.

Describing your modifications

When roy builds your signature file it will automatically populate a "details" field with information about all the modifications you have made. This information goes into the provenance block at the beginning of sf results. You can override this "details" field, to provide your own description, with the -details flag.

E.g. roy build -exclude @pdf -details "Sorry posterity... I don't care about provenance!" evil.sig

One signature file, multiple identifiers

A single signature file can contain one or more identifiers.

Identifiers are sets of format signatures with a common identity. When you run the sf tool, all the identifiers are listed in the "provenance" block at the head of the results and each identifier will report its results for every file matched. Siegfried's design means you can add additional identifiers without incurring significant additional runtime cost (i.e. a second identifier won't double the matching time). The main purpose of this feature is to enable support for additional signature formats. But you may want to build signature files with multiple identifiers for other reasons: for example, to view changes in signature files over time or to test the effects of various signature customisations on sample files.

To create a signature file with multiple identifiers you first use roy build to create a signature file with one identifier and then roy add to add additional identifiers. roy add accepts the same arguments as roy build: the only difference is that roy build creates a new signature file while roy add adds a new identifier to an existing signature file.

For example:

roy build -name latest -nocontainer history.sig (build a signature file with the latest version of DROID but without containers)

roy add -name "version 10" -droid DroidSignatureFile_Version10.xml -noreports -nocontainer history.sig (add an additional identifier with an older signature file. Use -noreports if you don't have old PRONOM XML reports lying around.)

Inspecting your handiwork

roy has an inspect command for viewing the contents of signature files, see Inspect and Debug

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.