Building a signature file with ROY
- Getting started
- Home directory
- DROID and container signature files
- PRONOM reports
- Apache Tika and freedesktop.org MIME-info signature files
- Library of Congress FDDs
- Format sets (optional step)
- Setup... in six steps
- A MIME-info signature file
- A Library of Congress FDD signature file
- A Wikidata signature file
- Naming your signature file and your identifier
- Describing your modifications
- One signature file, multiple identifiers
- Inspecting your handiwork
Clone this wiki locally
This guide describes how you can use the
roy tool to build custom signature files.
sfruns, it defaults to a standard signature file (usually
defaults.sig). You can choose a custom signature file with the
sf -sig custom.sig FILE.
roy is bundled with the windows releases. Simply copy the executable into a location in your PATH.
If you are on Ubuntu or OS/X,
roy is installed with the homebrew and Ubuntu packages.
If you are on a different OS, you can compile
roy with golang installed. Just use
go install github.com/richardlehane/siegfried/cmd/roy.
In order to build a signature file,
roy needs to the know the location of source signature files (e.g. DROID, DROID container, PRONOM reports, and Tika and freedesktop.org MIME-info files). It also needs to know where the signature file should be loaded from / saved to. The
sf tool and
roy both share a home directory where all this information is normally located. You can find your home directory by invoking either of those tools with the
-help flag. You can set a custom home directory for both tools with the
Quick setup: Copy and extract the contents of the latest data.zip file on the releases page page into your home directory and you can skip the rest of this section and start modifying signature files.
If you want to build your own PRONOM signature file, you'll need copies of recent DROID signature and container files in your home directory. You can find these signature files on the TNA's website: http://www.nationalarchives.gov.uk/aboutapps/pronom/droid-signature-files.htm. By default, the latest files in your home directory will be used. You can apply the
-container flags to choose specific versions.
In normal usage, the DROID file is only used as a list of all current puids. The XML Report files from the TNA's PRONOM website are preferred as the primary sources for building the signature. In order to build a signature in this way you need copies of all these reports. The
roy harvest command will download these for you into a reports directory within your home directory. This runs fairly quickly but if you are on a slow connection you may find it times out before completion. You can use the timeout flag (
roy harvest -timeout 5m). You can select a different reports directory with the
You can, if you prefer, build a PRONOM signature file without reports and using only the DROID file. To do this, use the flag
If you want to build a MIME-info signature file, you'll need a copy of the latest signature files from Apache Tika or freedesktop.org.
Apache Tika MIME-info files (tika-mimetypes.xml) are available for download with the Tika source code: http://tika.apache.org/download.html.
The freedesktop.org MIME-info files are available at: https://freedesktop.org/wiki/Software/shared-mime-info/.
Some other software projects maintain their own, custom MIME-info signature files, for example UFRaw (UFRaw's MIME-info files can be found within the project's source code e.g. v0.22). You can use any valid MIME-info file as a source when building signatures.
To build a Library of Congress FDD signature file, download the latest fddXML.zip from the Library of Congress website at: http://www.loc.gov/preservation/digital/formats/fdd/fdd_xml_info.shtml
Format sets are a convenience mechanism to support signature customisation with
-exclude flags take comma-separated lists of format IDs to limit a signature to a selection of formats or to exclude a selection. The sets feature makes these flags a bit more functional by allowing commands like:
roy build -limit @pdf
roy build -exclude @pdfa,@pdfx,@pdfe
Sets can also be used with the
-extendc flags for adding lists of format extensions to your signature. E.g.
roy build -extend @exponential-decay,@archivematica.
The sets feature works like a macro: it looks at any json files in a sets directory (within your siegfried "home" directory) for definitions of format sets. Any formats with the '@' prefix are expanded to the contents of those sets. Here is an example of a sets file for pdf. This expansion is recursive: you can include sets within larger sets. You can also refer to sets across set files (so could create a separate 'office.json' file that has references to the pdf sets).
So, to recap, if you want to build your own signature file you need to:
- identify where your home directory is located (or select a custom one with the
- copy DROID and container files from the TNA's website into that home directory
- invoke the
roy harvestcommand to download PRONOM reports
- download the Apache Tika and freedesktop.org MIME-info files
- download the Library of Congress FDD signatures
- (optional) create a sets directory and create/copy format sets there.
Once you've done all that, simply invoking
roy build is enough to create a new signature file. This will build a default.sig file identical to the signature file distributed by the siegfried update service (
sf -update). The default signature file contains a single identifier based on the latest release of the PRONOM database.
roy build command assumes that you are creating a PRONOM signature file by default. To build a MIME-info signature file instead, use the
-mi flag with the name of the MIME-info signature file:
roy build -mi tika-mimetypes.xml
As a convenience, you can just use "tika" instead of "tika-mimetypes.xml" and "freedesktop" instead of "freedesktop.org.xml". The
-mi flag also works with the
roy add command (which is described further below):
roy add -mi freedesktop
To build a FDD signature file do:
roy build -loc or
roy add -loc
Where FDD signatures reference PRONOM IDs, PRONOM signatures are imported into the LOC identifier. You can override this behaviour so that only LOC magic is used with the
-nopronom flag i.e.
roy build -loc -nopronom
The Wikidata identifier implements harvest and build routines. Using the defaults to build a Wikidata signature file you would do the following:
roy harvest -wikidata
roy build -wikidata
There are a few different ways to work with either of these capabilities which are documented more thoroughly in the documentation for the Wikidata identifier.
roy has a number of options for further customising your signature files.
Here are the flags you can apply:
roy build -bof 16000 (set a maximum beginning of file offset for byte sequence matching)
roy build -eof 8000 (set a maximum end of file offset for byte sequence matching)
roy build -noeof (trim end of file segments from byte signatures)
roy build -nobyte (build an identifier without byte signatures)
roy build -nocontainer (build an identifier without container signatures)
roy build -notext (build an identifier without a text matcher)
roy build -noname (build an identifier without a filename matcher)
roy build -nomime (build an identifier without a MIME matcher)
roy build -noxml (build an identifier without an XML matcher)
roy build -noreports (build an identifier using the DROID file alone and not PRONOM XML reports)
roy build -limit fmt/1,fmt/2,fmt/3 (limit the identifier to certain formats)
roy build -exclude fmt/1,fmt/2,fmt/3 (exclude formats from the identifier)
roy build -extend custom-fmt1.xml,custom-fmt2.xml (add custom signatures in DROID format e.g. using this utility. Custom signature should be placed in a custom directory within your home directory)
roy build -multi single (build an identifier that is guaranteed to return a single result. In the event of a tie, UNKNOWN is returned with a descriptive warning)
roy build -multi conclusive (the default mode, applies weights and returns only the strongest result(s))
roy build -multi positive (in this mode, all strong results are returned. This means that a result that is based on an internal signature such as a byte, container, RIFF or xml match. Weights and priorities are still applied in order to return early from matching wherever possible - i.e. this mode does not affect speed.)
roy build -multi comprehensive (identical to positive except that weights and priorities are ignored during matching - like exhaustive, this mode will slow things down)
roy build -multi exhaustive (build an identifier that ignores format weights and returns all results - this will slow things down but can be useful for debugging e.g. alongside
sf -debug FILE)
roy build -extend custom-fmt1.xml -extendc -custom-container-fmt1.xml (add custom signatures in DROID container format. The DROID container format doesn't include format details such as name and mimetype so these need to be provided in a matching normal DROID extension file. Read this post for more information. Custom signature should be placed in a custom directory within your home directory)
roy build -droid DroidSignatureFile_V10.xml -noreports (specify a particular DROID file, the noreports flag is useful if you don't have matching PRONOM reports for older versions)
roy build -container container-signature-2010.xml (specify a container signature file)
roy build -mi tika-mimetypes.xml (build a MIMEInfo identifier with the supplied MIMEInfo signature file. You can use "tika" or "freedesktop" as aliases for "tika-mimetypes.xml" and "freedesktop.org.xml" respectively.)
All of the commands above will work but they will override your default default.sig file. Since many of these constraints will alter the way that files are identified, it is best practice to use a different signature name and a different identifier name.
roy build -name speedy -bof 131072 speedy.sig
The last part speedy.sig is the signature name and the
-name flag names the identifier.
roy builds your signature file it will automatically populate a "details" field with information about all the modifications you have made. This information goes into the provenance block at the beginning of
sf results. You can override this "details" field, to provide your own description, with the
roy build -exclude @pdf -details "Sorry posterity... I don't care about provenance!" evil.sig
A single signature file can contain one or more identifiers.
Identifiers are sets of format signatures with a common identity. When you run the
sf tool, all the identifiers are listed in the "provenance" block at the head of the results and each identifier will report its results for every file matched. Siegfried's design means you can add additional identifiers without incurring significant additional runtime cost (i.e. a second identifier won't double the matching time). The main purpose of this feature is to enable support for additional signature formats. But you may want to build signature files with multiple identifiers for other reasons: for example, to view changes in signature files over time or to test the effects of various signature customisations on sample files.
To create a signature file with multiple identifiers you first use
roy build to create a signature file with one identifier and then
roy add to add additional identifiers.
roy add accepts the same arguments as
roy build: the only difference is that
roy build creates a new signature file while
roy add adds a new identifier to an existing signature file.
roy build -name latest -nocontainer history.sig (build a signature file with the latest version of DROID but without containers)
roy add -name "version 10" -droid DroidSignatureFile_Version10.xml -noreports -nocontainer history.sig (add an additional identifier with an older signature file. Use
-noreports if you don't have old PRONOM XML reports lying around.)
roy has an
inspect command for viewing the contents of signature files, see Inspect and Debug