Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Identifying file formats
Scanning a file with siegfried just involves navigating to the right directory (using the
cd command) and running:
You can scan the whole contents of directories by providing a directory rather than a file as the first argument:
By default, siegfried will descend down into all the subdirectories of that directory. You may not want this (especially if it is a large directory) and can prevent it with a
-nr flag (for "no recurse") like so:
sf -nr DIR
sfis run from the command line, you can use the standard
ctrl-Ckey combination to kill the command if you accidentally start descending down a really big directory tree.
Scanning a list of files and/or directories
You can also provide a list of files or directories to scan, e.g.:
sf myfile1.doc mydir myfile3.txt
Saving output to files
Siegfried's output prints nicely in terminals but oftentimes you'll want to keep the scan results. To do this, simply redirect (
>) the output to a results file:
sf file.ext (or DIR) > my_results.yaml
JSON, CSV and DROID CSV
I gave my_results a .yaml extension because the default output format is YAML. You can switch the output to JSON, CSV or DROID CSV with the
sf -json DIR > my_results.json sf -csv DIR > my_results.csv sf -droid DIR > my_results.csv
Scanning archive formats (zip, tar, gzip, warc, arc)
By default, siegfried does not scan within archive formats. To scan within the contents of zip, tar, gzip, warc or arc files use the
sf -z file.ext sf -z DIR
Calculating checksums (md5, sha1, sha256, sha512, crc)
To include file hashes with your identification results, use the
sf -hash md5 file.ext sf -hash sha1 DIR
Scanning content piped to stdin
If you use
- instead of a file or directory argument, then
sf will scan from stdin. This allows you to do things like:
cat myfile.doc | sf -
You can optionally pass a filename with the
-filename flag, in order to enable filename/extension matching:
cat myfile.doc | sf -filename myfile.doc -
Scanning a list of files
-f flag causes
sf to scan a newline separated list of files. E.g.
sf -f myfiles.txt will scan each of the files listed in
You can combine this with the
- argument to read from a list of files piped to stdin. This allows you to do things like:
find */*.doc | sf -f - [on Mac or Linux] dir /b /s *.doc | sf -f - [on Win]
Throttling directory scans
If you use the
-throttle [Duration] flag,
sf will pause between files when scanning directories. For example:
sf -throttle 50ms DIR
will pause for 50 milliseconds between each file scan.
This flag can be useful if you encounter bandwidth issues running
Continue on error
You can use the continue on error flag (
-coe) to prevent
sf halting when it encounters a fatal file error. This can be useful e.g. for scans over unreliable networks.
sf -coe DIR
Scanning many files at once
If you use the
-multi [Number] flag,
sf will scan up to that number of files at once. For example:
sf -multi 256 DIR
will scan up to 256 files at once.
The default for
-multi is 1 as the flag can slow down matching (especially when IO is the bottleneck - e.g. spinning disks or network scans). Depending on your platform, you may find you get dramatic improvements in speed by using the
-multi flag and it is worth experimenting with. For example, on the lab PC at my workplace (which has SSD hard drives, a fast processor, and a lot of RAM) using
-multi 256 speeds up a scan of the Govdocs Selected Corpus (31 Gb) from 6m41s to 36s.
-multi flag can't be stacked with the
-z flag (if you attempt to, sf will provide a warning and automatically drop down to single scan mode).
Saving scan settings
You can save and load frequently used scan settings with the
To save a default configuration, execute the
sf command with the
-setconf flag and any other desired combination of flags (e.g.
-csv -multi 32):
sf -setconf -multi 32 -hash sha1 -csv
This will save your default configuration in a sf.conf file within your siegfried home directory.
You can save multiple named configurations by using the
-conf flag. E.g. the following command saves a configuration to a server.conf file:
sf -setconf -serve :5138 -z -hash md5 -conf server.conf
You can then invoke that named configuration with:
sf -conf server.conf
If you forget a command or option, use
sf -help for a list of options.
Working with siegfried output
-log flag reports progress, time, errors, warnings, knowns, unknowns, and slow and debug information to either stderr or stdout.
For example, if you're scanning a large directory, you might like to see the progress of your scan. You can do this with
sf -log progress -csv DIR > my_results.csv
This command reports progress to stderr (the default output).
-log flag takes the following options:
progress OR p time or t error OR err OR e warning OR warn OR w known OR k unknown OR u chart or c debug OR d slow OR s stdout OR out OR o
You can combine any of these options in comma-separated strings e.g.
-log e,w reports all errors and warnings to stderr.
-log p,t,e reports progress, errors and time elapsed.
-log u,o reports unknowns to stdout (when you direct
-log to stdout it replaces the normal result output).
In addition to those specific options, you can also include format IDs (e.g. fmt/1) and sets (e.g. @pdf) in your log string. These can be combined with the normal logging options. E.g.
-log u,fmt/1,fmt/2,@pdf,o reports to stdout unknowns, any files that identify as fmt/1 or fmt/2, and any files that identify as one of the formats in the @pdf set.
Knowns, Unknowns, Formats and Sets
-log known and
-log unknown commands output lists of files that are either recognised or not recognised.
-log fmt/1,fmt/2 and
-log @pdf,@tiff commands output lists of files that are recognised as having those specific format IDs or belonging to the provided sets.
One use for these commands is in combination with a modified signature file (see Building a signature file with roy). For example, you could create a signature file that only recognises pdf formats with:
roy build -limit @pdf -name pdf_only pdf.sig
sf -log known you could then filter all the pdf files in a given directory for further processing by some other command, such as tika:
sf -sig pdf.sig -log known,stdout . > temp.out && java -jar tika-app.jar -t -i . -o ~/local/out -fileList temp.out
Another, slightly slower, way to accomplish the same task would be to use your regular signature file and log the PDF formats directly:
sf -log @pdf,stdout . > temp.out && java -jar tika-app.jar -t -i . -o ~/local/out -fileList temp.out
You might also want to send a list of unknowns to the file command:
sf -log unknown ~/local/files 2> temp.out && file -f temp.out
You can even pipe results from these commands back to
sf itself. For example, you might run a full identification over all the non-pdf files in a directory:
sf -sig pdf.sig -log unknown,stdout . | sf -f -
Replaying a scan from results file(s)
You can use the
-replay command to simulate a scan from one or more results files (CSV, JSON, YAML or even DROID and Fido results). E.g.
sf -replay myresults.yaml
sf -replay myresults.yaml moreresults.csv
The value of this command is apparent when used with additional flags. E.g. you can use
-replay to convert a YAML results file to CSV:
sf -replay -csv myresults.yaml > myresults.csv
Or you can use
-replay to concatenate results files together:
sf -replay myresults.yaml moreresults.csv > allresults.yaml
Another use case for
-replay is to use logging functions to interactively explore results files. For example:
sf -replay -log chart,o myresults.yaml // view a chart of formats sf -replay -log error,warn,o myresults.yaml // view warning and errors sf -replay -log unknown,o myresults.yaml // view unknowns sf -replay -log @pdf,o myresults.yaml moreresults.csv // view all PDFs across two results files
This clip shows how the
replay command can be used:
compare sub-command allows you to view the difference between multiple results file (in any of the
sf formats as well as Droid and FIDO results). This sub-command outputs the differences into a CSV file for analysis e.g. in MS Excel.
roy compare myresults1.yaml myresults2.json droid-results.csv fido-results.csv > comparison.csv
This example does a four-way comparison between two
sf results files, a DROID results file and a FIDO results file.
By default this sub-command joins the different results files based on file paths within those results. Sometimes results can't be joined in this way (e.g. because DROID gives absolute paths where the other tools give relative paths). You can use the
-join flag with the
roy compare sub-command to change the way results are joined.
roy compare -join 0 myresults1.yaml myresults2.json // join on full path (default) roy compare -join 1 myresults1.yaml myresults2.json // join on (local) filename only roy compare -join 2 myresults1.yaml myresults2.json // join on filename and size roy compare -join 3 myresults1.yaml myresults2.json // join on filename and modified roy compare -join 4 myresults1.yaml myresults2.json // join on filename and hash roy compare -join 5 myresults1.yaml myresults2.json // join on hash only
Interpreting the output
Technical provenance fields
Note: the JSON and CSV outputs have identical fields to the YAML output, except that the CSV output omits the technical provenance block. The DROID output mimics the TNA's DROID tool's CSV export.
The first block of information in siegfried output gives a technical provenance for the scan.
This includes information about siegfried (version number), about the date and time of the scan, about the signature file (name and date created), and about the identifiers within that signature file. The default signature file (default.sig) includes a single identifier named "pronom". In the "details" field for an identifier you'll see the versions of DROID signature files used to create the identifier as well as any modifications made to it (e.g. limited BOF, extensions etc.). No modifications are made to the default signature file's PRONOM identifier.
The second block of information in siegfried ouput describes the file being scanned.
This includes the file's name, size (in bytes), last modified date, and any errors siegfried encountered in attempting to read the file. Treat any errors reported here as red flags warranting further investigation. File errors may prevent matching occurring altogether or they may only affect certain matching processes. For example, a badly structured zip or Microsoft Compound file will prevent prevent container matching and generate a file error but the byte matcher will still report its results.
The third block of information in siegfried output is a list of matches reported by the identifiers within the signature file. All identifiers will return at least one match (which may have the special value "UNKNOWN") and may report multiple matches (if there are multiple matches returned that have equal weighting).
For each match you will see:
- the name of the identifier returning the match (just "pronom" if you are using the default signature file)
- the format ID (a unique identifier or the special value "UNKNOWN")
- the format's name, version and MIME type
- the basis for the match
- and any warnings.
The basis field gives a technical justification for why the format has matched. This includes the names of the matchers (extension, container, byte and text matchers) that have triggered the result. If it is a byte matcher result, you will also see comma separated pairs that describe the offsets and lengths of matching segments (signatures may have one or more segments that must be satisfied). If it is a container matcher, you will see the names of matching sub-files as well as the output of any byte matchers that are applied to those sub-files. If it is a text matcher, you will see the character encoding detected.
'extension match; container name CompObj with byte match at 77, 20; name WordDocument with name only'
This basis value tells us that file in question matched on extension and triggered a container match, due to the sub-files "CompObj" and "WordDocument", with a byte match for the "CompObj" stream.
The warning field reports any warnings reported by the identifiers during matching. These aren't strictly errors but may still warrant further investigation. A common warning is for "UNKNOWN" files. The warning text for "UNKNOWN" files will list any potential matches based on extension that the byte matcher has excluded.