Skip to content

LoC+PROMOM identifier ids indexes aren't being built as one might expect, affecting the basis field results #142

@ross-spencer

Description

@ross-spencer

When an LoC identifier is built with PRONOM the sources are added as follows:

  1. All LoC sigs and ids are generated.
  2. The PRONOM identifiers we want for LoC records that have a PUID are then generated and attached.

Signatures and IDs are then returned to the identifier when it's built.

For the byte matcher this is done here.

OGG is a good example format here as the LoC identifier expects to have one LoC record with a byte-pattern and one PRONOM sequence that should match against it as well.

If we test against the OGG skeleton the ideal result is as follows:

$ ./sf ogg/fmt-203-signature-id-504.ogg
---
siegfried   : 1.8.0
scandate    : 2020-06-07T21:40:23-04:00
signature   : default.sig
created     : 2020-06-07T21:39:38-04:00
identifiers : 
  - name    : 'loc'
    details : 'fddXML.zip (2016-12-13, DROID_SignatureFile_V96.xml, container-signature-20200121.xml)'
---
filename : 'ogg/fmt-203-signature-id-504.ogg'
filesize : 62
modified : 2020-06-07T15:57:34-04:00
errors   : 
matches  :
  - ns      : 'loc'
    id      : 'fdd000026'
    format  : 'Ogg File Format'
    full    : 'Ogg File Format'
    mime    : 'application/ogg'
    basis   : 'extension match ogg; byte match at 0, 6 (signature 2/2)'
    warning : 

The Place() function inside the identifier looks up the IDs index, and returns position and total number of signatures for the matching pattern.

Again, the ideal for these indexes (I believe) should be as follows:

Indexes IDs (base.Place() [fdd000019 fdd000019 fdd000022 fdd000022 fdd000022 fdd000022 fdd000022 fdd000026 fdd000026 fdd000027 fdd000027 fdd000031 fdd000031]

Note that all of the identifiers run contiguously for each other within the slice, i.e. 0026 entries are next to each other, 0027 entries follow each other, etc.

Actual:

What we're seeing currently is:

Indexes IDs (base.Place() [fdd000019 fdd000022 fdd000022 fdd000022 fdd000026 fdd000027 fdd000031 fdd000019 fdd000027 fdd000022 fdd000022 fdd000031 fdd000026]

Note that 0026 entries are 4th and last in the slice. Siegfried will stop looping through the indexes to calculate the position and total number before it finds the 2nd-n identifier it is supposed to find.

The primary impact here seems to be visual. The binary pattern being used and the result returned is still accurate, but instead of:

  • basis : 'extension match ogg; byte match at 0, 6 (signature 2/2)'

Will be:

  • basis : 'extension match ogg; byte match at 0, 6'

Where we don't see the (signature 2/2) value we'd like to know which pattern matched specifically so that we can audit the results in more detail.

---
siegfried   : 1.8.0
scandate    : 2020-06-07T21:38:21-04:00
signature   : default.sig
created     : 2020-06-07T21:35:35-04:00
identifiers : 
  - name    : 'loc'
    details : 'fddXML.zip (2016-12-13, DROID_SignatureFile_V96.xml, container-signature-20200121.xml)'
---
filename : 'ogg/fmt-203-signature-id-504.ogg'
filesize : 62
modified : 2020-06-07T15:57:34-04:00
errors   : 
matches  :
  - ns      : 'loc'
    id      : 'fdd000026'
    format  : 'Ogg File Format'
    full    : 'Ogg File Format'
    mime    : 'application/ogg'
    basis   : 'extension match ogg; byte match at 0, 6'
    warning : 

I've some sample files here to help making recreate this a little easier.

OGG skeleton fmt-203-signature-id-504.zip

Restricted FDD set which is enough to recreate the issue without being the whole set restricted-set-fddXML.zip

OGG only FDD record ogg_fddXML.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions