Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoC+PROMOM identifier ids indexes aren't being built as one might expect, affecting the basis field results #142

Closed
ross-spencer opened this issue Jun 8, 2020 · 3 comments

Comments

@ross-spencer
Copy link
Collaborator

When an LoC identifier is built with PRONOM the sources are added as follows:

  1. All LoC sigs and ids are generated.
  2. The PRONOM identifiers we want for LoC records that have a PUID are then generated and attached.

Signatures and IDs are then returned to the identifier when it's built.

For the byte matcher this is done here.

OGG is a good example format here as the LoC identifier expects to have one LoC record with a byte-pattern and one PRONOM sequence that should match against it as well.

If we test against the OGG skeleton the ideal result is as follows:

$ ./sf ogg/fmt-203-signature-id-504.ogg
---
siegfried   : 1.8.0
scandate    : 2020-06-07T21:40:23-04:00
signature   : default.sig
created     : 2020-06-07T21:39:38-04:00
identifiers : 
  - name    : 'loc'
    details : 'fddXML.zip (2016-12-13, DROID_SignatureFile_V96.xml, container-signature-20200121.xml)'
---
filename : 'ogg/fmt-203-signature-id-504.ogg'
filesize : 62
modified : 2020-06-07T15:57:34-04:00
errors   : 
matches  :
  - ns      : 'loc'
    id      : 'fdd000026'
    format  : 'Ogg File Format'
    full    : 'Ogg File Format'
    mime    : 'application/ogg'
    basis   : 'extension match ogg; byte match at 0, 6 (signature 2/2)'
    warning : 

The Place() function inside the identifier looks up the IDs index, and returns position and total number of signatures for the matching pattern.

Again, the ideal for these indexes (I believe) should be as follows:

Indexes IDs (base.Place() [fdd000019 fdd000019 fdd000022 fdd000022 fdd000022 fdd000022 fdd000022 fdd000026 fdd000026 fdd000027 fdd000027 fdd000031 fdd000031]

Note that all of the identifiers run contiguously for each other within the slice, i.e. 0026 entries are next to each other, 0027 entries follow each other, etc.

Actual:

What we're seeing currently is:

Indexes IDs (base.Place() [fdd000019 fdd000022 fdd000022 fdd000022 fdd000026 fdd000027 fdd000031 fdd000019 fdd000027 fdd000022 fdd000022 fdd000031 fdd000026]

Note that 0026 entries are 4th and last in the slice. Siegfried will stop looping through the indexes to calculate the position and total number before it finds the 2nd-n identifier it is supposed to find.

The primary impact here seems to be visual. The binary pattern being used and the result returned is still accurate, but instead of:

  • basis : 'extension match ogg; byte match at 0, 6 (signature 2/2)'

Will be:

  • basis : 'extension match ogg; byte match at 0, 6'

Where we don't see the (signature 2/2) value we'd like to know which pattern matched specifically so that we can audit the results in more detail.

---
siegfried   : 1.8.0
scandate    : 2020-06-07T21:38:21-04:00
signature   : default.sig
created     : 2020-06-07T21:35:35-04:00
identifiers : 
  - name    : 'loc'
    details : 'fddXML.zip (2016-12-13, DROID_SignatureFile_V96.xml, container-signature-20200121.xml)'
---
filename : 'ogg/fmt-203-signature-id-504.ogg'
filesize : 62
modified : 2020-06-07T15:57:34-04:00
errors   : 
matches  :
  - ns      : 'loc'
    id      : 'fdd000026'
    format  : 'Ogg File Format'
    full    : 'Ogg File Format'
    mime    : 'application/ogg'
    basis   : 'extension match ogg; byte match at 0, 6'
    warning : 

I've some sample files here to help making recreate this a little easier.

OGG skeleton fmt-203-signature-id-504.zip

Restricted FDD set which is enough to recreate the issue without being the whole set restricted-set-fddXML.zip

OGG only FDD record ogg_fddXML.zip

@richardlehane
Copy link
Owner

nice pickup Ross!

It might be nice to fix in a way that all implementations of Identifers could benefit? E.g. by adding a sort feature to parseable.go in the internal/identifier package.

E.g. you could make a sorted implementation of the Parseable interface:
https://gist.github.com/richardlehane/8a8a1fec99f4d2879ed12373901b507f

If you then add a line to the ApplyConfig(p Parseable) func:
p = sorted{p}
you'd fix it?

@ross-spencer
Copy link
Collaborator Author

Nice. I'm pretty confident that works as well. I've submitted your fix on the vanilla develop branch along with a proposed unit test for it. Hopefully that's not looking too bad.

@richardlehane
Copy link
Owner

Fixed with 1.9.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants