Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple results in DROID not visible in Siegfried #112

Closed
JSLair opened this Issue Feb 27, 2018 · 9 comments

Comments

Projects
None yet
2 participants
@JSLair
Copy link

JSLair commented Feb 27, 2018

Hello,

First of all, your tool is very efficient and useful. We are using it in our archiving solution for French government.

But there's a point we've recently discovered which can be a big problem for us. When we use DROID to identify compliant PUID in Pronom list, we discover that some files can have multiple formats. Siegfried give only one and in different cases it can be a wrong one.

For example this ODF presentation is identified as spreadsheet and presentation (which can be explain by the fact that there's a spreasheet object in it) by DROID and only as spreadsheet by Siegfried
FalseODS.zip

Would it be possible to have an option to get all the possible formats as in DROID...

Regards

@richardlehane

This comment has been minimized.

Copy link
Owner

richardlehane commented Feb 27, 2018

Thanks for reporting this issue Jean-Séverin. There's a couple of things going on here so I'll break apart and discuss in separate comments...

@richardlehane

This comment has been minimized.

Copy link
Owner

richardlehane commented Feb 27, 2018

Firstly "Would it be possible to have an option to get all the possible formats as in DROID".

The difference between sf and DROID in this case boils down to how both tools apply the relationships between formats (the format priorities).

DROID, as I understand things, applies format priorities after it does its matching i.e. it builds a list of candidate matches & then applies the format priorities as a filter before presenting to users.

sf applies format priorities during matching. I.e. if a match comes in for "PDF" then sf will wait to see if it is a "PDF/A" (or other more specific type of PDF) but it won't wait to see if it is a MPG or anything else unrelated to that initial match. This is really sf's special sauce and is what allows it to have good performance without setting a predefined limit on how much of a file should be scanned (the "Maximum bytes" setting you set in DROID's preferences).

Issues arise with sf's approach when: 1. format relationships aren't comprehensively defined in PRONOM; 2. format signatures aren't sufficiently accurate in PRONOM; 3. files are constructed that legitimately present as multiple formats e.g. polyglot files such as those demonstrated by Ange Albertini.

You can control the way sf applies format priorities by building custom signatures with the roy tool. The -multi flag allows you to override the default setting. If you do roy build -multi comprehensive then format priorities will be ignored during matching and you'll get a result set that includes all matching signatures. More detail about all the -multi options is available here: https://github.com/richardlehane/siegfried/wiki/Building-a-signature-file-with-ROY#customisable

Unfortunately this probably doesn't solve your issue with this file as the results with roy build -multi comprehensive are more exhaustive than DROID's (priorities are basically just ignored) [see below]. There isn't currently a -multi setting that will match the DROID results. It might be possible to add a new -multi setting that would cause sf to mimic the DROID approach of applying priorities after matching... I'll investigate adding this as a feature. (Though wouldn't recommend using it as it will slow things down!)

image

@richardlehane

This comment has been minimized.

Copy link
Owner

richardlehane commented Feb 28, 2018

Secondly "When we use DROID to identify compliant PUID in Pronom list, we discover that some files can have multiple formats. Siegfried give only one and in different cases it can be a wrong one."

Please don't be offended but I'd like to gently challenge this statement :). I'd argue that for the case of your "FalseODS.odp" file both sf and DROID are wrong & the real cause is an underlying issue with the PRONOM signatures.

So sf is clearly wrong here, this file is a presentation not a spreadsheet. Thankfully you do get a warning which should cause you to investigate ("extension mismatch"):

image

I think DROID is also wrong here: the file isn't a presentation and a spreadsheet, it is a presentation. At least DROID does give you the right option in the list but a user still needs to investigate to determine which of the options is right.

The underlying issue, I believe, is the way the PRONOM signatures match ODF files. These files are matched by container signatures (signatures that unpack the ODF files zip containers) that look in the META-INF/manifest.xml file for the mime-type. The problem is that ODF files can sometimes contain multiple objects (e.g. a word document with an embedded spreadsheet) in which case those manifest.xml files will contain multiple mime-types. The PRONOM signatures have no way to determine what the primary mime-type is.

I'd recommend that the best fix for the issue you are facing with this file is to approach the PRONOM team and request that the ODF signatures be made more specific so that they can deal with ODF files that contain embedded objects. I'll do a bit of research to see if I can come up with a concrete suggestion for how the signatures can be improved.

As I said in the comment above - I suspect most issues of this kind will boil down to either a PRONOM problem (missing relationships or signatures that lack sufficient precision) that should be fixed in PRONOM or (more rarely) be genuinely polyglot files. I don't have a solution for polyglots but also I don't really see it as sf's role to uncover these as polyglots have typically been artificially crafted and aren't "natural" files that we should expect to encounter in digital preservation work.

@JSLair

This comment has been minimized.

Copy link
Author

JSLair commented Mar 18, 2018

Thanks for your answers,

On the first point, I understand that the Siegfried main goal is to find a format as quickly as possible and so your default way you do this is surely the best. But we have to know if there're different answers and we know that the it will be more time consuming.
We'll try the multi comprehensive way, and see if it can be the solution.

On the second point, we will contact TNA on PRONOM when we will have a little bit more progressed in our preservation workshops.

@richardlehane

This comment has been minimized.

Copy link
Owner

richardlehane commented May 29, 2018

Just to add to this ticket: I've had reports of PDF files that have been misidentified as MPEG (because contain MPEG byte pattern). DROID identifies these with a double PDF + MPEG identification.

@richardlehane

This comment has been minimized.

Copy link
Owner

richardlehane commented Aug 30, 2018

Hi Jean-Séverin
I've made changes to the matching algorithm for v1.7.9 (see this post for details: https://www.itforarchivists.com/post/sf179/). I hope that these changes address this issue - could you please review against the files you were having trouble with and let me know if resolved?
thanks
Richard

@JSLair

This comment has been minimized.

Copy link
Author

JSLair commented Aug 31, 2018

@JSLair

This comment has been minimized.

Copy link
Author

JSLair commented Sep 7, 2018

@richardlehane

This comment has been minimized.

Copy link
Owner

richardlehane commented Sep 10, 2018

thanks for the update Jean-Séverin, that's good to hear (and great that you are doing such large-scale comparison of results). I'll close this ticket now - but please re-open if you notice any further issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.