Multiple results in DROID not visible in Siegfried #112

JSLair · 2018-02-27T12:50:38Z

Hello,

First of all, your tool is very efficient and useful. We are using it in our archiving solution for French government.

But there's a point we've recently discovered which can be a big problem for us. When we use DROID to identify compliant PUID in Pronom list, we discover that some files can have multiple formats. Siegfried give only one and in different cases it can be a wrong one.

For example this ODF presentation is identified as spreadsheet and presentation (which can be explain by the fact that there's a spreasheet object in it) by DROID and only as spreadsheet by Siegfried
FalseODS.zip

Would it be possible to have an option to get all the possible formats as in DROID...

Regards

richardlehane · 2018-02-27T23:04:25Z

Thanks for reporting this issue Jean-Séverin. There's a couple of things going on here so I'll break apart and discuss in separate comments...

richardlehane · 2018-02-27T23:45:47Z

Firstly "Would it be possible to have an option to get all the possible formats as in DROID".

The difference between sf and DROID in this case boils down to how both tools apply the relationships between formats (the format priorities).

DROID, as I understand things, applies format priorities after it does its matching i.e. it builds a list of candidate matches & then applies the format priorities as a filter before presenting to users.

sf applies format priorities during matching. I.e. if a match comes in for "PDF" then sf will wait to see if it is a "PDF/A" (or other more specific type of PDF) but it won't wait to see if it is a MPG or anything else unrelated to that initial match. This is really sf's special sauce and is what allows it to have good performance without setting a predefined limit on how much of a file should be scanned (the "Maximum bytes" setting you set in DROID's preferences).

Issues arise with sf's approach when: 1. format relationships aren't comprehensively defined in PRONOM; 2. format signatures aren't sufficiently accurate in PRONOM; 3. files are constructed that legitimately present as multiple formats e.g. polyglot files such as those demonstrated by Ange Albertini.

You can control the way sf applies format priorities by building custom signatures with the roy tool. The -multi flag allows you to override the default setting. If you do roy build -multi comprehensive then format priorities will be ignored during matching and you'll get a result set that includes all matching signatures. More detail about all the -multi options is available here: https://github.com/richardlehane/siegfried/wiki/Building-a-signature-file-with-ROY#customisable

Unfortunately this probably doesn't solve your issue with this file as the results with roy build -multi comprehensive are more exhaustive than DROID's (priorities are basically just ignored) [see below]. There isn't currently a -multi setting that will match the DROID results. It might be possible to add a new -multi setting that would cause sf to mimic the DROID approach of applying priorities after matching... I'll investigate adding this as a feature. (Though wouldn't recommend using it as it will slow things down!)

richardlehane · 2018-02-28T01:46:32Z

Secondly "When we use DROID to identify compliant PUID in Pronom list, we discover that some files can have multiple formats. Siegfried give only one and in different cases it can be a wrong one."

Please don't be offended but I'd like to gently challenge this statement :). I'd argue that for the case of your "FalseODS.odp" file both sf and DROID are wrong & the real cause is an underlying issue with the PRONOM signatures.

So sf is clearly wrong here, this file is a presentation not a spreadsheet. Thankfully you do get a warning which should cause you to investigate ("extension mismatch"):

I think DROID is also wrong here: the file isn't a presentation and a spreadsheet, it is a presentation. At least DROID does give you the right option in the list but a user still needs to investigate to determine which of the options is right.

The underlying issue, I believe, is the way the PRONOM signatures match ODF files. These files are matched by container signatures (signatures that unpack the ODF files zip containers) that look in the META-INF/manifest.xml file for the mime-type. The problem is that ODF files can sometimes contain multiple objects (e.g. a word document with an embedded spreadsheet) in which case those manifest.xml files will contain multiple mime-types. The PRONOM signatures have no way to determine what the primary mime-type is.

I'd recommend that the best fix for the issue you are facing with this file is to approach the PRONOM team and request that the ODF signatures be made more specific so that they can deal with ODF files that contain embedded objects. I'll do a bit of research to see if I can come up with a concrete suggestion for how the signatures can be improved.

As I said in the comment above - I suspect most issues of this kind will boil down to either a PRONOM problem (missing relationships or signatures that lack sufficient precision) that should be fixed in PRONOM or (more rarely) be genuinely polyglot files. I don't have a solution for polyglots but also I don't really see it as sf's role to uncover these as polyglots have typically been artificially crafted and aren't "natural" files that we should expect to encounter in digital preservation work.

JSLair · 2018-03-18T16:30:55Z

Thanks for your answers,

On the first point, I understand that the Siegfried main goal is to find a format as quickly as possible and so your default way you do this is surely the best. But we have to know if there're different answers and we know that the it will be more time consuming.
We'll try the multi comprehensive way, and see if it can be the solution.

On the second point, we will contact TNA on PRONOM when we will have a little bit more progressed in our preservation workshops.

richardlehane · 2018-05-29T07:07:28Z

Just to add to this ticket: I've had reports of PDF files that have been misidentified as MPEG (because contain MPEG byte pattern). DROID identifies these with a double PDF + MPEG identification.

richardlehane · 2018-08-30T02:43:12Z

Hi Jean-Séverin
I've made changes to the matching algorithm for v1.7.9 (see this post for details: https://www.itforarchivists.com/post/sf179/). I hope that these changes address this issue - could you please review against the files you were having trouble with and let me know if resolved?
thanks
Richard

JSLair · 2018-08-31T13:25:23Z

Hi Richard, In a first rapid test on a few files it seem's to be fine, thanx a lot!!! We were considering to change to Droid but I hope we can now keep on use Siegfried! I'll ask my team to test with more test files and I 'll tell you. Jean-Séverin *Jean-Séverin LAIR* Directeur de programme/Program head manager Programme Vitam Services du Premier Ministre, DINSIC 47 rue de la chapelle, 75018 Paris jean-severin.lair@culture.gouv.fr <mailto:jean-severin.lair@culture.gouv.fr> /Le programme est hébergé par les moyens informatiques du MCC/ Suivez-nous sur Twitter @ProgVitam <https://twitter.com/progvitam> Richard Lehane a écrit :

…

Hi Jean-Séverin I've made changes to the matching algorithm for v1.7.9 (see this post for details: https://www.itforarchivists.com/post/sf179/). I hope that these changes address this issue - could you please review against the files you were having trouble with and let me know if resolved? thanks Richard — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#112 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AV6Gsg7_21mNO2VBrnwSFctvLI-e6H1Qks5uV1FEgaJpZM4SU4Km>.

---------------------------------------------------------------------- Merci de nous aider à préserver l'environnement en n'imprimant ce courriel et les documents joints que si nécessaire.

JSLair · 2018-09-07T15:19:02Z

Hi, I can confirm that on a lot of around ten problematic files with divergence between optimal Droid results and Siegfried results (on thousands and thousands of files...), the result from Siegfried is now the expected one. Thanx a lot for this improvement! JS Lair www.programmevitam.fr

…

---------------------------------------------------------------------- Merci de nous aider C prC)server l'environnement en n'imprimant ce courriel et les documents joints que si nC)cessaire.

richardlehane · 2018-09-10T03:23:44Z

thanks for the update Jean-Séverin, that's good to hear (and great that you are doing such large-scale comparison of results). I'll close this ticket now - but please re-open if you notice any further issues

richardlehane self-assigned this Feb 27, 2018

richardlehane added enhancement PRONOM labels Feb 27, 2018

richardlehane closed this as completed Sep 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple results in DROID not visible in Siegfried #112

Multiple results in DROID not visible in Siegfried #112

JSLair commented Feb 27, 2018

richardlehane commented Feb 27, 2018

richardlehane commented Feb 27, 2018

richardlehane commented Feb 28, 2018

JSLair commented Mar 18, 2018

richardlehane commented May 29, 2018

richardlehane commented Aug 30, 2018

JSLair commented Aug 31, 2018 via email

JSLair commented Sep 7, 2018 via email

richardlehane commented Sep 10, 2018

Multiple results in DROID not visible in Siegfried #112

Multiple results in DROID not visible in Siegfried #112

Comments

JSLair commented Feb 27, 2018

richardlehane commented Feb 27, 2018

richardlehane commented Feb 27, 2018

richardlehane commented Feb 28, 2018

JSLair commented Mar 18, 2018

richardlehane commented May 29, 2018

richardlehane commented Aug 30, 2018

JSLair commented Aug 31, 2018 via email

JSLair commented Sep 7, 2018 via email

richardlehane commented Sep 10, 2018