Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Multiple results in DROID not visible in Siegfried #112
First of all, your tool is very efficient and useful. We are using it in our archiving solution for French government.
But there's a point we've recently discovered which can be a big problem for us. When we use DROID to identify compliant PUID in Pronom list, we discover that some files can have multiple formats. Siegfried give only one and in different cases it can be a wrong one.
For example this ODF presentation is identified as spreadsheet and presentation (which can be explain by the fact that there's a spreasheet object in it) by DROID and only as spreadsheet by Siegfried
Would it be possible to have an option to get all the possible formats as in DROID...
Firstly "Would it be possible to have an option to get all the possible formats as in DROID".
The difference between sf and DROID in this case boils down to how both tools apply the relationships between formats (the format priorities).
DROID, as I understand things, applies format priorities after it does its matching i.e. it builds a list of candidate matches & then applies the format priorities as a filter before presenting to users.
sf applies format priorities during matching. I.e. if a match comes in for "PDF" then sf will wait to see if it is a "PDF/A" (or other more specific type of PDF) but it won't wait to see if it is a MPG or anything else unrelated to that initial match. This is really sf's special sauce and is what allows it to have good performance without setting a predefined limit on how much of a file should be scanned (the "Maximum bytes" setting you set in DROID's preferences).
Issues arise with sf's approach when: 1. format relationships aren't comprehensively defined in PRONOM; 2. format signatures aren't sufficiently accurate in PRONOM; 3. files are constructed that legitimately present as multiple formats e.g. polyglot files such as those demonstrated by Ange Albertini.
You can control the way sf applies format priorities by building custom signatures with the
Unfortunately this probably doesn't solve your issue with this file as the results with
Secondly "When we use DROID to identify compliant PUID in Pronom list, we discover that some files can have multiple formats. Siegfried give only one and in different cases it can be a wrong one."
Please don't be offended but I'd like to gently challenge this statement :). I'd argue that for the case of your "FalseODS.odp" file both sf and DROID are wrong & the real cause is an underlying issue with the PRONOM signatures.
So sf is clearly wrong here, this file is a presentation not a spreadsheet. Thankfully you do get a warning which should cause you to investigate ("extension mismatch"):
I think DROID is also wrong here: the file isn't a presentation and a spreadsheet, it is a presentation. At least DROID does give you the right option in the list but a user still needs to investigate to determine which of the options is right.
The underlying issue, I believe, is the way the PRONOM signatures match ODF files. These files are matched by container signatures (signatures that unpack the ODF files zip containers) that look in the META-INF/manifest.xml file for the mime-type. The problem is that ODF files can sometimes contain multiple objects (e.g. a word document with an embedded spreadsheet) in which case those manifest.xml files will contain multiple mime-types. The PRONOM signatures have no way to determine what the primary mime-type is.
I'd recommend that the best fix for the issue you are facing with this file is to approach the PRONOM team and request that the ODF signatures be made more specific so that they can deal with ODF files that contain embedded objects. I'll do a bit of research to see if I can come up with a concrete suggestion for how the signatures can be improved.
As I said in the comment above - I suspect most issues of this kind will boil down to either a PRONOM problem (missing relationships or signatures that lack sufficient precision) that should be fixed in PRONOM or (more rarely) be genuinely polyglot files. I don't have a solution for polyglots but also I don't really see it as sf's role to uncover these as polyglots have typically been artificially crafted and aren't "natural" files that we should expect to encounter in digital preservation work.
Thanks for your answers,
On the first point, I understand that the Siegfried main goal is to find a format as quickly as possible and so your default way you do this is surely the best. But we have to know if there're different answers and we know that the it will be more time consuming.
On the second point, we will contact TNA on PRONOM when we will have a little bit more progressed in our preservation workshops.
Hi Richard, In a first rapid test on a few files it seem's to be fine, thanx a lot!!! We were considering to change to Droid but I hope we can now keep on use Siegfried! I'll ask my team to test with more test files and I 'll tell you. Jean-Séverin *Jean-Séverin LAIR* Directeur de programme/Program head manager Programme Vitam Services du Premier Ministre, DINSIC 47 rue de la chapelle, 75018 Paris email@example.com <mailto:firstname.lastname@example.org> /Le programme est hébergé par les moyens informatiques du MCC/ Suivez-nous sur Twitter @ProgVitam <https://twitter.com/progvitam> Richard Lehane a écrit :…
Hi Jean-Séverin I've made changes to the matching algorithm for v1.7.9 (see this post for details: https://www.itforarchivists.com/post/sf179/). I hope that these changes address this issue - could you please review against the files you were having trouble with and let me know if resolved? thanks Richard — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#112 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AV6Gsg7_21mNO2VBrnwSFctvLI-e6H1Qks5uV1FEgaJpZM4SU4Km>.
---------------------------------------------------------------------- Merci de nous aider à préserver l'environnement en n'imprimant ce courriel et les documents joints que si nécessaire.
Hi, I can confirm that on a lot of around ten problematic files with divergence between optimal Droid results and Siegfried results (on thousands and thousands of files...), the result from Siegfried is now the expected one. Thanx a lot for this improvement! JS Lair www.programmevitam.fr…
---------------------------------------------------------------------- Merci de nous aider C prC)server l'environnement en n'imprimant ce courriel et les documents joints que si nC)cessaire.