Add PRONOM types to PRONOM identifier #209

ross-spencer · 2022-12-28T11:56:16Z

Exploration adding PRONOM types classification to the Siegfried PRONOM identifier.

Connected to: #207

This results in a new output from Siegfried which looks something as follows:

filename : 'testdata/skeleton-suite/x-fmt/x-fmt-95-signature-id-858.pwi'
filesize : 5
modified : 2020-07-05T19:53:49+02:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'x-fmt/95'
    format  : 'Inkwriter/Notetaker Document'
    version : 
    mime    : 
    type    : 'Word Processor'
    basis   : 'extension match pwi; byte match at 0, 5'
    warning :

Note the addition of type : 'Word Processor'

NB. This will only show a value if the PRONOM identifier is configured with PRONOM reports, i.e. the PRONOM XML export from PRONOM itself. The DROID signature file still needs this information to be added, we believe this is on the way. I can attend the next PRONOM meeting at the beginning of the year to ask more.

Tests have been included as part of this feature. Additionally, source files have had linting changes made to them to pass linting. These are in the third commit associated with the PR and may warrant special attention for accuracy, especially around the correctness of the documentation.

Tests are added for the PRONOM types work along with new helper functions for making Siegfried tests more discrete and maintainable.

PRONOM identifier related linting fixes for the different source files touched by the PRONOM types additions.

ross-spencer · 2022-12-28T12:37:40Z

cmd/sf/sf.go

@@ -427,7 +427,7 @@ func main() {
 	case *jsono:
 		w = writer.JSON(os.Stdout)
 	case *droido:
-		if len(s.Fields()) != 1 || len(s.Fields()[0]) != 7 {
+		if len(s.Fields()) != 1 || len(s.Fields()[0]) != 8 {


I was thinking about something like this for here:

p, _ := pronom.New() if len(s.Fields()) != 1 || len(s.Fields()[0]) != len(p.Fields()) { // ... }

But is that just creating a big overhead in terms of speed (I think all reports are read?) and there's always the potential for error. Is there another way to export the fields as they're constant?

yes, loading pronom here would be expensive!

We can avoid writing to disk and make the tests here more portable by reading from an in-memory filesystem. The skeletons themselves are small and so can be easily stored in-line as strings and then turned into byte objects. Given the refactor to in-memory objects, we also take the opportunity to add a file that won't identify with the minimal PRONOM signature file and PRONOM reports. Type should be a nil-string as with many of the other fields.

ross-spencer · 2023-01-05T17:29:20Z

NB. I was accidentally testing in "production" against this branch today and it looks like type may have crept into the DROID report and so I need to go back to the DROID CSV creation and make sure that doesn't happen, and then likely include a DROID specific test to make sure headers are output correctly.

NB. Also, chatted to the PRONOM crew today in the PRONOM weekly. David C checked DROID signature files for compatibility, and they look good. One question is whether the SOAP service that delivers the XML to DROID has a different take on this, but overall it sounds positive this can potentially be added. It is a good time to ask with other PRONOM changes in the works over the course of the year.

Ensures that the DROID header doesn't change in code unless it is explicitly made to do so.

ross-spencer · 2023-02-05T20:44:13Z

A small DROID header test has been added here: 957c2e7

To clarify, the "TYPE" field is specific to the DROID CSV, and so, type in the standard YAML/JSON/CSV outputs of SF may not be a good idea, be somewhat confusing.

@robin-francois @richardlehane is Classification the preferred term for the format type/classification field? Does it make sense to others reading this?

LoC uses "content categories" to describe this: https://www.loc.gov/preservation/digital/formats/content/content_categories.shtml

PRONOM as we know uses "Classification".

Wikidata doesn't seem to have the equivalent predicate, it tends towards instance of, and use/used for to describe similar. There may be an equivalent predicate.

robin-francois · 2023-02-06T16:57:06Z

Thanks @ross-spencer, that's a splendid job.

Wikipedia seems to use type of format in some pages. I have no favoured term. Both category or class seem to be appropriate instead of the misleading type.

ross-spencer · 2023-02-07T07:32:07Z

Thanks @robin-francois - both good options, perhaps preceded by "format" e.g. "format class" in the one example? Thinking about it I'll add these notes to the discussion page and then post it on digipres.club/Twitter and see if folks have an opinion too. I'm not sure why the discussion works differently from the PR comments - but I think maybe it does?

NB. it's also really easy to change these things, so once there's a name we've settled on, I can make those changes and the PR should be good for review.

prwheatley · 2023-02-07T15:33:30Z

Would be great to have this - and useful to have a decent standard set of content type categories (however imperfect they always are) for use in other areas. COPTR has its own that aligns/overlaps pretty well with LOC, albeit with some different titles and a few that aren't on the LOC list. https://coptr.digipres.org/index.php/Content_Types

richardlehane · 2023-02-08T13:26:35Z

thanks for all the work on this @ross-spencer ... it's looking good!

There's a chance that some of the current integrations with sf are relying on the number/order of fields (e.g. if they are using csv output they may expect certain elements in certain columns). I think adding this new "classification" field should become the new default but I wonder whether adding a flag to roy to build a signature file without the field (e.g. "-noclass" flag) might make sense as a fallback in case any integrations get broken? This could also be how signatures built with droid xml files only get handled.

To do this you'd probably need to add a new bool field in the PRONOM identifier to indicate whether or not classifications are used & then make the output functions check that field to give the two possible forms of output. What do you think?

richardlehane · 2023-02-08T13:32:26Z

pkg/pronom/pronom_test.go

+// the test data directory.
+func setMinimalParams() {
+	config.SetDroid("DROID_minimal.xml")
+	config.SetPRONOMReportsDir("pronom_minimal")


Suggest could omit the "DROID_minimal.xml" and pronom_minimal/* dependencies if you add a config.SetLimit("fmt/1", "fmt/11", "fmt/14", "fmt/3", fmt/5" )

Call like config.SetLimit("fmt/1", "fmt/11", "fmt/14", "fmt/3", fmt/5")() to set the limit immediately (needed if you are using NewPronom() rather than pronom.New(opts...)

If I understand correctly, rather than looking at the new files as dependencies per se, the intention here was to make the tests less sensitive to changes in the signature file, and only sensitive to changes in the code. So there's a bit of overhead with this PR to achieve that, but in future, DROID_SignatureFile_v109.xml can be upgraded for the tests that use it, without having to fixup and breakages in these newly introduced tests, e.g. if more format classifications are added. I believe it can lead to more flexibility in how test cases are added in future.

What do you think? I can revert back to the current approach if it doesn't work? Another way to do this, to avoid creating more fixtures here, is to see the concept used for creating the skeleton files in this PR through to the end and maybe output the signature files to a temporary location on the filesystem too?

That's a good point but there probably won't be too much churn in the tests: you chose pretty stable formats, none of them has been updated since 2013! If there are occasional failures due to PRONOM updates it might also be interesting to know?

richardlehane · 2023-02-08T13:46:45Z

pkg/config/pronom.go

@@ -202,8 +203,17 @@ func TextPuid() string {
 // SetDroid sets the name and/or location of the DROID signature file.
 // I.e. can provide a full path or a filename relative to the HOME directory.
 func SetDroid(d string) func() private {
+	pronom.droid = d


pronom.droid should be set within the closure not outside it. This means that the option will only take effect when the identifer is made. Making this change will mean that the option gets immediately set (changing behaviour of code in cmd/roy). Generally if you need config options like this to take immediate effect you can do so by invoking like this: SetDroid("x")() [that second parens calls the returned option func immediately]

Ah! That explains why it wasn't working - will give this a try.

richardlehane · 2023-02-08T13:47:51Z

pkg/config/pronom.go

+
+// SetPRONOMReportsDir sets the PRONOM reports directory, used to
+// generate a PRONOM identifier from the XML data retrieved from PRONOM.
+func SetPRONOMReportsDir(r string) func() private {


suggest this function not needed if config.SetLimit() used (as described in another comment)... could probably drop this addition

richardlehane · 2023-03-20T15:30:20Z

merged into develop branch with this commit: 98516b1

ross-spencer · 2023-03-20T17:48:19Z

@richardlehane I think that squish probably lost the attribution to me unfortunately. I'm not 100% sure. The GitHub UI is a bit flaky on this. You probably wanted something like this Co-authored-by: https://docs.github.com/en/pull-requests/committing-changes-to-your-project/creating-and-editing-commits/creating-a-commit-with-multiple-authors#required-co-author-information I noticed the fixup work. It's not that I didn't have time to do this, I just didn't have all the information about what needed changing, and was holding off until I heard back a little bit more from PRONOM, i.e. seeking closer alignment with PRONOM/DROID changes. Still. Good to have this in.

richardlehane · 2023-03-20T20:44:03Z

hey @ross-spencer, sorry I screwed up when doing that merge, should just not have squashed it. I've fixed the authorship now on the develop branch, unfortunately it looks like merging back into main will be painful. I've been merging in as much as possible as I'd like to cut the new release soon (maybe this weekend?) but if you need a bit more time for this one let me know

ross-spencer · 2023-03-21T08:11:53Z

Thanks @richardlehane. Squishing was a good instinct. With a bit of a heads up, I can help with anything like that in future. I was anticipating rebase/merge myself, but it's a bit of a (necessary) dance. Some of my commits were out of sequence (a price to pay for not having multiple PRs which I have found more often than not a greater headache than git-fu). I don't sense any particular concerns from the PRONOM team about releasing these changes into the world with Siegfried, that's more on me, so I don't think there's a need to hold back. The DROID changes are being investigated with no concrete promises, but it will be exciting to have that change to the DROID sig file in time.

richardlehane · 2023-03-21T10:25:02Z

ok so, in rectifying the authorship history, I completely messed up the develop branch... it needs to be sacrificed now to the gods of git. In my defence the stack overflow answer I was following did have 700+ points.

So... I've made a new "release" branch (https://github.com/richardlehane/siegfried/tree/release) and cherry picked the intervening commits from develop into it. This does change history a bit (dates are gone and it makes me committer for everything) but authors are fixed, history is linear, and it can be merged back into main without git screaming at me. Pls work off release for now if you've got any other things in train. I'll hose develop and after the 1.10 release will start a new develop branch.

Next time I'll take you up on that git consultancy offer, or can we just move to fossil

ross-spencer added 3 commits December 27, 2022 17:45

Add format type to Siegfried PRONOM output

0b02110

Add tests for PRONOM types work

2bdc899

Tests are added for the PRONOM types work along with new helper functions for making Siegfried tests more discrete and maintainable.

Linting fixes

e27bb70

PRONOM identifier related linting fixes for the different source files touched by the PRONOM types additions.

ross-spencer force-pushed the dev/add-pronom-type branch from 6051d23 to e27bb70 Compare December 28, 2022 12:24

ross-spencer changed the base branch from main to develop December 28, 2022 12:26

ross-spencer commented Dec 28, 2022

View reviewed changes

Add test for DROID CSV header output

957c2e7

Ensures that the DROID header doesn't change in code unless it is explicitly made to do so.

ross-spencer force-pushed the dev/add-pronom-type branch from 435c8a8 to 957c2e7 Compare February 5, 2023 20:18

richardlehane reviewed Feb 8, 2023

View reviewed changes

richardlehane and others added 5 commits March 19, 2023 13:22

use Limit

b958528

add "noclass" flag to allow omitting format class

c95e02d

fix indexes used by droid writer

dcb15c2

Merge branch 'develop' into dev/add-pronom-type

706209d

miscellaneous edit to prompt a merge check

64bf4da

richardlehane closed this Mar 20, 2023

richardlehane deleted the dev/add-pronom-type branch March 20, 2023 20:30

ross-spencer mentioned this pull request Mar 31, 2023

Preparing changes to add format types (classification) from DROID sig file #226

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PRONOM types to PRONOM identifier #209

Add PRONOM types to PRONOM identifier #209

ross-spencer commented Dec 28, 2022

ross-spencer Dec 28, 2022

richardlehane Jan 13, 2023

ross-spencer commented Jan 5, 2023

ross-spencer commented Feb 5, 2023

robin-francois commented Feb 6, 2023

ross-spencer commented Feb 7, 2023 •

edited

prwheatley commented Feb 7, 2023

richardlehane commented Feb 8, 2023

richardlehane Feb 8, 2023

richardlehane Feb 8, 2023

ross-spencer Feb 22, 2023

richardlehane Feb 22, 2023

richardlehane Feb 8, 2023

ross-spencer Feb 22, 2023

richardlehane Feb 8, 2023

richardlehane commented Mar 20, 2023

ross-spencer commented Mar 20, 2023 •

edited

richardlehane commented Mar 20, 2023

ross-spencer commented Mar 21, 2023

richardlehane commented Mar 21, 2023

Add PRONOM types to PRONOM identifier #209

Add PRONOM types to PRONOM identifier #209

Conversation

ross-spencer commented Dec 28, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ross-spencer commented Jan 5, 2023

ross-spencer commented Feb 5, 2023

robin-francois commented Feb 6, 2023

ross-spencer commented Feb 7, 2023 • edited

prwheatley commented Feb 7, 2023

richardlehane commented Feb 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardlehane commented Mar 20, 2023

ross-spencer commented Mar 20, 2023 • edited

richardlehane commented Mar 20, 2023

ross-spencer commented Mar 21, 2023

richardlehane commented Mar 21, 2023

ross-spencer commented Feb 7, 2023 •

edited

ross-spencer commented Mar 20, 2023 •

edited