Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PRONOM types to PRONOM identifier #209

Closed
wants to merge 10 commits into from

Conversation

ross-spencer
Copy link
Collaborator

Exploration adding PRONOM types classification to the Siegfried PRONOM identifier.

Connected to: #207

This results in a new output from Siegfried which looks something as follows:

filename : 'testdata/skeleton-suite/x-fmt/x-fmt-95-signature-id-858.pwi'
filesize : 5
modified : 2020-07-05T19:53:49+02:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'x-fmt/95'
    format  : 'Inkwriter/Notetaker Document'
    version : 
    mime    : 
    type    : 'Word Processor'
    basis   : 'extension match pwi; byte match at 0, 5'
    warning : 

Note the addition of type : 'Word Processor'

NB. This will only show a value if the PRONOM identifier is configured with PRONOM reports, i.e. the PRONOM XML export from PRONOM itself. The DROID signature file still needs this information to be added, we believe this is on the way. I can attend the next PRONOM meeting at the beginning of the year to ask more.

Tests have been included as part of this feature. Additionally, source files have had linting changes made to them to pass linting. These are in the third commit associated with the PR and may warrant special attention for accuracy, especially around the correctness of the documentation.

Tests are added for the PRONOM types work along with new helper
functions for making Siegfried tests more discrete and maintainable.
PRONOM identifier related linting fixes for the different source
files touched by the PRONOM types additions.
@ross-spencer ross-spencer changed the base branch from main to develop December 28, 2022 12:26
cmd/sf/sf.go Outdated
@@ -427,7 +427,7 @@ func main() {
case *jsono:
w = writer.JSON(os.Stdout)
case *droido:
if len(s.Fields()) != 1 || len(s.Fields()[0]) != 7 {
if len(s.Fields()) != 1 || len(s.Fields()[0]) != 8 {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about something like this for here:

	p, _ := pronom.New()
	if len(s.Fields()) != 1 || len(s.Fields()[0]) != len(p.Fields()) {
	    // ...
	}

But is that just creating a big overhead in terms of speed (I think all reports are read?) and there's always the potential for error. Is there another way to export the fields as they're constant?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, loading pronom here would be expensive!

We can avoid writing to disk and make the tests here more portable by
reading from an in-memory filesystem. The skeletons themselves are
small and so can be easily stored in-line as strings and then turned
into byte objects.

Given the refactor to in-memory objects, we also take the opportunity
to add a file that won't identify with the minimal PRONOM signature
file and PRONOM reports. Type should be a nil-string as with many
of the other fields.
@ross-spencer
Copy link
Collaborator Author

NB. I was accidentally testing in "production" against this branch today and it looks like type may have crept into the DROID report and so I need to go back to the DROID CSV creation and make sure that doesn't happen, and then likely include a DROID specific test to make sure headers are output correctly.

NB. Also, chatted to the PRONOM crew today in the PRONOM weekly. David C checked DROID signature files for compatibility, and they look good. One question is whether the SOAP service that delivers the XML to DROID has a different take on this, but overall it sounds positive this can potentially be added. It is a good time to ask with other PRONOM changes in the works over the course of the year.

Ensures that the DROID header doesn't change in code unless it
is explicitly made to do so.
@ross-spencer
Copy link
Collaborator Author

A small DROID header test has been added here: 957c2e7

To clarify, the "TYPE" field is specific to the DROID CSV, and so, type in the standard YAML/JSON/CSV outputs of SF may not be a good idea, be somewhat confusing.

@robin-francois @richardlehane is Classification the preferred term for the format type/classification field? Does it make sense to others reading this?

LoC uses "content categories" to describe this: https://www.loc.gov/preservation/digital/formats/content/content_categories.shtml

PRONOM as we know uses "Classification".

Wikidata doesn't seem to have the equivalent predicate, it tends towards instance of, and use/used for to describe similar. There may be an equivalent predicate.

@robin-francois
Copy link

Thanks @ross-spencer, that's a splendid job.

Wikipedia seems to use type of format in some pages. I have no favoured term. Both category or class seem to be appropriate instead of the misleading type.

@ross-spencer
Copy link
Collaborator Author

ross-spencer commented Feb 7, 2023

Thanks @robin-francois - both good options, perhaps preceded by "format" e.g. "format class" in the one example? Thinking about it I'll add these notes to the discussion page and then post it on digipres.club/Twitter and see if folks have an opinion too. I'm not sure why the discussion works differently from the PR comments - but I think maybe it does?

NB. it's also really easy to change these things, so once there's a name we've settled on, I can make those changes and the PR should be good for review.

@prwheatley
Copy link

Would be great to have this - and useful to have a decent standard set of content type categories (however imperfect they always are) for use in other areas. COPTR has its own that aligns/overlaps pretty well with LOC, albeit with some different titles and a few that aren't on the LOC list. https://coptr.digipres.org/index.php/Content_Types

@richardlehane
Copy link
Owner

thanks for all the work on this @ross-spencer ... it's looking good!

There's a chance that some of the current integrations with sf are relying on the number/order of fields (e.g. if they are using csv output they may expect certain elements in certain columns). I think adding this new "classification" field should become the new default but I wonder whether adding a flag to roy to build a signature file without the field (e.g. "-noclass" flag) might make sense as a fallback in case any integrations get broken? This could also be how signatures built with droid xml files only get handled.

To do this you'd probably need to add a new bool field in the PRONOM identifier to indicate whether or not classifications are used & then make the output functions check that field to give the two possible forms of output. What do you think?

// the test data directory.
func setMinimalParams() {
config.SetDroid("DROID_minimal.xml")
config.SetPRONOMReportsDir("pronom_minimal")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest could omit the "DROID_minimal.xml" and pronom_minimal/* dependencies if you add a config.SetLimit("fmt/1", "fmt/11", "fmt/14", "fmt/3", fmt/5" )

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call like config.SetLimit("fmt/1", "fmt/11", "fmt/14", "fmt/3", fmt/5")() to set the limit immediately (needed if you are using NewPronom() rather than pronom.New(opts...)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, rather than looking at the new files as dependencies per se, the intention here was to make the tests less sensitive to changes in the signature file, and only sensitive to changes in the code. So there's a bit of overhead with this PR to achieve that, but in future, DROID_SignatureFile_v109.xml can be upgraded for the tests that use it, without having to fixup and breakages in these newly introduced tests, e.g. if more format classifications are added. I believe it can lead to more flexibility in how test cases are added in future.

What do you think? I can revert back to the current approach if it doesn't work? Another way to do this, to avoid creating more fixtures here, is to see the concept used for creating the skeleton files in this PR through to the end and maybe output the signature files to a temporary location on the filesystem too?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point but there probably won't be too much churn in the tests: you chose pretty stable formats, none of them has been updated since 2013! If there are occasional failures due to PRONOM updates it might also be interesting to know?

@@ -202,8 +203,17 @@ func TextPuid() string {
// SetDroid sets the name and/or location of the DROID signature file.
// I.e. can provide a full path or a filename relative to the HOME directory.
func SetDroid(d string) func() private {
pronom.droid = d
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pronom.droid should be set within the closure not outside it. This means that the option will only take effect when the identifer is made. Making this change will mean that the option gets immediately set (changing behaviour of code in cmd/roy). Generally if you need config options like this to take immediate effect you can do so by invoking like this: SetDroid("x")() [that second parens calls the returned option func immediately]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! That explains why it wasn't working - will give this a try.


// SetPRONOMReportsDir sets the PRONOM reports directory, used to
// generate a PRONOM identifier from the XML data retrieved from PRONOM.
func SetPRONOMReportsDir(r string) func() private {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest this function not needed if config.SetLimit() used (as described in another comment)... could probably drop this addition

@richardlehane
Copy link
Owner

merged into develop branch with this commit: 98516b1

@ross-spencer
Copy link
Collaborator Author

ross-spencer commented Mar 20, 2023

@richardlehane I think that squish probably lost the attribution to me unfortunately. I'm not 100% sure. The GitHub UI is a bit flaky on this. You probably wanted something like this Co-authored-by: https://docs.github.com/en/pull-requests/committing-changes-to-your-project/creating-and-editing-commits/creating-a-commit-with-multiple-authors#required-co-author-information I noticed the fixup work. It's not that I didn't have time to do this, I just didn't have all the information about what needed changing, and was holding off until I heard back a little bit more from PRONOM, i.e. seeking closer alignment with PRONOM/DROID changes. Still. Good to have this in.

@richardlehane richardlehane deleted the dev/add-pronom-type branch March 20, 2023 20:30
@richardlehane
Copy link
Owner

hey @ross-spencer, sorry I screwed up when doing that merge, should just not have squashed it. I've fixed the authorship now on the develop branch, unfortunately it looks like merging back into main will be painful. I've been merging in as much as possible as I'd like to cut the new release soon (maybe this weekend?) but if you need a bit more time for this one let me know

@ross-spencer
Copy link
Collaborator Author

Thanks @richardlehane. Squishing was a good instinct. With a bit of a heads up, I can help with anything like that in future. I was anticipating rebase/merge myself, but it's a bit of a (necessary) dance. Some of my commits were out of sequence (a price to pay for not having multiple PRs which I have found more often than not a greater headache than git-fu). I don't sense any particular concerns from the PRONOM team about releasing these changes into the world with Siegfried, that's more on me, so I don't think there's a need to hold back. The DROID changes are being investigated with no concrete promises, but it will be exciting to have that change to the DROID sig file in time.

@richardlehane
Copy link
Owner

ok so, in rectifying the authorship history, I completely messed up the develop branch... it needs to be sacrificed now to the gods of git. In my defence the stack overflow answer I was following did have 700+ points.

So... I've made a new "release" branch (https://github.com/richardlehane/siegfried/tree/release) and cherry picked the intervening commits from develop into it. This does change history a bit (dates are gone and it makes me committer for everything) but authors are fixed, history is linear, and it can be merged back into main without git screaming at me. Pls work off release for now if you've got any other things in train. I'll hose develop and after the 1.10 release will start a new develop branch.

Next time I'll take you up on that git consultancy offer, or can we just move to fossil

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants