New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PRONOM types to PRONOM identifier #209
Conversation
Tests are added for the PRONOM types work along with new helper functions for making Siegfried tests more discrete and maintainable.
PRONOM identifier related linting fixes for the different source files touched by the PRONOM types additions.
6051d23
to
e27bb70
Compare
cmd/sf/sf.go
Outdated
@@ -427,7 +427,7 @@ func main() { | |||
case *jsono: | |||
w = writer.JSON(os.Stdout) | |||
case *droido: | |||
if len(s.Fields()) != 1 || len(s.Fields()[0]) != 7 { | |||
if len(s.Fields()) != 1 || len(s.Fields()[0]) != 8 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about something like this for here:
p, _ := pronom.New()
if len(s.Fields()) != 1 || len(s.Fields()[0]) != len(p.Fields()) {
// ...
}
But is that just creating a big overhead in terms of speed (I think all reports are read?) and there's always the potential for error. Is there another way to export the fields as they're constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, loading pronom here would be expensive!
We can avoid writing to disk and make the tests here more portable by reading from an in-memory filesystem. The skeletons themselves are small and so can be easily stored in-line as strings and then turned into byte objects. Given the refactor to in-memory objects, we also take the opportunity to add a file that won't identify with the minimal PRONOM signature file and PRONOM reports. Type should be a nil-string as with many of the other fields.
NB. I was accidentally testing in "production" against this branch today and it looks like type may have crept into the DROID report and so I need to go back to the DROID CSV creation and make sure that doesn't happen, and then likely include a DROID specific test to make sure headers are output correctly. NB. Also, chatted to the PRONOM crew today in the PRONOM weekly. David C checked DROID signature files for compatibility, and they look good. One question is whether the SOAP service that delivers the XML to DROID has a different take on this, but overall it sounds positive this can potentially be added. It is a good time to ask with other PRONOM changes in the works over the course of the year. |
Ensures that the DROID header doesn't change in code unless it is explicitly made to do so.
435c8a8
to
957c2e7
Compare
A small DROID header test has been added here: 957c2e7 To clarify, the "TYPE" field is specific to the DROID CSV, and so, @robin-francois @richardlehane is LoC uses "content categories" to describe this: https://www.loc.gov/preservation/digital/formats/content/content_categories.shtml PRONOM as we know uses "Classification". Wikidata doesn't seem to have the equivalent predicate, it tends towards instance of, and use/used for to describe similar. There may be an equivalent predicate. |
Thanks @ross-spencer, that's a splendid job. Wikipedia seems to use |
Thanks @robin-francois - both good options, perhaps preceded by "format" e.g. "format class" in the one example? Thinking about it I'll add these notes to the discussion page and then post it on digipres.club/Twitter and see if folks have an opinion too. I'm not sure why the discussion works differently from the PR comments - but I think maybe it does? NB. it's also really easy to change these things, so once there's a name we've settled on, I can make those changes and the PR should be good for review. |
Would be great to have this - and useful to have a decent standard set of content type categories (however imperfect they always are) for use in other areas. COPTR has its own that aligns/overlaps pretty well with LOC, albeit with some different titles and a few that aren't on the LOC list. https://coptr.digipres.org/index.php/Content_Types |
thanks for all the work on this @ross-spencer ... it's looking good! There's a chance that some of the current integrations with sf are relying on the number/order of fields (e.g. if they are using csv output they may expect certain elements in certain columns). I think adding this new "classification" field should become the new default but I wonder whether adding a flag to roy to build a signature file without the field (e.g. "-noclass" flag) might make sense as a fallback in case any integrations get broken? This could also be how signatures built with droid xml files only get handled. To do this you'd probably need to add a new bool field in the PRONOM identifier to indicate whether or not classifications are used & then make the output functions check that field to give the two possible forms of output. What do you think? |
pkg/pronom/pronom_test.go
Outdated
// the test data directory. | ||
func setMinimalParams() { | ||
config.SetDroid("DROID_minimal.xml") | ||
config.SetPRONOMReportsDir("pronom_minimal") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest could omit the "DROID_minimal.xml" and pronom_minimal/* dependencies if you add a config.SetLimit("fmt/1", "fmt/11", "fmt/14", "fmt/3", fmt/5" )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call like config.SetLimit("fmt/1", "fmt/11", "fmt/14", "fmt/3", fmt/5")() to set the limit immediately (needed if you are using NewPronom() rather than pronom.New(opts...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, rather than looking at the new files as dependencies per se, the intention here was to make the tests less sensitive to changes in the signature file, and only sensitive to changes in the code. So there's a bit of overhead with this PR to achieve that, but in future, DROID_SignatureFile_v109.xml can be upgraded for the tests that use it, without having to fixup and breakages in these newly introduced tests, e.g. if more format classifications are added. I believe it can lead to more flexibility in how test cases are added in future.
What do you think? I can revert back to the current approach if it doesn't work? Another way to do this, to avoid creating more fixtures here, is to see the concept used for creating the skeleton files in this PR through to the end and maybe output the signature files to a temporary location on the filesystem too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point but there probably won't be too much churn in the tests: you chose pretty stable formats, none of them has been updated since 2013! If there are occasional failures due to PRONOM updates it might also be interesting to know?
pkg/config/pronom.go
Outdated
@@ -202,8 +203,17 @@ func TextPuid() string { | |||
// SetDroid sets the name and/or location of the DROID signature file. | |||
// I.e. can provide a full path or a filename relative to the HOME directory. | |||
func SetDroid(d string) func() private { | |||
pronom.droid = d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pronom.droid should be set within the closure not outside it. This means that the option will only take effect when the identifer is made. Making this change will mean that the option gets immediately set (changing behaviour of code in cmd/roy). Generally if you need config options like this to take immediate effect you can do so by invoking like this: SetDroid("x")() [that second parens calls the returned option func immediately]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! That explains why it wasn't working - will give this a try.
pkg/config/pronom.go
Outdated
|
||
// SetPRONOMReportsDir sets the PRONOM reports directory, used to | ||
// generate a PRONOM identifier from the XML data retrieved from PRONOM. | ||
func SetPRONOMReportsDir(r string) func() private { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest this function not needed if config.SetLimit() used (as described in another comment)... could probably drop this addition
merged into develop branch with this commit: 98516b1 |
@richardlehane I think that squish probably lost the attribution to me unfortunately. I'm not 100% sure. The GitHub UI is a bit flaky on this. You probably wanted something like this |
hey @ross-spencer, sorry I screwed up when doing that merge, should just not have squashed it. I've fixed the authorship now on the develop branch, unfortunately it looks like merging back into main will be painful. I've been merging in as much as possible as I'd like to cut the new release soon (maybe this weekend?) but if you need a bit more time for this one let me know |
Thanks @richardlehane. Squishing was a good instinct. With a bit of a heads up, I can help with anything like that in future. I was anticipating rebase/merge myself, but it's a bit of a (necessary) dance. Some of my commits were out of sequence (a price to pay for not having multiple PRs which I have found more often than not a greater headache than git-fu). I don't sense any particular concerns from the PRONOM team about releasing these changes into the world with Siegfried, that's more on me, so I don't think there's a need to hold back. The DROID changes are being investigated with no concrete promises, but it will be exciting to have that change to the DROID sig file in time. |
ok so, in rectifying the authorship history, I completely messed up the develop branch... it needs to be sacrificed now to the gods of git. In my defence the stack overflow answer I was following did have 700+ points. So... I've made a new "release" branch (https://github.com/richardlehane/siegfried/tree/release) and cherry picked the intervening commits from develop into it. This does change history a bit (dates are gone and it makes me committer for everything) but authors are fixed, history is linear, and it can be merged back into main without git screaming at me. Pls work off release for now if you've got any other things in train. I'll hose develop and after the 1.10 release will start a new develop branch. Next time I'll take you up on that git consultancy offer, or can we just move to fossil |
Exploration adding PRONOM types classification to the Siegfried PRONOM identifier.
Connected to: #207
This results in a new output from Siegfried which looks something as follows:
Note the addition of
type : 'Word Processor'
NB. This will only show a value if the PRONOM identifier is configured with PRONOM reports, i.e. the PRONOM XML export from PRONOM itself. The DROID signature file still needs this information to be added, we believe this is on the way. I can attend the next PRONOM meeting at the beginning of the year to ask more.
Tests have been included as part of this feature. Additionally, source files have had linting changes made to them to pass linting. These are in the third commit associated with the PR and may warrant special attention for accuracy, especially around the correctness of the documentation.