Skip to content
This repository has been archived by the owner on Mar 18, 2020. It is now read-only.

Enforce created and modified dates for records #92

Closed
mcritchlow opened this issue Jul 8, 2019 · 15 comments · Fixed by #93
Closed

Enforce created and modified dates for records #92

mcritchlow opened this issue Jul 8, 2019 · 15 comments · Fixed by #93

Comments

@mcritchlow
Copy link
Member

We have a few direct use cases for the created and modified events that damsrepo should be persisting.

However, in discussion w/ @lsitu it seems we are not enforcing that these dates be persisted. This may be for a variety of reasons, including:

  • historical ingestion practices that pre-date dams4
  • historical ingestion/editing options in dams4 that can accidentally wipe out previous event history

While we cannot "fix" the past, hopefully we can at least start tracking these events properly going forward.

Note: a separate ticket probably needs to be created to properly integrate these dates into damspas

@lsitu
Copy link
Member

lsitu commented Jul 8, 2019

@mcritchlow I think we've got the events persisted in damsrepo for record creation and record edited, though we may lost it overtime without cares. However, we haven't indexed/utilized those event dates in SOLR for damspas UI display yet. So this seems more like a damspas ticket instead of a damsrepo ticket. What do you think?

@mcritchlow
Copy link
Member Author

mcritchlow commented Jul 8, 2019

@lsitu - I'd love for that to be the case :) My concern though, while doing some cursory testing of some random records, is I'm very rarely seeing either a created or modified date. It might just be that my random set of records I chose fell in the "lost" category.

I think the answer/outcome I'd like to see signed off on this ticket is: "any new or modified records from this point forward will ALWAYS have a created and modified event date associated with them, and those cannot be lost due to (for example) a manual RDF edit"

If we can already say that, then yes we can close this ticket out and instead focus on getting those dates in the Solr records for damspas. Which would be great!

We also need to think of what our fallback strategy is for all the records that have nothing (this isn't necessarily aimed at you, we may need input from @gamontoya / @arwenhutt )

  • Should we go through an exercise of trying to "guess" at the created dates and assign them?
  • If there is no modified date, should we apply the created date?
  • other?

@lsitu
Copy link
Member

lsitu commented Jul 8, 2019

@mcrichlow I think if searching for predicate dams:event, we should find at least one such predicates. But we need the /export option to see the event detail. For example, http://lib-hydratail-prod.ucsd.edu:8080/dams/api/objects/bb0446443f/export.
If you see any records that won't have any events in the object metadata, just send me the example so that I can take a look. Thank you.

@mcritchlow
Copy link
Member Author

@lsitu - Oh, I was looking directly at the /api/events route. This actually does show events for records I wasn't seeing in the api/events route we were chatting about the other day. This actually seems like great news, and perhaps we can indeed move towards figuring out how to get these dates indexed 👍

@mcritchlow
Copy link
Member Author

mcritchlow commented Jul 9, 2019

@lsitu - here are some thoughts/findings after poking around at this for a few minutes today. I'm hoping you might have some historical perspective with the damspas code to help connect a few dots for me.

A few things I've noticed, is that we:

  • Have a DamsEvent model and datastream setup in damspas, along with a to_solr method, etc.
  • I cannot find good examples of these events actually in the Solr index, though perhaps I'm querying for them incorrectly
  • I don't see the DamsEvent instances in any way integrated with the DamsObject to_solr indexing process

So it seems like we might have most of the pieces in place for this, but that they're not actually talking to each other properly such that we could easily use them (with a bit of sorting logic) for things like created and last-modified for objects/collections. I pause to ask you about the above because I see the model/datastream/tests around DamsEvent, but I don't really see them being used. So I'm wondering if I'm missing something. Curious to hear your thoughts.

@lsitu
Copy link
Member

lsitu commented Jul 9, 2019

@mcritchlow As far as I know, I think you are right that we haven't utilized the dams:Event model in damspas at all. So it's something new for Solr indexing with to_solr method.

@mcritchlow
Copy link
Member Author

@lsitu - ok thanks, that seems to be the case as far as i could tell. I was just a bit confused that we'd written all that code, but as far as i could tell weren't using it.

@hweng and @VivianChu - do you know have any thoughts on this? I noticed (not that you'll remember :) ) @VivianChu that it looks like you added the first damsevent commit back in the day ucsdlib/damspas@8536a2e

@hweng
Copy link
Contributor

hweng commented Jul 10, 2019

@mcritchlow Form solr indexing, for example bb90797993, I found the following dates which is related to created and modified:

system_create_dtsi: "2013-07-01T04:49:07Z",
system_modified_dtsi: "2013-07-01T04:49:07Z",
object_create_dtsi: "1985-01-01T00:00:00Z",
timestamp: "2019-07-09T23:22:40.545Z"

I think the timestamp should be the latest modified date, as I edited this record yesterday.

But as you mentioned, there might be some connection missed between existing code for DamsEvent that need to fix.

@mcritchlow
Copy link
Member Author

mcritchlow commented Jul 10, 2019

@hweng - Yeah, i admit that after looking at a variety of examples, I'm suspicious of those dates to be honest, at least in the context of what RDCP is hoping to use created/modified dates for. What they want is only the actual created and modified dates for the objects/collections themselves, whereas these dates are more directly related to the solr document, which could be regenerated for any number of reasons. I think some of those dates are supposed to more directly correspond to the repository dates, but they don't in our case, I'm guessing because we're not actually using Fedora 3 so we don't have a full API implementation.

That's why the focus on DamsEvent, and wanting to ideally get the actual repository dates included in the Solr documents, so we can rely on and target those.

It's kind of comical in a way, this is literally the most trivial set of dates to get on a database record in a "normal" application. Not so with our system :)

@mcritchlow
Copy link
Member Author

mcritchlow commented Jul 10, 2019

@lsitu - hmm, so here's something interesting related to the comment from @hweng above and my note that those dates should be useful:

Here is where it seems to me that an aggregation of date event processing happens, which seems to prefer just grabbing the most recent Event date available (with a static fallback date that seems like it should instead be the current date?):

https://github.com/ucsdlib/damsrepo/blob/master/src/xsl/fedora-object-profile.xsl#L13-L29

And then that timestamp date is applied to BOTH created and modified properties:

https://github.com/ucsdlib/damsrepo/blob/master/src/xsl/fedora-object-profile.xsl#L51-L52

So, it seems like if perhaps we reworked this logic to:

  1. only use the created event date for the objCreateDate
  2. have logic for what is currently timestamp remain as-is for objLastModDate

Then we might be heading in a better direction? Thoughts?

Edit:
And because I was curious, here's where those are mapped in the Hydra code:

[conan@lib-hydrahead-qa 2.3.0]$ grep -r "objCreateDate" .
./gems/active-fedora-6.7.8/lib/active_fedora/base.rb:        @inner_object.profile['objCreateDate']
./gems/active-fedora-6.7.8/lib/active_fedora/solr_digital_object.rb:      @profile['objCreateDate'] ||= Time.now.xmlschema
./gems/active-fedora-6.7.8/lib/active_fedora/solr_digital_object.rb:      @profile['objLastModDate'] ||= @profile['objCreateDate']
./gems/rubydora-1.7.5/lib/rubydora/digital_object.rb:    OBJ_ATTRIBUTES = {:state => :objState, :ownerId => :objOwnerId, :label => :objLabel, :logMessage => nil, :lastModifiedDate => :objLastModDate, :createdDate => :objCreateDate }

@lsitu
Copy link
Member

lsitu commented Jul 10, 2019

@mcritchlow That sound good. Maybe we can adjust the logic a little as you suggested above with the event history in order at mind:

  1. only use the created event date for the objCreateDate if presented, otherwise use the earliest modified data
  2. have logic for what is currently timestamp for objLastModDate with the latest edited event if presented. Otherwise use the created event.

This could be be more accurate. What do you think?

@mcritchlow
Copy link
Member Author

@lsitu - I think that sounds great. Your fallback idea for objCreateDate using the oldest(earliest) modified date seems reasonable in the situation that we've somehow lost the create event date.

If you're up for making those adjustments at your convenience, I can start getting a new damspas branch ready so we could test with out w/ Ho Jung and others on QA.

I suspect one implication here is to really be confident of the results, we may want to:

  1. sync data from prod to QA
  2. reindex everything in Solr
  3. Then run our tests

Thoughts?

@lsitu
Copy link
Member

lsitu commented Jul 10, 2019

@mcritchlow It sound great! I'll see work on adjustments on damsrepo shortly. Thanks.

@mcritchlow
Copy link
Member Author

@lsitu - Ok, I have a branch ready for us to test against whenever we're ready ucsdlib/damspas#681

This will leverage both create and modified, so it should give us good coverage on whether these changes are successful.

@lsitu
Copy link
Member

lsitu commented Jul 10, 2019

@mcritchlow I've created PR #93 for it. It's ready for review now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants