-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
last_harvest.txt datestamp incorrectly updated when OAI-PMH response is not chronological #7
Comments
The logic at play here seems to be inside Harvester::getRecords(). A most recent datestamp is recorded for each group of records fetched, and then when the last group is fetched, whatever datestamp came back from that group is injected into last_harvest.txt via the state manager. I assume that there is a good reason to get the datestamp values directly from the API responses, but I also wonder why we would not simply set an explicit harvest end date as part of the harvest launch, and then write this value to last_harvest.txt as soon as we can detect that the harvest completed. That would totally short-circut this issue and potentially simplify the date and state tracking business logic a bit. I'm working on a quick PR to see if this is a workable option. |
PR Opened: #8 |
@ryan-jacobs, thanks for catching this issue and for sharing a possible solution. I believe that the reason for the current implementation was to avoid any possibility of missing records due to inconsistencies between time on the local machine and time on the OAI server -- e.g. if time zones are interpreted differently in different places, etc. The idea was that if we used a date that came from the OAI server as the end, it was more reliable than using a date that came from the client, and if some records got harvested more than once, that was mostly harmless, as opposed to missing records entirely, which would be a significant problem. Obviously, the possibility of an infinite harvest loop had not occurred to anybody! I spoke with @EreMaijala, who develops the RecordManager tool, to see how he approaches this, and I think his solution seems like it accounts for both VuFind's original concerns and the problem you discovered. Here's what he said:
How would you feel about revising #8 so that it pulls the end date from the Identify response instead of the local clock? I think that would be pretty similar to what you have already done, but it would ensure that we use times provided by the server, avoiding the potential inconsistency/gap problem. I'm happy to help (time permitting) if you need me to, and once again, thanks for taking the time to investigate and report this problem! |
If I may add one thing, and here I'm not quite certain how things currently work or what #8 suggests: it's important to use the timestamp of the beginning of the last harvesting run for the start date of the next one. Some repositories implicitly convert a request like: Sorry if I made a mess out of trying to convey my thoughts above! To sum it up, I'd take the server date from Identify before starting harvesting, store it only if harvesting completes successfully and use it next time for the |
Thanks, @EreMaijala, that makes sense to me. Bottom line: I think the work #8 is currently doing to persist the harvest end date is a useful addition, but the logic of how that end date is initially calculated, and the way it is used as an until parameter, should be changed based on the conversation above. |
Great, thanks both for your quick feedback!
That makes sense. I've dealt with this problem before with older Salesforce APIs and we needed to error on the side of adding artificial overlap between request ranges based on this same thinking. I think we've just uncovered some cases here in vufindharvest where these protective overlaps can grow so large that they become a problem for other reasons.
I assume this essentially means we need to account for the time it takes for the harvest to actually run, so that we never miss any changes that happen during the harvest. A great point. As long as the "until" datestamp exactly matches, or ever-so-slightly precedes, the first getRecords() call I think we would be covered there -- if also controlling for timezones and clock skew of course.
That sounds like a great idea. I see that the Identify request is currently only run now as part of loadGranularity(). Unless there's a compelling reason to maintain that as a separate method, it seems like that loadGranularity() logic could be merged into the storeDateSettings(). That way all the date initialization details would be in one place, along with a single Identify request. This would not impact any public methods. I'll have another pass at this shortly. |
@ryan-jacobs, I think @EreMaijala's suggestion is that if you capture the current timestamp of the OAI server through it's Identify response as the first step of your harvest process, then save that as the last harvest date, you'll be absolutely sure not to miss any records on your next harvest, regardless of issues of clock skew, timezones, variant behavior about how resumption tokens are handled, etc., etc. While I suppose you could use that value as the "until," it might be safer not to, just in case it imposes limits that hit edge cases. It might allow some redundant harvesting to occur (if records are added to the system while the harvest is taking place, and the server includes them in its response), but I think that's likely to be a rare case, and the important thing is that multiple iterative harvests would not end up getting the same records over and over using this logic. Regarding the Identify request in loadGranularity, maybe the best thing to do is to call Identify at the start of the process and store the response in a property, and then both loadGranularity and other methods that depend on Identify details can fetch them from there. That way you only call the API once, but you don't have to do all of the unrelated processing aspects in the same long piece of code. Does that make any sense? |
…Depricate Harvester::loadGranularity() to merge all Indentiy requests into Harvester::storeDateSettings(). Supports vufind-org#7.
Hi @demiankatz, I think we are very aligned in concept, but there may be a couple implementation details still to confirm together. I've pushed another commit which may help communicate some details.
Indeed. I think we're on the same page here, and the main difference with the newest commit is using the OAI host time instead of the local time when getting that "current timestamp" value.
This is the part I've been ruminating on a bit. This is really a question of a default explicit harvest range vs an implicit range. I'm of course happy to go either direction in this PR, but the most recent commit still embraces the default explicit range approach for the following reasons:
All this said, I'm a little new to this OAI standard and will defer to you and @EreMaijala on the final direction if you see flaws in any of this reasoning... I always enjoy these discussions 😄 I know the tests are not yet passing, but the failures I saw seemed to be explained by some signature updates and other small details. I'll take a crack at those tests once we land on the default explicit/implicit "until" date detail. |
@ryan-jacobs Your reasoning makes sense, but unfortunately the reality is... less optimal. There are, for instance, servers that only support date granularity but don't handle There is also the argument to be made that with VuFind there's no reason to not let more recent records in if they happen to be available. If, for instance, you harvest new records once an hour, it's only a bonus if you happen to get records that were added or updated while the harvesting is in progress. This means that the search index is regularly more up to date, and at least with us the library staff want as frequent updates as possible [1]. This is pronounced if the harvesting interval is longer or the server is slow. [1] We typically harvest library catalogs every half an hour, but we offer the possibility of harvesting up to every minute provided the previous run has completed. This results in us harvesting and indexing north of 10 million records daily, but also allows near-real-time indexing of availability information that we get for some libraries so that the user can select an "Available in library" checkbox facet. And the sooner we can get the updates in, the better. |
Thanks @EreMaijala. It's very interesting to hear about the variety of host-side implementations, and various problems. I'm also convinced now that the benefit (or at least welcomed) behavior of some OAI hosts to include updates that happen during the harvest stream itself is worth embracing. This seems especially useful for very large/long harvest runs (we have data management cases that require this) or high frequency harvesting (as in your example). While there are some small tradeoffs it does sound like this is the way to go. I'm glad that we could walk-through, and loosely document, these considerations here. This is valuable. I'll take this approach to the PR. |
@ryan-jacobs Thanks! Always appreciate a good discussion and thoughtful comments! |
It appears that the datestamp set in last_harvest.txt is based on the most recent record of the last group of records returned. This must assume that records are returned by the API in some form of chronological order based on last update/create. However, we have learned that there does not appear to be any guarantee that records come back though the API in any specific order, so the most recent record of the last group may not be the most recent record from the harvest. When this happens last_harvest.txt can be incorrectly set to a datestamp the precedes the correct value, leading to records being repeatedly re-harvested. In extreme cases this can lead to the same groups or records being repeatedly re-harvested without moving the last_harvest.txt date forward.
Example
Start a large FOLIO harvest on 2022-08-15T12:00:00Z with last_harvest.txt set to 2022-06-01T01:01:10Z. The last known record update in FOLIO is on 2022-08-13T15:00:00Z.
Expected Outcome
Harvest completes in-full and last_harvest.txt gets set to either 2022-08-13T15:00:00Z or 2022-08-15T12:00:00Z
Observed Outcome
Harvest completes in full, but last_harvest.txt gets set to a seemingly random datestamp in the middle of the harvest range (e.g. 2022-06-15T18:34:45Z). Subsequent incremental harvests repeatedly re-harvest a sub-set of the records already fetched.
More Info
We have observed this issue (specifically records coming-back out of chronological order) while running VuFind harvests connected to an EBSCO-hosted FOLIO endpoint, but the API spec itself seems to indicate that chronological ordering cannot be assumed (in the Flow Control section):
So this could explain why this issue exists. The logic used to set last_harvest.txt may be based on an incorrect assumption about the order OAI-PMH is returning records.
The text was updated successfully, but these errors were encountered: