Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

last_harvest.txt datestamp incorrectly updated when OAI-PMH response is not chronological #7

Closed
ryan-jacobs opened this issue Aug 22, 2022 · 12 comments

Comments

@ryan-jacobs
Copy link
Contributor

It appears that the datestamp set in last_harvest.txt is based on the most recent record of the last group of records returned. This must assume that records are returned by the API in some form of chronological order based on last update/create. However, we have learned that there does not appear to be any guarantee that records come back though the API in any specific order, so the most recent record of the last group may not be the most recent record from the harvest. When this happens last_harvest.txt can be incorrectly set to a datestamp the precedes the correct value, leading to records being repeatedly re-harvested. In extreme cases this can lead to the same groups or records being repeatedly re-harvested without moving the last_harvest.txt date forward.

Example

Start a large FOLIO harvest on 2022-08-15T12:00:00Z with last_harvest.txt set to 2022-06-01T01:01:10Z. The last known record update in FOLIO is on 2022-08-13T15:00:00Z.

Expected Outcome

Harvest completes in-full and last_harvest.txt gets set to either 2022-08-13T15:00:00Z or 2022-08-15T12:00:00Z

Observed Outcome

Harvest completes in full, but last_harvest.txt gets set to a seemingly random datestamp in the middle of the harvest range (e.g. 2022-06-15T18:34:45Z). Subsequent incremental harvests repeatedly re-harvest a sub-set of the records already fetched.

More Info

We have observed this issue (specifically records coming-back out of chronological order) while running VuFind harvests connected to an EBSCO-hosted FOLIO endpoint, but the API spec itself seems to indicate that chronological ordering cannot be assumed (in the Flow Control section):

The protocol does not define the semantics of incompleteness. Therefore, a harvester should not assume that the members in an incomplete list conform to some selection criteria (e.g., date ordering).

So this could explain why this issue exists. The logic used to set last_harvest.txt may be based on an incorrect assumption about the order OAI-PMH is returning records.

@ryan-jacobs
Copy link
Contributor Author

ryan-jacobs commented Aug 22, 2022

The logic at play here seems to be inside Harvester::getRecords(). A most recent datestamp is recorded for each group of records fetched, and then when the last group is fetched, whatever datestamp came back from that group is injected into last_harvest.txt via the state manager.

I assume that there is a good reason to get the datestamp values directly from the API responses, but I also wonder why we would not simply set an explicit harvest end date as part of the harvest launch, and then write this value to last_harvest.txt as soon as we can detect that the harvest completed. That would totally short-circut this issue and potentially simplify the date and state tracking business logic a bit.

I'm working on a quick PR to see if this is a workable option.

@ryan-jacobs
Copy link
Contributor Author

PR Opened: #8

@demiankatz
Copy link
Member

@ryan-jacobs, thanks for catching this issue and for sharing a possible solution.

I believe that the reason for the current implementation was to avoid any possibility of missing records due to inconsistencies between time on the local machine and time on the OAI server -- e.g. if time zones are interpreted differently in different places, etc. The idea was that if we used a date that came from the OAI server as the end, it was more reliable than using a date that came from the client, and if some records got harvested more than once, that was mostly harmless, as opposed to missing records entirely, which would be a significant problem. Obviously, the possibility of an infinite harvest loop had not occurred to anybody!

I spoke with @EreMaijala, who develops the RecordManager tool, to see how he approaches this, and I think his solution seems like it accounts for both VuFind's original concerns and the problem you discovered. Here's what he said:

RecordManager doesn’t check record timestamps for that. It retrieves server’s current time with an Identify request when starting harvesting and saves it as the next start date after successful completion. This avoids any issues including:

  • Clock skew between provider and harvester
  • Records not having correct timestamps in the OAI-PMH header
  • Time zone handling issues

How would you feel about revising #8 so that it pulls the end date from the Identify response instead of the local clock? I think that would be pretty similar to what you have already done, but it would ensure that we use times provided by the server, avoiding the potential inconsistency/gap problem.

I'm happy to help (time permitting) if you need me to, and once again, thanks for taking the time to investigate and report this problem!

@EreMaijala
Copy link
Contributor

If I may add one thing, and here I'm not quite certain how things currently work or what #8 suggests: it's important to use the timestamp of the beginning of the last harvesting run for the start date of the next one.

Some repositories implicitly convert a request like:
verb=ListRecords&from=2022-08-22T10:00:00Z&...
to
verb=ListRecords&from=2022-08-22T10:00:00Z&until=2022-08-23T17:30:00Z&... (current timestamp used for the end date)
or use a record set that can't change dynamically, which means that you won't get anything newer than what was available when you started harvesting, or changes made to the records during harvesting. So you need to start from the last start timestamp for the next run to ensure you get everything. This also means that the end date is pretty much meaningless, and unless you can somehow ensure that it's at sync with the OAI-PMH provider's (server's) time, using the until parameter risks losing records that fall between the previously provided until date and the current real from date.

Sorry if I made a mess out of trying to convey my thoughts above! To sum it up, I'd take the server date from Identify before starting harvesting, store it only if harvesting completes successfully and use it next time for the from parameter. I'd never use until parameter with a normal harvest.

@demiankatz
Copy link
Member

Thanks, @EreMaijala, that makes sense to me. Bottom line: I think the work #8 is currently doing to persist the harvest end date is a useful addition, but the logic of how that end date is initially calculated, and the way it is used as an until parameter, should be changed based on the conversation above.

@ryan-jacobs
Copy link
Contributor Author

ryan-jacobs commented Aug 23, 2022

Great, thanks both for your quick feedback!

(demiankatz) The idea was that if we used a date that came from the OAI server as the end, it was more reliable than using a date that came from the client, and if some records got harvested more than once, that was mostly harmless, as opposed to missing records entirely, which would be a significant problem

That makes sense. I've dealt with this problem before with older Salesforce APIs and we needed to error on the side of adding artificial overlap between request ranges based on this same thinking. I think we've just uncovered some cases here in vufindharvest where these protective overlaps can grow so large that they become a problem for other reasons.

(EreMaijala) it's important to use the timestamp of the beginning of the last harvesting run for the start date of the next one.

I assume this essentially means we need to account for the time it takes for the harvest to actually run, so that we never miss any changes that happen during the harvest. A great point. As long as the "until" datestamp exactly matches, or ever-so-slightly precedes, the first getRecords() call I think we would be covered there -- if also controlling for timezones and clock skew of course.

(demiankatz) How would you feel about revising #8 so that it pulls the end date from the Identify response instead of the local clock?

That sounds like a great idea. I see that the Identify request is currently only run now as part of loadGranularity(). Unless there's a compelling reason to maintain that as a separate method, it seems like that loadGranularity() logic could be merged into the storeDateSettings(). That way all the date initialization details would be in one place, along with a single Identify request. This would not impact any public methods.

I'll have another pass at this shortly.

@demiankatz
Copy link
Member

@ryan-jacobs, I think @EreMaijala's suggestion is that if you capture the current timestamp of the OAI server through it's Identify response as the first step of your harvest process, then save that as the last harvest date, you'll be absolutely sure not to miss any records on your next harvest, regardless of issues of clock skew, timezones, variant behavior about how resumption tokens are handled, etc., etc. While I suppose you could use that value as the "until," it might be safer not to, just in case it imposes limits that hit edge cases. It might allow some redundant harvesting to occur (if records are added to the system while the harvest is taking place, and the server includes them in its response), but I think that's likely to be a rare case, and the important thing is that multiple iterative harvests would not end up getting the same records over and over using this logic.

Regarding the Identify request in loadGranularity, maybe the best thing to do is to call Identify at the start of the process and store the response in a property, and then both loadGranularity and other methods that depend on Identify details can fetch them from there. That way you only call the API once, but you don't have to do all of the unrelated processing aspects in the same long piece of code.

Does that make any sense?

ryan-jacobs added a commit to ryan-jacobs/vufindharvest that referenced this issue Aug 24, 2022
…Depricate Harvester::loadGranularity() to merge all Indentiy requests into Harvester::storeDateSettings(). Supports vufind-org#7.
@ryan-jacobs
Copy link
Contributor Author

Hi @demiankatz,

I think we are very aligned in concept, but there may be a couple implementation details still to confirm together. I've pushed another commit which may help communicate some details.

if you capture the current timestamp of the OAI server through it's Identify response as the first step of your harvest process, then save that as the last harvest date, you'll be absolutely sure not to miss any records on your next harvest, regardless of issues of clock skew, timezones, variant behavior about how resumption tokens are handled, etc., etc.

Indeed. I think we're on the same page here, and the main difference with the newest commit is using the OAI host time instead of the local time when getting that "current timestamp" value.

While I suppose you could use that value as the "until," it might be safer not to, just in case it imposes limits that hit edge cases. It might allow some redundant harvesting to occur (if records are added to the system while the harvest is taking place, and the server includes them in its response)

This is the part I've been ruminating on a bit. This is really a question of a default explicit harvest range vs an implicit range. I'm of course happy to go either direction in this PR, but the most recent commit still embraces the default explicit range approach for the following reasons:

  1. More predictable control. Maybe this is a bit too much "belt and suspenders" thinking, but always explicitly passing an "until" value guarantees that the last harvest date we are storing is always aligned with a specific range requested from the server. No ambiguity. While I see that the OAI-PMH spec clearly labels the "until" param optional, it does not specify exact behavior for when it's null. As you noted, this could mean some hosts return records that are updated while the harvest stream is active, while other may not. By always setting "until" we get the same predictable result is all cases. Most importantly, as long as we do all calculations based on server time we still avoid any harvest gaps.
  2. Implementation simplicity. I assume we also want to maintain the option to pass-in a custom end date via $settings, and use that to control the last harvest value if it's declared. If so, and if we always compute an end date when needed, that same end date property (and it's related state tracking controls) can be used universally. Otherwise I think we need to introduce at least one new property (harvest start) and some other conditionals. Not a big deal of course, but it could make the logic a little more bloated.

All this said, I'm a little new to this OAI standard and will defer to you and @EreMaijala on the final direction if you see flaws in any of this reasoning... I always enjoy these discussions 😄

I know the tests are not yet passing, but the failures I saw seemed to be explained by some signature updates and other small details. I'll take a crack at those tests once we land on the default explicit/implicit "until" date detail.

@EreMaijala
Copy link
Contributor

EreMaijala commented Aug 25, 2022

@ryan-jacobs Your reasoning makes sense, but unfortunately the reality is... less optimal. There are, for instance, servers that only support date granularity but don't handle until properly resulting in something like from=2022-08-25&until=2022-08-25& to return nothing even though there are records updated during the day. Admittedly this is a server side issue, but still.

There is also the argument to be made that with VuFind there's no reason to not let more recent records in if they happen to be available. If, for instance, you harvest new records once an hour, it's only a bonus if you happen to get records that were added or updated while the harvesting is in progress. This means that the search index is regularly more up to date, and at least with us the library staff want as frequent updates as possible [1]. This is pronounced if the harvesting interval is longer or the server is slow.

[1] We typically harvest library catalogs every half an hour, but we offer the possibility of harvesting up to every minute provided the previous run has completed. This results in us harvesting and indexing north of 10 million records daily, but also allows near-real-time indexing of availability information that we get for some libraries so that the user can select an "Available in library" checkbox facet. And the sooner we can get the updates in, the better.

@ryan-jacobs
Copy link
Contributor Author

Thanks @EreMaijala. It's very interesting to hear about the variety of host-side implementations, and various problems. I'm also convinced now that the benefit (or at least welcomed) behavior of some OAI hosts to include updates that happen during the harvest stream itself is worth embracing. This seems especially useful for very large/long harvest runs (we have data management cases that require this) or high frequency harvesting (as in your example). While there are some small tradeoffs it does sound like this is the way to go.

I'm glad that we could walk-through, and loosely document, these considerations here. This is valuable. I'll take this approach to the PR.

@EreMaijala
Copy link
Contributor

EreMaijala commented Aug 30, 2022

@ryan-jacobs Thanks! Always appreciate a good discussion and thoughtful comments!

@demiankatz
Copy link
Member

With #8 and #10 merged, and v5.0.0 about to be released, I believe it is now safe to close this issue. Thanks again, everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants