-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixity checks happening more often than anticipated #498
Comments
I'm switching tracks, but I want to make a couple quick notes for myself or the next person to check this for what I've been able to rule out so far.
|
Oh dear. All the assets are getting checked, every night. And it's taking nine hours. I haven't found the bug yet, but it will save us a ton of computing time once we do.
|
Interesting, it says
So that's the weird thing right there! Gives us something to dig into at least, how can that happen? And we now have evidence it is a single run of the task that's doing it, it's reported right there. |
Based on what @eddierubeiz found from his investigations, it seems like somehow the limit on the fetch of “assets to check” isn’t being applied, it’s fetching all of them. A thought I had on my morning constitutional — using ActiveRecord find_each or find_in_batches can ignore your limit, as it applies it’s own limit to avoid fetching everything into memory at once. At least it used to work that way, although following to current AR docs, it may no longer have that limitation and be able to respect limit? "Limits are honored, and if present there is no requirement for the batch size: it can be less than, equal to, or greater than the limit." The logic for making a scope and fetching records has gotten sadly a bit abstract and hard to follow (totally my fault). There IS a
But actually, the way we do that with a sub-query SHOULD avoid any possible issue of find_each eliminating your limit -- the subquery should still have the limit, and we're SQL fetching only things in the subquery. So I'm not sure what's going on. But I still wonder if somehow the ActiveRecord code is resulting in SQL that is ignoring our limit. |
Accomplish the same thing more manually -- fetching in batches of 1000, but still yielding in a single stream. We think find_each race condition may have been responsible for #498, and this may solve it. @eddierubeiz 's debugging in #583 was crucial for figuring out what was going wrong and coming up with possible solution. Possibly we're still being too clever -- but I just didn't want to give up on avoiding fetching al objects or IDs into memory, to have an implementation that should be able to scale to orders of magnitude more assets no problem.
Accomplish the same thing more manually -- fetching in batches of 1000, but still yielding in a single stream. We think find_each race condition may have been responsible for #498, and this may solve it. @eddierubeiz 's debugging in #583 was crucial for figuring out what was going wrong and coming up with possible solution. Possibly we're still being too clever -- but I just didn't want to give up on avoiding fetching al objects or IDs into memory, to have an implementation that should be able to scale to orders of magnitude more assets no problem.
We'll leave this in "in progress" until we've verified it seems to be working properly -- check back one week from today. |
It's February 10th and this is definitely working. Moving this to "Closed." |
Our intent with the automatic nightly fixity checking code, is to set it up so it fixity checks 1/7th of all assets every night, in such a way that after 7 days they've all been checked -- every asset gets checked once every 7 days.
However, the fixity checks seem to be happening more frequently.
Our fixity health report page gives us the "Fully-ingested Asset with oldest most-recent fixity check" -- we expect that to be 7 days old, but looking at it now on 18 Nov 2019, it says 17 Nov 2019, only two days ago. (It's at least possible the query to look up this statistic is wrong).
As another example, for asset ft848s025, it had a fixity check failure on 16 Nov 2019. Eddie re-ran a fixity check that same day, which then passed. (Mystery why it temporary failed, that's #464). Then the automatic fixity check on 18 Nov -- choose to fixity check it again, even though it had a fresh check from only 2 days ago. Not sure why.
Logic for choosing what assets to fixity check nightly might be wrong -- although it's got a SQL
limit
in there, so not sure how it could be fixity checking more things per run.Alternately, perhaps the task for "fixity check 1/7th of things" is being run more than once per night, perhaps on more than one machine?
We might need more logging to figure out what's going on, logging as to exactly what's being checked when by whom we don't have now. But keep in mind that if the fixity check is being run on a host we don't expect, if we were just logging to local file system, we'd never see those logs on a host we're not thinking of looking at. So might need some kind of loggable visibility in a manner other than just writing to local file system on host where activity occurs. (Emails or slack messages?)
The text was updated successfully, but these errors were encountered: