Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Fixity checks happening more often than anticipated #498
Our intent with the automatic nightly fixity checking code, is to set it up so it fixity checks 1/7th of all assets every night, in such a way that after 7 days they've all been checked -- every asset gets checked once every 7 days.
However, the fixity checks seem to be happening more frequently.
Our fixity health report page gives us the "Fully-ingested Asset with oldest most-recent fixity check" -- we expect that to be 7 days old, but looking at it now on 18 Nov 2019, it says 17 Nov 2019, only two days ago. (It's at least possible the query to look up this statistic is wrong).
As another example, for asset ft848s025, it had a fixity check failure on 16 Nov 2019. Eddie re-ran a fixity check that same day, which then passed. (Mystery why it temporary failed, that's #464). Then the automatic fixity check on 18 Nov -- choose to fixity check it again, even though it had a fresh check from only 2 days ago. Not sure why.
Logic for choosing what assets to fixity check nightly might be wrong -- although it's got a SQL
Alternately, perhaps the task for "fixity check 1/7th of things" is being run more than once per night, perhaps on more than one machine?
We might need more logging to figure out what's going on, logging as to exactly what's being checked when by whom we don't have now. But keep in mind that if the fixity check is being run on a host we don't expect, if we were just logging to local file system, we'd never see those logs on a host we're not thinking of looking at. So might need some kind of loggable visibility in a manner other than just writing to local file system on host where activity occurs. (Emails or slack messages?)
I'm switching tracks, but I want to make a couple quick notes for myself or the next person to check this for what I've been able to rule out so far.
Oh dear. All the assets are getting checked, every night. And it's taking nine hours. I haven't found the bug yet, but it will save us a ton of computing time once we do.
Interesting, it says
So that's the weird thing right there! Gives us something to dig into at least, how can that happen?
And we now have evidence it is a single run of the task that's doing it, it's reported right there.
Based on what @eddierubeiz found from his investigations, it seems like somehow the limit on the fetch of “assets to check” isn’t being applied, it’s fetching all of them.
A thought I had on my morning constitutional — using ActiveRecord find_each or find_in_batches can ignore your limit, as it applies it’s own limit to avoid fetching everything into memory at once. At least it used to work that way, although following to current AR docs, it may no longer have that limitation and be able to respect limit? "Limits are honored, and if present there is no requirement for the batch size: it can be less than, equal to, or greater than the limit."
The logic for making a scope and fetching records has gotten sadly a bit abstract and hard to follow (totally my fault).
There IS a
But actually, the way we do that with a sub-query SHOULD avoid any possible issue of find_each eliminating your limit -- the subquery should still have the limit, and we're SQL fetching only things in the subquery.
So I'm not sure what's going on. But I still wonder if somehow the ActiveRecord code is resulting in SQL that is ignoring our limit.