Fixity checks happening more often than anticipated #498

jrochkind · 2019-11-18T15:27:19Z

Our intent with the automatic nightly fixity checking code, is to set it up so it fixity checks 1/7th of all assets every night, in such a way that after 7 days they've all been checked -- every asset gets checked once every 7 days.

However, the fixity checks seem to be happening more frequently.

Our fixity health report page gives us the "Fully-ingested Asset with oldest most-recent fixity check" -- we expect that to be 7 days old, but looking at it now on 18 Nov 2019, it says 17 Nov 2019, only two days ago. (It's at least possible the query to look up this statistic is wrong).

As another example, for asset ft848s025, it had a fixity check failure on 16 Nov 2019. Eddie re-ran a fixity check that same day, which then passed. (Mystery why it temporary failed, that's #464). Then the automatic fixity check on 18 Nov -- choose to fixity check it again, even though it had a fresh check from only 2 days ago. Not sure why.

Logic for choosing what assets to fixity check nightly might be wrong -- although it's got a SQL limitin there, so not sure how it could be fixity checking more things per run.

Alternately, perhaps the task for "fixity check 1/7th of things" is being run more than once per night, perhaps on more than one machine?

We might need more logging to figure out what's going on, logging as to exactly what's being checked when by whom we don't have now. But keep in mind that if the fixity check is being run on a host we don't expect, if we were just logging to local file system, we'd never see those logs on a host we're not thinking of looking at. So might need some kind of loggable visibility in a manner other than just writing to local file system on host where activity occurs. (Emails or slack messages?)

The text was updated successfully, but these errors were encountered:

sanfordd · 2020-01-09T18:57:41Z

I'm switching tracks, but I want to make a couple quick notes for myself or the next person to check this for what I've been able to rule out so far.

There's definitely only one cron role server in production or staging, so it's not another cron server anywhere.
- The other cron jobs don't seem to be running multiple times. I took a look at sitemap on S3 to confirm we're not seeing 2 made in the same day. So that rules out a non-cron tagged server running tasks.
Manually running the fixity check correctly handles item counts
Our current log level isn't dumping any details about the work being done by the fixity task, we might want to bump staging up to debug?
My stopping point was checking if somehow the CYCLE_LENGTH env variable was getting lost or set to 0 somehow when it was run as a cron-like job.

eddierubeiz · 2020-01-17T22:23:48Z

Oh dear. All the assets are getting checked, every night. And it's taking nine hours. I haven't found the bug yet, but it will save us a ton of computing time once we do.

{
  "cycle_length" => 7,
  "expected_number_of_items_to_check" => 3742,
  "actual_number_of_items_checked" => 25568,
  "fixity_check_task_id" => "907",
  "start_time" => "2020-01-17 02:30:03 -0500",
  "end_time" => "2020-01-17 11:13:44 -0500",
  "time_elapsed" => "31420.634213594 seconds"
}

jrochkind · 2020-01-21T15:28:21Z

Interesting, it says cycle_length 7, but it's not acting as if the cycle_length is really 7, eh?

  "expected_number_of_items_to_check" => 3742,
  "actual_number_of_items_checked" => 25568,

So that's the weird thing right there! Gives us something to dig into at least, how can that happen?

And we now have evidence it is a single run of the task that's doing it, it's reported right there.

jrochkind · 2020-01-23T15:11:40Z

Based on what @eddierubeiz found from his investigations, it seems like somehow the limit on the fetch of “assets to check” isn’t being applied, it’s fetching all of them.

A thought I had on my morning constitutional — using ActiveRecord find_each or find_in_batches can ignore your limit, as it applies it’s own limit to avoid fetching everything into memory at once. At least it used to work that way, although following to current AR docs, it may no longer have that limitation and be able to respect limit? "Limits are honored, and if present there is no requirement for the batch size: it can be less than, equal to, or greater than the limit."

The logic for making a scope and fetching records has gotten sadly a bit abstract and hard to follow (totally my fault).

There IS a find_each involved:

scihist_digicoll/app/lib/scihist_digicoll/assets_needing_fixity_checks.rb

Line 37 in 9a734a1

Asset.where(id: selected_assets_scope).find_each

But actually, the way we do that with a sub-query SHOULD avoid any possible issue of find_each eliminating your limit -- the subquery should still have the limit, and we're SQL fetching only things in the subquery.

So I'm not sure what's going on. But I still wonder if somehow the ActiveRecord code is resulting in SQL that is ignoring our limit.

@eddierubeiz

Accomplish the same thing more manually -- fetching in batches of 1000, but still yielding in a single stream. We think find_each race condition may have been responsible for #498, and this may solve it. @eddierubeiz 's debugging in #583 was crucial for figuring out what was going wrong and coming up with possible solution. Possibly we're still being too clever -- but I just didn't want to give up on avoiding fetching al objects or IDs into memory, to have an implementation that should be able to scale to orders of magnitude more assets no problem.

@eddierubeiz

Accomplish the same thing more manually -- fetching in batches of 1000, but still yielding in a single stream. We think find_each race condition may have been responsible for #498, and this may solve it. @eddierubeiz 's debugging in #583 was crucial for figuring out what was going wrong and coming up with possible solution. Possibly we're still being too clever -- but I just didn't want to give up on avoiding fetching al objects or IDs into memory, to have an implementation that should be able to scale to orders of magnitude more assets no problem.

jrochkind · 2020-02-03T16:10:39Z

We'll leave this in "in progress" until we've verified it seems to be working properly -- check back one week from today.

eddierubeiz · 2020-02-10T18:44:06Z

It's February 10th and this is definitely working. Moving this to "Closed."

sanfordd self-assigned this Jan 6, 2020

sanfordd removed their assignment Jan 13, 2020

eddierubeiz self-assigned this Jan 16, 2020

eddierubeiz mentioned this issue Jan 16, 2020

Log beginning and end of fixity check rake tasks to Honeybadger #579

Merged

eddierubeiz mentioned this issue Jan 21, 2020

[WIP] Temporarily remove transactions & batching in staging. #583

Closed

jrochkind mentioned this issue Jan 28, 2020

AssetsNeedingFixityChecks avoid find_each #605

Merged

eddierubeiz closed this as completed Feb 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixity checks happening more often than anticipated #498

Fixity checks happening more often than anticipated #498

jrochkind commented Nov 18, 2019

sanfordd commented Jan 9, 2020

eddierubeiz commented Jan 17, 2020

jrochkind commented Jan 21, 2020 •

edited

Loading

jrochkind commented Jan 23, 2020

jrochkind commented Feb 3, 2020

eddierubeiz commented Feb 10, 2020 •

edited

Loading

Fixity checks happening more often than anticipated #498

Fixity checks happening more often than anticipated #498

Comments

jrochkind commented Nov 18, 2019

sanfordd commented Jan 9, 2020

eddierubeiz commented Jan 17, 2020

jrochkind commented Jan 21, 2020 • edited Loading

jrochkind commented Jan 23, 2020

jrochkind commented Feb 3, 2020

eddierubeiz commented Feb 10, 2020 • edited Loading

jrochkind commented Jan 21, 2020 •

edited

Loading

eddierubeiz commented Feb 10, 2020 •

edited

Loading