Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixity checks happening more often than anticipated #498

Open
jrochkind opened this issue Nov 18, 2019 · 4 comments
Open

Fixity checks happening more often than anticipated #498

jrochkind opened this issue Nov 18, 2019 · 4 comments
Assignees

Comments

@jrochkind
Copy link
Collaborator

@jrochkind jrochkind commented Nov 18, 2019

Our intent with the automatic nightly fixity checking code, is to set it up so it fixity checks 1/7th of all assets every night, in such a way that after 7 days they've all been checked -- every asset gets checked once every 7 days.

However, the fixity checks seem to be happening more frequently.

Our fixity health report page gives us the "Fully-ingested Asset with oldest most-recent fixity check" -- we expect that to be 7 days old, but looking at it now on 18 Nov 2019, it says 17 Nov 2019, only two days ago. (It's at least possible the query to look up this statistic is wrong).

As another example, for asset ft848s025, it had a fixity check failure on 16 Nov 2019. Eddie re-ran a fixity check that same day, which then passed. (Mystery why it temporary failed, that's #464). Then the automatic fixity check on 18 Nov -- choose to fixity check it again, even though it had a fresh check from only 2 days ago. Not sure why.

Logic for choosing what assets to fixity check nightly might be wrong -- although it's got a SQL limitin there, so not sure how it could be fixity checking more things per run.

Alternately, perhaps the task for "fixity check 1/7th of things" is being run more than once per night, perhaps on more than one machine?

We might need more logging to figure out what's going on, logging as to exactly what's being checked when by whom we don't have now. But keep in mind that if the fixity check is being run on a host we don't expect, if we were just logging to local file system, we'd never see those logs on a host we're not thinking of looking at. So might need some kind of loggable visibility in a manner other than just writing to local file system on host where activity occurs. (Emails or slack messages?)

@sanfordd sanfordd self-assigned this Jan 6, 2020
@sanfordd

This comment has been minimized.

Copy link

@sanfordd sanfordd commented Jan 9, 2020

I'm switching tracks, but I want to make a couple quick notes for myself or the next person to check this for what I've been able to rule out so far.

  • There's definitely only one cron role server in production or staging, so it's not another cron server anywhere.
    • The other cron jobs don't seem to be running multiple times. I took a look at sitemap on S3 to confirm we're not seeing 2 made in the same day. So that rules out a non-cron tagged server running tasks.
  • Manually running the fixity check correctly handles item counts
  • Our current log level isn't dumping any details about the work being done by the fixity task, we might want to bump staging up to debug?
  • My stopping point was checking if somehow the CYCLE_LENGTH env variable was getting lost or set to 0 somehow when it was run as a cron-like job.
@eddierubeiz

This comment has been minimized.

Copy link
Contributor

@eddierubeiz eddierubeiz commented Jan 17, 2020

Oh dear. All the assets are getting checked, every night. And it's taking nine hours. I haven't found the bug yet, but it will save us a ton of computing time once we do.

{
  "cycle_length" => 7,
  "expected_number_of_items_to_check" => 3742,
  "actual_number_of_items_checked" => 25568,
  "fixity_check_task_id" => "907",
  "start_time" => "2020-01-17 02:30:03 -0500",
  "end_time" => "2020-01-17 11:13:44 -0500",
  "time_elapsed" => "31420.634213594 seconds"
}
@jrochkind

This comment has been minimized.

Copy link
Collaborator Author

@jrochkind jrochkind commented Jan 21, 2020

Interesting, it says cycle_length 7, but it's not acting as if the cycle_length is really 7, eh?

  "expected_number_of_items_to_check" => 3742,
  "actual_number_of_items_checked" => 25568,

So that's the weird thing right there! Gives us something to dig into at least, how can that happen?

And we now have evidence it is a single run of the task that's doing it, it's reported right there.

@jrochkind

This comment has been minimized.

Copy link
Collaborator Author

@jrochkind jrochkind commented Jan 23, 2020

Based on what @eddierubeiz found from his investigations, it seems like somehow the limit on the fetch of “assets to check” isn’t being applied, it’s fetching all of them.

A thought I had on my morning constitutional — using ActiveRecord find_each or find_in_batches can ignore your limit, as it applies it’s own limit to avoid fetching everything into memory at once. At least it used to work that way, although following to current AR docs, it may no longer have that limitation and be able to respect limit? "Limits are honored, and if present there is no requirement for the batch size: it can be less than, equal to, or greater than the limit."

The logic for making a scope and fetching records has gotten sadly a bit abstract and hard to follow (totally my fault).

There IS a find_each involved:

Asset.where(id: selected_assets_scope).find_each

But actually, the way we do that with a sub-query SHOULD avoid any possible issue of find_each eliminating your limit -- the subquery should still have the limit, and we're SQL fetching only things in the subquery.

So I'm not sure what's going on. But I still wonder if somehow the ActiveRecord code is resulting in SQL that is ignoring our limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.