Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixity checks happening more often than anticipated #498

Closed
jrochkind opened this issue Nov 18, 2019 · 6 comments
Closed

Fixity checks happening more often than anticipated #498

jrochkind opened this issue Nov 18, 2019 · 6 comments
Assignees

Comments

@jrochkind
Copy link
Contributor

Our intent with the automatic nightly fixity checking code, is to set it up so it fixity checks 1/7th of all assets every night, in such a way that after 7 days they've all been checked -- every asset gets checked once every 7 days.

However, the fixity checks seem to be happening more frequently.

Our fixity health report page gives us the "Fully-ingested Asset with oldest most-recent fixity check" -- we expect that to be 7 days old, but looking at it now on 18 Nov 2019, it says 17 Nov 2019, only two days ago. (It's at least possible the query to look up this statistic is wrong).

As another example, for asset ft848s025, it had a fixity check failure on 16 Nov 2019. Eddie re-ran a fixity check that same day, which then passed. (Mystery why it temporary failed, that's #464). Then the automatic fixity check on 18 Nov -- choose to fixity check it again, even though it had a fresh check from only 2 days ago. Not sure why.

Logic for choosing what assets to fixity check nightly might be wrong -- although it's got a SQL limitin there, so not sure how it could be fixity checking more things per run.

Alternately, perhaps the task for "fixity check 1/7th of things" is being run more than once per night, perhaps on more than one machine?

We might need more logging to figure out what's going on, logging as to exactly what's being checked when by whom we don't have now. But keep in mind that if the fixity check is being run on a host we don't expect, if we were just logging to local file system, we'd never see those logs on a host we're not thinking of looking at. So might need some kind of loggable visibility in a manner other than just writing to local file system on host where activity occurs. (Emails or slack messages?)

@sanfordd sanfordd self-assigned this Jan 6, 2020
@sanfordd
Copy link

sanfordd commented Jan 9, 2020

I'm switching tracks, but I want to make a couple quick notes for myself or the next person to check this for what I've been able to rule out so far.

  • There's definitely only one cron role server in production or staging, so it's not another cron server anywhere.
    • The other cron jobs don't seem to be running multiple times. I took a look at sitemap on S3 to confirm we're not seeing 2 made in the same day. So that rules out a non-cron tagged server running tasks.
  • Manually running the fixity check correctly handles item counts
  • Our current log level isn't dumping any details about the work being done by the fixity task, we might want to bump staging up to debug?
  • My stopping point was checking if somehow the CYCLE_LENGTH env variable was getting lost or set to 0 somehow when it was run as a cron-like job.

@eddierubeiz
Copy link
Contributor

Oh dear. All the assets are getting checked, every night. And it's taking nine hours. I haven't found the bug yet, but it will save us a ton of computing time once we do.

{
  "cycle_length" => 7,
  "expected_number_of_items_to_check" => 3742,
  "actual_number_of_items_checked" => 25568,
  "fixity_check_task_id" => "907",
  "start_time" => "2020-01-17 02:30:03 -0500",
  "end_time" => "2020-01-17 11:13:44 -0500",
  "time_elapsed" => "31420.634213594 seconds"
}

@jrochkind
Copy link
Contributor Author

jrochkind commented Jan 21, 2020

Interesting, it says cycle_length 7, but it's not acting as if the cycle_length is really 7, eh?

  "expected_number_of_items_to_check" => 3742,
  "actual_number_of_items_checked" => 25568,

So that's the weird thing right there! Gives us something to dig into at least, how can that happen?

And we now have evidence it is a single run of the task that's doing it, it's reported right there.

@jrochkind
Copy link
Contributor Author

Based on what @eddierubeiz found from his investigations, it seems like somehow the limit on the fetch of “assets to check” isn’t being applied, it’s fetching all of them.

A thought I had on my morning constitutional — using ActiveRecord find_each or find_in_batches can ignore your limit, as it applies it’s own limit to avoid fetching everything into memory at once. At least it used to work that way, although following to current AR docs, it may no longer have that limitation and be able to respect limit? "Limits are honored, and if present there is no requirement for the batch size: it can be less than, equal to, or greater than the limit."

The logic for making a scope and fetching records has gotten sadly a bit abstract and hard to follow (totally my fault).

There IS a find_each involved:

Asset.where(id: selected_assets_scope).find_each

But actually, the way we do that with a sub-query SHOULD avoid any possible issue of find_each eliminating your limit -- the subquery should still have the limit, and we're SQL fetching only things in the subquery.

So I'm not sure what's going on. But I still wonder if somehow the ActiveRecord code is resulting in SQL that is ignoring our limit.

jrochkind added a commit that referenced this issue Jan 28, 2020
Accomplish the same thing more manually -- fetching in batches of 1000, but still yielding in a single stream.

We think find_each race condition may have been responsible for #498, and this may solve it.

@eddierubeiz 's debugging in #583 was crucial for figuring out what was going wrong and coming up with possible solution. Possibly we're still being too clever -- but I just didn't want to give up on avoiding fetching al objects or IDs into memory, to have an implementation that should be able to scale to orders of magnitude more assets no problem.
jrochkind added a commit that referenced this issue Jan 28, 2020
Accomplish the same thing more manually -- fetching in batches of 1000, but still yielding in a single stream.

We think find_each race condition may have been responsible for #498, and this may solve it.

@eddierubeiz 's debugging in #583 was crucial for figuring out what was going wrong and coming up with possible solution. Possibly we're still being too clever -- but I just didn't want to give up on avoiding fetching al objects or IDs into memory, to have an implementation that should be able to scale to orders of magnitude more assets no problem.
@jrochkind
Copy link
Contributor Author

We'll leave this in "in progress" until we've verified it seems to be working properly -- check back one week from today.

@eddierubeiz
Copy link
Contributor

eddierubeiz commented Feb 10, 2020

It's February 10th and this is definitely working. Moving this to "Closed."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants