Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout while rescanning a previously scanned manifest leaves IndexReport empty forever #1124

Open
iainduncani opened this issue Oct 26, 2023 · 1 comment
Labels
triaged The maintainers have seen this issue

Comments

@iainduncani
Copy link
Contributor

The Problem

We are finding that sometimes when we scan a manifest that has already been scanned we will end up with an empty IndexReport stored for that manifest:

{
  "manifest_hash": "sha256:a2f196cc1f224ba665cde842a5be1d3c7ded333c279525242eba123677152172",
  "state": "",
  "packages": {},
  "distributions": {},
  "repository": {},
  "environments": {},
  "success": false,
  "err": ""
}

Once in this state any further request to rescan it will not do anything to change the IndexReport. The logs from the call to scan the previously scanned manifest show:

info index request start
debug locking attempt
debug locking OK
info starting scan
info index request done

The logs on any future attempts show an almost identical output except they include the manifest already scanned line:

info index request start
debug locking attempt
debug locking OK
info starting scan
info manifest already scanned
info index request done 

Suspected Code Path

Looking at the code I think this happens when an attempt to get the IndexReport in checkManifest timeouts. The flow I believe is:

  1. The Controller enters the run method and calls CheckManifest as the first state. At this point the IndexReport on the controller is empty
  2. checkManifest discovers that the manifest is already scanned
  3. checkManifest attempts to load the IndexReport but the call to s.Store.IndexReport times out (5s timeout) and an err is returned. checkManifest does not log anything in this case and IndexReport on the controller is not set because it was not loaded.
  4. Back in the calling controller it goes through the checks as to whether an error occurred. As this is a timeout errors.Is(err, context.DeadlineExceeded) will be true and it will go into that case
  5. As the timeout case does not exit the for loop early it just sets the retry variable to true the Controller then proceeds to store the IndexReport into the database. This stores an empty IndexReport into the database.
  6. The Controller then continues by going into the retry logic which pauses for a bit and then sets err to nil.
  7. When checkManifest returned it returned the next state as Terminal so Controller now sees this and exits

I think this is consistent with what we see in the logs where we get no logging at all indicating an error has occurred. Nowhere in the code path above does a log get emitted with the error from checkManifest. Future calls to scan on this manifest follow a very similar path except the call to IndexReport does not timeout but all it loads is an empty IndexReport, it then emits the manifest already scanned log line and does not return an error but we are left with the empty IndexReport which check manifests thinks is OK.

The solution

I think there are a couple of things to be addressed here. I'm happy to work on a PR but would like advice first as I think some change to behaviour is required.

Retry logic in controller

I do not think that the retry is doing anything. Every state that I can see that returns an error also sets the next state to Terminal. This means if any of these errors are timeouts then they will go through the retry logic path of having a small pause before seeing that the next state is Terminal and exiting.

Personally I would delete the errors.Is(err, context.DeadlineExceeded) case and retry variables rather than attempting to fix them (the fix would be to change that case to have what is currently inside the if retryblock and put acontinueon the end so it goes back around the loop without changing thenextstate and without persisting a potentially invalidIndexReport`).

Checking the validity of manifests in checkManifest

Fixing the retry logic maybe enough to fix this but care would need to be taken to not persist a bad IndexReport when the call to s.Store.IndexReport fails. Even if the Controller's standard error handling code was used for one of these errors then it would result in an IndexReport being persisted with state IndexError which again isn't what is wanted. I wonder if a check should be added to checkManifest to actually make sure the loaded IndexReport is valid, something like:

if sr.State != IndexFinished {
    zlog.Info(ctx).Msg("IndexReport not valid so manifest to be scanned")
    // Potentially do the filtering that happens in the case where a Manifest has not been scanned, but in this case there will be no scanners left as we know if we get this far `ManifestScanned` returned `ok`
    s.Vscnrs = indexer.VersionedScanners{}
    return FetchLayers, nil
}
@iainduncani
Copy link
Contributor Author

I put a proposed solution into PR #1131 with questions about the right approach to take.

@hdonnay hdonnay added the triaged The maintainers have seen this issue label Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged The maintainers have seen this issue
Development

No branches or pull requests

2 participants