-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
If we can't parse a resource in the database, stop checking #957
If we can't parse a resource in the database, stop checking #957
Conversation
- addColumn: | ||
tableName: resource_last_checked | ||
columns: | ||
- column: | ||
name: unreadable | ||
type: boolean | ||
defaultValue: false | ||
constraints: | ||
nullable: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a new DB column so we can just say "don't keep trying to check this"
This comment has been minimized.
This comment has been minimized.
keel-sql/src/main/kotlin/com/netflix/spinnaker/keel/sql/SqlResourceRepository.kt
Outdated
Show resolved
Hide resolved
bc8bb1a
to
a256edf
Compare
a256edf
to
d7efcdc
Compare
} | ||
} catch (e: TimeoutCancellationException) { | ||
log.error("Timed out checking resource ${it.id}", e) | ||
publisher.publishEvent(ResourceCheckTimedOut(it.kind, it.id, it.application)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the timeout metric currently appear on our keel dashboard? I don't think I've seen it there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's pretty new and I don't believe it got added to the dashboard, no
keel-core/src/test/kotlin/com/netflix/spinnaker/keel/actuation/CheckSchedulerTests.kt
Outdated
Show resolved
Hide resolved
d7efcdc
to
6a8400b
Compare
6a8400b
to
583628a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
constructResource(kind, metadata, spec) | ||
} catch (e: Exception) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic will ignore updates regardless of the exception that occurs. I worry about situations where constructResource throws an exception for a transient issue (e.g., networking issue connecting with the database). Couple of thoughts around that:
- We need to feed back info to a human being when a resource gets ignored. It should hopefully be rare enough that we can handle these as they come in. Since we're logging this as an error when it happens, I think we're good here, because we'll get alerted by RADAR.
- Human operator needs to know how to fix things when a resource gets into a bad state due to a transient issue. Looks like the operation is simple enough (just set ignore to
false
in the database record), but we need to make sure everyone knows how to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Note that I saw we're logging as an error when the exception gets caught upstream, so no need to add any additional log statements).
After I promoted the change where we no longer check image resources, but instead check artifacts, we discovered that resource checking halted because Keel was no longer able to parse resources with
ImageSpec
payloads.This change makes it so that if we hit that condition, we'll flag the resource as unreadable and stop checking it. That way resources we can handle will continue being processed.