-
Notifications
You must be signed in to change notification settings - Fork 806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(peering): Log error metrics more better #3647
fix(peering): Log error metrics more better #3647
Conversation
Not all exceptions were caught and thus would leak out of the `PeeringAgent` and weren't logged. Additionally, threw a retry on the most eggregious SQL query in case it fails due to timeout.
@@ -3,6 +3,10 @@ package com.netflix.spinnaker.orca.peering | |||
import com.netflix.spinnaker.kork.exceptions.SystemException | |||
import com.netflix.spinnaker.kork.sql.routing.withPool | |||
import com.netflix.spinnaker.orca.api.pipeline.models.ExecutionType | |||
import io.github.resilience4j.retry.Retry | |||
import io.github.resilience4j.retry.RetryConfig | |||
import io.vavr.control.Try |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha, TIL
var hadFailures = false | ||
var orchestrationsDeleted = 0 | ||
var pipelinesDeleted = 0 | ||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed the code duplication here. I would suggest the following:
data class DeletionResult(val numDeleted: Int, val hadFailures: Boolean)
private fun delete(executionType: ExecutionType, idsToDelete: List<String>): DeletionResult {
var numDeleted = 0
var hadFailures = false
try {
numDeleted = destDB.deleteExecutions(executionType, idsToDelete)
peeringMetrics.incrementNumDeleted(executionType, numDeleted)
} catch (e: Exception) {
log.error("Failed to delete some $executionType", e)
peeringMetrics.incrementNumErrors(executionType)
hadFailures = true
}
return DeletionResult(numDeleted, hadFailures)
}
And then in peerDeletedExecutions
, it becomes simply:
val (orchestrationsDeleted, orchestrationsHadFailures) = delete(ExecutionType.ORCHESTRATION, orchestrationIdsToDelete)
val (pipelinesDeleted, pipelinesHadFailures) = delete(ExecutionType.PIPELINE, pipelineIdsToDelete)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, i like it!
@@ -76,7 +75,7 @@ class PeeringAgent( | |||
override fun tick() { | |||
if (dynamicConfigService.isEnabled("pollers.peering", true) && | |||
dynamicConfigService.isEnabled("pollers.peering.$peeredId", true)) { | |||
peeringMetrics.recordOverallLag() { | |||
peeringMetrics.recordOverallLag { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it the same with and without the parens? I'm confused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, i guess spotless did it.. but yeah, i guess that's kotlin convention: https://kotlinlang.org/docs/reference/lambdas.html#passing-a-lambda-to-the-last-parameter
} catch (e: Exception) { | ||
log.error("Failed to delete some pipelines", e) | ||
log.error("Failed to delete some executions", e) | ||
peeringMetrics.incrementNumErrors(ExecutionType.ORCHESTRATION) | ||
peeringMetrics.incrementNumErrors(ExecutionType.PIPELINE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah hmmm... it wouldn't be terribad to pass an execution type to peerDeletedExecutions
and call it twice, right? Then the metrics wouldn't have to lie 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a bit sucky since there are both pipelines and orchestrations in that deleted_executions
table...
@@ -204,4 +210,16 @@ open class MySqlRawAccess( | |||
|
|||
return persisted | |||
} | |||
|
|||
private fun <T> withRetry(action: () -> T): T { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can also do this with a retry annotation:
https://github.com/spinnaker/fiat/blob/64021e98f8d55c11a83149dc8aacdb854342c777/fiat-roles/src/main/java/com/netflix/spinnaker/fiat/providers/internal/Front50DataLoader.java#L39
And then make the options configurable/overridable via spring:
https://github.com/spinnaker/fiat/blob/64021e98f8d55c11a83149dc8aacdb854342c777/fiat-roles/src/main/resources/resilience4j-defaults.properties
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that I say that, the annotation approach would change this from returning a function to executing the retry so never mind
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there was also some weird thing that @jonsie and I tried to figure out (the annotation wasn't working somewhere... river maybe?) anyway, it was driving me mad so I didn't want to chance it here :)
* fix(peering): Log error metrics more better Not all exceptions were caught and thus would leak out of the `PeeringAgent` and weren't logged. Additionally, threw a retry on the most eggregious SQL query in case it fails due to timeout. * fixup! fix(peering): Log error metrics more better Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Not all exceptions were caught and thus would leak out of the
PeeringAgent
and weren't logged.Additionally, I threw a retry on the most egregious SQL query in case it fails due to timeout.