Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(logging): add fabulous logging for when pipelines fail to trigger #555

Merged

Conversation

marchello2000
Copy link
Contributor

The impetus for this change is to be able to better alert on and understand pipeline trigger failures.
We've been bit by trigger failures that went undiscovered (sometimes for days) until our users complained.
With this change we can reliable alert on orca.trigger.errors metric which is incremented when kicking
of a pipeline fails (unlike the previous metric orca.error which was sometimes triggered erroneously).

When triggering does fail, this change will log, in much more detail, why the trigger failed as well as
log the full payload submitted to orca for quick and easy debugging/reproducability of the issue.

Bonus 1: This change (mostly) removes the reliance on Rx and replaces it with an executor to
trigger pipelines.

Bonus 2: Fixup calls to front50 to allow them to be anonymous

The impetus for this change is to be able to better alert on and understand pipeline trigger failures.
We've been bit by trigger failures that went undiscovered (sometimes for days) until our users complained.
With this change we can reliable alert on `orca.trigger.errors` metric which is incremented when kicking
of a pipeline fails (unlike the previous metric `orca.error` which was sometimes triggered erroneously).

When triggering does fail, this change will log, in much more detail, why the trigger failed as well as
log the full payload submitted to `orca` for quick and easy debugging/reproducability of the issue.

Bonus 1: This change (mostly) removes the reliance on Rx and replaces it with an executor to
trigger pipelines.

Bonus 2: Fixup calls to `front50` to allow them to be anonymous
@@ -27,7 +27,6 @@ dependencies {
implementation "com.netflix.spinnaker.kork:kork-core"
implementation "com.netflix.spinnaker.kork:kork-artifacts"
implementation "com.netflix.spinnaker.kork:kork-web"
implementation "io.reactivex:rxjava"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏

@marchello2000
Copy link
Contributor Author

Honestly, I am still not convinced that the executorService is needed:

  1. cron triggers already happen on quartz thread
  2. events are already Rx'd on IOScheduler here

So I don't see the point... but I am going to keep the executor for now (to maintain behavior as before)


log.warn("Retrying pipeline trigger for {}", pipeline);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  log.warn("Retrying pipeline trigger (attempt {}/{}) for {}", attempts, retryCount, pipeline);

try {
attempts++;
return orca.trigger(pipeline);
} catch (RetrofitError e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to log e

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, we log the last one (i.e. if all retries fail). otherwise any transitory errors make it noisy... but i could use splainer style and only log it all when the last one fails?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be a matter of preference but I'd rather see all exceptions as they happen rather than just the last one. They each are their own event, me thinks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will put them in as warnings, then it's clear it's not fatal but we still see the issues. will also add a orca.trigger.retry metric

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dreynaud likes

public enum TriggerSource {
SCHEDULER,
MISSEDSCHEDULER,
EVENT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be interesting to have a different value for the different types of events

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as in jenkins vs. docker, etc?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added to logOrcaErrorMetric

@dreynaud dreynaud self-requested a review May 21, 2019 17:09
@dreynaud
Copy link
Contributor

I APPROVE THIS

String orcaResponse = "N/A";

if (e.getResponse() != null && e.getResponse().getBody() != null) {
new String(((TypedByteArray) e.getResponse().getBody()).getBytes());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be assigned to something? like orcaResponse?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

um... yes. thank you! i refactored this and messed up

@emjburns
Copy link
Contributor

@ezimanyi you may be interested in this!

@marchello2000 marchello2000 merged commit ae23f03 into spinnaker:master May 21, 2019
marchello2000 added a commit to marchello2000/echo that referenced this pull request May 23, 2019
…ution

previous change (spinnaker#555) moved orca invocation
to an executor thread without propagating MDC context.
marchello2000 added a commit to marchello2000/echo that referenced this pull request May 23, 2019
…ution

previous change (spinnaker#555) moved orca invocation
to an executor thread without propagating MDC context.
marchello2000 added a commit that referenced this pull request May 24, 2019
…ution (#560)

previous change (#555) moved orca invocation
to an executor thread without propagating MDC context.
@marchello2000 marchello2000 deleted the mark/logging_extravaganza branch March 11, 2020 22:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants