-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(aws): Make deploy atomic operation more resilient to AWS failure #2463
Conversation
robzienert
commented
Mar 28, 2018
- Adds some more logging
- Will attempt to continue deploy operation if the server group already exists and matches desired state
2f14066
to
d818d91
Compare
|
||
// TODO rz - Make bester |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove or make bester.
new DescribeAutoScalingGroupsRequest().withAutoScalingGroupNames(asgName) | ||
) | ||
if (result.autoScalingGroups.isEmpty()) { | ||
// Curious... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add more explanation here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems reasonable to me -- I'd still probably err on the side of sending more requests to AWS then doing exponential backoff (more requests == higher chance of success?)
targetAutoScaling.putScheduledUpdateGroupAction(request) | ||
} catch (AlreadyExistsException e) { | ||
// This should never happen as the name is generated with a UUID. | ||
log.info("Scheduled action already exists on ASG, continuing: $request") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a warn?
For better or worse I try and follow a logging convention like "Scheduled action already exists on ASG (request: {})", request)
🤷♂️
} | ||
log.debug("Determined pre-existing ASG is desired state, continuing...", e) | ||
} | ||
}, 5, 1000, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I normally don't do exponential backoff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
beyond a particular threshold there is just too much delay between attempts -- at this point we'd be better off just retrying more frequently with a reasonable fixed backoff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this case would result in sleeps of 1s, 2s, 4s, 8s, 16s. a 16s sleep seems excessive but conceptually, exponential backoff seems a good approach to aws throttling or failures that are recoverable and load dependent.
If we change the exponential base from 2 to 1.5 in kork, this case would result in retry sleeps of 1s, 1.5s, 2.25s, 3.375s, 5.062s which seems more reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'll keep the exp backoff, but having a knob exposed inside of kork to adjust the exponential base like Asher suggests is a great idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seen here: spinnaker/kork#141
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The AWS SDK is already retrying exponentially, I'm still not convinced that we need to layer our own exponential retries on top of that even with the ability to manipulate the base.
🤷♂️
if (result.autoScalingGroups.isEmpty()) { | ||
// Curious... | ||
log.error("Attempted to find pre-existing ASG but none was found: $asgName") | ||
return false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To get here, we 1) attempted to create $asgName and got an AlreadyExistsException
, 2) check to see if $asgName exists and it doesn't. That is indeed curious and worth logging, but why not still retry createAutoScalingGroup
under this condition?
c4a97d8
to
844e46a
Compare
Addressed feedback, thanks for the eyeballs. |
844e46a
to
a122efc
Compare
Updated the PR to not use exponential backoff. I also bumped retries from 5x -> 10x. |
I don't feel very good about these long linear retries. I am generally of
the opinion that if you didn't get a result after 3 attempts, you should
probably bail and have the next layer deal with it. E.g. we could have
clouddriver fail fast but signal that the error is retryable and orca could
re-attempt later. Am I making sense?
2018-04-04 10:55 GMT-07:00 Rob Zienert <notifications@github.com>:
… Updated the PR to not use exponential backoff. I also bumped retries from
5x -> 10x.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2463 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AB1zFvvbgUiWKyPcDh-L7oxnKUfkZ2isks5tlQkegaJpZM4S_QhD>
.
|
I'm inclined to see this merge as it's an improvement over what's already in place. I do agree with the general sentiment of pulling retries up to orca (retry for 2hrs > manual retry for a few minutes) but not sure we can do this until we have idempotent requests to clouddriver. Make sense? |
I'm leaning towards agreement with Adam on this one. A personal objective for me is to get clouddriver operations to be idempotent, but I don't want that end goal to be a blocker on incremental improvement. Very happy to talk offline about how we can move the ball towards that goal. |