Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(aws): Make deploy atomic operation more resilient to AWS failure #2463

Merged
merged 4 commits into from
Apr 5, 2018

Conversation

robzienert
Copy link
Member

  • Adds some more logging
  • Will attempt to continue deploy operation if the server group already exists and matches desired state


// TODO rz - Make bester
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove or make bester.

new DescribeAutoScalingGroupsRequest().withAutoScalingGroupNames(asgName)
)
if (result.autoScalingGroups.isEmpty()) {
// Curious...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add more explanation here?

Copy link
Contributor

@ajordens ajordens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems reasonable to me -- I'd still probably err on the side of sending more requests to AWS then doing exponential backoff (more requests == higher chance of success?)

targetAutoScaling.putScheduledUpdateGroupAction(request)
} catch (AlreadyExistsException e) {
// This should never happen as the name is generated with a UUID.
log.info("Scheduled action already exists on ASG, continuing: $request")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a warn?

For better or worse I try and follow a logging convention like "Scheduled action already exists on ASG (request: {})", request)

🤷‍♂️

}
log.debug("Determined pre-existing ASG is desired state, continuing...", e)
}
}, 5, 1000, true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I normally don't do exponential backoff

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beyond a particular threshold there is just too much delay between attempts -- at this point we'd be better off just retrying more frequently with a reasonable fixed backoff

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this case would result in sleeps of 1s, 2s, 4s, 8s, 16s. a 16s sleep seems excessive but conceptually, exponential backoff seems a good approach to aws throttling or failures that are recoverable and load dependent.

If we change the exponential base from 2 to 1.5 in kork, this case would result in retry sleeps of 1s, 1.5s, 2.25s, 3.375s, 5.062s which seems more reasonable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll keep the exp backoff, but having a knob exposed inside of kork to adjust the exponential base like Asher suggests is a great idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seen here: spinnaker/kork#141

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AWS SDK is already retrying exponentially, I'm still not convinced that we need to layer our own exponential retries on top of that even with the ability to manipulate the base.

🤷‍♂️

if (result.autoScalingGroups.isEmpty()) {
// Curious...
log.error("Attempted to find pre-existing ASG but none was found: $asgName")
return false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To get here, we 1) attempted to create $asgName and got an AlreadyExistsException, 2) check to see if $asgName exists and it doesn't. That is indeed curious and worth logging, but why not still retry createAutoScalingGroup under this condition?

@robzienert
Copy link
Member Author

Addressed feedback, thanks for the eyeballs.

@robzienert
Copy link
Member Author

Updated the PR to not use exponential backoff. I also bumped retries from 5x -> 10x.

@dreynaud
Copy link
Contributor

dreynaud commented Apr 4, 2018 via email

@ajordens
Copy link
Contributor

ajordens commented Apr 5, 2018

I'm inclined to see this merge as it's an improvement over what's already in place.

I do agree with the general sentiment of pulling retries up to orca (retry for 2hrs > manual retry for a few minutes) but not sure we can do this until we have idempotent requests to clouddriver.

Make sense?

@robzienert
Copy link
Member Author

I'm leaning towards agreement with Adam on this one. A personal objective for me is to get clouddriver operations to be idempotent, but I don't want that end goal to be a blocker on incremental improvement.

Very happy to talk offline about how we can move the ball towards that goal.

@robzienert robzienert merged commit 8386dc9 into spinnaker:master Apr 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants