fix(aws): Make deploy atomic operation more resilient to AWS failure #2463

robzienert · 2018-03-28T19:52:28Z

Adds some more logging
Will attempt to continue deploy operation if the server group already exists and matches desired state

emjburns · 2018-03-29T18:06:51Z

...er-aws/src/main/groovy/com/netflix/spinnaker/clouddriver/aws/deploy/AutoScalingWorker.groovy


+    // TODO rz - Make bester


Remove or make bester.

emjburns · 2018-03-29T18:08:19Z

...er-aws/src/main/groovy/com/netflix/spinnaker/clouddriver/aws/deploy/AutoScalingWorker.groovy

+      new DescribeAutoScalingGroupsRequest().withAutoScalingGroupNames(asgName)
+    )
+    if (result.autoScalingGroups.isEmpty()) {
+      // Curious...


Can you add more explanation here?

ajordens

seems reasonable to me -- I'd still probably err on the side of sending more requests to AWS then doing exponential backoff (more requests == higher chance of success?)

ajordens · 2018-03-29T18:10:31Z

...r-aws/src/main/groovy/com/netflix/spinnaker/clouddriver/aws/deploy/AsgReferenceCopier.groovy

+        targetAutoScaling.putScheduledUpdateGroupAction(request)
+      } catch (AlreadyExistsException e) {
+        // This should never happen as the name is generated with a UUID.
+        log.info("Scheduled action already exists on ASG, continuing: $request")


Maybe a warn?

For better or worse I try and follow a logging convention like "Scheduled action already exists on ASG (request: {})", request)

🤷‍♂️

ajordens · 2018-03-29T18:16:02Z

...er-aws/src/main/groovy/com/netflix/spinnaker/clouddriver/aws/deploy/AutoScalingWorker.groovy

+        }
+        log.debug("Determined pre-existing ASG is desired state, continuing...", e)
+      }
+    }, 5, 1000, true)


I normally don't do exponential backoff

beyond a particular threshold there is just too much delay between attempts -- at this point we'd be better off just retrying more frequently with a reasonable fixed backoff

this case would result in sleeps of 1s, 2s, 4s, 8s, 16s. a 16s sleep seems excessive but conceptually, exponential backoff seems a good approach to aws throttling or failures that are recoverable and load dependent.

If we change the exponential base from 2 to 1.5 in kork, this case would result in retry sleeps of 1s, 1.5s, 2.25s, 3.375s, 5.062s which seems more reasonable.

I think I'll keep the exp backoff, but having a knob exposed inside of kork to adjust the exponential base like Asher suggests is a great idea.

Seen here: spinnaker/kork#141

The AWS SDK is already retrying exponentially, I'm still not convinced that we need to layer our own exponential retries on top of that even with the ability to manipulate the base.

🤷‍♂️

asher · 2018-03-29T18:29:40Z

...er-aws/src/main/groovy/com/netflix/spinnaker/clouddriver/aws/deploy/AutoScalingWorker.groovy

+    if (result.autoScalingGroups.isEmpty()) {
+      // Curious...
+      log.error("Attempted to find pre-existing ASG but none was found: $asgName")
+      return false


To get here, we 1) attempted to create $asgName and got an AlreadyExistsException, 2) check to see if $asgName exists and it doesn't. That is indeed curious and worth logging, but why not still retry createAutoScalingGroup under this condition?

robzienert · 2018-03-30T04:24:05Z

Addressed feedback, thanks for the eyeballs.

robzienert · 2018-04-04T17:55:38Z

Updated the PR to not use exponential backoff. I also bumped retries from 5x -> 10x.

dreynaud · 2018-04-04T18:27:58Z

I don't feel very good about these long linear retries. I am generally of the opinion that if you didn't get a result after 3 attempts, you should probably bail and have the next layer deal with it. E.g. we could have clouddriver fail fast but signal that the error is retryable and orca could re-attempt later. Am I making sense? 2018-04-04 10:55 GMT-07:00 Rob Zienert <notifications@github.com>:

…

Updated the PR to not use exponential backoff. I also bumped retries from 5x -> 10x. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2463 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AB1zFvvbgUiWKyPcDh-L7oxnKUfkZ2isks5tlQkegaJpZM4S_QhD> .

ajordens · 2018-04-05T03:29:04Z

I'm inclined to see this merge as it's an improvement over what's already in place.

I do agree with the general sentiment of pulling retries up to orca (retry for 2hrs > manual retry for a few minutes) but not sure we can do this until we have idempotent requests to clouddriver.

Make sense?

robzienert · 2018-04-05T19:43:33Z

I'm leaning towards agreement with Adam on this one. A personal objective for me is to get clouddriver operations to be idempotent, but I don't want that end goal to be a blocker on incremental improvement.

Very happy to talk offline about how we can move the ball towards that goal.

robzienert requested a review from ajordens March 28, 2018 19:52

robzienert force-pushed the aws-resiliency branch from 2f14066 to d818d91 Compare March 28, 2018 19:52

emjburns reviewed Mar 29, 2018

View reviewed changes

ajordens reviewed Mar 29, 2018

View reviewed changes

asher reviewed Mar 29, 2018

View reviewed changes

robzienert force-pushed the aws-resiliency branch from c4a97d8 to 844e46a Compare March 30, 2018 04:23

fix(aws): Make deploy atomic operation more resilient to AWS failures

a122efc

robzienert force-pushed the aws-resiliency branch from 844e46a to a122efc Compare April 4, 2018 17:54

robzienert and others added 2 commits April 4, 2018 11:30

Merge branch 'master' into aws-resiliency

1e3a789

Merge branch 'master' into aws-resiliency

051627e

Merge branch 'master' into aws-resiliency

863cbaa

robzienert merged commit 8386dc9 into spinnaker:master Apr 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(aws): Make deploy atomic operation more resilient to AWS failure #2463

fix(aws): Make deploy atomic operation more resilient to AWS failure #2463

robzienert commented Mar 28, 2018

emjburns Mar 29, 2018

emjburns Mar 29, 2018

ajordens left a comment •

edited

ajordens Mar 29, 2018

ajordens Mar 29, 2018

dreynaud Mar 29, 2018

ajordens Mar 29, 2018

asher Mar 29, 2018

robzienert Mar 30, 2018

robzienert Mar 30, 2018

ajordens Mar 30, 2018

asher Mar 29, 2018

robzienert commented Mar 30, 2018

robzienert commented Apr 4, 2018

dreynaud commented Apr 4, 2018 via email

ajordens commented Apr 5, 2018

robzienert commented Apr 5, 2018

fix(aws): Make deploy atomic operation more resilient to AWS failure #2463

fix(aws): Make deploy atomic operation more resilient to AWS failure #2463

Conversation

robzienert commented Mar 28, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajordens left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robzienert commented Mar 30, 2018

robzienert commented Apr 4, 2018

dreynaud commented Apr 4, 2018 via email

ajordens commented Apr 5, 2018

robzienert commented Apr 5, 2018

ajordens left a comment •

edited