fix(RRB): disable traffic on the source server group only #3037

marchello2000 · 2019-07-15T21:24:20Z

Make sure RRB disable step disabled the same number of instances as it brought online,
and only in the source server group.

Imagine there are multiple active server groups for RRB:

v001 (capacity: 6, active)
v002 (capacity: 10, active)

we are deploying v003, with the following steps: 50%, 100%

Currently, RRB will do the following:

Spin up 5 instances in v003 (50% of the source which is v002)
Disable 3 instances in v001 and 5 in v002 for a total of 8!
Spin up another 5 in v003 (100%)
Disable 3 and 5 respectively

Effectively we went from capacity 16 to 10 and are now under provisioned.
This is an edge case, but we want it to have predictable behavior that doesn't leave the cluster under provisioned

marchello2000 · 2019-07-15T21:25:28Z

Note: this is based on a branch for another PR (#3031) so you only need to review the last commit 9c8ce49.

I will rebase when the source branch gets merged

dreynaud · 2019-07-17T18:31:41Z

...ix/spinnaker/orca/clouddriver/pipeline/servergroup/strategies/RollingRedBlackStrategy.groovy

 @Slf4j
 @Component
 class RollingRedBlackStrategy implements Strategy, ApplicationContextAware {
  final String name = "rollingredblack"

  @Autowired
-  DisableClusterStage disableClusterStage
+  DisableServerGroupStage disableServerGroupStage


very elegant change, my mouth is agape in amazement

i thought it was clever :)

Make sure RRB disable step disabled the same number of instances as it brought online, and only in the source server group. Imagine there are multiple active server groups for RRB: - v001 (capacity: 6, active) - v002 (capacity: 10, active) we are deploying v003, with the following steps: 50%, 100% Currently, RRB will do the following: 1. Spin up 5 instances in v003 (50% of the source which is v002) 2. Disable 3 instances in v001 and 5 in v002 for a total of 8! 3. Spin up another 5 in v003 (100%) 4. Disable 3 and 5 respectively Effectively we went from capacity 16 to 10 and are now under provisioned. This is an edge case, but we want it to have predictable behavior that doesn't leave the cluster under provisioned

As a result of spinnaker#3037, the RRB flow now no longer bypasses traffic guards (it used to do that before, here: https://github.com/spinnaker/orca/blob/master/orca-clouddriver/src/main/groovy/com/netflix/spinnaker/orca/clouddriver/tasks/cluster/AbstractClusterWideClouddriverTask.groovy#L145 `shouldSkipTrafficGuards`.) Instead, now when we run the first disable step in RRB (e.g. disable 5% of source), the traffic guards think the whole ASG is going away as it is unaware of the percentage disable flag and thus it terminates the operation since "disabling entire ASG while new is still not fully up" is most certainly unsafe. For now, add back the logic we had in the `AbstractClusterWideTask` to `DisableServerGroupTask` to skip check on RRB deploys. In the future, we should not have artisanal checks like this and instead have traffic guards understand that a percentage of an ASG is going away and validate against that percentage.

As a result of #3037, the RRB flow now no longer bypasses traffic guards (it used to do that before, here: https://github.com/spinnaker/orca/blob/master/orca-clouddriver/src/main/groovy/com/netflix/spinnaker/orca/clouddriver/tasks/cluster/AbstractClusterWideClouddriverTask.groovy#L145 `shouldSkipTrafficGuards`.) Instead, now when we run the first disable step in RRB (e.g. disable 5% of source), the traffic guards think the whole ASG is going away as it is unaware of the percentage disable flag and thus it terminates the operation since "disabling entire ASG while new is still not fully up" is most certainly unsafe. For now, add back the logic we had in the `AbstractClusterWideTask` to `DisableServerGroupTask` to skip check on RRB deploys. In the future, we should not have artisanal checks like this and instead have traffic guards understand that a percentage of an ASG is going away and validate against that percentage.

marchello2000 requested a review from dreynaud July 15, 2019 21:24

dreynaud approved these changes Jul 17, 2019

View reviewed changes

marchello2000 force-pushed the mark/spin-4593 branch from 9c8ce49 to c9449c9 Compare July 17, 2019 20:38

marchello2000 merged commit 75bcafe into spinnaker:master Jul 17, 2019

marchello2000 deleted the mark/spin-4593 branch July 17, 2019 21:30

spinnakerbot added the target-release/1.16 label Jul 17, 2019

marchello2000 mentioned this pull request Jul 22, 2019

fix(RRB): Do not consider traffic guards during RRB deploy #3054

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(RRB): disable traffic on the source server group only #3037

fix(RRB): disable traffic on the source server group only #3037

marchello2000 commented Jul 15, 2019

marchello2000 commented Jul 15, 2019

dreynaud Jul 17, 2019

marchello2000 Jul 17, 2019

fix(RRB): disable traffic on the source server group only #3037

fix(RRB): disable traffic on the source server group only #3037

Conversation

marchello2000 commented Jul 15, 2019

marchello2000 commented Jul 15, 2019

dreynaud Jul 17, 2019

Choose a reason for hiding this comment

marchello2000 Jul 17, 2019

Choose a reason for hiding this comment