Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(RRB): disable traffic on the source server group only #3037

Merged
merged 1 commit into from
Jul 17, 2019

Conversation

marchello2000
Copy link
Contributor

Make sure RRB disable step disabled the same number of instances as it brought online,
and only in the source server group.

Imagine there are multiple active server groups for RRB:

  • v001 (capacity: 6, active)
  • v002 (capacity: 10, active)

we are deploying v003, with the following steps: 50%, 100%

Currently, RRB will do the following:

  1. Spin up 5 instances in v003 (50% of the source which is v002)
  2. Disable 3 instances in v001 and 5 in v002 for a total of 8!
  3. Spin up another 5 in v003 (100%)
  4. Disable 3 and 5 respectively

Effectively we went from capacity 16 to 10 and are now under provisioned.
This is an edge case, but we want it to have predictable behavior that doesn't leave the cluster under provisioned

@marchello2000
Copy link
Contributor Author

Note: this is based on a branch for another PR (#3031) so you only need to review the last commit 9c8ce49.

I will rebase when the source branch gets merged

@Slf4j
@Component
class RollingRedBlackStrategy implements Strategy, ApplicationContextAware {
final String name = "rollingredblack"

@Autowired
DisableClusterStage disableClusterStage
DisableServerGroupStage disableServerGroupStage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very elegant change, my mouth is agape in amazement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought it was clever :)

Make sure RRB disable step disabled the same number of instances as it brought online,
and only in the source server group.

Imagine there are multiple active server groups for RRB:

- v001 (capacity: 6, active)
- v002 (capacity: 10, active)

we are deploying v003, with the following steps: 50%, 100%

Currently, RRB will do the following:
1. Spin up 5 instances in v003 (50% of the source which is v002)
2. Disable 3 instances in v001 and 5 in v002 for a total of 8!
3. Spin up another 5 in v003 (100%)
4. Disable 3 and 5 respectively

Effectively we went from capacity 16 to 10 and are now under provisioned.
This is an edge case, but we want it to have predictable behavior that doesn't leave the cluster under provisioned
@marchello2000 marchello2000 merged commit 75bcafe into spinnaker:master Jul 17, 2019
@marchello2000 marchello2000 deleted the mark/spin-4593 branch July 17, 2019 21:30
marchello2000 added a commit to marchello2000/orca that referenced this pull request Jul 22, 2019
As a result of spinnaker#3037, the RRB flow now no longer bypasses traffic guards
(it used to do that before, here: https://github.com/spinnaker/orca/blob/master/orca-clouddriver/src/main/groovy/com/netflix/spinnaker/orca/clouddriver/tasks/cluster/AbstractClusterWideClouddriverTask.groovy#L145 `shouldSkipTrafficGuards`.)

Instead, now when we run the first disable step in RRB (e.g. disable 5% of source), the traffic guards think
the whole ASG is going away as it is unaware of the percentage disable flag and thus it terminates the operation
since "disabling entire ASG while new is still not fully up" is most certainly unsafe.

For now, add back the logic we had in the `AbstractClusterWideTask` to `DisableServerGroupTask` to skip check on RRB deploys.
In the future, we should not have artisanal checks like this and instead have traffic guards understand that
a percentage of an ASG is going away and validate against that percentage.
marchello2000 added a commit that referenced this pull request Jul 22, 2019
As a result of #3037, the RRB flow now no longer bypasses traffic guards
(it used to do that before, here: https://github.com/spinnaker/orca/blob/master/orca-clouddriver/src/main/groovy/com/netflix/spinnaker/orca/clouddriver/tasks/cluster/AbstractClusterWideClouddriverTask.groovy#L145 `shouldSkipTrafficGuards`.)

Instead, now when we run the first disable step in RRB (e.g. disable 5% of source), the traffic guards think
the whole ASG is going away as it is unaware of the percentage disable flag and thus it terminates the operation
since "disabling entire ASG while new is still not fully up" is most certainly unsafe.

For now, add back the logic we had in the `AbstractClusterWideTask` to `DisableServerGroupTask` to skip check on RRB deploys.
In the future, we should not have artisanal checks like this and instead have traffic guards understand that
a percentage of an ASG is going away and validate against that percentage.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants