Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Zenith SSHD more robust #364

Merged
merged 5 commits into from May 17, 2024
Merged

Make Zenith SSHD more robust #364

merged 5 commits into from May 17, 2024

Conversation

assumptionsandg
Copy link
Contributor

@assumptionsandg assumptionsandg commented Apr 24, 2024

  • Set SSHD replicas to 3 by default
  • Use topology spread constraints to ensure those 3 pods are spread over the workers as evenly as possible (i.e. maxSkew: 1)
  • Use pod disruption budget with maxUnavailable: 1 to ensure that only one SSHD pod is killed at once

@assumptionsandg assumptionsandg force-pushed the feature/pdb-zenith branch 2 times, most recently from 1a71ad2 to 2edb4bc Compare April 24, 2024 14:28
@mkjpryor mkjpryor changed the title Add PodDisruptionBudget to Zenith SSHD Make Zenith SSHD more robust Apr 25, 2024
@mkjpryor
Copy link
Member

@assumptionsandg There are more things than this that we need to do to make Zenith SSHD properly robust. I have changed the title and added some items in the description to reflect this.

@assumptionsandg assumptionsandg marked this pull request as ready for review May 1, 2024 11:28
@assumptionsandg assumptionsandg requested a review from a team as a code owner May 1, 2024 11:28
@mkjpryor
Copy link
Member

mkjpryor commented May 17, 2024

Topology spread should work in an AIO, and will make sure that the pods schedule on different nodes by preference on a HA cluster, so I would prefer to have it back...

The reason for using topology spread and not anti-affinity is specifically because you specify a maxSkew between nodes. So if there is one node and 3 replicas, all 3 should go on the same node because there is no other node to compare skew against.

If it wasn't working, we should work out why.

@mkjpryor
Copy link
Member

@JohnGarbutt Why 2 and not 3 for the replicas? I have a slight preference for 3 as it will result in one per node in our default HA setup.

@JohnGarbutt
Copy link
Member

So it was failing on the single node deployment, so I was mostly suggesting dialling a few things back to see if that gets it working. Now we have something working, we could always have a look at adding back in the max skew... it looked like it was failing to create the first pod, but was OK if you already had one pod running, id I understand what we found out correctly?

@mkjpryor mkjpryor changed the title Make Zenith SSHD more robust Use pod disruption budget for Zenith SSHD May 17, 2024
@mkjpryor mkjpryor changed the title Use pod disruption budget for Zenith SSHD Make Zenith SSHD more robust May 17, 2024
Copy link
Member

@mkjpryor mkjpryor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mkjpryor mkjpryor merged commit 0a79612 into main May 17, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants