Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ceph: mon: default pod anti-affinity applied multiple times #4998

Closed
BlaineEXE opened this issue Mar 10, 2020 · 4 comments · Fixed by #5072
Closed

Ceph: mon: default pod anti-affinity applied multiple times #4998

BlaineEXE opened this issue Mar 10, 2020 · 4 comments · Fixed by #5072
Labels

Comments

@BlaineEXE
Copy link
Member

BlaineEXE commented Mar 10, 2020

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior:

The Rook operator applies the default rule to spread monitors over different hosts multiple times.
mon-a: has one instance
mon-b: has two instances
mon-c: has 3 instances

This is likely due to an unnecessary append in a for loop.

This applies to the mon canary deployments only and not the regular mon deployments

Expected behavior:

Each mon deployment/podSpec should only have the default rule applied once

@BlaineEXE BlaineEXE added this to the 1.3 milestone Mar 10, 2020
@travisn
Copy link
Member

travisn commented Mar 20, 2020

A report elsewhere of the same issue. The mons were restarted during an orchestration. The mons had this antiaffinity before the operator restart:

    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - rook-ceph-mon
          topologyKey: failure-domain.beta.kubernetes.io/zone
        weight: 100
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: rook-ceph-mon
        topologyKey: kubernetes.io/hostname
      - labelSelector:
          matchLabels:
            app: rook-ceph-mon
        topologyKey: kubernetes.io/hostname
      - labelSelector:
          matchLabels:
            app: rook-ceph-mon
        topologyKey: kubernetes.io/hostname
      - labelSelector:
          matchLabels:
            app: rook-ceph-mon
        topologyKey: kubernetes.io/hostname
      - labelSelector:
          matchLabels:
            app: rook-ceph-mon
        topologyKey: kubernetes.io/hostname
      - labelSelector:
          matchLabels:
            app: rook-ceph-mon
        topologyKey: kubernetes.io/hostname
      - labelSelector:
          matchLabels:
            app: rook-ceph-mon
        topologyKey: kubernetes.io/hostname

And all of those were removed except one after the orchestration update.

@BlaineEXE BlaineEXE changed the title Ceph: mon: host anti-affinity applied multiple times Ceph: mon: default pod anti-affinity applied multiple times Mar 20, 2020
@travisn
Copy link
Member

travisn commented Mar 20, 2020

After investigating, I'm still not able to repro it or analyze by code inspection how this could happen.

The mon antiaffinity is updated here. Since it is an "append" operation to the existing list of antiaffinity I suspect that somehow rook is continuously appending the same antiaffinity. However, I still don't see how it's possible. The pod that is passed to the setPodPlacement() method is created from scratch with every reconcile loop. There is no re-use of the same deployment spec from a running mon. The makeDeployment() method creates the spec, then passes it on to the setPodPlacement()

Now to find more details on the repro...

@BlaineEXE
Copy link
Member Author

BlaineEXE commented Mar 23, 2020

By code inspection, I'm at a loss as well, though I can repro pretty easily. Perhaps its necessary to apply some anti-affinity in the CephCluster CR for the bug to show? I'm wondering if scheduling is somehow using a reference to the mon internal representation of the cluster CR and updating that when it shouldn't.

@travisn
Copy link
Member

travisn commented Mar 23, 2020

Ok, found the cause...

The default pod anti-affinity for the mons that rook adds automatically intended to be appended to any anti-affinity that is specified in the cluster CR.

There is a bug in the ApplyToPodSpec() method that has long existed. The issue is that when antiaffinity is appended, it will append not only to the pod spec, but will modify the original placement spec. Thus, each mon that is started will have one more antiaffinity clause than the previous mon.

This condition is rarely hit or noticed because it commonly is only hit in the canary pods. Since these pods are immediately deleted, there are no side effects of the duplicate affinity clauses. The place where it becomes an issue is when the mons are backed by a PVC. In this case, the mons do not have node affinity and will hit the code path that appends to the antiaffinity and thus modifies the original antiaffinity. Notice that the guilty code path is skipped earlier if a node selector is present.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants