Ceph: mon: default pod anti-affinity applied multiple times #4998

BlaineEXE · 2020-03-10T17:18:08Z

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior:

The Rook operator applies the default rule to spread monitors over different hosts multiple times.
mon-a: has one instance
mon-b: has two instances
mon-c: has 3 instances

This is likely due to an unnecessary append in a for loop.

This applies to the mon canary deployments only and not the regular mon deployments

Expected behavior:

Each mon deployment/podSpec should only have the default rule applied once

The text was updated successfully, but these errors were encountered:

travisn · 2020-03-20T15:36:27Z

A report elsewhere of the same issue. The mons were restarted during an orchestration. The mons had this antiaffinity before the operator restart:

    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - rook-ceph-mon
          topologyKey: failure-domain.beta.kubernetes.io/zone
        weight: 100
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: rook-ceph-mon
        topologyKey: kubernetes.io/hostname
      - labelSelector:
          matchLabels:
            app: rook-ceph-mon
        topologyKey: kubernetes.io/hostname
      - labelSelector:
          matchLabels:
            app: rook-ceph-mon
        topologyKey: kubernetes.io/hostname
      - labelSelector:
          matchLabels:
            app: rook-ceph-mon
        topologyKey: kubernetes.io/hostname
      - labelSelector:
          matchLabels:
            app: rook-ceph-mon
        topologyKey: kubernetes.io/hostname
      - labelSelector:
          matchLabels:
            app: rook-ceph-mon
        topologyKey: kubernetes.io/hostname
      - labelSelector:
          matchLabels:
            app: rook-ceph-mon
        topologyKey: kubernetes.io/hostname

And all of those were removed except one after the orchestration update.

travisn · 2020-03-20T23:02:42Z

After investigating, I'm still not able to repro it or analyze by code inspection how this could happen.

The mon antiaffinity is updated here. Since it is an "append" operation to the existing list of antiaffinity I suspect that somehow rook is continuously appending the same antiaffinity. However, I still don't see how it's possible. The pod that is passed to the setPodPlacement() method is created from scratch with every reconcile loop. There is no re-use of the same deployment spec from a running mon. The makeDeployment() method creates the spec, then passes it on to the setPodPlacement()

Now to find more details on the repro...

BlaineEXE · 2020-03-23T18:56:12Z

By code inspection, I'm at a loss as well, though I can repro pretty easily. Perhaps its necessary to apply some anti-affinity in the CephCluster CR for the bug to show? I'm wondering if scheduling is somehow using a reference to the mon internal representation of the cluster CR and updating that when it shouldn't.

travisn · 2020-03-23T22:12:00Z

Ok, found the cause...

The default pod anti-affinity for the mons that rook adds automatically intended to be appended to any anti-affinity that is specified in the cluster CR.

There is a bug in the ApplyToPodSpec() method that has long existed. The issue is that when antiaffinity is appended, it will append not only to the pod spec, but will modify the original placement spec. Thus, each mon that is started will have one more antiaffinity clause than the previous mon.

This condition is rarely hit or noticed because it commonly is only hit in the canary pods. Since these pods are immediately deleted, there are no side effects of the duplicate affinity clauses. The place where it becomes an issue is when the mons are backed by a PVC. In this case, the mons do not have node affinity and will hit the code path that appends to the antiaffinity and thus modifies the original antiaffinity. Notice that the guilty code path is skipped earlier if a node selector is present.

BlaineEXE added bug ceph main ceph tag ceph-mon labels Mar 10, 2020

BlaineEXE added this to the 1.3 milestone Mar 10, 2020

BlaineEXE changed the title ~~Ceph: mon: host anti-affinity applied multiple times~~ Ceph: mon: default pod anti-affinity applied multiple times Mar 20, 2020

This was referenced Mar 23, 2020

ceph: avoid duplication of mon anti-affinity #5071

Merged

ceph: avoid duplication of mon anti-affinity #5072

Merged

travisn closed this as completed in #5072 Mar 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ceph: mon: default pod anti-affinity applied multiple times #4998

Ceph: mon: default pod anti-affinity applied multiple times #4998

BlaineEXE commented Mar 10, 2020 •

edited

travisn commented Mar 20, 2020

travisn commented Mar 20, 2020

BlaineEXE commented Mar 23, 2020 •

edited

travisn commented Mar 23, 2020

Ceph: mon: default pod anti-affinity applied multiple times #4998

Ceph: mon: default pod anti-affinity applied multiple times #4998

Comments

BlaineEXE commented Mar 10, 2020 • edited

travisn commented Mar 20, 2020

travisn commented Mar 20, 2020

BlaineEXE commented Mar 23, 2020 • edited

travisn commented Mar 23, 2020

BlaineEXE commented Mar 10, 2020 •

edited

BlaineEXE commented Mar 23, 2020 •

edited