-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ceph: mon: default pod anti-affinity applied multiple times #4998
Comments
A report elsewhere of the same issue. The mons were restarted during an orchestration. The mons had this antiaffinity before the operator restart:
And all of those were removed except one after the orchestration update. |
After investigating, I'm still not able to repro it or analyze by code inspection how this could happen. The mon antiaffinity is updated here. Since it is an "append" operation to the existing list of antiaffinity I suspect that somehow rook is continuously appending the same antiaffinity. However, I still don't see how it's possible. The pod that is passed to the setPodPlacement() method is created from scratch with every reconcile loop. There is no re-use of the same deployment spec from a running mon. The makeDeployment() method creates the spec, then passes it on to the setPodPlacement() Now to find more details on the repro... |
By code inspection, I'm at a loss as well, though I can repro pretty easily. Perhaps its necessary to apply some anti-affinity in the CephCluster CR for the bug to show? I'm wondering if scheduling is somehow using a reference to the mon internal representation of the cluster CR and updating that when it shouldn't. |
Ok, found the cause... The default pod anti-affinity for the mons that rook adds automatically intended to be appended to any anti-affinity that is specified in the cluster CR. There is a bug in the ApplyToPodSpec() method that has long existed. The issue is that when antiaffinity is appended, it will append not only to the pod spec, but will modify the original placement spec. Thus, each mon that is started will have one more antiaffinity clause than the previous mon. This condition is rarely hit or noticed because it commonly is only hit in the canary pods. Since these pods are immediately deleted, there are no side effects of the duplicate affinity clauses. The place where it becomes an issue is when the mons are backed by a PVC. In this case, the mons do not have node affinity and will hit the code path that appends to the antiaffinity and thus modifies the original antiaffinity. Notice that the guilty code path is skipped earlier if a node selector is present. |
Is this a bug report or feature request?
Deviation from expected behavior:
The Rook operator applies the default rule to spread monitors over different hosts multiple times.
mon-a: has one instance
mon-b: has two instances
mon-c: has 3 instances
This is likely due to an unnecessary
append
in a for loop.This applies to the mon canary deployments only and not the regular mon deployments
Expected behavior:
Each mon deployment/podSpec should only have the default rule applied once
The text was updated successfully, but these errors were encountered: