Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

controllers: limit reconciliations to relevant events #2621

Merged
merged 1 commit into from
May 23, 2024

Conversation

iamniting
Copy link
Member

@iamniting iamniting commented May 21, 2024

Noobaa and CephCluster CRs' status.conditions[].lastHeartbeatTime and status.conditions[].lastTransitionTime etc change frequently, causing unnecessary StorageCluster reconciliations. Implement logic to filter out these frequent, non-essential changes and trigger reconciliations only on relevant events.

Also add the owns for other objects as noobaa and cephcluster will trigger the recociles only for the relevant events. Previously, even though these objects were not owned by the controller, reconciliations were triggered due to status updates of the NooBaa and CephCluster CRs. Now the other objects should send thier own recocile requests because noobaa and ceph clkuster wont trigger the unnecessary reconciles and these objects can not rely on noobaa and ceph cluster.

Signed-off-by: Nitin Goyal nigoyal@redhat.com
Co-authored-by: Malay Kumar Parida mparida@redhat.com

Bug: https://bugzilla.redhat.com/show_bug.cgi?id=2280595

We are required to backport this till 4.14

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 21, 2024
@iamniting iamniting force-pushed the watch branch 5 times, most recently from 625805f to 31fb9a2 Compare May 21, 2024 06:38
@iamniting iamniting force-pushed the watch branch 2 times, most recently from 626a6a5 to 3bf381c Compare May 21, 2024 07:07
Owns(&cephv1.CephObjectStore{}, builder.WithPredicates(predicate.GenerationChangedPredicate{})).
Owns(&cephv1.CephObjectStoreUser{}, builder.WithPredicates(predicate.GenerationChangedPredicate{})).
Owns(&cephv1.CephRBDMirror{}, builder.WithPredicates(predicate.GenerationChangedPredicate{})).
Owns(&corev1.PersistentVolumeClaim{}, builder.WithPredicates(predicate.GenerationChangedPredicate{}, pvcPredicate)).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Madhu-1 @umangachapagain Do we really need the owns for PersistentVolumeClaim?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this can be removed as well as we don't have any PVC Controller as of now. we removed it long back but we need to see if any other functionality depends on this one before removing it.

@iamniting
Copy link
Member Author

/test images

@iamniting
Copy link
Member Author

/test ocs-operator-bundle-e2e-aws

Copy link
Contributor

openshift-ci bot commented May 21, 2024

@parth-gr: changing LGTM is restricted to collaborators

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@nb-ohad
Copy link
Contributor

nb-ohad commented May 21, 2024

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 21, 2024
@leelavg
Copy link
Contributor

leelavg commented May 21, 2024

Small suggestion, can we only fix the known entities that are queuing reconciles rather than including that with a blanket change? Rationale is, in current context, it helps a lot fixing the main problem and only debug side effects directly related to that (if we hit any)?

@iamniting
Copy link
Member Author

/test ocs-operator-bundle-e2e-aws

@leelavg
Copy link
Contributor

leelavg commented May 22, 2024

not able to find the log in successful run to verify it's working or not, based on a failed run I still see reconciles getting logged every minute in the last ~15/119 events (basically once stabilized I guess, from 3:52 in below log)

log is from https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/red-hat-storage_ocs-operator/2621/pull-ci-red-hat-storage-ocs-operator-main-ocs-operator-bundle-e2e-aws/1793114562673774592/artifacts/ocs-operator-bundle-e2e-aws/gather-extra/artifacts/pods/openshift-storage_ocs-operator-f6fd7455d-jxpkb_ocs-operator.log

[root@forge tmp]# grep 'Reconciling StorageCluster' ocs.log | jq -r .ts | tail -n20
2024-05-22T03:51:39Z
2024-05-22T03:51:41Z
2024-05-22T03:51:42Z
2024-05-22T03:52:30Z
2024-05-22T03:53:30Z
2024-05-22T03:54:31Z
2024-05-22T03:55:32Z
2024-05-22T03:56:33Z
2024-05-22T03:57:34Z
2024-05-22T03:58:35Z
2024-05-22T03:59:36Z
2024-05-22T04:00:37Z
2024-05-22T04:01:38Z
2024-05-22T04:02:39Z
2024-05-22T04:03:40Z
2024-05-22T04:04:41Z
2024-05-22T04:05:42Z
2024-05-22T04:06:43Z
2024-05-22T04:07:44Z
2024-05-22T04:08:45Z
[root@forge tmp]# grep 'Reconciling StorageCluster' ocs.log | jq -r .ts | wc -l
119

@iamniting
Copy link
Member Author

not able to find the log in successful run to verify it's working or not, based on a failed run I still see reconciles getting logged every minute in the last ~15/119 events (basically once stabilized I guess, from 3:52 in below log)

log is from https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/red-hat-storage_ocs-operator/2621/pull-ci-red-hat-storage-ocs-operator-main-ocs-operator-bundle-e2e-aws/1793114562673774592/artifacts/ocs-operator-bundle-e2e-aws/gather-extra/artifacts/pods/openshift-storage_ocs-operator-f6fd7455d-jxpkb_ocs-operator.log

[root@forge tmp]# grep 'Reconciling StorageCluster' ocs.log | jq -r .ts | tail -n20
2024-05-22T03:51:39Z
2024-05-22T03:51:41Z
2024-05-22T03:51:42Z
2024-05-22T03:52:30Z
2024-05-22T03:53:30Z
2024-05-22T03:54:31Z
2024-05-22T03:55:32Z
2024-05-22T03:56:33Z
2024-05-22T03:57:34Z
2024-05-22T03:58:35Z
2024-05-22T03:59:36Z
2024-05-22T04:00:37Z
2024-05-22T04:01:38Z
2024-05-22T04:02:39Z
2024-05-22T04:03:40Z
2024-05-22T04:04:41Z
2024-05-22T04:05:42Z
2024-05-22T04:06:43Z
2024-05-22T04:07:44Z
2024-05-22T04:08:45Z
[root@forge tmp]# grep 'Reconciling StorageCluster' ocs.log | jq -r .ts | wc -l
119

They are every minute due to the cephcluster being updated its capacity field every minute. Earlier they were very frequent.

@leelavg
Copy link
Contributor

leelavg commented May 22, 2024

They are every minute due to the cephcluster being updated its capacity field every minute. Earlier they were very frequent.

  • Ack, unfortunately the worst case that I was aware of is reconciling every minute 😓.
  • As the fix can't be realised easily, is there anyway to quantify very frequent and provide some kind of metric/testing to take in this PR? Maybe, perform above filtering on other failed PRs but run similarly in CI to compare reconciles. (iirc, controller-runtime itself exports a metric on how many times it requeued events)

Nevertheless, if what I'm asking makes less sense, I could standby for others to approve PR.

Maybe the fix for bug is reduction of reconciles + an agreed/consistent logging guidelines.

@nb-ohad
Copy link
Contributor

nb-ohad commented May 22, 2024

@iamniting Are we sure the path for an array item looks like this? "Status.Conditions.LastTransitionTime" shouldn't it need a moniker for the item?

Noobaa and CephCluster CRs' status.conditions[].lastHeartbeatTime and
status.conditions[].lastTransitionTime etc change frequently, causing
unnecessary StorageCluster reconciliations. Implement logic to filter
out these frequent, non-essential changes and trigger reconciliations
only on relevant events.

Also add the owns for other objects as noobaa and cephcluster will
trigger the recociles only for the relevant events. Previously, even
though these objects were not owned by the controller, reconciliations
were triggered due to status updates of the NooBaa and CephCluster CRs.
Now the other objects should send thier own recocile requests because
noobaa and ceph clkuster wont trigger the unnecessary reconciles and
these objects can not rely on noobaa and ceph cluster.

Signed-off-by: Nitin Goyal <nigoyal@redhat.com>
Co-authored-by: Malay Kumar Parida <mparida@redhat.com>
@iamniting
Copy link
Member Author

@iamniting Are we sure the path for an array item looks like this? "Status.Conditions.LastTransitionTime" shouldn't it need a moniker for the item?

I have tested it, It works like this only.

@iamniting iamniting requested a review from nb-ohad May 23, 2024 01:59
@nb-ohad
Copy link
Contributor

nb-ohad commented May 23, 2024

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 23, 2024
@nb-ohad
Copy link
Contributor

nb-ohad commented May 23, 2024

LGTM - This seems good from my side, not merging to allow others to comment.

@nb-ohad
Copy link
Contributor

nb-ohad commented May 23, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 23, 2024
Copy link
Contributor

openshift-ci bot commented May 23, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: iamniting, nb-ohad

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@iamniting
Copy link
Member Author

/cherry-pick release-4.16

@openshift-cherrypick-robot

@iamniting: once the present PR merges, I will cherry-pick it on top of release-4.16 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@iamniting
Copy link
Member Author

/cherry-pick release-4.15

@openshift-cherrypick-robot

@iamniting: once the present PR merges, I will cherry-pick it on top of release-4.15 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@iamniting
Copy link
Member Author

/cherry-pick release-4.14

@openshift-cherrypick-robot

@iamniting: once the present PR merges, I will cherry-pick it on top of release-4.14 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-bot openshift-merge-bot bot merged commit 839beb1 into red-hat-storage:main May 23, 2024
11 checks passed
@openshift-cherrypick-robot

@iamniting: #2621 failed to apply on top of branch "release-4.16":

Applying: controllers: limit reconciliations to relevant events
Using index info to reconstruct a base tree...
M	controllers/storagecluster/storagecluster_controller.go
M	go.mod
Falling back to patching base and 3-way merge...
Auto-merging go.mod
Auto-merging controllers/storagecluster/storagecluster_controller.go
CONFLICT (content): Merge conflict in controllers/storagecluster/storagecluster_controller.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 controllers: limit reconciliations to relevant events
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-cherrypick-robot

@iamniting: #2621 failed to apply on top of branch "release-4.15":

Applying: controllers: limit reconciliations to relevant events
Using index info to reconstruct a base tree...
M	controllers/ocsinitialization/ocsinitialization_controller.go
M	controllers/storagecluster/storagecluster_controller.go
M	go.mod
Falling back to patching base and 3-way merge...
Auto-merging go.mod
CONFLICT (content): Merge conflict in go.mod
Auto-merging controllers/storagecluster/storagecluster_controller.go
CONFLICT (content): Merge conflict in controllers/storagecluster/storagecluster_controller.go
Auto-merging controllers/ocsinitialization/ocsinitialization_controller.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 controllers: limit reconciliations to relevant events
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-cherrypick-robot

@iamniting: #2621 failed to apply on top of branch "release-4.14":

Applying: controllers: limit reconciliations to relevant events
Using index info to reconstruct a base tree...
M	controllers/ocsinitialization/ocsinitialization_controller.go
M	controllers/storagecluster/storagecluster_controller.go
M	go.mod
Falling back to patching base and 3-way merge...
Auto-merging go.mod
CONFLICT (content): Merge conflict in go.mod
Auto-merging controllers/storagecluster/storagecluster_controller.go
CONFLICT (content): Merge conflict in controllers/storagecluster/storagecluster_controller.go
Auto-merging controllers/ocsinitialization/ocsinitialization_controller.go
CONFLICT (content): Merge conflict in controllers/ocsinitialization/ocsinitialization_controller.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 controllers: limit reconciliations to relevant events
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants