Skip to content

Fix CalicoNodeStatus update stuck in conflict #10555

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

theboringstuff
Copy link
Contributor

@theboringstuff theboringstuff commented Jun 13, 2025

Description

CalicoNodeStatus reporter may get stuck in a conflict trying to update object in k8s, thus leaving CalicoNodeStatus in outdated state. This conflict may happen due to syncer not sending object updates to reporter. Syncer may miss object updates if these updates happened during apiserver/etcd performance drops (e.g. due to slow disk), as described in the issue #8715. It is hard to trigger manually in real-world scenarios, but we frequently see this issue post-factum on our envs. This issue could be easily reproduced synthetically by making k8s objects updates directly through etcd, avoiding apiserver

We would like this also to be backported to calico 3.30/3.29

Related issues/PRs

fixes #8715

Todos

  • Tests
  • Documentation
  • Release note

Release Note

Fix that CalicoNodeStatus updates could get blocked by datastore errors

Reminder for the reviewer

Make sure that this PR has the correct labels and milestone set.

Every PR needs one docs-* label.

  • docs-pr-required: This change requires a change to the documentation that has not been completed yet.
  • docs-completed: This change has all necessary documentation completed.
  • docs-not-required: This change has no user-facing impact and requires no docs.

Every PR needs one release-note-* label.

  • release-note-required: This PR has user-facing changes. Most PRs should have this label.
  • release-note-not-required: This PR has no user-facing changes.

Other optional labels:

  • cherry-pick-candidate: This PR should be cherry-picked to an earlier release. For bug fixes only.
  • needs-operator-pr: This PR is related to install and requires a corresponding change to the operator.

@theboringstuff theboringstuff requested a review from a team as a code owner June 13, 2025 04:51
@marvin-tigera marvin-tigera added this to the Calico v3.31.0 milestone Jun 13, 2025
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Jun 13, 2025
@CLAassistant
Copy link

CLAassistant commented Jun 13, 2025

CLA assistant check
All committers have signed the CLA.

@theboringstuff theboringstuff changed the title Fix CalicoNodeStatus update stuck in conflict https://github.com/projectcalico/calico/issues/8715 Fix CalicoNodeStatus update stuck in conflict 8715 Jun 13, 2025
@theboringstuff theboringstuff changed the title Fix CalicoNodeStatus update stuck in conflict 8715 Fix CalicoNodeStatus update stuck in conflict Jun 13, 2025
Comment on lines +54 to +56
KubeConfig: apiconfig.KubeConfig{
Kubeconfig: os.Getenv("KUBECONFIG"),
},
Copy link
Contributor Author

@theboringstuff theboringstuff Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this change is really needed, but on my local environment, without this change make fv would silently skip whole nodestatus test suite, giving a false-positive success, while actually 0 tests were executed. This happens due to panic in scope of Describe, which ginkgo treats as success for some reason

@caseydavenport
Copy link
Member

/sem-approve

@caseydavenport caseydavenport added the docs-not-required Docs not required for this change label Jun 13, 2025
@marvin-tigera marvin-tigera removed the docs-pr-required Change is not yet documented label Jun 13, 2025
@marvin-tigera
Copy link
Contributor

Removing "merge-when-ready" label due to new commits

@theboringstuff
Copy link
Contributor Author

@caseydavenport , Hi, some tests have failed, one of them (related to linter) I have fixed, but for others I can not see any error information or artifacts to look at. Can you please help to understand why those tests failed

@caseydavenport
Copy link
Member

@theboringstuff I think the test failure was due to a bug in master that has since been fixed - you should be able to pull in the latest master to your PR and I believe those tests will pass now.

@theboringstuff
Copy link
Contributor Author

@caseydavenport , done, but semaphore is pending, can you please approve it

@caseydavenport
Copy link
Member

/sem-approve

@theboringstuff
Copy link
Contributor Author

@caseydavenport , check is still failing, I tried to reproduce locally one of them, but I am not sure I get the same error. Do I understand correctly that PR can not be merged until tests are passing?

@theboringstuff
Copy link
Contributor Author

theboringstuff commented Jun 24, 2025

@caseydavenport , Hi, do you have any suggestions what we should do to proceed? I would like to fix CI somehow, but without access to any informative output I can only guess what is the problem

@caseydavenport
Copy link
Member

@theboringstuff sorry - I have been out on PTO. I will take a look at this today!

@caseydavenport
Copy link
Member

I think the test failure is not related to your changes, so we can merge this.

@caseydavenport caseydavenport merged commit 7a43626 into projectcalico:master Jun 24, 2025
2 of 3 checks passed
@theboringstuff
Copy link
Contributor Author

@caseydavenport thank you! Can we also back-port it to v3.30 and v3.29? Should I create additional PRs to those release branches?

@caseydavenport
Copy link
Member

@theboringstuff yep! Feel free to create cherry-pick PRs to release-v3.30 and release-v3.29 and just link them to this one in the description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs-not-required Docs not required for this change merge-when-ready release-note-required Change has user-facing impact (no matter how small) squash-commits
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CalicoNodeStatus can't be updated when etcd performance is (temporarily) degraded
4 participants