Skip to content

Felix EndpointStatusReporter: do NOT coalesce status flaps #10608

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 2, 2025

Conversation

nelljerram
Copy link
Member

test_rebuild_vm, in our OpenStack ST, flakes about 50% of the time, and I've connected this with the Felix status_reporter.go code, which has, for a long time, been coded so as to coalesce flaps in endpoint status that happen more quickly than about 2s - or 2 times EndpointReportingDelaySeconds, which defaults to 1s.

The chain of events is:

  1. Rebuilding a VM means that the VM is destroyed and then recreated at the libvirt level.

  2. Felix observes this as the TAP interface going down, and then up again a few seconds later.

  3. Because of EndpointStatusReporter coding, Felix fails to write the status down update to etcd, for the VM that is being rebuilt.

  4. Missing the status down update confuses the Neutron server: it means that Neutron fails to send a 'VIF plugged' notification to Nova, after the VM has been recreated.

  5. Nova thinks that the recreation of the VM has failed, because of not receiving that 'VIF plugged' notification from Neutron, and so marks the VM as being in ERROR state.

In order to fix the OpenStack test_rebuild_vm scenario I think EndpointStatusReporter needs to be coded so as NOT to coalesce status flaps, and I'm not aware of anything else that will break if we propagate flaps through to etcd, or of any other strong reason for the current coding; so this PR changes EndpointStatusReporter so as not to coalesce flaps, and enhances UT to verify that flaps are seen in etcd.

Note, we already had UTs that were titled like "It SHOULD coalesce [flapping updates]", but in fact these UTs did not distinguish between the coalescing and non-coalescing behaviours! I've now re-titled those to "It should not coalesce [flapping updates]" and added further testing steps to properly verify the non-coalescing behaviour.

The previous, coalescing coding possibly dates back to the time when Felix was written in Python! But, as above, I'm not aware of any strong reason for it. Probably it was just a matter of "only the latest status matters, and we think no one downstream needs to see flaps, so lets save processing, network traffic and datastore writes by coalescing as much as possible." But now we have a scenario where it seems the downstream (specifically the Neutron server) does need to see the flaps.

Release Note

Bug fix: In Calico for OpenStack the operation to rebuild a VM could sometimes fail to complete successfully, with the VM getting stuck in ERROR state.

test_rebuild_vm, in our OpenStack ST, flakes about 50% of the time, and I've connected this with the
Felix status_reporter.go code, which has, for a long time, been coded so as to coalesce flaps in
endpoint status that happen more quickly than about 2s - or 2 times EndpointReportingDelaySeconds,
which defaults to 1s.

The chain of events is:

1. Rebuilding a VM means that the VM is destroyed and then recreated at the libvirt level.

2. Felix observes this as the TAP interface going down, and then up again a few seconds later.

3. Because of EndpointStatusReporter coding, Felix fails to write the status down update to etcd,
for the VM that is being rebuilt.

4. Missing the status down update confuses the Neutron server: it means that Neutron fails to send a
'VIF plugged' notification to Nova, after the VM has been recreated.

5. Nova thinks that the recreation of the VM has failed, because of not receiving that 'VIF plugged'
notification from Neutron, and so marks the VM as being in ERROR state.

In order to fix the OpenStack test_rebuild_vm scenario I think EndpointStatusReporter needs to be
coded so as NOT to coalesce status flaps, and I'm not aware of anything else that will break if we
propagate flaps through to etcd, or of any other strong reason for the current coding; so this PR
changes EndpointStatusReporter so as not to coalesce flaps, and enhances UT to verify that flaps are
seen in etcd.

Note, we already had UTs that were titled like "It SHOULD coalesce [flapping updates]", but in fact
these UTs did not distinguish between the coalescing and non-coalescing behaviours!  I've now
re-titled those to "It should not coalesce [flapping updates]" and added further testing steps to
properly verify the non-coalescing behaviour.

The previous, coalescing coding possibly dates back to the time when Felix was written in Python!
But, as above, I'm not aware of any strong reason for it.  Probably it was just a matter of "only
the latest status matters, and we think no one downstream needs to see flaps, so lets save
processing, network traffic and datastore writes by coalescing as much as possible."  But now we
have a scenario where it seems the downstream (specifically the Neutron server) does need to see the
flaps.
@Copilot Copilot AI review requested due to automatic review settings June 26, 2025 23:18
@nelljerram nelljerram requested a review from a team as a code owner June 26, 2025 23:18
@marvin-tigera marvin-tigera added this to the Calico v3.31.0 milestone Jun 26, 2025
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Jun 26, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR modifies the Felix EndpointStatusReporter to stop coalescing flapping endpoint status updates so that all status changes are propagated. The changes include updating test cases to verify the non-coalescing behavior and refactoring the reporter to use a new pendingUpdates slice instead of the previously coalesced dirty sets.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
felix/statusrep/status_reporter_test.go Updated test cases to validate that endpoint flapping events are not coalesced
felix/statusrep/status_reporter.go Refactored the update queuing mechanism to use a pendingUpdates slice and removed dirty set logic
Comments suppressed due to low confidence (2)

felix/statusrep/status_reporter.go:187

  • Consider adding a detailed comment here to explain the rationale behind switching from the coalescing logic (using queuedDirtyIDs and activeDirtyIDs) to the new pendingUpdates approach, to aid future maintainability and clarity.
				esr.pendingUpdates = append(esr.pendingUpdates, &pendingUpdate{statID: statID, status: status})

felix/statusrep/status_reporter_test.go:195

  • [nitpick] Consider including additional asserts or comments in this test case to further document and verify the expected behavior in flapping scenarios, ensuring that multiple rapid updates are properly handled.
				epUpdates <- &wlEPUpdateUp   // tick #1

@nelljerram nelljerram added docs-not-required Docs not required for this change and removed docs-pr-required Change is not yet documented labels Jun 26, 2025
Copy link
Contributor

@aaaaaaaalex aaaaaaaalex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks for digging into that one!

@nelljerram
Copy link
Member Author

I've had a fully green calico-test run with Felix code including this change (via this PR for testing pre-merge code), so I feel fairly confident that this is in fact a fix for the test_rebuild_vm flake, and also has not broken anything else.

@nelljerram
Copy link
Member Author

There was just this one CI test failure:

[batch=1 pid=12078] �[91m�[1m• Failure [14.305 seconds]�[0m
[batch=1 pid=12078] _BPF_ _BPF-SAFE_ BPF tests (ipv4 udp, ct=false, log=debug, tunnel=ipip, dsr=false) (kubernetes backend)
[batch=1 pid=12078] �[90m/go/src/github.com/projectcalico/calico/felix/fv/infrastructure/datastore_describe.go:40�[0m
[batch=1 pid=12078]   with a 3 node cluster
[batch=1 pid=12078]   �[90m/go/src/github.com/projectcalico/calico/felix/fv/bpf_test.go:1390�[0m
[batch=1 pid=12078]     with a policy allowing ingress to w[0][0] from all regular workloads
[batch=1 pid=12078]     �[90m/go/src/github.com/projectcalico/calico/felix/fv/bpf_test.go:1600�[0m
[batch=1 pid=12078]       Test load balancer service with no backend
[batch=1 pid=12078]       �[90m/go/src/github.com/projectcalico/calico/felix/fv/bpf_test.go:2055�[0m
[batch=1 pid=12078]         �[91m�[1mshould not have connectivity from external client, and return connection refused [It]�[0m
[batch=1 pid=12078]         �[90m/go/src/github.com/projectcalico/calico/felix/fv/bpf_test.go:2090�[0m
[batch=1 pid=12078] 
[batch=1 pid=12078]         �[91mTimed out after 1.001s.
[batch=1 pid=12078]         Expected
[batch=1 pid=12078]             <int>: 0
[batch=1 pid=12078]         to be >

I feel confident that's not related to the status reporting change, so will merge this PR.

@nelljerram nelljerram merged commit 906de42 into projectcalico:master Jul 2, 2025
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
delete-branch docs-not-required Docs not required for this change merge-when-ready release-note-required Change has user-facing impact (no matter how small)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants