Felix EndpointStatusReporter: do NOT coalesce status flaps #10608

nelljerram · 2025-06-26T23:18:36Z

test_rebuild_vm, in our OpenStack ST, flakes about 50% of the time, and I've connected this with the Felix status_reporter.go code, which has, for a long time, been coded so as to coalesce flaps in endpoint status that happen more quickly than about 2s - or 2 times EndpointReportingDelaySeconds, which defaults to 1s.

The chain of events is:

Rebuilding a VM means that the VM is destroyed and then recreated at the libvirt level.
Felix observes this as the TAP interface going down, and then up again a few seconds later.
Because of EndpointStatusReporter coding, Felix fails to write the status down update to etcd, for the VM that is being rebuilt.
Missing the status down update confuses the Neutron server: it means that Neutron fails to send a 'VIF plugged' notification to Nova, after the VM has been recreated.
Nova thinks that the recreation of the VM has failed, because of not receiving that 'VIF plugged' notification from Neutron, and so marks the VM as being in ERROR state.

In order to fix the OpenStack test_rebuild_vm scenario I think EndpointStatusReporter needs to be coded so as NOT to coalesce status flaps, and I'm not aware of anything else that will break if we propagate flaps through to etcd, or of any other strong reason for the current coding; so this PR changes EndpointStatusReporter so as not to coalesce flaps, and enhances UT to verify that flaps are seen in etcd.

Note, we already had UTs that were titled like "It SHOULD coalesce [flapping updates]", but in fact these UTs did not distinguish between the coalescing and non-coalescing behaviours! I've now re-titled those to "It should not coalesce [flapping updates]" and added further testing steps to properly verify the non-coalescing behaviour.

The previous, coalescing coding possibly dates back to the time when Felix was written in Python! But, as above, I'm not aware of any strong reason for it. Probably it was just a matter of "only the latest status matters, and we think no one downstream needs to see flaps, so lets save processing, network traffic and datastore writes by coalescing as much as possible." But now we have a scenario where it seems the downstream (specifically the Neutron server) does need to see the flaps.

Release Note

Bug fix: In Calico for OpenStack the operation to rebuild a VM could sometimes fail to complete successfully, with the VM getting stuck in ERROR state.

test_rebuild_vm, in our OpenStack ST, flakes about 50% of the time, and I've connected this with the Felix status_reporter.go code, which has, for a long time, been coded so as to coalesce flaps in endpoint status that happen more quickly than about 2s - or 2 times EndpointReportingDelaySeconds, which defaults to 1s. The chain of events is: 1. Rebuilding a VM means that the VM is destroyed and then recreated at the libvirt level. 2. Felix observes this as the TAP interface going down, and then up again a few seconds later. 3. Because of EndpointStatusReporter coding, Felix fails to write the status down update to etcd, for the VM that is being rebuilt. 4. Missing the status down update confuses the Neutron server: it means that Neutron fails to send a 'VIF plugged' notification to Nova, after the VM has been recreated. 5. Nova thinks that the recreation of the VM has failed, because of not receiving that 'VIF plugged' notification from Neutron, and so marks the VM as being in ERROR state. In order to fix the OpenStack test_rebuild_vm scenario I think EndpointStatusReporter needs to be coded so as NOT to coalesce status flaps, and I'm not aware of anything else that will break if we propagate flaps through to etcd, or of any other strong reason for the current coding; so this PR changes EndpointStatusReporter so as not to coalesce flaps, and enhances UT to verify that flaps are seen in etcd. Note, we already had UTs that were titled like "It SHOULD coalesce [flapping updates]", but in fact these UTs did not distinguish between the coalescing and non-coalescing behaviours! I've now re-titled those to "It should not coalesce [flapping updates]" and added further testing steps to properly verify the non-coalescing behaviour. The previous, coalescing coding possibly dates back to the time when Felix was written in Python! But, as above, I'm not aware of any strong reason for it. Probably it was just a matter of "only the latest status matters, and we think no one downstream needs to see flaps, so lets save processing, network traffic and datastore writes by coalescing as much as possible." But now we have a scenario where it seems the downstream (specifically the Neutron server) does need to see the flaps.

Copilot

Pull Request Overview

This PR modifies the Felix EndpointStatusReporter to stop coalescing flapping endpoint status updates so that all status changes are propagated. The changes include updating test cases to verify the non-coalescing behavior and refactoring the reporter to use a new pendingUpdates slice instead of the previously coalesced dirty sets.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
felix/statusrep/status_reporter_test.go	Updated test cases to validate that endpoint flapping events are not coalesced
felix/statusrep/status_reporter.go	Refactored the update queuing mechanism to use a pendingUpdates slice and removed dirty set logic

Comments suppressed due to low confidence (2)

felix/statusrep/status_reporter.go:187

Consider adding a detailed comment here to explain the rationale behind switching from the coalescing logic (using queuedDirtyIDs and activeDirtyIDs) to the new pendingUpdates approach, to aid future maintainability and clarity.

				esr.pendingUpdates = append(esr.pendingUpdates, &pendingUpdate{statID: statID, status: status})

felix/statusrep/status_reporter_test.go:195

[nitpick] Consider including additional asserts or comments in this test case to further document and verify the expected behavior in flapping scenarios, ensuring that multiple rapid updates are properly handled.

				epUpdates <- &wlEPUpdateUp   // tick #1

aaaaaaaalex

LGTM - thanks for digging into that one!

This reverts commit 1340b86.

nelljerram · 2025-07-01T22:46:26Z

I've had a fully green calico-test run with Felix code including this change (via this PR for testing pre-merge code), so I feel fairly confident that this is in fact a fix for the test_rebuild_vm flake, and also has not broken anything else.

nelljerram · 2025-07-02T10:33:49Z

There was just this one CI test failure:

[batch=1 pid=12078] �[91m�[1m• Failure [14.305 seconds]�[0m
[batch=1 pid=12078] _BPF_ _BPF-SAFE_ BPF tests (ipv4 udp, ct=false, log=debug, tunnel=ipip, dsr=false) (kubernetes backend)
[batch=1 pid=12078] �[90m/go/src/github.com/projectcalico/calico/felix/fv/infrastructure/datastore_describe.go:40�[0m
[batch=1 pid=12078]   with a 3 node cluster
[batch=1 pid=12078]   �[90m/go/src/github.com/projectcalico/calico/felix/fv/bpf_test.go:1390�[0m
[batch=1 pid=12078]     with a policy allowing ingress to w[0][0] from all regular workloads
[batch=1 pid=12078]     �[90m/go/src/github.com/projectcalico/calico/felix/fv/bpf_test.go:1600�[0m
[batch=1 pid=12078]       Test load balancer service with no backend
[batch=1 pid=12078]       �[90m/go/src/github.com/projectcalico/calico/felix/fv/bpf_test.go:2055�[0m
[batch=1 pid=12078]         �[91m�[1mshould not have connectivity from external client, and return connection refused [It]�[0m
[batch=1 pid=12078]         �[90m/go/src/github.com/projectcalico/calico/felix/fv/bpf_test.go:2090�[0m
[batch=1 pid=12078] 
[batch=1 pid=12078]         �[91mTimed out after 1.001s.
[batch=1 pid=12078]         Expected
[batch=1 pid=12078]             <int>: 0
[batch=1 pid=12078]         to be >

I feel confident that's not related to the status reporting change, so will merge this PR.

Copilot AI review requested due to automatic review settings June 26, 2025 23:18

nelljerram requested a review from a team as a code owner June 26, 2025 23:18

marvin-tigera added this to the Calico v3.31.0 milestone Jun 26, 2025

marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Jun 26, 2025

Copilot AI reviewed Jun 26, 2025

View reviewed changes

nelljerram added docs-not-required Docs not required for this change and removed docs-pr-required Change is not yet documented labels Jun 26, 2025

aaaaaaaalex approved these changes Jun 27, 2025

View reviewed changes

nelljerram added 2 commits June 27, 2025 13:42

Enable packaging for pre-merge code

1340b86

Revert "Enable packaging for pre-merge code"

787b29a

This reverts commit 1340b86.

Merge remote-tracking branch 'origin/master' into openstack-rebuild-fix

fb8d2a8

nelljerram added merge-when-ready delete-branch labels Jul 2, 2025

nelljerram merged commit 906de42 into projectcalico:master Jul 2, 2025
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Felix EndpointStatusReporter: do NOT coalesce status flaps #10608

Felix EndpointStatusReporter: do NOT coalesce status flaps #10608

Uh oh!

nelljerram commented Jun 26, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

aaaaaaaalex left a comment

Uh oh!

nelljerram commented Jul 1, 2025

Uh oh!

nelljerram commented Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

Felix EndpointStatusReporter: do NOT coalesce status flaps #10608

Felix EndpointStatusReporter: do NOT coalesce status flaps #10608

Uh oh!

Conversation

nelljerram commented Jun 26, 2025

Release Note

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

aaaaaaaalex left a comment

Choose a reason for hiding this comment

Uh oh!

nelljerram commented Jul 1, 2025

Uh oh!

nelljerram commented Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!