Fix missing routes caused by Node events coming out of order #1526

tnqn · 2020-11-10T13:36:52Z

PodCIDRs can be released from deleted Nodes and allocated to new Nodes.
For server side, it won't happen that a PodCIDR is allocated to more
than one Node at any point. However, for client side, if a resync
happens to occur when there are Node creation and deletion events, the
informer will generate the events in a way that all creation events come
before deletion ones even they actually happen in the opposite order on
the server side. Therefore, a PodCIDR may appear in a new Node before
the Node that previously owns it is removed.

To ensure the stale routes, flows, and relevant cache of this podCIDR
are removed appropriately, we wait for the Node deletion event to be
processed before proceeding, or the route installation and
uninstallation operations may override or conflict with each other.

Fixes #1527

antrea-bot · 2020-11-10T13:37:07Z

Thanks for your PR.
Unit tests and code linters are run automatically every time the PR is updated.
E2e, conformance and network policy tests can only be triggered by a member of the vmware-tanzu organization. Regular contributors to the project should join the org.

The following commands are available:

/test-e2e: to trigger e2e tests.
/skip-e2e: to skip e2e tests.
/test-conformance: to trigger conformance tests.
/skip-conformance: to skip conformance tests.
/test-all-features-conformance: to trigger conformance tests with all alpha features enabled.
/skip-all-features-conformance: to skip conformance tests with all alpha features enabled.
/test-whole-conformance: to trigger all conformance tests on linux.
/skip-whole-conformance: to skip all conformance tests on linux.
/test-networkpolicy: to trigger networkpolicy tests.
/skip-networkpolicy: to skip networkpolicy tests.
/test-windows-conformance: to trigger windows conformance tests.
/skip-windows-conformance: to skip windows conformance tests.
/test-windows-networkpolicy: to trigger windows networkpolicy tests.
/skip-windows-networkpolicy: to skip windows networkpolicy tests.
/test-hw-offload: to trigger ovs hardware offload test.
/skip-hw-offload: to skip ovs hardware offload test.
/test-all: to trigger all tests (except whole conformance).
/skip-all: to skip all tests (except whole conformance).

codecov-io · 2020-11-10T13:40:37Z

Codecov Report

Merging #1526 (f93cf30) into master (60f9182) will increase coverage by 0.05%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1526      +/-   ##
==========================================
+ Coverage   67.54%   67.59%   +0.05%     
==========================================
  Files         169      169              
  Lines       13424    13471      +47     
==========================================
+ Hits         9067     9106      +39     
+ Misses       3416     3415       -1     
- Partials      941      950       +9

Flag	Coverage Δ
integration-tests	`45.62% <ø> (-0.12%)`	⬇️
kind-e2e-tests	`55.38% <60.00%> (-0.01%)`	⬇️
unit-tests	`41.48% <90.00%> (-0.21%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...gent/controller/noderoute/node_route_controller.go	`44.80% <100.00%> (+6.68%)`	⬆️
pkg/agent/controller/networkpolicy/allocator.go	`77.77% <0.00%> (-15.98%)`	⬇️
pkg/ovs/openflow/ofctrl_bridge.go	`68.37% <0.00%> (-3.56%)`	⬇️
pkg/querier/querier.go	`57.14% <0.00%> (ø)`
cmd/antrea-agent/agent.go	`0.00% <0.00%> (ø)`
pkg/agent/openflow/client.go	`68.10% <0.00%> (ø)`
pkg/agent/stats/collector.go	`97.72% <0.00%> (+0.02%)`	⬆️
...ntroller/networkpolicy/networkpolicy_controller.go	`69.12% <0.00%> (+0.03%)`	⬆️
pkg/agent/controller/networkpolicy/reconciler.go	`70.76% <0.00%> (+0.07%)`	⬆️
...agent/controller/traceflow/traceflow_controller.go	`82.69% <0.00%> (+0.08%)`	⬆️
... and 3 more

tnqn · 2020-11-10T13:50:37Z

/test-all

Dyanngg · 2020-11-10T18:39:33Z

pkg/agent/controller/noderoute/node_route_controller.go

+	// event to be processed before proceeding, or the route installation and uninstallation operations may override or
+	// conflict with each other.
+	if len(nodesHaveSamePodCIDR) > 0 {
+		return fmt.Errorf("podCIDR %s for Node %s is duplicate with Node %s", node.Spec.PodCIDR, nodeName, nodesHaveSamePodCIDR[0].(*nodeRouteInfo).nodeName)


Should we mention that this error could be temporary? or something like addNoteRoute is deferred

Not a bad idea.

thanks for the suggestion, done

pkg/agent/controller/noderoute/node_route_controller.go

jianjuns · 2020-11-10T21:55:55Z

pkg/agent/controller/noderoute/node_route_controller.go

+	// event to be processed before proceeding, or the route installation and uninstallation operations may override or
+	// conflict with each other.
+	if len(nodesHaveSamePodCIDR) > 0 {
+		return fmt.Errorf("podCIDR %s for Node %s is duplicate with Node %s", node.Spec.PodCIDR, nodeName, nodesHaveSamePodCIDR[0].(*nodeRouteInfo).nodeName)


Not a bad idea.

antoninbas

LGTM, also agree with Yang's suggestion

pkg/agent/controller/noderoute/node_route_controller.go

abhiraut · 2020-11-10T23:25:02Z

LGTM

PodCIDRs can be released from deleted Nodes and allocated to new Nodes. For server side, it won't happen that a PodCIDR is allocated to more than one Node at any point. However, for client side, if a resync happens to occur when there are Node creation and deletion events, the informer will generate the events in a way that all creation events come before deletion ones even they actually happen in the opposite order on the server side. Therefore, a PodCIDR may appear in a new Node before the Node that previously owns it is removed. To ensure the stale routes, flows, and relevant cache of this podCIDR are removed appropriately, we wait for the Node deletion event to be processed before proceeding, or the route installation and uninstallation operations may override or conflict with each other.

tnqn · 2020-11-11T03:03:03Z

/test-all

lzhecheng · 2020-11-11T03:10:37Z

/test-windows-conformance

tnqn · 2020-11-11T04:16:38Z

"TestHostPortPodConnectivity" in "Kind / E2e tests on a Kind cluster on Linux with Antrea NetworkPolicies enabled" failed because of "ErrImagePull", since the test has passed in other verifications, I will skip rerunning all kind tests.

Nov 11 03:41:53 kind-worker kubelet[405]: E1111 03:41:53.764091     405 pod_workers.go:191] Error syncing pod ec08f77f-e118-46a1-8525-d001b3168c33 ("test-host-port-pod-6sp37a7v_antrea-test(ec08f77f-e118-46a1-8525-d001b3168c33)"), skipping: failed to "StartContainer" for "agnhost" with ErrImagePull: "rpc error: code = Unknown desc = failed to pull and unpack image \"gcr.io/kubernetes-e2e-test-images/agnhost:2.8\": failed to copy: httpReaderSeeker: failed open: failed to do request: Get https://storage.googleapis.com/artifacts.kubernetes-e2e-test-images.appspot.com/containers/images/sha256:5a3ea8efae5d0abb93d2a04be0a4870087042b8ecab8001f613cdc2a9440616a: net/http: TLS handshake timeout"
Nov 11 03:41:54 kind-worker kubelet[405]: E1111 03:41:54.473966     405 pod_workers.go:191] Error syncing pod ec08f77f-e118-46a1-8525-d001b3168c33 ("test-host-port-pod-6sp37a7v_antrea-test(ec08f77f-e118-46a1-8525-d001b3168c33)"), skipping: failed to "StartContainer" for "agnhost" with ImagePullBackOff: "Back-off pulling image \"gcr.io/kubernetes-e2e-test-images/agnhost:2.8\""

…io#1526) PodCIDRs can be released from deleted Nodes and allocated to new Nodes. For server side, it won't happen that a PodCIDR is allocated to more than one Node at any point. However, for client side, if a resync happens to occur when there are Node creation and deletion events, the informer will generate the events in a way that all creation events come before deletion ones even they actually happen in the opposite order on the server side. Therefore, a PodCIDR may appear in a new Node before the Node that previously owns it is removed. To ensure the stale routes, flows, and relevant cache of this podCIDR are removed appropriately, we wait for the Node deletion event to be processed before proceeding, or the route installation and uninstallation operations may override or conflict with each other.

PodCIDRs can be released from deleted Nodes and allocated to new Nodes. For server side, it won't happen that a PodCIDR is allocated to more than one Node at any point. However, for client side, if a resync happens to occur when there are Node creation and deletion events, the informer will generate the events in a way that all creation events come before deletion ones even they actually happen in the opposite order on the server side. Therefore, a PodCIDR may appear in a new Node before the Node that previously owns it is removed. To ensure the stale routes, flows, and relevant cache of this podCIDR are removed appropriately, we wait for the Node deletion event to be processed before proceeding, or the route installation and uninstallation operations may override or conflict with each other.

vmwclabot added the cla-not-required label Nov 10, 2020

Dyanngg reviewed Nov 10, 2020

View reviewed changes

jianjuns reviewed Nov 10, 2020

View reviewed changes

antoninbas reviewed Nov 10, 2020

View reviewed changes

pkg/agent/controller/noderoute/node_route_controller.go Show resolved Hide resolved

antoninbas mentioned this pull request Nov 10, 2020

Release 0.10.2 #1520

Merged

tnqn force-pushed the node-route-override branch from 915c4c8 to f93cf30 Compare November 11, 2020 02:33

antoninbas approved these changes Nov 11, 2020

View reviewed changes

antoninbas added this to the Antrea v0.11.0 release milestone Nov 11, 2020

tnqn merged commit 17333f8 into antrea-io:master Nov 11, 2020

tnqn deleted the node-route-override branch November 11, 2020 04:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix missing routes caused by Node events coming out of order #1526

Fix missing routes caused by Node events coming out of order #1526

tnqn commented Nov 10, 2020 •

edited

Loading

antrea-bot commented Nov 10, 2020

codecov-io commented Nov 10, 2020 •

edited

Loading

tnqn commented Nov 10, 2020

Dyanngg Nov 10, 2020

jianjuns Nov 10, 2020

tnqn Nov 11, 2020

jianjuns Nov 10, 2020

antoninbas left a comment

abhiraut commented Nov 10, 2020

tnqn commented Nov 11, 2020

lzhecheng commented Nov 11, 2020

tnqn commented Nov 11, 2020

Fix missing routes caused by Node events coming out of order #1526

Fix missing routes caused by Node events coming out of order #1526

Conversation

tnqn commented Nov 10, 2020 • edited Loading

antrea-bot commented Nov 10, 2020

codecov-io commented Nov 10, 2020 • edited Loading

Codecov Report

tnqn commented Nov 10, 2020

Dyanngg Nov 10, 2020

Choose a reason for hiding this comment

jianjuns Nov 10, 2020

Choose a reason for hiding this comment

tnqn Nov 11, 2020

Choose a reason for hiding this comment

jianjuns Nov 10, 2020

Choose a reason for hiding this comment

antoninbas left a comment

Choose a reason for hiding this comment

abhiraut commented Nov 10, 2020

tnqn commented Nov 11, 2020

lzhecheng commented Nov 11, 2020

tnqn commented Nov 11, 2020

tnqn commented Nov 10, 2020 •

edited

Loading

codecov-io commented Nov 10, 2020 •

edited

Loading