Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix missing routes caused by Node events coming out of order #1526

Merged
merged 1 commit into from
Nov 11, 2020

Conversation

tnqn
Copy link
Member

@tnqn tnqn commented Nov 10, 2020

PodCIDRs can be released from deleted Nodes and allocated to new Nodes.
For server side, it won't happen that a PodCIDR is allocated to more
than one Node at any point. However, for client side, if a resync
happens to occur when there are Node creation and deletion events, the
informer will generate the events in a way that all creation events come
before deletion ones even they actually happen in the opposite order on
the server side. Therefore, a PodCIDR may appear in a new Node before
the Node that previously owns it is removed.

To ensure the stale routes, flows, and relevant cache of this podCIDR
are removed appropriately, we wait for the Node deletion event to be
processed before proceeding, or the route installation and
uninstallation operations may override or conflict with each other.

Fixes #1527

@antrea-bot
Copy link
Collaborator

Thanks for your PR.
Unit tests and code linters are run automatically every time the PR is updated.
E2e, conformance and network policy tests can only be triggered by a member of the vmware-tanzu organization. Regular contributors to the project should join the org.

The following commands are available:

  • /test-e2e: to trigger e2e tests.
  • /skip-e2e: to skip e2e tests.
  • /test-conformance: to trigger conformance tests.
  • /skip-conformance: to skip conformance tests.
  • /test-all-features-conformance: to trigger conformance tests with all alpha features enabled.
  • /skip-all-features-conformance: to skip conformance tests with all alpha features enabled.
  • /test-whole-conformance: to trigger all conformance tests on linux.
  • /skip-whole-conformance: to skip all conformance tests on linux.
  • /test-networkpolicy: to trigger networkpolicy tests.
  • /skip-networkpolicy: to skip networkpolicy tests.
  • /test-windows-conformance: to trigger windows conformance tests.
  • /skip-windows-conformance: to skip windows conformance tests.
  • /test-windows-networkpolicy: to trigger windows networkpolicy tests.
  • /skip-windows-networkpolicy: to skip windows networkpolicy tests.
  • /test-hw-offload: to trigger ovs hardware offload test.
  • /skip-hw-offload: to skip ovs hardware offload test.
  • /test-all: to trigger all tests (except whole conformance).
  • /skip-all: to skip all tests (except whole conformance).

@codecov-io
Copy link

codecov-io commented Nov 10, 2020

Codecov Report

Merging #1526 (f93cf30) into master (60f9182) will increase coverage by 0.05%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1526      +/-   ##
==========================================
+ Coverage   67.54%   67.59%   +0.05%     
==========================================
  Files         169      169              
  Lines       13424    13471      +47     
==========================================
+ Hits         9067     9106      +39     
+ Misses       3416     3415       -1     
- Partials      941      950       +9     
Flag Coverage Δ
integration-tests 45.62% <ø> (-0.12%) ⬇️
kind-e2e-tests 55.38% <60.00%> (-0.01%) ⬇️
unit-tests 41.48% <90.00%> (-0.21%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...gent/controller/noderoute/node_route_controller.go 44.80% <100.00%> (+6.68%) ⬆️
pkg/agent/controller/networkpolicy/allocator.go 77.77% <0.00%> (-15.98%) ⬇️
pkg/ovs/openflow/ofctrl_bridge.go 68.37% <0.00%> (-3.56%) ⬇️
pkg/querier/querier.go 57.14% <0.00%> (ø)
cmd/antrea-agent/agent.go 0.00% <0.00%> (ø)
pkg/agent/openflow/client.go 68.10% <0.00%> (ø)
pkg/agent/stats/collector.go 97.72% <0.00%> (+0.02%) ⬆️
...ntroller/networkpolicy/networkpolicy_controller.go 69.12% <0.00%> (+0.03%) ⬆️
pkg/agent/controller/networkpolicy/reconciler.go 70.76% <0.00%> (+0.07%) ⬆️
...agent/controller/traceflow/traceflow_controller.go 82.69% <0.00%> (+0.08%) ⬆️
... and 3 more

@tnqn
Copy link
Member Author

tnqn commented Nov 10, 2020

/test-all

// event to be processed before proceeding, or the route installation and uninstallation operations may override or
// conflict with each other.
if len(nodesHaveSamePodCIDR) > 0 {
return fmt.Errorf("podCIDR %s for Node %s is duplicate with Node %s", node.Spec.PodCIDR, nodeName, nodesHaveSamePodCIDR[0].(*nodeRouteInfo).nodeName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention that this error could be temporary? or something like addNoteRoute is deferred

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a bad idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion, done

// event to be processed before proceeding, or the route installation and uninstallation operations may override or
// conflict with each other.
if len(nodesHaveSamePodCIDR) > 0 {
return fmt.Errorf("podCIDR %s for Node %s is duplicate with Node %s", node.Spec.PodCIDR, nodeName, nodesHaveSamePodCIDR[0].(*nodeRouteInfo).nodeName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a bad idea.

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, also agree with Yang's suggestion

@antoninbas antoninbas mentioned this pull request Nov 10, 2020
@abhiraut
Copy link
Contributor

LGTM

PodCIDRs can be released from deleted Nodes and allocated to new Nodes.
For server side, it won't happen that a PodCIDR is allocated to more
than one Node at any point. However, for client side, if a resync
happens to occur when there are Node creation and deletion events, the
informer will generate the events in a way that all creation events come
before deletion ones even they actually happen in the opposite order on
the server side. Therefore, a PodCIDR may appear in a new Node before
the Node that previously owns it is removed.

To ensure the stale routes, flows, and relevant cache of this podCIDR
are removed appropriately, we wait for the Node deletion event to be
processed before proceeding, or the route installation and
uninstallation operations may override or conflict with each other.
@tnqn
Copy link
Member Author

tnqn commented Nov 11, 2020

/test-all

@lzhecheng
Copy link
Contributor

/test-windows-conformance

@antoninbas antoninbas added this to the Antrea v0.11.0 release milestone Nov 11, 2020
@tnqn
Copy link
Member Author

tnqn commented Nov 11, 2020

"TestHostPortPodConnectivity" in "Kind / E2e tests on a Kind cluster on Linux with Antrea NetworkPolicies enabled" failed because of "ErrImagePull", since the test has passed in other verifications, I will skip rerunning all kind tests.

Nov 11 03:41:53 kind-worker kubelet[405]: E1111 03:41:53.764091     405 pod_workers.go:191] Error syncing pod ec08f77f-e118-46a1-8525-d001b3168c33 ("test-host-port-pod-6sp37a7v_antrea-test(ec08f77f-e118-46a1-8525-d001b3168c33)"), skipping: failed to "StartContainer" for "agnhost" with ErrImagePull: "rpc error: code = Unknown desc = failed to pull and unpack image \"gcr.io/kubernetes-e2e-test-images/agnhost:2.8\": failed to copy: httpReaderSeeker: failed open: failed to do request: Get https://storage.googleapis.com/artifacts.kubernetes-e2e-test-images.appspot.com/containers/images/sha256:5a3ea8efae5d0abb93d2a04be0a4870087042b8ecab8001f613cdc2a9440616a: net/http: TLS handshake timeout"
Nov 11 03:41:54 kind-worker kubelet[405]: E1111 03:41:54.473966     405 pod_workers.go:191] Error syncing pod ec08f77f-e118-46a1-8525-d001b3168c33 ("test-host-port-pod-6sp37a7v_antrea-test(ec08f77f-e118-46a1-8525-d001b3168c33)"), skipping: failed to "StartContainer" for "agnhost" with ImagePullBackOff: "Back-off pulling image \"gcr.io/kubernetes-e2e-test-images/agnhost:2.8\""

@tnqn tnqn merged commit 17333f8 into antrea-io:master Nov 11, 2020
@tnqn tnqn deleted the node-route-override branch November 11, 2020 04:17
antoninbas pushed a commit to antoninbas/antrea that referenced this pull request Nov 11, 2020
…io#1526)

PodCIDRs can be released from deleted Nodes and allocated to new Nodes.
For server side, it won't happen that a PodCIDR is allocated to more
than one Node at any point. However, for client side, if a resync
happens to occur when there are Node creation and deletion events, the
informer will generate the events in a way that all creation events come
before deletion ones even they actually happen in the opposite order on
the server side. Therefore, a PodCIDR may appear in a new Node before
the Node that previously owns it is removed.

To ensure the stale routes, flows, and relevant cache of this podCIDR
are removed appropriately, we wait for the Node deletion event to be
processed before proceeding, or the route installation and
uninstallation operations may override or conflict with each other.
antoninbas pushed a commit that referenced this pull request Nov 11, 2020
PodCIDRs can be released from deleted Nodes and allocated to new Nodes.
For server side, it won't happen that a PodCIDR is allocated to more
than one Node at any point. However, for client side, if a resync
happens to occur when there are Node creation and deletion events, the
informer will generate the events in a way that all creation events come
before deletion ones even they actually happen in the opposite order on
the server side. Therefore, a PodCIDR may appear in a new Node before
the Node that previously owns it is removed.

To ensure the stale routes, flows, and relevant cache of this podCIDR
are removed appropriately, we wait for the Node deletion event to be
processed before proceeding, or the route installation and
uninstallation operations may override or conflict with each other.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Traffic between some random pods is not working
9 participants