Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[arp_update]: Fix IPv6 neighbor race condition #15583

Merged
merged 3 commits into from
Jun 30, 2023

Conversation

theasianpianist
Copy link
Contributor

Why I did it

A race condition exists which makes it possible for the kernel to resolve a trapped packet's destination IP at the same time that arp_update is running the ip neigh replace command for that neighbor IP. When this occurs, the kernel's neighbor entry for this IP is in the INCOMPLETE state, and the ip neigh replace command sets it to permanently incomplete. This means no netlink message will be generated for this neighbor, since the kernel doesn't generate netlink messages for INCOMPLETE neighbors (it would only generate a message once the neighbor transitions to FAILED, which doesn't happen due to the ip neigh replace command). As a result, no APPL_DB neighbor table entry is ever created and no tunnel route for the IP is ever installed, leading to dropped traffic.

Work item tracking
  • Microsoft ADO (number only): 24152065

How I did it

  • Check each FAILED or INCOMPLETE neighbor to see if a corresponding APPL_DB neighbor table entry exists. If no entry exists, flush the neighbor from the kernel
  • After pinging the neighbor IPs, wait for any neighbor entries which might be transiently INCOMPLETE to transition to FAILED (so that the subsequent ip neigh replace command can set them to permanently incomplete)
  • Exclude any INCOMPLETE neighbors from the ip neigh replace command in case they are new neighbors for which no netlink message has been generated yet

How to verify it

  • Run arp_update with no FAILED or INCOMPLETE neighbor entries, verify no changes are made to the kernel neighbor table

  • Run arp_update with a FAILED neighbor entry with corresponding APPL_DB entry, verify the neighbor IP is pinged and set to INCOMPLETE permanently

  • Run arp_update with an INCOMPLETE neighbor entry with corresponding APPL_DB entry, verify the neighbor IP is pinged and set to INCOMPLETE permanently

  • Run arp_update with a FAILED neighbor entry without corresponding APPL_DB entry, verify the neighbor is flushed, pinged, and set to INCOMPLETE permanently

  • Run arp_update with an INCOMPLETE neighbor entry without corresponding APPL_DB entry, verify the neighbor is flushed, pinged, and set to INCOMPLETE permanently

  • Run arp_update with various combinations of above scenarios - verify that only neighbors missing APPL_DB entries are flushed from the kernel; verify that all FAILED/INCOMPLETE neighbors are pinged and set to permanently INCOMPLETE

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
@prsunny
Copy link
Contributor

prsunny commented Jun 23, 2023

@theasianpianist , can you please add Requestfor labels and remove the Approvedfor label. Approvedfor shall be done by branch owners

files/scripts/arp_update Show resolved Hide resolved
files/scripts/arp_update Show resolved Hide resolved
files/scripts/arp_update Outdated Show resolved Hide resolved
if [[ ! -z "$failed_kernel_neighbors" ]]; then
neigh_replace_template="sed -e 's/^/ip neigh replace /' -e 's/,/ dev /' -e 's/$/ nud incomplete;/'"
ip_neigh_replace_cmd="echo \"$failed_kernel_neighbors\" | cut -d ' ' -f 1,3 --output-delimiter=',' | $neigh_replace_template"
eval `eval "$ip_neigh_replace_cmd"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please test this with traffic which can resolve a neighbor during a ping and another where neighbor cannot be resolved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with one resolvable IP, one unresolvable IP, and one unresolvable IP where I manually deleted the APPL_DB entry. The script behaved as expected - the deleted IP was flushed, all three IPs were pinged, and the final state of the IPs was REACHABLE (for the resolvable IP) and INCOMPLETE for the other two.

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
@prsunny prsunny merged commit b4a3711 into sonic-net:master Jun 30, 2023
18 checks passed
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Jun 30, 2023
* [arp_update]: Fix IPv6 neighbor race condition on dualtors
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202211: #15693

mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Jun 30, 2023
* [arp_update]: Fix IPv6 neighbor race condition on dualtors
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202205: #15694

@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202305: #15877

sonic-otn pushed a commit to sonic-otn/sonic-buildimage that referenced this pull request Sep 20, 2023
* [arp_update]: Fix IPv6 neighbor race condition on dualtors
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants