Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod communication not working on RHEL 8.4 on cloud providers #1053

Closed
Oats87 opened this issue May 25, 2021 · 9 comments
Closed

Pod communication not working on RHEL 8.4 on cloud providers #1053

Oats87 opened this issue May 25, 2021 · 9 comments

Comments

@Oats87
Copy link
Contributor

Oats87 commented May 25, 2021

Environmental Info:
RKE2 Version: v1.20.7+rke2r1

Node(s) CPU architecture, OS, and Version:
Linux hostname 4.18.0-305.el8.x86_64 #1 SMP Thu Apr 29 08:54:30 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 8.4 (Ootpa)

Cluster Configuration: Single node.

Describe the bug:
Pod communication is not possible when running on a RHEL 8.4 host on AWS, regardless of selinux status.

Steps To Reproduce:

mkdir -p /etc/rancher/rke2/
systemctl stop firewalld
cat > /etc/rancher/rke2/config.yaml <<EOF
selinux: true
write-kubeconfig-mode: "0644"
EOF
curl -sfL https://get.rke2.io --output install.sh
chmod +x install.sh
INSTALL_RKE2_CHANNEL=stable ./install.sh
systemctl enable rke2-server.service --now

Wait for RKE2 server to start and pods to start, and try to ping any pod IP in the 10.42.x.x range.

Expected behavior:
Pods can talk.

Actual behavior:
Pods don't talk.

Additional context / logs:
Looking at a tcpdump from within the container, you can see the corresponding ICMP request/reply, so the pod is definitely receiving the traffic. It seems to be getting lost heading back to the host network namespace.

This does not occur on VMs running in our test environment that are running RHEL 8.4

selinux status does not seem to affect the issue.

@davidnuzik davidnuzik added this to the v1.21.2+rke2r1 milestone May 25, 2021
@davidnuzik davidnuzik added this to To Triage in Development [DEPRECATED] via automation May 25, 2021
@davidnuzik davidnuzik moved this from To Triage to Next Up in Development [DEPRECATED] May 25, 2021
@Oats87
Copy link
Contributor Author

Oats87 commented May 25, 2021

Investigating this with @manuelbuil we found that we did not have issues on RHEL 8.3 AMI.

We also did not have any issues with our internal lab environment running 8.3 and 8.4.

We discovered a difference in the two AMIs, that there are new services (nm-cloud-setup.service, nm-cloud-setup.timer) that are enabled on the nodes. Disabling these i.e. systemctl disable nm-cloud-setup.service nm-cloud-setup.timer and rebooting the node restored connectivity on the node.

More work will need to be done to investigate why nm-cloud-setup is causing us issues when it is running.

@brandond
Copy link
Contributor

It's Always NetworkManager.

@davidnuzik davidnuzik changed the title Pod communication not working on RHEL 8.4 on AWS Pod communication not working on RHEL 8.4 on cloud providers Jun 3, 2021
@manuelbuil
Copy link
Contributor

manuelbuil commented Jun 4, 2021

nm-cloud-setup creates a policy based route table more prioritized than the main routing table. That table only applies to packets with source-ip of the main interface ip and basically sets the default gateway to what the cloud provider announced as the gateway. For example. table 30400 in this case:

[ec2-user@ip-10-0-10-5 ~]$ ip rule
0:	from all lookup local
30400:	from 10.0.10.5 lookup 30400
32766:	from all lookup main
32767:	from all lookup default

That table is breaking the connectivity node<-->local_pod. As soon as that table is removed, the connectivity works again. This behaviour applies to the main interface, therefore, the current recommended configuration in rke2 for network manager https://docs.rke2.io/known_issues.html#networkmanager does not fix this. According to the docs, the creation of that policy based route table is always done: https://www.mankier.com/8/nm-cloud-setup#Supported_Cloud_Providers.

I created an issue in the NetworkManager gitlab asking for a config to avoid this problem ==> https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/issues/740. My suggestion is to wait for an answer. In the meantime, we need to do as Chris said systemctl disable nm-cloud-setup.service nm-cloud-setup.timer or remove that table (but if NetworkManager restarts, it'll be recreated)

@Oats87
Copy link
Contributor Author

Oats87 commented Jun 4, 2021

We should probably cross-file an issue over at https://github.com/flannel-io/flannel

@manuelbuil
Copy link
Contributor

manuelbuil commented Jun 8, 2021

Reported it in calico as it seems more people are affected: projectcalico/calico#4662. It's also affecting calico, not only canal. I suspect it might also affect cilium

@manuelbuil
Copy link
Contributor

I think can close this one. Calico guys are saying that it should be disabled.

  • Docs updated saying nm-cloud-setup should be disabled
  • Both rke2 and k3s systemd file updated. They will not run if nm-cloud-setup is enabled

Development [DEPRECATED] automation moved this from Next Up to Done Issue / Merged PR Jun 18, 2021
@brandond brandond moved this from Done Issue / Merged PR to To Test in Development [DEPRECATED] Aug 6, 2021
@brandond
Copy link
Contributor

brandond commented Aug 6, 2021

QA should test the change that was made to the systemd unit.

@brandond brandond reopened this Aug 6, 2021
Development [DEPRECATED] automation moved this from To Test to Working Aug 6, 2021
@brandond brandond moved this from Working to To Test in Development [DEPRECATED] Aug 6, 2021
@manuelbuil
Copy link
Contributor

@brandond right! I don't know why I closed it directly :/

@bmdepesa bmdepesa added kind/dev-validation Dev will be validating this issue and removed kind/dev-validation Dev will be validating this issue labels Aug 12, 2021
@rancher-max
Copy link
Contributor

Validated on master branch commit f23f5915ee3d3270bad235e26c8c54e8c3c0cf76 using the same steps as in #1670 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development [DEPRECATED]
Done Issue / Merged PR
Development

No branches or pull requests

7 participants