-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node behind NAT always NotReady #314
Comments
hi @idcmp can you share the Kilo logs from the laptop VM? This is the most important piece of information here for why the node is unready. |
Here are some logs from the kilo pod running on the laptop VM:
Looking at journalctl, there's a bunch of:
and of course
My
Hope this helps, just let me know if there are other logs you'd like! |
Hi @idcmp, you mentioned that the Kilo pod on the laptop VM was crashlooping, but the logs you shared don't mention anything about a crash. Are there more logs that do include this? |
I'll see if I can follow the logs and catch it in the act ( Here's a describe of the pod:
kube-proxy configmap:
kilo configmap:
|
Caught one:
Here are the events from the pod:
|
Okay, I'm still experiencing wireguard dropouts, (as described in Situation), but both nodes think they're |
Thanks @idcmp those logs are really helpful :) |
Ah, it sounds like maybe it's related to this:
|
I'll try testing on k8s v1.24 (and updating our ewe tests). So far our tests have been on v1.23. I wonder if this has caused an issue. |
I wanted to add that this issue sounds to me like it is a bit more fundamental than WireGuard/Kilo/networking. It seems like something is breaking with the container runtime interrupting Kilo. Once we figure out why Kilo is being killed prematurely, we can investigate any networking issues. In other words, there is no reason why Kubernetes should kill the Kilo Pod just because the network (and thus the node) isn't yet read. Could you share any details about the container runtime? Are there any differences between the VM's runtime and the EC2 node? |
Nope. No changes from what was fetched. The EC2 instance is running Linux 5.10.109-104.500.amzn2.x86_64 with containerd 1.4.13, the laptop VM (which as you've probably guessed runs Fedora), runs Linux 5.17.5-200.fc35.x86_64 and containerd 1.6.2 in VirtualBox (as spun up by Vagrant). I really think the problem relates to the EC2 instance having an |
This could be part of the explanation why networking isn't working, however wouldn't explain why the Kilo Pod is getting killed and the node is unready. |
Added --log-level=debug to the kilo daemonset, will post an update when it dies. |
Ran Crashing kilo:
Non-crashing kilo:
|
@squat is there anything i can do to help? |
This is a duplicate of #189. I think we should close this issue. |
Setup
kubeadm init ...
kubeadm join ...
(not as control plane)manifests/kilo-kubeadm.yaml
from this repoecho "module wireguard +p" >/sys/kernel/debug/dynamic_debug/control
KubeletNotReady container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
Situation
Relevant log on the laptop VM:
CrashLoopBackOff
.Description
The wg configuration for the EC2 instance has
Endpoint=my.external.laptop.ip:52412
listed in the Peer configuration stanza. I think when key rotation happens, the EC2 instance tries to connect back to the laptop VM (which it can't), and the handshake errors are from the mismatch of the laptop VM trying to connect to the EC2 instance at the same time.I think if the Endpoint= wasn't listed in the Peer stanza, the laptop VM would just reconnect after rekeying and life would continue on. I'm hoping that would also fix why the laptop VM is still in
NotReady
too.The text was updated successfully, but these errors were encountered: