RKE INTERNAL-IP and EXTERNAL-IP addresses are not correctly set #22584

rgl · 2019-08-31T20:34:14Z

What kind of request is this (question/bug/enhancement/feature request):
bug.

Steps to reproduce (least amount of steps as possible):
Add a node to a RKE cluster as:

https://github.com/rgl/rancher-single-node-ubuntu-vagrant/blob/048567e05b87247ce14b1b3d2680314cbd7f3115/provision-rancher.sh#L182-L199

rancher_ip_address="${1:-10.1.0.3}"; shift || true
node_ip_address="$rancher_ip_address"

# register this node as a rancher-agent.
echo "getting the rancher-agent registration command..."
cluster_id="$(echo "$cluster_response" | jq -r .id)"
cluster_registration_response="$(
    wget -qO- \
        --header 'Content-Type: application/json' \
        --header "Authorization: Bearer $admin_api_token" \
        --post-data '{"type":"clusterRegistrationToken","clusterId":"'$cluster_id'"}' \
        "$rancher_server_url/v3/clusterregistrationtoken")"
echo "registering this node as a rancher-agent..."
rancher_agent_registration_command="
    $(echo "$cluster_registration_response" | jq -r .nodeCommand)
        --address $node_ip_address
        --internal-address $node_ip_address
        --etcd
        --controlplane
        --worker"
$rancher_agent_registration_command

Result:

The INTERNAL-IP and EXTERNAL-IP and not correctly set as can be seen on the following output:

# kubectl get nodes -o wide
NAME     STATUS   ROLES                      AGE   VERSION   INTERNAL-IP       EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
server   Ready    controlplane,etcd,worker   36m   v1.15.3   192.168.121.150   <none>        Ubuntu 18.04.3 LTS   4.15.0-58-generic   docker://19.3.1

# kubectl describe nodes
Name:               server
Roles:              controlplane,etcd,worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=server
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/controlplane=true
                    node-role.kubernetes.io/etcd=true
                    node-role.kubernetes.io/worker=true
Annotations:        flannel.alpha.coreos.com/backend-data: {"VtepMAC":"06:0f:17:92:00:ef"}
                    flannel.alpha.coreos.com/backend-type: vxlan
                    flannel.alpha.coreos.com/kube-subnet-manager: true
                    flannel.alpha.coreos.com/public-ip: 10.1.0.3
                    node.alpha.kubernetes.io/ttl: 0
                    rke.cattle.io/external-ip: 10.1.0.3
                    rke.cattle.io/internal-ip: 10.1.0.3
                    volumes.kubernetes.io/controller-managed-attach-detach: true
Addresses:
  InternalIP:  192.168.121.150
  Hostname:    server

Other details that may be helpful:
This is using a vagrant VM which has two interfaces, eth0 (192.168.121.150) and eth1(10.1.0.3). It should use the eth1(10.1.0.3) ip address as INTERNAL-IP and EXTERNAL-IP addresses.

The vagrant environment is at https://github.com/rgl/rancher-single-node-ubuntu-vagrant.

Environment information

Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): 2.2.8
Installation option (single install/HA): single

Cluster information

Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Custom/RKE (as launched by rancher UI)
Machine type (cloud/VM/metal) and specifications (CPU/memory): VM/4core/4GBram
Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T11:13:54Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T11:05:50Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}

Docker version (use docker version):

Client: Docker Engine - Community
 Version:           19.03.1
 API version:       1.40
 Go version:        go1.12.5
 Git commit:        74b1e89
 Built:             Thu Jul 25 21:21:05 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.1
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.5
  Git commit:       74b1e89
  Built:            Thu Jul 25 21:19:41 2019
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.6
  GitCommit:        894b81a4b802e4eb2a91d1ce216b8817763c29fb
 runc:
  Version:          1.0.0-rc8
  GitCommit:        425e105d5a03fabd737a126ad93d62a9eeede87f
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

The text was updated successfully, but these errors were encountered:

rgl · 2019-09-01T09:52:57Z

As a reference, at https://github.com/rgl/kubernetes-ubuntu-vagrant/blob/master/provision-kubernetes-master.sh, I launch k8s with kubeadm and all the IP addresses are OK:

kubeadm init \
  --kubernetes-version=1.15.3 \
  --apiserver-advertise-address=10.11.0.101 \
  --pod-network-cidr=10.12.0.0/16 \
  --service-cidr=10.13.0.0/16 \
  --service-dns-domain=vagrant.local

ip address show dev eth0 | grep 'inet '
# => inet 192.168.121.77/24 brd 192.168.121.255 scope global dynamic eth0

ip address show dev eth1 | grep 'inet '
# => inet 10.11.0.101/24 brd 10.11.0.255 scope global eth1

kubectl get nodes -o wide
# => NAME   STATUS   ROLES    AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
# => km1    Ready    master   23m   v1.15.3   10.11.0.101   <none>        Ubuntu 18.04.3 LTS   4.15.0-58-generic   docker://18.9.8

kubectl describe nodes
# => Name:               km1
# => Roles:              master
# => Labels:             beta.kubernetes.io/arch=amd64
# =>                     beta.kubernetes.io/os=linux
# =>                     kubernetes.io/arch=amd64
# =>                     kubernetes.io/hostname=km1
# =>                     kubernetes.io/os=linux
# =>                     node-role.kubernetes.io/master=
# => Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
# =>                     node.alpha.kubernetes.io/ttl: 0
# =>                     volumes.kubernetes.io/controller-managed-attach-detach: true
# => Addresses:
# =>   InternalIP:  10.11.0.101
# =>   Hostname:    km1

ps auxw|grep 10.11.0.101
# => root      8378  4.0  4.8 1859952 49228 ?       Ssl  10:23   0:46 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=cgroupfs --network-plugin=cni --pod-infra-container-image=k8s.gcr.io/pause:3.1 --resolv-conf=/run/systemd/resolve/resolv.conf --node-ip=10.11.0.101
# => root      8791  4.6 21.3 403328 214936 ?       Ssl  10:23   0:53 kube-apiserver --advertise-address=10.11.0.101 --allow-privileged=true --authorization-mode=Node,RBAC --client-ca-file=/etc/kubernetes/pki/ca.crt --enable-admission-plugins=NodeRestriction --enable-bootstrap-token-auth=true --etcd-cafile=/etc/kubernetes/pki/etcd/ca.crt --etcd-certfile=/etc/kubernetes/pki/apiserver-etcd-client.crt --etcd-keyfile=/etc/kubernetes/pki/apiserver-etcd-client.key --etcd-servers=https://127.0.0.1:2379 --insecure-port=0 --kubelet-client-certificate=/etc/kubernetes/pki/apiserver-kubelet-client.crt --kubelet-client-key=/etc/kubernetes/pki/apiserver-kubelet-client.key --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname --proxy-client-cert-file=/etc/kubernetes/pki/front-proxy-client.crt --proxy-client-key-file=/etc/kubernetes/pki/front-proxy-client.key --requestheader-allowed-names=front-proxy-client --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=6443 --service-account-key-file=/etc/kubernetes/pki/sa.pub --service-cluster-ip-range=10.13.0.0/16 --tls-cert-file=/etc/kubernetes/pki/apiserver.crt --tls-private-key-file=/etc/kubernetes/pki/apiserver.key
# => root      8873  2.5  3.2 10538668 32996 ?      Ssl  10:23   0:28 etcd --advertise-client-urls=https://10.11.0.101:2379 --cert-file=/etc/kubernetes/pki/etcd/server.crt --client-cert-auth=true --data-dir=/var/lib/etcd --initial-advertise-peer-urls=https://10.11.0.101:2380 --initial-cluster=km1=https://10.11.0.101:2380 --key-file=/etc/kubernetes/pki/etcd/server.key --listen-client-urls=https://127.0.0.1:2379,https://10.11.0.101:2379 --listen-peer-urls=https://10.11.0.101:2380 --name=km1 --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt --peer-client-cert-auth=true --peer-key-file=/etc/kubernetes/pki/etcd/peer.key --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt --snapshot-count=10000 --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt

smnbbrv · 2020-04-26T20:02:47Z

Is there a way to work this around? Just faced this issue and it is quite a blocker for me

stroebs · 2020-05-07T18:22:17Z

Facing this issue as well. Internal-IP is set to a public IP address and I cannot get the nodes to communicate over given private IP address. I've tried kubectl edit node <node-id> which doesn't seem to have any effect.

iosifnicolae2 · 2020-09-24T12:36:56Z

Any news on this?

stroebs · 2020-09-24T13:55:17Z

There seems to be no way to convince a RKE cluster to only use specified IPs for a given node if the node's primary ethernet interface (eth0 or similar) has an IP, whether the IP is public or private.

In my use-case working with Hetzner Cloud (or baremetal), the nodes have public IPs and traffic is explicitly blocked on the public interface. All private traffic traverses a secondary VLAN interface with a different IP address.

The workaround:

Create a RKE cluster as usual in the Rancher control plane (Custom)
Edit cluster -> Edit as yaml
Modify network key as below:

  network:
    mtu: 0
    options:
      flannel_backend_type: host-gw
      flannel_iface: eth0.vlan100
    plugin: flannel

Add nodes using their public IP (eth0 primary IP) as --external-address and secondary IP as --internal-address

Huge caveat with the workaround is that the node still uses eth0 as the node's IP and there's no way to tell Rancher to only use the IPs given! Canal/flannel will always lookup the primary interface's IP and use it if you try spoon-feed anything on the cli. The other approach I've seen is to create two additional private interfaces and use their IPs as external/internal to avoid using the primary ethernet interface's IP.

theAkito · 2020-09-24T15:29:44Z

This issue is currently hitting us in a (at the time of writing) benign way:
We use the Java Kubernetes API for requests and need all nodes' external IPs. Currently, in our specific case, each node has their public IP address categorized as the internal IP address and the public IP address field has no value, at all!

This is an issue that needs to be fixed, ASAP, considering the age of it.

iosifnicolae2 · 2020-09-25T09:27:41Z

There seems to be no way to convince a RKE cluster to only use specified IPs for a given node if the node's primary ethernet interface (eth0 or similar) has an IP, whether the IP is public or private.

In my use-case working with Hetzner Cloud (or baremetal), the nodes have public IPs and traffic is explicitly blocked on the public interface. All private traffic traverses a secondary VLAN interface with a different IP address.

The workaround:

Create a RKE cluster as usual in the Rancher control plane (Custom)

Edit cluster -> Edit as yaml

Modify network key as below:
  network:
    mtu: 0
    options:
      flannel_backend_type: host-gw
      flannel_iface: eth0.vlan100
    plugin: flannel
Add nodes using their public IP (eth0 primary IP) as --external-address and secondary IP as --internal-address

Huge caveat with the workaround is that the node still uses eth0 as the node's IP and there's no way to tell Rancher to only use the IPs given! Canal/flannel will always lookup the primary interface's IP and use it if you try spoon-feed anything on the cli. The other approach I've seen is to create two additional private interfaces and use their IPs as external/internal to avoid using the primary ethernet interface's IP.

I've managed to "fix" the way Flannel extract host IP by passing the below options through cluster.yml file:

network:
  plugin: canal
  options:
    canal_iface: ens10 <- this interface is mounted on a Private Network
    flannel_iface: ens10
  mtu: 0
  canal_network_provider:
    iface: ens10
  flannel_network_provider:
    iface: ens10
  node_selector: {}
  update_strategy: null

There's still a problem with the method used to extract the Internal IP label as seen below:

as you can see in the above image, the Internal IP is actually the public IP address of the host even though I've specified in cluster.yml the node addresses (address: 10.0.0.5 and internal_address: 10.0.0.5)

* I'm trying to configure Kubernetes to communicate between hosts using Hetzner Private networks and blocking all the incoming traffic from the Public IP address.
* If it helps, here's the entire cluster.yml file: https://gist.github.com/iosifnicolae2/87805e421a9faf83ca632825d1d6946b

Update

I've managed to solve the Internal IP problem by removing hostname_override variable from cluster.yml

pasikarkkainen · 2020-09-25T14:00:17Z

Probably related rke issue "add ability to force set node-ip argument on kubelet": rancher/rke#900

almoghamdani · 2020-12-03T07:29:08Z

Any news on when will this be fixed?

Dexolite · 2020-12-22T11:53:20Z

Hitting same issue rke2.

JellyZhang · 2021-01-26T17:22:36Z

Hitting same issue.

Dexolite · 2021-01-27T14:08:18Z

On RKE2 issue with node unable to communicate to eachother was due to the kubelet defaulting to DNS or ExternalIP
What fixed it in out case was giving an argument to preffer InternalIP.

Our /etc/rancher/rke2/config.yaml look like:

resolv-conf: "/etc/resolv.conf"
kube-apiserver-arg:
  - kubelet-preferred-address-types=InternalIP

Workaround, but at least we can use the cluster.

haswalt · 2021-01-29T08:33:21Z

Any way to apply these changes when launching an RKE cluster from rancher?

Like@iosifnicolae2 I am using Hetzner private networks for communications. No traffic is allowed over the public network and we're deploying clusters using the node driver from rancher. I've tried adding the network plugin settings however when deploying things like monitoring I'm seeing that the node exporter endpoints are using the public ip address and so cannot be reached.

haswalt · 2021-01-29T13:21:49Z

Additionaly this causes issues with certificates as they are setup for the private lan and since requests go over the public network the certs aren't valid.

riker09 · 2021-04-08T06:28:50Z

Any way to apply these changes when launching an RKE cluster from rancher?

Like@iosifnicolae2 I am using Hetzner private networks for communications. No traffic is allowed over the public network and we're deploying clusters using the node driver from rancher. I've tried adding the network plugin settings however when deploying things like monitoring I'm seeing that the node exporter endpoints are using the public ip address and so cannot be reached.

Were you able to solve your issue? I'm facing the same challenges and I haven't had success so far. Whatever I do, it seems it always boils done to the nodes not using the internal IPs that I assign to them.

iosifnicolae2 · 2021-04-08T10:48:17Z

Additionaly this causes issues with certificates as they are setup for the private lan and since requests go over the public network the certs aren't valid.

A quick fix for this problem is to setup CertManager to perform the verification using a DNS challange.
https://cert-manager.io/docs/configuration/acme/dns01/cloudflare/

riker09 · 2021-04-21T07:41:24Z

@iosifnicolae2 I think what @haswalt actually tried to achieve is securing the internal network with SSL encryption. But the certs are generated for the public IP which the nodes don't use in this case (the use internal IP). Using a DNS challenge is good for issuing certs for a public reachable domain.

Has anyone figured a working configuration for using a Rancher generated RKE cluster with a Hetzner private network?

This is what I'm after:

Load Balancer 
(with public IP)
     |
+-----------+           +-----------+
| Network A |           | Network B |
+-----------+           +-----------+
     |                        |
  Node 1  --------------------+
  Node 2  --------------------+
  Node 3  --------------------+
    ...                    Rancher

The cluster nodes are attached to Networks A and B. In network A we have the Hetzner Load Balancer with the public IP, in network B the is the Rancher as a single Docker instance (for now). Adding a node to the cluster works, but the public IP is used. I want the node to use the internal IP in network B.

haswalt · 2021-04-21T07:46:01Z

@riker09 you are correct, that was my aim. However it's not just certs that are the problem.

The incorrect IP setup means things like the node exorters for prometheus and other endpoints are setup using the public IP and cannot be accessed NOT (and I think this is more important) secured with a firewall.

riker09 · 2021-04-21T07:57:11Z

I think we're on the same page here. 🙂

What is puzzling me is the fact, that this is still unsolved. When I explicitly tell the cluster node upon registration to use the private network interface (here: ens11) I would expect that Rancher respects that.

docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run  rancher/rancher-agent:v2.5.7 --server https://rancher.[REDACTED].tld --token [REDACTED] --address ens11 --internal-address ens11 --worker

I don't have/see any problem with the nodes connecting to the Rancher cluster via its public IP. Does anybody else object to that?

haswalt · 2021-04-21T08:05:01Z

Except I wouldn't want nodes to connect over the public network. That means that rancher communication is going over the WAN. I would expect to be able to make nodes use the private network exclusively.

riker09 · 2021-04-21T08:19:24Z

Except I wouldn't want nodes to connect over the public network. That means that rancher communication is going over the WAN. I would expect to be able to make nodes use the private network exclusively.

Good point. But since the traffic should be SSL encrypted I didn't pay much thought to it. I will, however, setup a single Rancher node that will reside behind a load balancer.

iosifnicolae2 · 2021-04-21T08:58:24Z

@iosifnicolae2 I think what @haswalt actually tried to achieve is securing the internal network with SSL encryption. But the certs are generated for the public IP which the nodes don't use in this case (the use internal IP). Using a DNS challenge is good for issuing certs for a public reachable domain.

Has anyone figured a working configuration for using a Rancher generated RKE cluster with a Hetzner private network?

This is what I'm after:
Load Balancer 
(with public IP)
     |
+-----------+           +-----------+
| Network A |           | Network B |
+-----------+           +-----------+
     |                        |
  Node 1  --------------------+
  Node 2  --------------------+
  Node 3  --------------------+
    ...                    Rancher
The cluster nodes are attached to Networks A and B. In network A we have the Hetzner Load Balancer with the public IP, in network B the is the Rancher as a single Docker instance (for now). Adding a node to the cluster works, but the public IP is used. I want the node to use the internal IP in network B.

No, you can issue HTTPS certificates for a domain name that points to an internal IP if you do a DNS challenge verification (I'm having HTTPS certificates for domains that point to an internal IP).

haswalt · 2021-04-21T09:15:53Z

I think the point is being missed here and the focus is being pushed onto SSL. Let's ignore SSL for now as SSL can be achieved multiple ways.

The issue is that the external and internal IPs are set incorrectly so that communication between nodes does not work over a public network.

In conversations with others on the Rancher community we managed to get a working setup (outside of Hetzner) however this involved using a stack where we could remove the public network interface and only have 1, private, interface as eth0.

The setup we're aiming for here is each node has 2 network interfaces (as provided by hetzner). This can't really be changed. eth0 is connected to the public WAN. ens11 is the private network between nodes.

We want to be able to secure communication between rancher using a firewall. With autoscaling in place, we don't know the IP address of each new node, but we do know the subnet of the private lan so we can allow access via that. We also want to avoid traffic going over the WAN entirely.

Any public network access happens via the load balancers which send traffic over the private network (ens11). So we essentially want to disable any communication over eth0.

Now when we launch new nodes with RKE they don't set the interface / internal network correctly so connections to Rancher server, and various endpoints attempt to use eth0 which fails because the firewall blocks all traffic over the public WAN.

riker09 · 2021-04-21T09:25:45Z

Yes, this! I couldn't say it any better (really, I tried), so thank you for your summary.

Just a few notes, however:

We want to be able to secure communication between rancher using a firewall. With autoscaling in place, we don't know the IP address of each new node, but we don't know the subnet of the private lan so we can allow access via that. We also want to avoid traffic going over the WAN entirely.

You meant "...but we DO know the subnet", right?

And the private networks between LB A and Rancher Server and LB B and the Nodes could differ. Also between the Rancher Server and the Nodes. Hetzner is naming the virtual network interfaces ens1*, the first created interface gets a Zero and is then counted up.

[EDIT]
I have just created a new virtual server in the Hetzner Cloud and the interface name is different: enp7s0 for the first connected VNet, enp8s0 for the second net and so on. However, this should not make any impact on the issue at hand.

haswalt · 2021-04-21T09:27:20Z

You're right I did mean "DO" thanks.

haswalt · 2021-04-21T10:20:32Z

@riker09 The interface name is dictated by the VM type. From Heztners docs:

The interface for the first attached network will be named ens10 (for CX, CCX) or enp7s0 (for CPX). Additional interfaces will be named ens11 (CX, CCX) or enp8s0 (CPX) for the second, and ens12 (CX, CCX) or enp9s0 (CPX) for the third.

riker09 · 2021-04-26T09:20:27Z

I believe I found a working solution for my problem. Let me explain my setup:

There are three CX31 nodes running at Hetzner Cloud (the type shouldn't matter, I'm just being thorough). All are provisioned with a combination of Cloud Init and Ansible.

#cloud-config

groups:
- mygroup

users:
- name: myuser
  groups: users, admin, mygroup
  sudo: ALL=(ALL) NOPASSWD:ALL
  shell: /bin/bash
  ssh_authorized_keys:
  - [REDACTED]

packages:
- fail2ban
package_update: true
package_upgrade: true

runcmd:
## Enable Fail2Ban & SSH jail
- mkdir -p /etc/fail2ban/jail.d
- touch /etc/fail2ban/jail.d/sshd.local
- printf "[sshd]\nenabled = true\nbanaction = iptables-multiport" > /etc/fail2ban/jail.d/sshd.local
- systemctl enable fail2ban
## Harden SSH
- sed -i -e '/^\(#\|\)PermitRootLogin/s/^.*$/PermitRootLogin no/' /etc/ssh/sshd_config
- sed -i -e '/^\(#\|\)PasswordAuthentication/s/^.*$/PasswordAuthentication no/' /etc/ssh/sshd_config
- sed -i -e '/^\(#\|\)X11Forwarding/s/^.*$/X11Forwarding no/' /etc/ssh/sshd_config
- sed -i -e '/^\(#\|\)MaxAuthTries/s/^.*$/MaxAuthTries 2/' /etc/ssh/sshd_config
- sed -i -e '/^\(#\|\)AllowTcpForwarding/s/^.*$/AllowTcpForwarding yes/' /etc/ssh/sshd_config
- sed -i -e '/^\(#\|\)AllowAgentForwarding/s/^.*$/AllowAgentForwarding no/' /etc/ssh/sshd_config
- sed -i -e '/^\(#\|\)AuthorizedKeysFile/s/^.*$/AuthorizedKeysFile .ssh\/authorized_keys/' /etc/ssh/sshd_config
- sed -i '$a AllowUsers myuser' /etc/ssh/sshd_config
## Reboot
- reboot

After the initial cloud config has successfully run I further setup each node with Ansibe. Nothing fancy, though. I install Docker with geerlingguy.docker, setup a few more users. I especially don't do any firewall tweaking. The only package that I apt install is open-iscsi (required for Longhorn).

One of the three nodes is hosting the Rancher installation, started by docker run -d --restart=unless-stopped -p 80:80 -p 443:443 --privileged --name rancher rancher/rancher:latest according to the official documentation.

I've created to networks in Hetzner Cloud named rancher and k8s with 10.2.0.0/24 and 10.3.0.0/24 respectively. All three nodes are attached to both networks. There's a Load Balancer attached to the k8s network. At this point I have created a new cluster and naturally I've tried the Canal CNI provider first. I ran into weird issues where requests to a NodePort service failed about 50% of the times. After destroying the cluster, cleaning the nodes (!!) I tried Weave as a CNI provider and it looks like it is running stable and as intended.

This is the command that I've used to provision the cluster on the two remaining nodes:

## Main node
docker run -d \
--privileged \
--restart=unless-stopped \
--net=host \
-v /etc/kubernetes:/etc/kubernetes \
-v /var/run:/var/run  \
rancher/rancher-agent:v2.5.7 \
--server https://10.2.0.4 \
--token [REDACTED] \
--ca-checksum [REDACTED] \
--address ens11 \
--internal-address ens10 \
--all-roles

## Worker node
docker run -d \
--privileged \
--restart=unless-stopped \
--net=host \
-v /etc/kubernetes:/etc/kubernetes \
-v /var/run:/var/run  \
rancher/rancher-agent:v2.5.7 \
--server https://10.2.0.4 \
--token [REDACTED] \
--ca-checksum [REDACTED] \
--address ens11 \
--internal-address ens10 \
--worker

For the sake of completeness, this is the Rancher cluster.yaml config:

answers: {}
docker_root_dir: /var/lib/docker
enable_cluster_alerting: false
enable_cluster_monitoring: false
enable_network_policy: false
fleet_workspace_name: fleet-default
local_cluster_auth_endpoint:
  enabled: false
name: mycluster
rancher_kubernetes_engine_config:
  addon_job_timeout: 45
  authentication:
    strategy: x509
  authorization: {}
  bastion_host:
    ssh_agent_auth: false
  cloud_provider: {}
  dns:
    linear_autoscaler_params: {}
    node_selector: null
    nodelocal:
      node_selector: null
      update_strategy: {}
    options: null
    reversecidrs: null
    stubdomains: null
    tolerations: null
    update_strategy: {}
    upstreamnameservers: null
  ignore_docker_version: true
  ingress:
    default_backend: false
    http_port: 0
    https_port: 0
    provider: none
  kubernetes_version: v1.20.5-rancher1-1
  monitoring:
    provider: metrics-server
    replicas: 1
  network:
    mtu: 0
    options:
      flannel_backend_type: vxlan
    plugin: weave
    weave_network_provider: {}
  restore:
    restore: false
  rotate_encryption_key: false
  services:
    etcd:
      backup_config:
        enabled: true
        interval_hours: 12
        retention: 6
        safe_timestamp: false
        timeout: 300
      creation: 12h
      extra_args:
        election-timeout: '5000'
        heartbeat-interval: '500'
      gid: 0
      retention: 72h
      snapshot: false
      uid: 0
    kube-api:
      always_pull_images: false
      pod_security_policy: false
      secrets_encryption_config:
        enabled: false
      service_node_port_range: 30000-32767
    kube-controller: {}
    kubelet:
      fail_swap_on: false
      generate_serving_certificate: false
    kubeproxy: {}
    scheduler: {}
  ssh_agent_auth: false
  upgrade_strategy:
    drain: false
    max_unavailable_controlplane: '1'
    max_unavailable_worker: 10%
    node_drain_input:
      delete_local_data: false
      force: false
      grace_period: -1
      ignore_daemon_sets: true
      timeout: 120
scheduled_cluster_scan:
  enabled: false
  scan_config:
    cis_scan_config:
      override_benchmark_version: rke-cis-1.5
      profile: permissive
  schedule_config:
    cron_schedule: 0 0 * * *
    retention: 24

With this configuration I was able to create a DaemonSet with an NginX image and a Service with a NodePort 30080 that the Load Balancer routes to. Also the deployment of Longhorn went through without any issues (which has failed in the past). The thing is, when I change the CNI from Weave to Canal everything falls apart. So either the default setup for Canal is buggy or missing some essential configuration. 🤷🏻‍♂️
I'll keep playing around with my setup and report any oddities here.

riker09 · 2021-05-04T07:24:01Z

@haswalt I figured out a way that works. The solution was already in this very issue, see #22584 (comment)

The trick is to use two private networks and use Weave as CNI provider. In my experiments when I used Canal it wouldn't work although the nodes overview showed two private IPs. But Longhorn would still try to access a public IP.

Long story short, here's my setup:

Two private networks that are attached to:

the Virtual Server where the Rancher installation resides
all cluster nodes

I have created a third private network that is attached to a Load Balancer and the cluster nodes, but that does not affect the Rancher-Cluster-setup. I'm just mentioning this for completeness.

When creating the cluster in Rancher I select Weave as CNI, answer "no" to Authorized Endpoint and "no" to Nginx Ingress and that's it.

Since my Rancher installation uses a Let's Encrypt certificate I made one more change to my cluster nodes and that is a modification to the /etc/hosts file:

# Append
10.2.0.2 rancher.my.tld

Let me know if this works for you.

sebstyle · 2021-05-18T20:36:58Z

@riker09 I'm not sure but I think I ran into this problem a long time ago when experimenting with openstack..
I believe the conclusion as to why weave/calico works and Canal does not was that:

Weave/Calico is not an overlay network
Canal is an overlay network
The private network Hetzner uses is also an overlay network (look at the reduced MTU size)

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
ens10: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450

The reduction in MTU is used so whatever space is left can be used to add overlay routing information to the packet.

I believe the solution was to either

Get it to use Hetzner private network interface as the bridge interface so the existing 1450 MTU can be used.
or
Have an overlay network on top of an overlay network but reduce the MTU even more so on Canal

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
ens10: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1400

/edit
Also.. You mentioned 2 networks..

named rancher and k8s with 10.2.0.0/24 and 10.3.0.0/24

Where does that 10.1.0.2 address fit in?

Since my Rancher installation uses a Let's Encrypt certificate I made one more change to my cluster nodes and that is a modification to the /etc/hosts file:
Append
10.1.0.2 rancher.my.tld

Should that not have been an address from your "rancher" range ?
Or is that the "other" private network you were referring to ?

riker09 · 2021-05-19T07:21:18Z

Where does that 10.1.0.2 address fit in?

Since my Rancher installation uses a Let's Encrypt certificate I made one more change to my cluster nodes and that is a modification to the /etc/hosts file:
Append
10.1.0.2 rancher.my.tld

Should that not have been an address from your "rancher" range ?
Or is that the "other" private network you were referring to ?

You are absolutely right, this should have been an IP from one of the two networks. Achievement unlocked: Typo spotter! 😄

Thanks for your insights into the MTUs, I hope this is helpful for somebody else. I guess I will stick to my two network solution for now.

cortopy · 2021-07-20T17:54:31Z

This issue is still relevant. At least until a solution like #17180 is implemented

stale · 2021-09-18T23:50:51Z

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

riker09 · 2021-09-21T06:44:13Z

What @cortopy said. 😄

bpesics · 2021-11-20T02:41:24Z

If you look at Hetzner (private) networks and subnets more closely the cloud servers are not in one L2 network and can only reach each other via the gateway of the given subnet. (look at ip neigh. arping and traceroute)

So one either:

needs to add the routes to the Hetzner network for subnets created by a CNI (cumbersome or would need to be supported by the CNI) because it does not know which one is on which server
or use something like Calico which takes care of this in a different way for you... (correct encapsulation with IPIP + BGP)

stale · 2022-01-19T04:19:48Z

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

pasikarkkainen · 2022-01-19T14:39:47Z

not stale

jszanto · 2022-02-18T17:12:20Z

I just ran in to this issue this week and found a work-around. It seems like the main issue is that rancher does not pass the --node-ip flag to kubelet. If the node IP is not set, then kubelet determines it automatically. All other components (including Rancher itself) grab the IP which is set by Kubelet.

The behavior which Kubelet uses to determine the IP can be found here: https://github.com/kubernetes/kubernetes/blob/0e0abd602fac12c4422f8fe89c1f04c34067a76f/pkg/kubelet/nodestatus/setters.go#L214, it boils down to:

If the hostname of the node is actually an IP address -> use that
Perform a DNS lookup for the hostname of the node -> if it returns anything then use that
Get the IP of the interface used in the default route.

So simply adding your desired node IP along with the nodes hostname to /etc/hosts solves the problem.

stale · 2022-04-21T04:40:00Z

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

pasikarkkainen · 2022-04-21T20:12:47Z

not stale

sebastienroul · 2022-04-28T23:05:10Z

Hi !
I confirm, still there, but only in some network case. (2 network interfaces, and your private network you want to use is not the default gateway).

@jszanto your analyse is right : nothing sent to kubelet, so it tries it self. BUT kubelet container it self is responsible of that wrong IP.
In short, when no CATTLE_ADDRESS (or --address) sent then ip -o route get 8.8.8.8 => Then it is the public-default-route-ip returned.
Guilty code is here :

rancher/package/run.sh

Line 163 in 13aff47

if [ -z "$CATTLE_ADDRESS" ]; then

And solution is easy and there

Just move the block from line 215

# Extract hostname from URL
CATTLE_SERVER_HOSTNAME=$(echo $CATTLE_SERVER | sed -e 's/[^/]*\/\/\([^@]*@\)\?\([^:/]*\).*/\2/')
CATTLE_SERVER_HOSTNAME_WITH_PORT=$(echo $CATTLE_SERVER | sed -e 's/[^/]*\/\/\([^@]*@\)\?\(.*\).*/\2/')
# Resolve IPv4 address(es) from hostname
RESOLVED_ADDR=$(getent ahostsv4 $CATTLE_SERVER_HOSTNAME | sed -n 's/ *STREAM.*//p')

# If CATTLE_SERVER_HOSTNAME equals RESOLVED_ADDR, its an IP address and there is nothing to resolve
if [ "${CATTLE_SERVER_HOSTNAME}" != "${RESOLVED_ADDR}" ]; then
  info "${CATTLE_SERVER_HOSTNAME} resolves to $(echo $RESOLVED_ADDR)"
fi

At line 162

And then use the $RESOLVED_ADDR in the line 164

CATTLE_ADDRESS=$(ip -o route get $RESOLVED_ADDR | sed -n 's/.*src \([0-9.]\+\).*/\1/p')

There lot of chances the way to the resolver is the way of the private NET

Hey rancher team, please add this small hack in the short terms, it would really help Bare Metal deployment with public/private network.

Thanks !

--------------- UPDATED ---------------

…etwork interfces are on the node, and you want to use the private one : see rancher#22584 (comment)

CyberPoison · 2022-05-05T21:42:21Z

I also facing the same issue but the solution provided here doesn't worked for me.

There seems to be no way to convince a RKE cluster to only use specified IPs for a given node if the node's primary ethernet interface (eth0 or similar) has an IP, whether the IP is public or private.

In my use-case working with Hetzner Cloud (or baremetal), the nodes have public IPs and traffic is explicitly blocked on the public interface. All private traffic traverses a secondary VLAN interface with a different IP address.

The workaround:

Create a RKE cluster as usual in the Rancher control plane (Custom)

Edit cluster -> Edit as yaml

Modify network key as below:
  network:
    mtu: 0
    options:
      flannel_backend_type: host-gw
      flannel_iface: eth0.vlan100
    plugin: flannel
Add nodes using their public IP (eth0 primary IP) as --external-address and secondary IP as --internal-address

Huge caveat with the workaround is that the node still uses eth0 as the node's IP and there's no way to tell Rancher to only use the IPs given! Canal/flannel will always lookup the primary interface's IP and use it if you try spoon-feed anything on the cli. The other approach I've seen is to create two additional private interfaces and use their IPs as external/internal to avoid using the primary ethernet interface's IP.

Also tried

I just ran in to this issue this week and found a work-around. It seems like the main issue is that rancher does not pass the --node-ip flag to kubelet. If the node IP is not set, then kubelet determines it automatically. All other components (including Rancher itself) grab the IP which is set by Kubelet.

The behavior which Kubelet uses to determine the IP can be found here: https://github.com/kubernetes/kubernetes/blob/0e0abd602fac12c4422f8fe89c1f04c34067a76f/pkg/kubelet/nodestatus/setters.go#L214, it boils down to:

If the hostname of the node is actually an IP address -> use that

Perform a DNS lookup for the hostname of the node -> if it returns anything then use that

Get the IP of the interface used in the default route.

So simply adding your desired node IP along with the nodes hostname to /etc/hosts solves the problem.

but no luck :(
Any other solutions ?

kerren · 2022-06-11T08:37:22Z

Hi everyone, I just wanted to add a solution that combined a few things here, and is working for me! By the way, I know some people have major requirements for specific CNIs, if that's the case then this may not help because I had to use Calico for this to work (thanks bpesics for your comment above, it's what made me give it a try 😄).

Just for context, this is a cluster on Hetzner Cloud that requires outgoing traffic but I don't need all of that traffic to come from a specific/static IP. I used RKE to set this up and I wanted to load balance requests across all of my worker nodes. The steps I took were as follows:

Create 2 networks. The first for general servers and "public ips" for the cluster (you can decide this but as an example 192.168.0.0/24 would be fine). The second is for your cluster servers, this would only be nodes and nothing else and would serve as the private network (as an example 10.0.0.0/24 would be fine to use).
Update the /etc/hosts file, set the full qualified domain name (FQDN) to the general server IP, eg. 192.168.0.5 or whatever it is there. This may be unnecessary, but I have been doing this for years because I think it used to help with interface selection on nodes joining the cluster (it's probably not the case anymore).
Create the RKE cluster.yml file using the general server IPs for SSH and then set the internal IP to the 10.0.0.0/8 network. So for instance, you'd set SSH to 192.168.0.5 but then the internal IP would be 10.0.0.5.
Choose Calico as the network when it comes to the CNI selection in the RKE setup.
Once the cluster.yml file is generated, edit the ingress section, these are the settings I added:

provider: "nginx"
http_port: 80
https_port: 443
network_mode: "hostNetwork"

You can now load balance across the worker nodes on port 80 and 443.

I know many may say that this is stupid because I am using "hostNetwork", but accessing the ingress was near impossible for me unless I had a direct port bind, especially across multiple interfaces. I was thinking of making the ingress service NodePort from ClusterIP but the "hostNetwork" served my needs and so I thought I'd share for anyone else that is looking to get their cluster working.

One final note, the reason that I chose a general network (192.168.0.0/24) was because the IP does not change if you restart the node, whereas, the public IP is ephemeral. So for the setup and maintenance of a cluster, I figured this was a better approach to use.

Hope this helps, and thanks to everyone in this issue (and others) that gave feedback that led to this solution. I appreciate the community and the knowledge shared here!

github-actions · 2022-08-11T02:13:54Z

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

qeternity · 2023-04-15T15:40:40Z

Not stale. Is there still no "proper" way to provision an RKE cluster with dual nics?

DeyvsonL · 2023-05-04T18:06:22Z

Not stale

medicol69 · 2024-05-08T14:33:24Z

Not stale

Soterix · 2024-08-12T16:32:46Z

Ability to launch RKE in configurations with two network interfaces on node machines is highly needed.

stale bot added the status/stale label Jul 18, 2021

stale bot removed the status/stale label Jul 20, 2021

stale bot added the status/stale label Sep 18, 2021

stale bot removed the status/stale label Sep 21, 2021

stale bot added the status/stale label Jan 19, 2022

stale bot removed the status/stale label Jan 19, 2022

stale bot added the status/stale label Apr 21, 2022

stale bot removed the status/stale label Apr 21, 2022

sebastienroul added a commit to sebastienroul/rancher that referenced this issue Apr 28, 2022

Local resolver adresse is better than GOOGLE DNS, espacially when 2 n…

ba114e5

…etwork interfces are on the node, and you want to use the private one : see rancher#22584 (comment)

sebastienroul mentioned this issue Apr 28, 2022

Local resolver adresse is better than GOOGLE DNS #37529

Closed

Itxaka mentioned this issue Aug 2, 2022

Labels on machine-inventory-selector-ready are doing nothing rancher/elemental-operator#105

Closed

github-actions bot added the status/stale label Aug 11, 2022

github-actions bot closed this as completed Aug 25, 2022

RKE INTERNAL-IP and EXTERNAL-IP addresses are not correctly set #22584

RKE INTERNAL-IP and EXTERNAL-IP addresses are not correctly set #22584

Comments

rgl commented Aug 31, 2019

rgl commented Sep 1, 2019

smnbbrv commented Apr 26, 2020 • edited Loading

stroebs commented May 7, 2020

iosifnicolae2 commented Sep 24, 2020

stroebs commented Sep 24, 2020 • edited Loading

theAkito commented Sep 24, 2020

iosifnicolae2 commented Sep 25, 2020 • edited Loading

Update

pasikarkkainen commented Sep 25, 2020

almoghamdani commented Dec 3, 2020

Dexolite commented Dec 22, 2020 • edited Loading

JellyZhang commented Jan 26, 2021

Dexolite commented Jan 27, 2021 • edited Loading

haswalt commented Jan 29, 2021 • edited Loading

haswalt commented Jan 29, 2021

riker09 commented Apr 8, 2021

iosifnicolae2 commented Apr 8, 2021

riker09 commented Apr 21, 2021

haswalt commented Apr 21, 2021

riker09 commented Apr 21, 2021

haswalt commented Apr 21, 2021

riker09 commented Apr 21, 2021

iosifnicolae2 commented Apr 21, 2021

haswalt commented Apr 21, 2021 • edited Loading

riker09 commented Apr 21, 2021 • edited Loading

haswalt commented Apr 21, 2021

haswalt commented Apr 21, 2021

riker09 commented Apr 26, 2021

riker09 commented May 4, 2021 • edited Loading

sebstyle commented May 18, 2021 • edited Loading

riker09 commented May 19, 2021

cortopy commented Jul 20, 2021

stale bot commented Sep 18, 2021

riker09 commented Sep 21, 2021

bpesics commented Nov 20, 2021 • edited Loading

stale bot commented Jan 19, 2022

pasikarkkainen commented Jan 19, 2022

jszanto commented Feb 18, 2022

stale bot commented Apr 21, 2022

pasikarkkainen commented Apr 21, 2022

sebastienroul commented Apr 28, 2022 • edited Loading

CyberPoison commented May 5, 2022

kerren commented Jun 11, 2022

github-actions bot commented Aug 11, 2022

qeternity commented Apr 15, 2023

DeyvsonL commented May 4, 2023

medicol69 commented May 8, 2024

Soterix commented Aug 12, 2024

smnbbrv commented Apr 26, 2020 •

edited

Loading

stroebs commented Sep 24, 2020 •

edited

Loading

iosifnicolae2 commented Sep 25, 2020 •

edited

Loading

Dexolite commented Dec 22, 2020 •

edited

Loading

Dexolite commented Jan 27, 2021 •

edited

Loading

haswalt commented Jan 29, 2021 •

edited

Loading

haswalt commented Apr 21, 2021 •

edited

Loading

riker09 commented Apr 21, 2021 •

edited

Loading

riker09 commented May 4, 2021 •

edited

Loading

sebstyle commented May 18, 2021 •

edited

Loading

bpesics commented Nov 20, 2021 •

edited

Loading

sebastienroul commented Apr 28, 2022 •

edited

Loading