dual NIC netconfig breaks 0.9.1 -> 0.9.2 and 1.0.0 #1812

joshuacox · 2017-04-24T16:15:58Z

**RancherOS Version: (ros os version)**0.9.2-0.1.0

Where are you running RancherOS? (docker-machine, AWS, GCE, baremetal, etc.)
KVM VM with two virtual NICs (one public, one private)

I've been following #1477 and #1705 and have run into more woes with my dual NIC setup.

When I upgraded to -> rancheros 1.0.0 on my dual NIC VM, I found that again I had only one interface working (the one with the gateway), I rolled back to 0.9.2, and found the same issue there, so I kept rolling back to 0.9.1

That fixed it for me and I now have two working NICs in separate networks again.

relevant network section

rancher:
  network:
    interfaces:
      eth0:
        address: 192.168.0.38/24
        mtu: 1500
        dhcp: false
      eth1:
        address: 10.1.111.93/24
        gateway: 10.1.111.65

SvenDowideit · 2017-04-24T16:21:54Z

I thought I wrote a test for this :/

SvenDowideit · 2017-04-24T16:24:44Z

I did - https://github.com/rancher/os/pull/1666/files#diff-2f43c1a40bf9c492da522998bd0fb833

but ... @joshuacox I don't suppose you have a spare box you can use to show me the output of ip a, and the dmesg when booted with rancher.debug=true in the kernel cmdline?

joshuacox · 2017-04-24T17:12:51Z

Ok, I have figured out why mine failed, having the gateway on the second address kills the first, here is an ip a of it in a broken state

for example this is a broken config:

  network:
    interfaces:
      eth0:
        address: 192.168.0.93/24
        mtu: 1500
        dhcp: false
      eth1:
        address: 10.3.33.247/24
        gateway: 10.3.33.1
        mtu: 1500
        dhcp: false

However, for this next one, I changed out which networks the virtual nics were attached (the opposite of the first), and changed the order here:

  network:
    interfaces:
      eth0:
        address: 10.3.33.247/24
        gateway: 10.3.33.1
        mtu: 1500
        dhcp: false
      eth1:
        address: 192.168.0.93/24
        mtu: 1500
        dhcp: false

that worksfine, @SvenDowideit feel free to close this as I can fix mine by ensuring the public interface (the one with the gateway on it), has to come first

SvenDowideit · 2017-04-24T17:20:19Z

ah, yes, and my test uses a gateway on both :( nope, not closing, I'll add another test and fix it too - hopefully that'll resolve the problem for more future configs.

joshuacox · 2017-04-24T17:23:01Z

I've got another server on 1.0.0 where the above is not necessarily true, possibly just rebooting a few times or some other timeout happens the issue resolves itself. still testing here

SvenDowideit · 2017-04-24T20:55:37Z

I might just make a 1.0.1-rc3 today (depending on how much the flu is affecting me) - at which point, i'd be very interested in seeing the rancher.debug=true dmesg output on both a working and failing setup (assuming its not just the ordering issue you noted above)

joshuacox · 2017-05-02T13:54:46Z

@SvenDowideit I'm not certain if you ever made an rc3, but I did upgrade to 1.0.1 and a few of my dual nic setups immediately died upon reboot (no addresses show up on the console login screen, and they are all unresponsive on any net interface). I needed to get them back up pretty quickly so I cut a few new VMs using the latest (1.0.1) iso, Install worked great, they even took the first reboot fine (where I add in the public interface). But then I logged in set a new hostname, ros console switch ubuntu, and reboot a second time and dead again, but not on machines that do not receive a second public interface and merely connect to the private service network. continuing to test here. I'll sud ros config set rancher.debug=true as it says here and dig deeper.

EDIT: adding the relevant network configs

rancher:
  network:
    interfaces:
      eth0:
        address: 8.8.8.93/24
        gateway: 8.8.8.65
        mtu: 1500
        dhcp: false
      eth1:
        address: 192.168.88.42/24
        mtu: 1500
        dhcp: false

further EDITs:
I hate this kind of "fix", but after recreating one of the public facing rancher hosts, and rebooting it after it sat with the dead interface a bit, it 'fixed' itself (no clue if something was going on in the background I was not able to log in at that point). I'm not certain if this would have fixed the original upgraded machines as they are all gone. But I can test some more.

SvenDowideit · 2017-05-03T00:46:58Z

argh. those are the worst, and probably explain why I'm having so much pain producing a similar result.

stffabi · 2017-05-23T07:12:23Z

We are seeing the same issue on our servers with multiple nics, running 1.0.1. I try to get a debug output...

stffabi · 2017-05-23T09:39:11Z

I suspect it has something to do with DHCP being activated per default on startup on all interfaces. I've seen that the interface which has no ip assigned is the one which has always been assigned an IP from the DHCP.

For example in the following case, eth2 had been assigned an IP from DHCP (10.141.50.81/16). In that case eth2 had no IP assigned after rancher has been started.

[ 21.154500] time="2017-05-23T07:14:24Z" level=info msg="Apply Network Config"
[ 21.155473] time="2017-05-23T07:14:24Z" level=info msg="Applying 127.0.0.1/8 to lo"
[ 21.155518] time="2017-05-23T07:14:24Z" level=info msg="Applying ::1/128 to lo"
[ 21.155671] time="2017-05-23T07:14:24Z" level=info msg="Applying 10.139.241.111/24 to eth0"
[ 21.155757] time="2017-05-23T07:14:24Z" level=info msg="Set 10.139.241.111/24 on eth0"
[ 21.155776] time="2017-05-23T07:14:24Z" level=info msg="removing fe80::20c:29ff:fea5:d29e/64 from eth0"
[ 21.156091] time="2017-05-23T07:14:24Z" level=info msg="Removed fe80::20c:29ff:fea5:d29e/64 from eth0"
[ 21.156364] time="2017-05-23T07:14:24Z" level=info msg="Applying 10.139.18.68/20 to eth1"
[ 21.156435] time="2017-05-23T07:14:24Z" level=info msg="Set 10.139.18.68/20 on eth1"
[ 21.156453] time="2017-05-23T07:14:24Z" level=info msg="removing fe80::20c:29ff:fea5:d2a8/64 from eth1"
[ 21.156673] time="2017-05-23T07:14:24Z" level=info msg="Removed fe80::20c:29ff:fea5:d2a8/64 from eth1"
[ 21.158250] time="2017-05-23T07:14:24Z" level=info msg="Added default gateway 10.139.16.1"
[ 21.158352] time="2017-05-23T07:14:24Z" level=info msg="Applying 10.141.0.33/16 to eth2"
[ 21.158415] time="2017-05-23T07:14:24Z" level=info msg="Set 10.141.0.33/16 on eth2"
[ 21.158433] time="2017-05-23T07:14:24Z" level=info msg="removing 10.141.50.81/16 eth2 from eth2"
[ 21.158501] time="2017-05-23T07:14:24Z" level=info msg="Removed 10.141.50.81/16 eth2 from eth2"
[ 21.158519] time="2017-05-23T07:14:24Z" level=info msg="removing fe80::20c:29ff:fea5:d2b2/64 from eth2"
[ 21.158727] time="2017-05-23T07:14:24Z" level=info msg="Removed fe80::20c:29ff:fea5:d2b2/64 from eth2"
[ 21.158914] time="2017-05-23T07:14:24Z" level=info msg="Apply Network Config RunDhcp"
[ 21.158927] time="2017-05-23T07:14:24Z" level=debug msg=RunDhcp
[ 21.188548] time="2017-05-23T07:14:24Z" level=error msg="exit status 1"
[ 21.188748] time="2017-05-23T07:14:24Z" level=debug msg="dhcpcd -u eth2: "
[ 21.191272] time="2017-05-23T07:14:24Z" level=error msg="exit status 1"
[ 21.191664] time="2017-05-23T07:14:24Z" level=error msg="exit status 1"
[ 21.192036] time="2017-05-23T07:14:24Z" level=debug msg="dhcpcd -u eth1: "
[ 21.192039] time="2017-05-23T07:14:24Z" level=debug msg="dhcpcd -u eth0: "
[ 21.192107] time="2017-05-23T07:14:24Z" level=info msg="Apply Network Config SyncHostname"

Here you see a dmesg output after a successfull boot, in that case no IP has been assigned from DCHP.

[ 24.038581] time="2017-05-23T08:53:51Z" level=info msg="Apply Network Config"
[ 24.039612] time="2017-05-23T08:53:51Z" level=info msg="Applying 127.0.0.1/8 to lo"
[ 24.039665] time="2017-05-23T08:53:51Z" level=info msg="Applying ::1/128 to lo"
[ 24.039853] time="2017-05-23T08:53:51Z" level=info msg="Applying 10.139.18.68/20 to eth0"
[ 24.039911] time="2017-05-23T08:53:51Z" level=info msg="removing fe80::20c:29ff:fea5:d29e/64 from eth0"
[ 24.040198] time="2017-05-23T08:53:51Z" level=info msg="Removed fe80::20c:29ff:fea5:d29e/64 from eth0"
[ 24.041831] time="2017-05-23T08:53:51Z" level=info msg="Added default gateway 10.139.16.1"
[ 24.041935] time="2017-05-23T08:53:51Z" level=info msg="Applying 10.139.241.111/24 to eth1"
[ 24.042030] time="2017-05-23T08:53:51Z" level=info msg="Set 10.139.241.111/24 on eth1"
[ 24.042050] time="2017-05-23T08:53:51Z" level=info msg="removing fe80::20c:29ff:fea5:d2a8/64 from eth1"
[ 24.042301] time="2017-05-23T08:53:51Z" level=info msg="Removed fe80::20c:29ff:fea5:d2a8/64 from eth1"
[ 24.042562] time="2017-05-23T08:53:51Z" level=info msg="Applying 10.141.0.33/16 to eth2"
[ 24.042639] time="2017-05-23T08:53:51Z" level=info msg="Set 10.141.0.33/16 on eth2"
[ 24.042658] time="2017-05-23T08:53:51Z" level=info msg="removing fe80::20c:29ff:fea5:d2b2/64 from eth2"
[ 24.042865] time="2017-05-23T08:53:51Z" level=info msg="Removed fe80::20c:29ff:fea5:d2b2/64 from eth2"
[ 24.043052] time="2017-05-23T08:53:51Z" level=info msg="Apply Network Config RunDhcp"
[ 24.043064] time="2017-05-23T08:53:51Z" level=debug msg=RunDhcp
[ 24.112092] time="2017-05-23T08:53:51Z" level=error msg="exit status 1"
[ 24.112444] time="2017-05-23T08:53:51Z" level=debug msg="dhcpcd -u eth0: "
[ 24.112621] time="2017-05-23T08:53:51Z" level=error msg="exit status 1"
[ 24.112805] time="2017-05-23T08:53:51Z" level=debug msg="dhcpcd -u eth2: "
[ 24.112837] time="2017-05-23T08:53:51Z" level=error msg="exit status 1"
[ 24.112865] time="2017-05-23T08:53:51Z" level=debug msg="dhcpcd -u eth1: "
[ 24.112932] time="2017-05-23T08:53:51Z" level=info msg="Apply Network Config SyncHostname"

This would also explain why sometimes the ip addresses get correctly assigned and with another reboot it fails. This depends on the timing the DHCP server is able to lease an IP out or not, before the config has been changed to the static ip.

@joshuacox do you have DHCP servers running on your networks whose interfaces have this problem?

stffabi · 2017-05-23T09:41:29Z

@SvenDowideit do you know what the meaning of the err == syscall.EEXIST error code in

os/netconf/netconf_linux.go

Line 285 in 6d5fc4c

if err := netlink.AddrAdd(link, addr); err == syscall.EEXIST {

is? Could this be the problem, that this error happens when dhcp has already assigned an IP in the same network as defined by the static ip?

joshuacox · 2017-05-23T20:24:48Z

@stffabi I do have a dhcp server running on the *private network.

It happens to be a virtual net in KVM, which I believe is supplied by dnsmasq. Which is not inside a rancher host.

Even though I have configured both interfaces as static, I have seen one interface get two addresses (i.e. one from the static config, and another from the dhcp server).

But right now in v1.0.1, and given I have the public address first in my configs, everything has been working for the past three weeks.

prologic · 2017-06-05T04:24:38Z

I think I've been bitten by this and finally worked out why my upgraded node is an utter failure :/

I suspect it has something to do with DHCP being activated per default on startup on all interfaces. I've seen that the interface which has no ip assigned is the one which has always been assigned an IP from the DHCP.

My dmesg | grep eth0 output:

[rancher@dm4 ~]$ dmesg | grep eth0
[    9.675681] e1000 0000:00:12.0 eth0: (PCI:33MHz:32-bit) 36:39:61:37:66:32
[    9.676203] e1000 0000:00:12.0 eth0: Intel(R) PRO/1000 Network Connection
[   10.081475] time="2017-06-05T04:09:55Z" level=info msg="Running DHCP on eth0: dhcpcd -MA4 -e force_hostname=true eth0"
[   10.104736] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   12.136912] e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[   12.137970] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   13.166429] time="2017-06-05T04:09:58Z" level=info msg="Running DHCP on eth0: dhcpcd -MA4 -e force_hostname=true eth0"
[   17.945853] time="2017-06-05T04:10:02Z" level=info msg="Applying 10.0.0.13/24 to eth0"
[   17.948054] time="2017-06-05T04:10:02Z" level=info msg="Set 10.0.0.13/24 on eth0"
[   17.948944] time="2017-06-05T04:10:02Z" level=info msg="removing  10.0.0.102/24 eth0 from eth0"
[   17.951031] time="2017-06-05T04:10:02Z" level=info msg="Removed 10.0.0.102/24 eth0 from eth0"
[   17.952491] time="2017-06-05T04:10:02Z" level=info msg="removing  2603:3024:181a:c600:3439:61ff:fe37:6632/64 from eth0"
[   17.954406] time="2017-06-05T04:10:02Z" level=info msg="Removed 2603:3024:181a:c600:3439:61ff:fe37:6632/64 from eth0"
[   17.956503] time="2017-06-05T04:10:02Z" level=info msg="removing  fe80::3439:61ff:fe37:6632/64 from eth0"
[   17.959120] time="2017-06-05T04:10:02Z" level=info msg="Removed fe80::3439:61ff:fe37:6632/64 from eth0"
[   18.015139] time="2017-06-05T04:10:02Z" level=debug msg="dhcpcd -u eth0: "
[   18.933181] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   18.940783] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

My user_config.yml:

#cloud-config
hostname: dm4
ssh_authorized_keys:
  - ssh-rsa ...
rancher:
  network:
    dns:
      nameservers:
        - 8.8.8.8
        - 8.8.4.4
    interfaces:
      eth0:
        dhcp: false
        address: 10.0.0.13/24
        gateway: 10.0.0.1
        mtu: 1500
      eth1:
        dhcp: true
  upgrade:
      url: https://releases.rancher.com/os/releases.yml
      image: rancher/os

When this boots:

[rancher@dm4 ~]$ /sbin/ip addr show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 36:39:61:37:66:32 brd ff:ff:ff:ff:ff:ff
    inet6 2603:3024:181a:c600:3439:61ff:fe37:6632/64 scope global mngtmpaddr dynamic
       valid_lft 345582sec preferred_lft 345582sec

prologic · 2017-06-05T04:28:38Z

@SvenDowideit If it helps I have a pretty reliable repro here

prologic · 2017-06-05T04:41:56Z

My work-around for the moment is to assign static addresses from my network's DHCP server

stffabi · 2017-06-06T15:09:53Z

@prologic yes, we currently use the same workaround.

SvenDowideit · 2017-06-07T00:18:47Z

@prologic nice! now I wonder if we can get our heads together to make an automated integration test that fails every time too

prologic · 2017-06-08T04:23:35Z

What do you need? 1) Setup a DHCP server on a network/vlan 2) Spin up a RancherOS KVM instance on the same network 3) Install to Disk and set eth0 as a static address 4) Reboot Watch eth0 come back with no address. Poof :D James Mills / prologic E: prologic@shortcircuit.net.au W: prologic.shortcircuit.net.au

…

On Tue, Jun 6, 2017 at 5:18 PM, Sven Dowideit ***@***.***> wrote: @prologic <https://github.com/prologic> nice! now I wonder if we can get our heads together to make an automated integration test that fails every time too — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1812 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABOv-gFngMTN2Hxn1NnOzrw6OkfM07p9ks5sBexsgaJpZM4NGY4k> .

joshuacox · 2017-06-08T12:34:55Z

@prologic you are describing the exact workflow I use for my private nodes (those without public interfaces). However, my private nodes have never had that issue. Just to be exact here is the section from a working node with an assigned IP in a private network that is controlled by KVM with dnsmasq as DHCP server:

rancher:
  network:
    interfaces:
      eth0:
        address: 192.168.1.44/24
        gateway: 192.168.1.1
        mtu: 1500
        dhcp: false

Note: I never told dnsmasq about the assigned IP, and that IP is outside of the pool of IPs dnsmasq draws upon for assignments [in this case 192.168.1.3-192.168.1.20 is the pool for dhcp leases]

prologic · 2017-06-08T14:50:11Z

Yeah but currently this doesn't work because rancheros does dhcp at boot, then tries to remove things later if the interfaces was configured to be stati and something goes wrong. I don't understand it fully yet (was only perusing through some commits that affected this). James Mills / prologic E: prologic@shortcircuit.net.au W: prologic.shortcircuit.net.au

…

On Thu, Jun 8, 2017 at 5:35 AM, Josh Cox ***@***.***> wrote: @prologic <https://github.com/prologic> you are describing the exact workflow I use for my private nodes (those without public interfaces). However, my private nodes have never had an issue. Just to be exact here is the section from a working node with an assigned IP in a private network that is controlled by KVM with dnsmasq as DHCP server: rancher: network: interfaces: eth0: address: 192.168.1.44/24 gateway: 192.168.1.1 mtu: 1500 dhcp: false — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1812 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABOv-mc70u5UUyEvFBBznb3QlmHWRXwgks5sB-p1gaJpZM4NGY4k> .

joshuacox · 2017-06-09T06:14:53Z

@prologic what ros os version are you on? I see I am behind a bit with a:

[rancher@rancher ~]$ sudo ros os version
v1.0.1

I guess I'll attempt an upgrade tomorrow, or sometime this weekend.

prologic · 2017-06-09T06:25:31Z

Whatever the latest is; this is only broken in the latest version (maybe a couple minor versions back too) onwards. James Mills / prologic E: prologic@shortcircuit.net.au W: prologic.shortcircuit.net.au

…

On Fri, Jun 9, 2017 at 7:14 AM, Josh Cox ***@***.***> wrote: @prologic <https://github.com/prologic> what ros os version are you on? I see I am behind a bit with a: ***@***.*** ~]$ sudo ros os version v1.0.1 I guess I'll attempt an upgrade tomorrow, or sometime this weekend. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1812 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABOv-rNHlMVbGXbZQA0BTPgoBa4hrw50ks5sCOLhgaJpZM4NGY4k> .

SvenDowideit · 2017-06-09T08:00:49Z

yeah, its likely to be 1.0 or from the last 0.9 - I had to rejig the pre-cloud-init networking code pretty majorly to fix it for several platforms, and in the process its obviously not cleaning up after itself.

the thing I wanted (er, needed) was to have the interfaces come up with DHCP, then query /detect for all the possible cloud-init options, and then use those to set things to whatever is actually desired.

the old behavior made an incorrect guess of when no network was needed to get cloud-init, and then set things up - which obviously is easier to get right for simpler cloud-init cases, but utterly breaks for the more complicated ones.

@prologic what weirds me out, is that your dmesg looks like what I expect - it gets a random dchp address, then does a cloud-init-save, and then sets the ip to 10.10.10.13/24 and removes the addresses that cam from the dhcp server - and when it works for me, that's a-ok.
I do wonder if the order is a problem - maybe the remove needs to happen before the add, but that feels off to me.

and presumably, its just something simple and dumb that i picked up somewhere :(

SvenDowideit · 2017-06-09T08:03:45Z

abd just as a :( - I do have a dhcp server with a random address, and do install testing here setting a static address, and it works - I'm willing to bet that its enough of a timing issue that knowing what dhcp server makes a difference (I'm mostly using my billion router, but I'm pretty close to needing to build a automated multi-server test for other issues too)

prologic · 2017-06-09T08:07:27Z

RouterOS 6.18 here James Mills / prologic E: prologic@shortcircuit.net.au W: prologic.shortcircuit.net.au

…

On Fri, Jun 9, 2017 at 1:03 AM, Sven Dowideit ***@***.***> wrote: abd just as a :( - I do have a dhcp server with a random address, and do install testing here setting a static address, and it works - I'm willing to bet that its enough of a timing issue that knowing what dhcp server makes a difference (I'm mostly using my billion router, but I'm pretty close to needing to build a automated multi-server test for other issues too) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1812 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABOv-nNOeD8h1Va7f4UCLRwITpFj2Rdnks5sCPxlgaJpZM4NGY4k> .

SvenDowideit · 2017-06-12T05:07:18Z

@prologic I've just followed your reproduction steps on qemu, and nope - works ok

off to set up a non-vm system and network :/

#cloud-config
rancher:
  network:
    interfaces:
      eth0:
        address: 192.168.1.44/24
        gateway: 192.168.1.1
        mtu: 1500
        dhcp: false
  bootstrap_docker:
    registry_mirror: "http://10.10.10.23:5555"
    insecure_registry:
    - 10.10.10.23:5000
    - arian:5000
  docker:
    registry_mirror: "http://10.10.10.23:5555"
    insecure_registry:
    - 10.10.10.23:5000
    - arian:5000
  system_docker:
    registry_mirror: "http://10.10.10.23:5555"
    insecure_registry:
    - 10.10.10.23:5000
    - arian:5000
ssh_authorized_keys:
  - ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC85w9stZyiLQp/DkVO6fqwiShYcj1ClKdtCqgHtf+PLpJkFReSFu8y21y+ev09gsSMRRrjF7yt0pUHV6zncQhVeqsZtgc5WbELY2DOYUGmRn/CCvPbXovoBrQjSorqlBmpuPwsStYLr92Xn+VVsMNSUIegHY22DphGbDKG85vrKB8HxUxGIDxFBds/uE8FhSy+xsoyT/jUZDK6pgq2HnGl6D81ViIlKecpOpWlW3B+fea99ADNyZNVvDzbHE5pcI3VRw8u59WmpWOUgT6qacNVACl8GqpBvQk8sw7O/X9DSZHCKafeD9G5k+GYbAUz92fKWrx/lOXfUXPS3+c8dRIF

[root@rancher rancher]# system-docker logs network
DEBU[0451] Calling GET /v1.23/containers/network/json   
DEBU[0451] Calling GET /v1.23/containers/network/logs?stderr=1&stdout=1&tail=all 
DEBU[0451] logs: begin stream                           
DEBU[0451] logs: end stream                             
> time="2017-06-12T04:58:25Z" level=info msg="Apply Network Config" 
> time="2017-06-12T04:58:25Z" level=debug msg="Config: &netconf.NetworkConfig{PreCmds:[]string(nil), DNS:netconf.DNSConfig{Nameservers:[]string(nil), Search:[]string(nil)}, Interfaces:map[string]netconf.InterfaceConfig{\"eth0\":netconf.InterfaceConfig{Match:\"\", DHCP:false, DHCPArgs:\"\", Address:\"192.168.1.44/24\", Addresses:[]string(nil), IPV4LL:false, Gateway:\"192.168.1.1\", GatewayIpv6:\"\", MTU:1500, Bridge:\"\", Bond:\"\", BondOpts:map[string]string(nil), PostUp:[]string(nil), PreUp:[]string(nil), Vlans:\"\"}, \"lo\":netconf.InterfaceConfig{Match:\"\", DHCP:false, DHCPArgs:\"\", Address:\"\", Addresses:[]string{\"127.0.0.1/8\", \"::1/128\"}, IPV4LL:false, Gateway:\"\", GatewayIpv6:\"\", MTU:0, Bridge:\"\", Bond:\"\", BondOpts:map[string]string(nil), PostUp:[]string(nil), PreUp:[]string(nil), Vlans:\"\"}}, PostCmds:[]string(nil), HTTPProxy:\"\", HTTPSProxy:\"\", NoProxy:\"\"}" 
> time="2017-06-12T04:58:25Z" level=info msg="Applying 127.0.0.1/8 to lo" 
> time="2017-06-12T04:58:25Z" level=info msg="Applying ::1/128 to lo" 
> time="2017-06-12T04:58:25Z" level=info msg="Applying 192.168.1.44/24 to eth0" 
> time="2017-06-12T04:58:25Z" level=info msg="Set 192.168.1.44/24 on eth0" 
> time="2017-06-12T04:58:25Z" level=info msg="removing  10.0.2.15/24 eth0 from eth0" 
> time="2017-06-12T04:58:25Z" level=info msg="Removed 10.0.2.15/24 eth0 from eth0" 
> time="2017-06-12T04:58:25Z" level=info msg="removing  fec0::5054:ff:fe12:3456/64 from eth0" 
> time="2017-06-12T04:58:25Z" level=info msg="Removed fec0::5054:ff:fe12:3456/64 from eth0" 
> time="2017-06-12T04:58:25Z" level=info msg="removing  fe80::5054:ff:fe12:3456/64 from eth0" 
> time="2017-06-12T04:58:25Z" level=info msg="Removed fe80::5054:ff:fe12:3456/64 from eth0" 
> time="2017-06-12T04:58:25Z" level=info msg="Added default gateway 192.168.1.1" 
> time="2017-06-12T04:58:25Z" level=info msg="Apply Network Config RunDhcp" 
> time="2017-06-12T04:58:25Z" level=debug msg=RunDhcp 
> time="2017-06-12T04:58:25Z" level=error msg="exit status 1" 
> time="2017-06-12T04:58:25Z" level=debug msg="dhcpcd -u eth0: " 
> time="2017-06-12T04:58:25Z" level=info msg="Apply Network Config SyncHostname" 
> time="2017-06-12T04:58:25Z" level=info msg="Restart syslog"

SvenDowideit · 2017-06-12T05:43:24Z

and when I do the same on my real network, with real hardware, the same thing - my eth0 has the static IP I requested in the config.

stffabi · 2017-06-12T05:52:12Z

At least for @prologic and me, it seems like we are having the same subnet for the static ip adress and the dhcp address.

Me: DHCP 10.141.50.81/16 and static 10.141.0.33/16
@prologic: DHCP 10.0.0.102/24 and static 10.0.0.13/24

This seems not to be the case in your test @SvenDowideit. Don't know if that makes any difference...

SvenDowideit · 2017-06-12T06:25:05Z

and after more messing about on vmware, I think I have it.

I've set up 3 eth's

  network:
    interfaces:
      eth0:
        address: 10.10.10.253/24
        dhcp: false
        gateway: 10.10.10.1
        mtu: 1500
      eth2:
        dhcp: true

eth0 set statically using cloud-config (and this time its got no IP)
eth1 with no dhcp setting in the cfg - which has undefined behaiviour - if you have no rancher.network settings, then ~~ dhcp: true, otherwise, its left over from the pre-cloud-config dhcp values
eth2 set as rancher.network.interfaces.eth2.dhcp: true - and i think its setting this that then causes eth0 to be broken.

whereas:

  network:
    interfaces:
      eth0:
        address: 10.10.10.253/24
        dhcp: false
        gateway: 10.10.10.1
        mtu: 1500
      eth2:
        dhcp: false

results in eth0 being 10.10.10.253 and eth1 still having its old dhcp address.

Time to walk the dog - but this is good progress.

SvenDowideit · 2017-06-12T08:23:55Z

OH BOOM! poked it s'more, and there was just one little thing missing from the --boothd option cmdline used by the integration tests - finally, something is working today :)

prologic · 2017-06-12T16:10:22Z

Just woke up and caught up! But yeah DHCP subnet == Static subnet in both our cases (but I'm sure you've gone past this now) Glad you got to the bottom of it? :D James Mills / prologic E: prologic@shortcircuit.net.au W: prologic.shortcircuit.net.au

…

On Mon, Jun 12, 2017 at 1:24 AM, Sven Dowideit ***@***.***> wrote: OH BOOM! poked it s'more, and there was just one little thing missing from the --boothd option cmdline used by the integration tests - finally, something is working today :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1812 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABOv-qXUMwG_ewPift02tkNXBplQ_o_aks5sDPWggaJpZM4NGY4k> .

SvenDowideit · 2017-06-13T10:23:22Z

no, today's been a disaster. to automate the test reproducable case I found yesterday, I coded up a new comms path for our integration tests, only to have that lock up my local computer (and then wasted a few hours trying to track that down). giving up on that, I moved to a server box - where the vm didn't lock up the host, but instead, locks itself up in pretty much unrelated code - but now I can't manually get my reproduction to happen without locking up one or the other.

I'm probably going to give up on it for a few days and circle back to it later.

SvenDowideit · 2017-06-13T12:26:00Z

#1915 - I've made a dumbed down test - and have commented out the one that kills my hosts.

prologic · 2017-06-13T15:49:22Z

(y) James Mills / prologic E: prologic@shortcircuit.net.au W: prologic.shortcircuit.net.au

…

On Tue, Jun 13, 2017 at 5:26 AM, Sven Dowideit ***@***.***> wrote: #1915 <#1915> - I've made a dumbed down test - and have commented out the one that kills my hosts. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1812 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABOv-rCTnzRoohHvfHconZS44zb2QxODks5sDn_jgaJpZM4NGY4k> .

SvenDowideit · 2017-06-14T14:04:44Z

@joshuacox @prologic @stffabi , I've built a test release with the add&remove IP switched around - would it be possible for you to try it out?

https://github.com/rancher/os/releases/tag/v1.1.0-test1

I'm not holding my breath, but we might get lucky :)

prologic · 2017-06-14T16:31:11Z

I can try it tonight (PST) James Mills / prologic E: prologic@shortcircuit.net.au W: prologic.shortcircuit.net.au

…

On Wed, Jun 14, 2017 at 7:04 AM, Sven Dowideit ***@***.***> wrote: @joshuacox <https://github.com/joshuacox> @prologic <https://github.com/prologic> @stffabi <https://github.com/stffabi> , I've built a test release with the add&remove IP switched around - would it be possible for you to try it out? https://github.com/rancher/os/releases/tag/v1.1.0-test1 I'm not holding my breath, but we might get lucky :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1812 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABOv-o8iZr9Yk4h7iCNNAJduXH__GVwTks5sD-iEgaJpZM4NGY4k> .

stffabi · 2017-06-14T19:16:54Z

@SvenDowideit 🎉 🎉. I've upgraded one of our servers and rebooted it several times. It always came up with all interfaces and correct IP settings. Awesome...

prologic · 2017-06-14T19:18:28Z

Whoohoo! James Mills / prologic E: prologic@shortcircuit.net.au W: prologic.shortcircuit.net.au

…

On Wed, Jun 14, 2017 at 12:17 PM, Fabrizio Steiner ***@***.*** > wrote: @SvenDowideit <https://github.com/svendowideit> 🎉 🎉. I've upgraded one of our servers and rebooted it several times. It always came up with all interfaces and correct IP settings. Awesome... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1812 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABOv-mVBWRmv6tKaTpf0vQrZ7XA7ioIqks5sEDGxgaJpZM4NGY4k> .

joshuacox · 2017-06-15T13:56:14Z

@SvenDowideit initial testing says all good here. Of note, I have not put that on my production front-end (the rancher with the public IP), but I have done quite a bit of testing with two-interfaces on a newly cut VM built using v1.1.0-test1. From what I can tell it should all work fine.

joshuacox · 2017-06-22T05:44:33Z

interestingly enough I tried out 1.0.2, and eth1 was dead in the water, scrollilng up with shift-pageup in the KVM console I could see where the dhcpcd was firing off desspite me having an entirely static configuration, but it did look like it removed the dhcp addresses and had left the static (including on eth1) but when I boot I get nothing but this:

3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether ff:54:00:ee:ee:ee brd ff:ff:ff:ff:ff:ff

my net configs:

rancher:
  network:
    interfaces:
      eth0:
        address: 10.30.111.33/24
        gateway: 10.30.111.65
        mtu: 1500
        dhcp: false
      eth1:
        address: 192.168.1.42/24
        mtu: 1500
        dhcp: false

I ended up rolling back to 1.0.0 which was the first to booth with both interfaces working. 1.0.1 same problem as 1.0.2 here.

SvenDowideit · 2017-06-23T14:55:57Z

considering the feedback for the v1.1.0-test1 release, and now the need to deal with the Stack Clash CVE, I've merged the PR into v1.0.x, and am building and uploading an v1.0.3-rc1 - @stffabi @joshuacox @prologic if it doesn't work for you, please yell :D (sorry its taking a long time, TurnBull net is excruciatingly slow tonite)

SvenDowideit · 2017-06-23T15:42:13Z

please try https://github.com/rancher/os/releases/tag/v1.0.3-rc1

prologic · 2017-06-23T15:47:12Z

(haven't tried this yet) but silly question if I use the new ISO will it also install that version or download whatever the latest (broken) version is? James Mills / prologic E: prologic@shortcircuit.net.au W: prologic.shortcircuit.net.au

…

On Fri, Jun 23, 2017 at 8:42 AM, Sven Dowideit ***@***.***> wrote: please try https://github.com/rancher/os/releases/tag/v1.0.3-rc1 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1812 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABOv-n7XcwgU_96Rj1MC3QsEzxB43_e0ks5sG9zdgaJpZM4NGY4k> .

SvenDowideit · 2017-06-23T15:49:52Z

the iso is :magic: - it'll install that version (er, i hope, so long as the magic didn't leak out)

stffabi · 2017-06-23T18:01:16Z

@SvenDowideit I've upgraded one server, looks good. 🎉

SvenDowideit · 2017-06-26T14:31:26Z

I'm going to close this one for https://github.com/rancher/os/releases/tag/v1.0.3

shaoyangyu · 2017-07-03T10:22:18Z

I am suffered the similar problem around one week, with Dual Nic config, the ip address can't be allocated correctly for everytime host rebooting.

Going to upgrade my system now, not sure if my problem can be fixed too.

shaoyangyu · 2017-07-03T11:13:07Z

@SvenDowideit Thanks for your work. I upgraded to 1.0.3 , and test with 10 times rebooting, each time the dual ip addresses are allocated correctly. no pain now. :) 👍

SvenDowideit modified the milestones: v1.0.1, vNext Apr 24, 2017

SvenDowideit added kind/bug area/networking labels Apr 26, 2017

SvenDowideit mentioned this issue Jun 13, 2017

Test complicated dhcp with static ip #1915

Merged

This was referenced Jun 19, 2017

Network configuration broken after update 0.8.1 -> 0.9.0 #1705

Closed

iPXE boot doesn't load cloud-config (DHCP based DNS is not being used) #1790

Closed

SvenDowideit mentioned this issue Jun 23, 2017

Backports for v1.0.3 #1937

Merged

SvenDowideit modified the milestones: v1.0.3, vNext Jun 23, 2017

SvenDowideit closed this as completed Jun 26, 2017

dual NIC netconfig breaks 0.9.1 -> 0.9.2 and 1.0.0 #1812

dual NIC netconfig breaks 0.9.1 -> 0.9.2 and 1.0.0 #1812

Comments

joshuacox commented Apr 24, 2017

SvenDowideit commented Apr 24, 2017

SvenDowideit commented Apr 24, 2017

joshuacox commented Apr 24, 2017

SvenDowideit commented Apr 24, 2017

joshuacox commented Apr 24, 2017

SvenDowideit commented Apr 24, 2017

joshuacox commented May 2, 2017 • edited

SvenDowideit commented May 3, 2017

stffabi commented May 23, 2017

stffabi commented May 23, 2017

stffabi commented May 23, 2017

joshuacox commented May 23, 2017 • edited

prologic commented Jun 5, 2017

prologic commented Jun 5, 2017

prologic commented Jun 5, 2017

stffabi commented Jun 6, 2017

SvenDowideit commented Jun 7, 2017

prologic commented Jun 8, 2017 via email

joshuacox commented Jun 8, 2017 • edited

prologic commented Jun 8, 2017 via email

joshuacox commented Jun 9, 2017

prologic commented Jun 9, 2017 via email

SvenDowideit commented Jun 9, 2017

SvenDowideit commented Jun 9, 2017

prologic commented Jun 9, 2017 via email

SvenDowideit commented Jun 12, 2017

SvenDowideit commented Jun 12, 2017

stffabi commented Jun 12, 2017

SvenDowideit commented Jun 12, 2017 • edited

SvenDowideit commented Jun 12, 2017

prologic commented Jun 12, 2017 via email

SvenDowideit commented Jun 13, 2017

SvenDowideit commented Jun 13, 2017

prologic commented Jun 13, 2017 via email

SvenDowideit commented Jun 14, 2017

prologic commented Jun 14, 2017 via email

stffabi commented Jun 14, 2017

prologic commented Jun 14, 2017 via email

joshuacox commented Jun 15, 2017

joshuacox commented Jun 22, 2017 • edited

SvenDowideit commented Jun 23, 2017

SvenDowideit commented Jun 23, 2017

prologic commented Jun 23, 2017 via email

SvenDowideit commented Jun 23, 2017

stffabi commented Jun 23, 2017

SvenDowideit commented Jun 26, 2017

shaoyangyu commented Jul 3, 2017

shaoyangyu commented Jul 3, 2017

joshuacox commented May 2, 2017 •

edited

joshuacox commented May 23, 2017 •

edited

joshuacox commented Jun 8, 2017 •

edited

SvenDowideit commented Jun 12, 2017 •

edited

joshuacox commented Jun 22, 2017 •

edited