Skip to content
This repository has been archived by the owner on Oct 11, 2023. It is now read-only.

dual NIC netconfig breaks 0.9.1 -> 0.9.2 and 1.0.0 #1812

Closed
joshuacox opened this issue Apr 24, 2017 · 52 comments
Closed

dual NIC netconfig breaks 0.9.1 -> 0.9.2 and 1.0.0 #1812

joshuacox opened this issue Apr 24, 2017 · 52 comments

Comments

@joshuacox
Copy link

**RancherOS Version: (ros os version)**0.9.2-0.1.0

Where are you running RancherOS? (docker-machine, AWS, GCE, baremetal, etc.)
KVM VM with two virtual NICs (one public, one private)

I've been following #1477 and #1705 and have run into more woes with my dual NIC setup.

When I upgraded to -> rancheros 1.0.0 on my dual NIC VM, I found that again I had only one interface working (the one with the gateway), I rolled back to 0.9.2, and found the same issue there, so I kept rolling back to 0.9.1

That fixed it for me and I now have two working NICs in separate networks again.

relevant network section

rancher:
  network:
    interfaces:
      eth0:
        address: 192.168.0.38/24
        mtu: 1500
        dhcp: false
      eth1:
        address: 10.1.111.93/24
        gateway: 10.1.111.65
@SvenDowideit
Copy link
Contributor

I thought I wrote a test for this :/

@SvenDowideit SvenDowideit modified the milestones: v1.0.1, vNext Apr 24, 2017
@SvenDowideit
Copy link
Contributor

I did - https://github.com/rancher/os/pull/1666/files#diff-2f43c1a40bf9c492da522998bd0fb833

but ... @joshuacox I don't suppose you have a spare box you can use to show me the output of ip a, and the dmesg when booted with rancher.debug=true in the kernel cmdline?

@joshuacox
Copy link
Author

Ok, I have figured out why mine failed, having the gateway on the second address kills the first, here is an ip a of it in a broken state

for example this is a broken config:

  network:
    interfaces:
      eth0:
        address: 192.168.0.93/24
        mtu: 1500
        dhcp: false
      eth1:
        address: 10.3.33.247/24
        gateway: 10.3.33.1
        mtu: 1500
        dhcp: false

However, for this next one, I changed out which networks the virtual nics were attached (the opposite of the first), and changed the order here:

  network:
    interfaces:
      eth0:
        address: 10.3.33.247/24
        gateway: 10.3.33.1
        mtu: 1500
        dhcp: false
      eth1:
        address: 192.168.0.93/24
        mtu: 1500
        dhcp: false

that worksfine, @SvenDowideit feel free to close this as I can fix mine by ensuring the public interface (the one with the gateway on it), has to come first

@SvenDowideit
Copy link
Contributor

ah, yes, and my test uses a gateway on both :( nope, not closing, I'll add another test and fix it too - hopefully that'll resolve the problem for more future configs.

@joshuacox
Copy link
Author

I've got another server on 1.0.0 where the above is not necessarily true, possibly just rebooting a few times or some other timeout happens the issue resolves itself. still testing here

@SvenDowideit
Copy link
Contributor

I might just make a 1.0.1-rc3 today (depending on how much the flu is affecting me) - at which point, i'd be very interested in seeing the rancher.debug=true dmesg output on both a working and failing setup (assuming its not just the ordering issue you noted above)

@joshuacox
Copy link
Author

joshuacox commented May 2, 2017

@SvenDowideit I'm not certain if you ever made an rc3, but I did upgrade to 1.0.1 and a few of my dual nic setups immediately died upon reboot (no addresses show up on the console login screen, and they are all unresponsive on any net interface). I needed to get them back up pretty quickly so I cut a few new VMs using the latest (1.0.1) iso, Install worked great, they even took the first reboot fine (where I add in the public interface). But then I logged in set a new hostname, ros console switch ubuntu, and reboot a second time and dead again, but not on machines that do not receive a second public interface and merely connect to the private service network. continuing to test here. I'll sud ros config set rancher.debug=true as it says here and dig deeper.

EDIT: adding the relevant network configs

rancher:
  network:
    interfaces:
      eth0:
        address: 8.8.8.93/24
        gateway: 8.8.8.65
        mtu: 1500
        dhcp: false
      eth1:
        address: 192.168.88.42/24
        mtu: 1500
        dhcp: false

further EDITs:
I hate this kind of "fix", but after recreating one of the public facing rancher hosts, and rebooting it after it sat with the dead interface a bit, it 'fixed' itself (no clue if something was going on in the background I was not able to log in at that point). I'm not certain if this would have fixed the original upgraded machines as they are all gone. But I can test some more.

@SvenDowideit
Copy link
Contributor

argh. those are the worst, and probably explain why I'm having so much pain producing a similar result.

@stffabi
Copy link
Contributor

stffabi commented May 23, 2017

We are seeing the same issue on our servers with multiple nics, running 1.0.1. I try to get a debug output...

@stffabi
Copy link
Contributor

stffabi commented May 23, 2017

I suspect it has something to do with DHCP being activated per default on startup on all interfaces. I've seen that the interface which has no ip assigned is the one which has always been assigned an IP from the DHCP.

For example in the following case, eth2 had been assigned an IP from DHCP (10.141.50.81/16). In that case eth2 had no IP assigned after rancher has been started.

[ 21.154500] time="2017-05-23T07:14:24Z" level=info msg="Apply Network Config"
[ 21.155473] time="2017-05-23T07:14:24Z" level=info msg="Applying 127.0.0.1/8 to lo"
[ 21.155518] time="2017-05-23T07:14:24Z" level=info msg="Applying ::1/128 to lo"
[ 21.155671] time="2017-05-23T07:14:24Z" level=info msg="Applying 10.139.241.111/24 to eth0"
[ 21.155757] time="2017-05-23T07:14:24Z" level=info msg="Set 10.139.241.111/24 on eth0"
[ 21.155776] time="2017-05-23T07:14:24Z" level=info msg="removing fe80::20c:29ff:fea5:d29e/64 from eth0"
[ 21.156091] time="2017-05-23T07:14:24Z" level=info msg="Removed fe80::20c:29ff:fea5:d29e/64 from eth0"
[ 21.156364] time="2017-05-23T07:14:24Z" level=info msg="Applying 10.139.18.68/20 to eth1"
[ 21.156435] time="2017-05-23T07:14:24Z" level=info msg="Set 10.139.18.68/20 on eth1"
[ 21.156453] time="2017-05-23T07:14:24Z" level=info msg="removing fe80::20c:29ff:fea5:d2a8/64 from eth1"
[ 21.156673] time="2017-05-23T07:14:24Z" level=info msg="Removed fe80::20c:29ff:fea5:d2a8/64 from eth1"
[ 21.158250] time="2017-05-23T07:14:24Z" level=info msg="Added default gateway 10.139.16.1"
[ 21.158352] time="2017-05-23T07:14:24Z" level=info msg="Applying 10.141.0.33/16 to eth2"
[ 21.158415] time="2017-05-23T07:14:24Z" level=info msg="Set 10.141.0.33/16 on eth2"
[ 21.158433] time="2017-05-23T07:14:24Z" level=info msg="removing 10.141.50.81/16 eth2 from eth2"
[ 21.158501] time="2017-05-23T07:14:24Z" level=info msg="Removed 10.141.50.81/16 eth2 from eth2"
[ 21.158519] time="2017-05-23T07:14:24Z" level=info msg="removing fe80::20c:29ff:fea5:d2b2/64 from eth2"
[ 21.158727] time="2017-05-23T07:14:24Z" level=info msg="Removed fe80::20c:29ff:fea5:d2b2/64 from eth2"
[ 21.158914] time="2017-05-23T07:14:24Z" level=info msg="Apply Network Config RunDhcp"
[ 21.158927] time="2017-05-23T07:14:24Z" level=debug msg=RunDhcp
[ 21.188548] time="2017-05-23T07:14:24Z" level=error msg="exit status 1"
[ 21.188748] time="2017-05-23T07:14:24Z" level=debug msg="dhcpcd -u eth2: "
[ 21.191272] time="2017-05-23T07:14:24Z" level=error msg="exit status 1"
[ 21.191664] time="2017-05-23T07:14:24Z" level=error msg="exit status 1"
[ 21.192036] time="2017-05-23T07:14:24Z" level=debug msg="dhcpcd -u eth1: "
[ 21.192039] time="2017-05-23T07:14:24Z" level=debug msg="dhcpcd -u eth0: "
[ 21.192107] time="2017-05-23T07:14:24Z" level=info msg="Apply Network Config SyncHostname"

Here you see a dmesg output after a successfull boot, in that case no IP has been assigned from DCHP.

[ 24.038581] time="2017-05-23T08:53:51Z" level=info msg="Apply Network Config"
[ 24.039612] time="2017-05-23T08:53:51Z" level=info msg="Applying 127.0.0.1/8 to lo"
[ 24.039665] time="2017-05-23T08:53:51Z" level=info msg="Applying ::1/128 to lo"
[ 24.039853] time="2017-05-23T08:53:51Z" level=info msg="Applying 10.139.18.68/20 to eth0"
[ 24.039911] time="2017-05-23T08:53:51Z" level=info msg="removing fe80::20c:29ff:fea5:d29e/64 from eth0"
[ 24.040198] time="2017-05-23T08:53:51Z" level=info msg="Removed fe80::20c:29ff:fea5:d29e/64 from eth0"
[ 24.041831] time="2017-05-23T08:53:51Z" level=info msg="Added default gateway 10.139.16.1"
[ 24.041935] time="2017-05-23T08:53:51Z" level=info msg="Applying 10.139.241.111/24 to eth1"
[ 24.042030] time="2017-05-23T08:53:51Z" level=info msg="Set 10.139.241.111/24 on eth1"
[ 24.042050] time="2017-05-23T08:53:51Z" level=info msg="removing fe80::20c:29ff:fea5:d2a8/64 from eth1"
[ 24.042301] time="2017-05-23T08:53:51Z" level=info msg="Removed fe80::20c:29ff:fea5:d2a8/64 from eth1"
[ 24.042562] time="2017-05-23T08:53:51Z" level=info msg="Applying 10.141.0.33/16 to eth2"
[ 24.042639] time="2017-05-23T08:53:51Z" level=info msg="Set 10.141.0.33/16 on eth2"
[ 24.042658] time="2017-05-23T08:53:51Z" level=info msg="removing fe80::20c:29ff:fea5:d2b2/64 from eth2"
[ 24.042865] time="2017-05-23T08:53:51Z" level=info msg="Removed fe80::20c:29ff:fea5:d2b2/64 from eth2"
[ 24.043052] time="2017-05-23T08:53:51Z" level=info msg="Apply Network Config RunDhcp"
[ 24.043064] time="2017-05-23T08:53:51Z" level=debug msg=RunDhcp
[ 24.112092] time="2017-05-23T08:53:51Z" level=error msg="exit status 1"
[ 24.112444] time="2017-05-23T08:53:51Z" level=debug msg="dhcpcd -u eth0: "
[ 24.112621] time="2017-05-23T08:53:51Z" level=error msg="exit status 1"
[ 24.112805] time="2017-05-23T08:53:51Z" level=debug msg="dhcpcd -u eth2: "
[ 24.112837] time="2017-05-23T08:53:51Z" level=error msg="exit status 1"
[ 24.112865] time="2017-05-23T08:53:51Z" level=debug msg="dhcpcd -u eth1: "
[ 24.112932] time="2017-05-23T08:53:51Z" level=info msg="Apply Network Config SyncHostname"

This would also explain why sometimes the ip addresses get correctly assigned and with another reboot it fails. This depends on the timing the DHCP server is able to lease an IP out or not, before the config has been changed to the static ip.

@joshuacox do you have DHCP servers running on your networks whose interfaces have this problem?

@stffabi
Copy link
Contributor

stffabi commented May 23, 2017

@SvenDowideit do you know what the meaning of the err == syscall.EEXIST error code in

if err := netlink.AddrAdd(link, addr); err == syscall.EEXIST {
is? Could this be the problem, that this error happens when dhcp has already assigned an IP in the same network as defined by the static ip?

@joshuacox
Copy link
Author

joshuacox commented May 23, 2017

@stffabi I do have a dhcp server running on the *private network.

It happens to be a virtual net in KVM, which I believe is supplied by dnsmasq. Which is not inside a rancher host.

Even though I have configured both interfaces as static, I have seen one interface get two addresses (i.e. one from the static config, and another from the dhcp server).

But right now in v1.0.1, and given I have the public address first in my configs, everything has been working for the past three weeks.

@prologic
Copy link

prologic commented Jun 5, 2017

I think I've been bitten by this and finally worked out why my upgraded node is an utter failure :/

I suspect it has something to do with DHCP being activated per default on startup on all interfaces. I've seen that the interface which has no ip assigned is the one which has always been assigned an IP from the DHCP.

My dmesg | grep eth0 output:

[rancher@dm4 ~]$ dmesg | grep eth0
[    9.675681] e1000 0000:00:12.0 eth0: (PCI:33MHz:32-bit) 36:39:61:37:66:32
[    9.676203] e1000 0000:00:12.0 eth0: Intel(R) PRO/1000 Network Connection
[   10.081475] time="2017-06-05T04:09:55Z" level=info msg="Running DHCP on eth0: dhcpcd -MA4 -e force_hostname=true eth0"
[   10.104736] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   12.136912] e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[   12.137970] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   13.166429] time="2017-06-05T04:09:58Z" level=info msg="Running DHCP on eth0: dhcpcd -MA4 -e force_hostname=true eth0"
[   17.945853] time="2017-06-05T04:10:02Z" level=info msg="Applying 10.0.0.13/24 to eth0"
[   17.948054] time="2017-06-05T04:10:02Z" level=info msg="Set 10.0.0.13/24 on eth0"
[   17.948944] time="2017-06-05T04:10:02Z" level=info msg="removing  10.0.0.102/24 eth0 from eth0"
[   17.951031] time="2017-06-05T04:10:02Z" level=info msg="Removed 10.0.0.102/24 eth0 from eth0"
[   17.952491] time="2017-06-05T04:10:02Z" level=info msg="removing  2603:3024:181a:c600:3439:61ff:fe37:6632/64 from eth0"
[   17.954406] time="2017-06-05T04:10:02Z" level=info msg="Removed 2603:3024:181a:c600:3439:61ff:fe37:6632/64 from eth0"
[   17.956503] time="2017-06-05T04:10:02Z" level=info msg="removing  fe80::3439:61ff:fe37:6632/64 from eth0"
[   17.959120] time="2017-06-05T04:10:02Z" level=info msg="Removed fe80::3439:61ff:fe37:6632/64 from eth0"
[   18.015139] time="2017-06-05T04:10:02Z" level=debug msg="dhcpcd -u eth0: "
[   18.933181] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   18.940783] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

My user_config.yml:

#cloud-config
hostname: dm4
ssh_authorized_keys:
  - ssh-rsa ...
rancher:
  network:
    dns:
      nameservers:
        - 8.8.8.8
        - 8.8.4.4
    interfaces:
      eth0:
        dhcp: false
        address: 10.0.0.13/24
        gateway: 10.0.0.1
        mtu: 1500
      eth1:
        dhcp: true
  upgrade:
      url: https://releases.rancher.com/os/releases.yml
      image: rancher/os

When this boots:

[rancher@dm4 ~]$ /sbin/ip addr show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 36:39:61:37:66:32 brd ff:ff:ff:ff:ff:ff
    inet6 2603:3024:181a:c600:3439:61ff:fe37:6632/64 scope global mngtmpaddr dynamic
       valid_lft 345582sec preferred_lft 345582sec

@prologic
Copy link

prologic commented Jun 5, 2017

@SvenDowideit If it helps I have a pretty reliable repro here

@prologic
Copy link

prologic commented Jun 5, 2017

My work-around for the moment is to assign static addresses from my network's DHCP server

@stffabi
Copy link
Contributor

stffabi commented Jun 6, 2017

@prologic yes, we currently use the same workaround.

@SvenDowideit
Copy link
Contributor

@prologic nice! now I wonder if we can get our heads together to make an automated integration test that fails every time too

@prologic
Copy link

prologic commented Jun 8, 2017 via email

@joshuacox
Copy link
Author

joshuacox commented Jun 8, 2017

@prologic you are describing the exact workflow I use for my private nodes (those without public interfaces). However, my private nodes have never had that issue. Just to be exact here is the section from a working node with an assigned IP in a private network that is controlled by KVM with dnsmasq as DHCP server:

rancher:
  network:
    interfaces:
      eth0:
        address: 192.168.1.44/24
        gateway: 192.168.1.1
        mtu: 1500
        dhcp: false

Note: I never told dnsmasq about the assigned IP, and that IP is outside of the pool of IPs dnsmasq draws upon for assignments [in this case 192.168.1.3-192.168.1.20 is the pool for dhcp leases]

@prologic
Copy link

prologic commented Jun 8, 2017 via email

@joshuacox
Copy link
Author

@prologic what ros os version are you on? I see I am behind a bit with a:

[rancher@rancher ~]$ sudo ros os version
v1.0.1

I guess I'll attempt an upgrade tomorrow, or sometime this weekend.

@prologic
Copy link

prologic commented Jun 9, 2017 via email

@SvenDowideit
Copy link
Contributor

yeah, its likely to be 1.0 or from the last 0.9 - I had to rejig the pre-cloud-init networking code pretty majorly to fix it for several platforms, and in the process its obviously not cleaning up after itself.

the thing I wanted (er, needed) was to have the interfaces come up with DHCP, then query /detect for all the possible cloud-init options, and then use those to set things to whatever is actually desired.

the old behavior made an incorrect guess of when no network was needed to get cloud-init, and then set things up - which obviously is easier to get right for simpler cloud-init cases, but utterly breaks for the more complicated ones.

@prologic what weirds me out, is that your dmesg looks like what I expect - it gets a random dchp address, then does a cloud-init-save, and then sets the ip to 10.10.10.13/24 and removes the addresses that cam from the dhcp server - and when it works for me, that's a-ok.
I do wonder if the order is a problem - maybe the remove needs to happen before the add, but that feels off to me.

and presumably, its just something simple and dumb that i picked up somewhere :(

@SvenDowideit
Copy link
Contributor

abd just as a :( - I do have a dhcp server with a random address, and do install testing here setting a static address, and it works - I'm willing to bet that its enough of a timing issue that knowing what dhcp server makes a difference (I'm mostly using my billion router, but I'm pretty close to needing to build a automated multi-server test for other issues too)

@prologic
Copy link

prologic commented Jun 9, 2017 via email

@SvenDowideit
Copy link
Contributor

@prologic I've just followed your reproduction steps on qemu, and nope - works ok

off to set up a non-vm system and network :/

#cloud-config
rancher:
  network:
    interfaces:
      eth0:
        address: 192.168.1.44/24
        gateway: 192.168.1.1
        mtu: 1500
        dhcp: false
  bootstrap_docker:
    registry_mirror: "http://10.10.10.23:5555"
    insecure_registry:
    - 10.10.10.23:5000
    - arian:5000
  docker:
    registry_mirror: "http://10.10.10.23:5555"
    insecure_registry:
    - 10.10.10.23:5000
    - arian:5000
  system_docker:
    registry_mirror: "http://10.10.10.23:5555"
    insecure_registry:
    - 10.10.10.23:5000
    - arian:5000
ssh_authorized_keys:
  - ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC85w9stZyiLQp/DkVO6fqwiShYcj1ClKdtCqgHtf+PLpJkFReSFu8y21y+ev09gsSMRRrjF7yt0pUHV6zncQhVeqsZtgc5WbELY2DOYUGmRn/CCvPbXovoBrQjSorqlBmpuPwsStYLr92Xn+VVsMNSUIegHY22DphGbDKG85vrKB8HxUxGIDxFBds/uE8FhSy+xsoyT/jUZDK6pgq2HnGl6D81ViIlKecpOpWlW3B+fea99ADNyZNVvDzbHE5pcI3VRw8u59WmpWOUgT6qacNVACl8GqpBvQk8sw7O/X9DSZHCKafeD9G5k+GYbAUz92fKWrx/lOXfUXPS3+c8dRIF

[root@rancher rancher]# system-docker logs network
DEBU[0451] Calling GET /v1.23/containers/network/json   
DEBU[0451] Calling GET /v1.23/containers/network/logs?stderr=1&stdout=1&tail=all 
DEBU[0451] logs: begin stream                           
DEBU[0451] logs: end stream                             
> time="2017-06-12T04:58:25Z" level=info msg="Apply Network Config" 
> time="2017-06-12T04:58:25Z" level=debug msg="Config: &netconf.NetworkConfig{PreCmds:[]string(nil), DNS:netconf.DNSConfig{Nameservers:[]string(nil), Search:[]string(nil)}, Interfaces:map[string]netconf.InterfaceConfig{\"eth0\":netconf.InterfaceConfig{Match:\"\", DHCP:false, DHCPArgs:\"\", Address:\"192.168.1.44/24\", Addresses:[]string(nil), IPV4LL:false, Gateway:\"192.168.1.1\", GatewayIpv6:\"\", MTU:1500, Bridge:\"\", Bond:\"\", BondOpts:map[string]string(nil), PostUp:[]string(nil), PreUp:[]string(nil), Vlans:\"\"}, \"lo\":netconf.InterfaceConfig{Match:\"\", DHCP:false, DHCPArgs:\"\", Address:\"\", Addresses:[]string{\"127.0.0.1/8\", \"::1/128\"}, IPV4LL:false, Gateway:\"\", GatewayIpv6:\"\", MTU:0, Bridge:\"\", Bond:\"\", BondOpts:map[string]string(nil), PostUp:[]string(nil), PreUp:[]string(nil), Vlans:\"\"}}, PostCmds:[]string(nil), HTTPProxy:\"\", HTTPSProxy:\"\", NoProxy:\"\"}" 
> time="2017-06-12T04:58:25Z" level=info msg="Applying 127.0.0.1/8 to lo" 
> time="2017-06-12T04:58:25Z" level=info msg="Applying ::1/128 to lo" 
> time="2017-06-12T04:58:25Z" level=info msg="Applying 192.168.1.44/24 to eth0" 
> time="2017-06-12T04:58:25Z" level=info msg="Set 192.168.1.44/24 on eth0" 
> time="2017-06-12T04:58:25Z" level=info msg="removing  10.0.2.15/24 eth0 from eth0" 
> time="2017-06-12T04:58:25Z" level=info msg="Removed 10.0.2.15/24 eth0 from eth0" 
> time="2017-06-12T04:58:25Z" level=info msg="removing  fec0::5054:ff:fe12:3456/64 from eth0" 
> time="2017-06-12T04:58:25Z" level=info msg="Removed fec0::5054:ff:fe12:3456/64 from eth0" 
> time="2017-06-12T04:58:25Z" level=info msg="removing  fe80::5054:ff:fe12:3456/64 from eth0" 
> time="2017-06-12T04:58:25Z" level=info msg="Removed fe80::5054:ff:fe12:3456/64 from eth0" 
> time="2017-06-12T04:58:25Z" level=info msg="Added default gateway 192.168.1.1" 
> time="2017-06-12T04:58:25Z" level=info msg="Apply Network Config RunDhcp" 
> time="2017-06-12T04:58:25Z" level=debug msg=RunDhcp 
> time="2017-06-12T04:58:25Z" level=error msg="exit status 1" 
> time="2017-06-12T04:58:25Z" level=debug msg="dhcpcd -u eth0: " 
> time="2017-06-12T04:58:25Z" level=info msg="Apply Network Config SyncHostname" 
> time="2017-06-12T04:58:25Z" level=info msg="Restart syslog" 

@SvenDowideit
Copy link
Contributor

and when I do the same on my real network, with real hardware, the same thing - my eth0 has the static IP I requested in the config.

@stffabi
Copy link
Contributor

stffabi commented Jun 12, 2017

At least for @prologic and me, it seems like we are having the same subnet for the static ip adress and the dhcp address.

  • Me: DHCP 10.141.50.81/16 and static 10.141.0.33/16
  • @prologic: DHCP 10.0.0.102/24 and static 10.0.0.13/24

This seems not to be the case in your test @SvenDowideit. Don't know if that makes any difference...

@SvenDowideit
Copy link
Contributor

SvenDowideit commented Jun 12, 2017

and after more messing about on vmware, I think I have it.

I've set up 3 eth's

  network:
    interfaces:
      eth0:
        address: 10.10.10.253/24
        dhcp: false
        gateway: 10.10.10.1
        mtu: 1500
      eth2:
        dhcp: true
  • eth0 set statically using cloud-config (and this time its got no IP)
  • eth1 with no dhcp setting in the cfg - which has undefined behaiviour - if you have no rancher.network settings, then ~~ dhcp: true, otherwise, its left over from the pre-cloud-config dhcp values
  • eth2 set as rancher.network.interfaces.eth2.dhcp: true - and i think its setting this that then causes eth0 to be broken.

whereas:

  network:
    interfaces:
      eth0:
        address: 10.10.10.253/24
        dhcp: false
        gateway: 10.10.10.1
        mtu: 1500
      eth2:
        dhcp: false

results in eth0 being 10.10.10.253 and eth1 still having its old dhcp address.

Time to walk the dog - but this is good progress.

@SvenDowideit
Copy link
Contributor

OH BOOM! poked it s'more, and there was just one little thing missing from the --boothd option cmdline used by the integration tests - finally, something is working today :)

@prologic
Copy link

prologic commented Jun 12, 2017 via email

@SvenDowideit
Copy link
Contributor

no, today's been a disaster. to automate the test reproducable case I found yesterday, I coded up a new comms path for our integration tests, only to have that lock up my local computer (and then wasted a few hours trying to track that down). giving up on that, I moved to a server box - where the vm didn't lock up the host, but instead, locks itself up in pretty much unrelated code - but now I can't manually get my reproduction to happen without locking up one or the other.

I'm probably going to give up on it for a few days and circle back to it later.

@SvenDowideit
Copy link
Contributor

#1915 - I've made a dumbed down test - and have commented out the one that kills my hosts.

@prologic
Copy link

prologic commented Jun 13, 2017 via email

@SvenDowideit
Copy link
Contributor

@joshuacox @prologic @stffabi , I've built a test release with the add&remove IP switched around - would it be possible for you to try it out?

https://github.com/rancher/os/releases/tag/v1.1.0-test1

I'm not holding my breath, but we might get lucky :)

@prologic
Copy link

prologic commented Jun 14, 2017 via email

@stffabi
Copy link
Contributor

stffabi commented Jun 14, 2017

@SvenDowideit 🎉 🎉. I've upgraded one of our servers and rebooted it several times. It always came up with all interfaces and correct IP settings. Awesome...

@prologic
Copy link

prologic commented Jun 14, 2017 via email

@joshuacox
Copy link
Author

@SvenDowideit initial testing says all good here. Of note, I have not put that on my production front-end (the rancher with the public IP), but I have done quite a bit of testing with two-interfaces on a newly cut VM built using v1.1.0-test1. From what I can tell it should all work fine.

@joshuacox
Copy link
Author

joshuacox commented Jun 22, 2017

interestingly enough I tried out 1.0.2, and eth1 was dead in the water, scrollilng up with shift-pageup in the KVM console I could see where the dhcpcd was firing off desspite me having an entirely static configuration, but it did look like it removed the dhcp addresses and had left the static (including on eth1) but when I boot I get nothing but this:

3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether ff:54:00:ee:ee:ee brd ff:ff:ff:ff:ff:ff

my net configs:

rancher:
  network:
    interfaces:
      eth0:
        address: 10.30.111.33/24
        gateway: 10.30.111.65
        mtu: 1500
        dhcp: false
      eth1:
        address: 192.168.1.42/24
        mtu: 1500
        dhcp: false

I ended up rolling back to 1.0.0 which was the first to booth with both interfaces working. 1.0.1 same problem as 1.0.2 here.

@SvenDowideit SvenDowideit modified the milestones: v1.0.3, vNext Jun 23, 2017
@SvenDowideit
Copy link
Contributor

considering the feedback for the v1.1.0-test1 release, and now the need to deal with the Stack Clash CVE, I've merged the PR into v1.0.x, and am building and uploading an v1.0.3-rc1 - @stffabi @joshuacox @prologic if it doesn't work for you, please yell :D (sorry its taking a long time, TurnBull net is excruciatingly slow tonite)

@SvenDowideit
Copy link
Contributor

@prologic
Copy link

prologic commented Jun 23, 2017 via email

@SvenDowideit
Copy link
Contributor

the iso is :magic: - it'll install that version (er, i hope, so long as the magic didn't leak out)

@stffabi
Copy link
Contributor

stffabi commented Jun 23, 2017

@SvenDowideit I've upgraded one server, looks good. 🎉

@SvenDowideit
Copy link
Contributor

I'm going to close this one for https://github.com/rancher/os/releases/tag/v1.0.3

@shaoyangyu
Copy link

I am suffered the similar problem around one week, with Dual Nic config, the ip address can't be allocated correctly for everytime host rebooting.

Going to upgrade my system now, not sure if my problem can be fixed too.

@shaoyangyu
Copy link

@SvenDowideit Thanks for your work. I upgraded to 1.0.3 , and test with 10 times rebooting, each time the dual ip addresses are allocated correctly. no pain now. :) 👍

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants