dual NIC netconfig breaks 0.9.1 -> 0.9.2 and 1.0.0 #1812
Comments
I thought I wrote a test for this :/ |
I did - https://github.com/rancher/os/pull/1666/files#diff-2f43c1a40bf9c492da522998bd0fb833 but ... @joshuacox I don't suppose you have a spare box you can use to show me the output of |
Ok, I have figured out why mine failed, having the gateway on the second address kills the first, here is an for example this is a broken config:
However, for this next one, I changed out which networks the virtual nics were attached (the opposite of the first), and changed the order here:
that worksfine, @SvenDowideit feel free to close this as I can |
ah, yes, and my test uses a gateway on both :( nope, not closing, I'll add another test and fix it too - hopefully that'll resolve the problem for more future configs. |
I've got another server on 1.0.0 where the above is not necessarily true, possibly just rebooting a few times or some other timeout happens the issue resolves itself. still testing here |
I might just make a 1.0.1-rc3 today (depending on how much the flu is affecting me) - at which point, i'd be very interested in seeing the |
@SvenDowideit I'm not certain if you ever made an rc3, but I did upgrade to 1.0.1 and a few of my dual nic setups immediately died upon reboot (no addresses show up on the console login screen, and they are all unresponsive on any net interface). I needed to get them back up pretty quickly so I cut a few new VMs using the latest (1.0.1) iso, Install worked great, they even took the first reboot fine (where I add in the public interface). But then I logged in set a new hostname, EDIT: adding the relevant network configs
further EDITs: |
argh. those are the worst, and probably explain why I'm having so much pain producing a similar result. |
We are seeing the same issue on our servers with multiple nics, running 1.0.1. I try to get a debug output... |
I suspect it has something to do with DHCP being activated per default on startup on all interfaces. I've seen that the interface which has no ip assigned is the one which has always been assigned an IP from the DHCP. For example in the following case, eth2 had been assigned an IP from DHCP (10.141.50.81/16). In that case eth2 had no IP assigned after rancher has been started.
Here you see a dmesg output after a successfull boot, in that case no IP has been assigned from DCHP.
This would also explain why sometimes the ip addresses get correctly assigned and with another reboot it fails. This depends on the timing the DHCP server is able to lease an IP out or not, before the config has been changed to the static ip. @joshuacox do you have DHCP servers running on your networks whose interfaces have this problem? |
@SvenDowideit do you know what the meaning of the Line 285 in 6d5fc4c
|
@stffabi I do have a dhcp server running on the *private network. It happens to be a virtual net in KVM, which I believe is supplied by dnsmasq. Which is not inside a rancher host. Even though I have configured both interfaces as static, I have seen one interface get two addresses (i.e. one from the static config, and another from the dhcp server). But right now in |
I think I've been bitten by this and finally worked out why my upgraded node is an utter failure :/
My
My
When this boots:
|
@SvenDowideit If it helps I have a pretty reliable repro here |
My work-around for the moment is to assign static addresses from my network's DHCP server |
@prologic yes, we currently use the same workaround. |
@prologic nice! now I wonder if we can get our heads together to make an automated integration test that fails every time too |
What do you need?
1) Setup a DHCP server on a network/vlan
2) Spin up a RancherOS KVM instance on the same network
3) Install to Disk and set eth0 as a static address
4) Reboot
Watch eth0 come back with no address. Poof :D
James Mills / prologic
E: prologic@shortcircuit.net.au
W: prologic.shortcircuit.net.au
…On Tue, Jun 6, 2017 at 5:18 PM, Sven Dowideit ***@***.***> wrote:
@prologic <https://github.com/prologic> nice! now I wonder if we can get
our heads together to make an automated integration test that fails every
time too
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1812 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABOv-gFngMTN2Hxn1NnOzrw6OkfM07p9ks5sBexsgaJpZM4NGY4k>
.
|
@prologic you are describing the exact workflow I use for my private nodes (those without public interfaces). However, my private nodes have never had that issue. Just to be exact here is the section from a working node with an assigned IP in a private network that is controlled by KVM with dnsmasq as DHCP server:
Note: I never told dnsmasq about the assigned IP, and that IP is outside of the pool of IPs dnsmasq draws upon for assignments [in this case 192.168.1.3-192.168.1.20 is the pool for dhcp leases] |
Yeah but currently this doesn't work because rancheros does dhcp at boot,
then tries to remove things later if the interfaces was configured to be
stati and something goes wrong. I don't understand it fully yet (was only
perusing through some commits that affected this).
James Mills / prologic
E: prologic@shortcircuit.net.au
W: prologic.shortcircuit.net.au
…On Thu, Jun 8, 2017 at 5:35 AM, Josh Cox ***@***.***> wrote:
@prologic <https://github.com/prologic> you are describing the exact
workflow I use for my private nodes (those without public interfaces).
However, my private nodes have never had an issue. Just to be exact here is
the section from a working node with an assigned IP in a private network
that is controlled by KVM with dnsmasq as DHCP server:
rancher:
network:
interfaces:
eth0:
address: 192.168.1.44/24
gateway: 192.168.1.1
mtu: 1500
dhcp: false
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1812 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABOv-mc70u5UUyEvFBBznb3QlmHWRXwgks5sB-p1gaJpZM4NGY4k>
.
|
@prologic what ros os version are you on? I see I am behind a bit with a:
I guess I'll attempt an upgrade tomorrow, or sometime this weekend. |
Whatever the latest is; this is only broken in the latest version (maybe a
couple minor versions back too) onwards.
James Mills / prologic
E: prologic@shortcircuit.net.au
W: prologic.shortcircuit.net.au
…On Fri, Jun 9, 2017 at 7:14 AM, Josh Cox ***@***.***> wrote:
@prologic <https://github.com/prologic> what ros os version are you on? I
see I am behind a bit with a:
***@***.*** ~]$ sudo ros os version
v1.0.1
I guess I'll attempt an upgrade tomorrow, or sometime this weekend.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1812 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABOv-rNHlMVbGXbZQA0BTPgoBa4hrw50ks5sCOLhgaJpZM4NGY4k>
.
|
yeah, its likely to be 1.0 or from the last 0.9 - I had to rejig the pre-cloud-init networking code pretty majorly to fix it for several platforms, and in the process its obviously not cleaning up after itself. the thing I wanted (er, needed) was to have the interfaces come up with DHCP, then query /detect for all the possible cloud-init options, and then use those to set things to whatever is actually desired. the old behavior made an incorrect guess of when no network was needed to get cloud-init, and then set things up - which obviously is easier to get right for simpler cloud-init cases, but utterly breaks for the more complicated ones. @prologic what weirds me out, is that your dmesg looks like what I expect - it gets a random dchp address, then does a cloud-init-save, and then sets the ip to 10.10.10.13/24 and removes the addresses that cam from the dhcp server - and when it works for me, that's a-ok. and presumably, its just something simple and dumb that i picked up somewhere :( |
abd just as a :( - I do have a dhcp server with a random address, and do install testing here setting a static address, and it works - I'm willing to bet that its enough of a timing issue that knowing what dhcp server makes a difference (I'm mostly using my billion router, but I'm pretty close to needing to build a automated multi-server test for other issues too) |
RouterOS 6.18 here
James Mills / prologic
E: prologic@shortcircuit.net.au
W: prologic.shortcircuit.net.au
…On Fri, Jun 9, 2017 at 1:03 AM, Sven Dowideit ***@***.***> wrote:
abd just as a :( - I do have a dhcp server with a random address, and do
install testing here setting a static address, and it works - I'm willing
to bet that its enough of a timing issue that knowing what dhcp server
makes a difference (I'm mostly using my billion router, but I'm pretty
close to needing to build a automated multi-server test for other issues
too)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1812 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABOv-nNOeD8h1Va7f4UCLRwITpFj2Rdnks5sCPxlgaJpZM4NGY4k>
.
|
@prologic I've just followed your reproduction steps on qemu, and nope - works ok off to set up a non-vm system and network :/
|
and when I do the same on my real network, with real hardware, the same thing - my eth0 has the static IP I requested in the config. |
At least for @prologic and me, it seems like we are having the same subnet for the static ip adress and the dhcp address.
This seems not to be the case in your test @SvenDowideit. Don't know if that makes any difference... |
and after more messing about on vmware, I think I have it. I've set up 3 eth's
whereas:
results in Time to walk the dog - but this is good progress. |
OH BOOM! poked it s'more, and there was just one little thing missing from the |
Just woke up and caught up! But yeah DHCP subnet == Static subnet in both
our cases (but I'm sure you've gone past this now)
Glad you got to the bottom of it? :D
James Mills / prologic
E: prologic@shortcircuit.net.au
W: prologic.shortcircuit.net.au
…On Mon, Jun 12, 2017 at 1:24 AM, Sven Dowideit ***@***.***> wrote:
OH BOOM! poked it s'more, and there was just one little thing missing from
the --boothd option cmdline used by the integration tests - finally,
something is working today :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1812 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABOv-qXUMwG_ewPift02tkNXBplQ_o_aks5sDPWggaJpZM4NGY4k>
.
|
no, today's been a disaster. to automate the test reproducable case I found yesterday, I coded up a new comms path for our integration tests, only to have that lock up my local computer (and then wasted a few hours trying to track that down). giving up on that, I moved to a server box - where the vm didn't lock up the host, but instead, locks itself up in pretty much unrelated code - but now I can't manually get my reproduction to happen without locking up one or the other. I'm probably going to give up on it for a few days and circle back to it later. |
#1915 - I've made a dumbed down test - and have commented out the one that kills my hosts. |
(y)
James Mills / prologic
E: prologic@shortcircuit.net.au
W: prologic.shortcircuit.net.au
…On Tue, Jun 13, 2017 at 5:26 AM, Sven Dowideit ***@***.***> wrote:
#1915 <#1915> - I've made a dumbed down
test - and have commented out the one that kills my hosts.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1812 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABOv-rCTnzRoohHvfHconZS44zb2QxODks5sDn_jgaJpZM4NGY4k>
.
|
@joshuacox @prologic @stffabi , I've built a test release with the add&remove IP switched around - would it be possible for you to try it out? https://github.com/rancher/os/releases/tag/v1.1.0-test1 I'm not holding my breath, but we might get lucky :) |
I can try it tonight (PST)
James Mills / prologic
E: prologic@shortcircuit.net.au
W: prologic.shortcircuit.net.au
…On Wed, Jun 14, 2017 at 7:04 AM, Sven Dowideit ***@***.***> wrote:
@joshuacox <https://github.com/joshuacox> @prologic
<https://github.com/prologic> @stffabi <https://github.com/stffabi> ,
I've built a test release with the add&remove IP switched around - would it
be possible for you to try it out?
https://github.com/rancher/os/releases/tag/v1.1.0-test1
I'm not holding my breath, but we might get lucky :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1812 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABOv-o8iZr9Yk4h7iCNNAJduXH__GVwTks5sD-iEgaJpZM4NGY4k>
.
|
@SvenDowideit 🎉 🎉. I've upgraded one of our servers and rebooted it several times. It always came up with all interfaces and correct IP settings. Awesome... |
Whoohoo!
James Mills / prologic
E: prologic@shortcircuit.net.au
W: prologic.shortcircuit.net.au
…On Wed, Jun 14, 2017 at 12:17 PM, Fabrizio Steiner ***@***.*** > wrote:
@SvenDowideit <https://github.com/svendowideit> 🎉 🎉. I've upgraded one
of our servers and rebooted it several times. It always came up with all
interfaces and correct IP settings. Awesome...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1812 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABOv-mVBWRmv6tKaTpf0vQrZ7XA7ioIqks5sEDGxgaJpZM4NGY4k>
.
|
@SvenDowideit initial testing says all good here. Of note, I have not put that on my production front-end (the rancher with the public IP), but I have done quite a bit of testing with two-interfaces on a newly cut VM built using |
interestingly enough I tried out 1.0.2, and eth1 was dead in the water, scrollilng up with shift-pageup in the KVM console I could see where the dhcpcd was firing off desspite me having an entirely static configuration, but it did look like it removed the dhcp addresses and had left the static (including on eth1) but when I boot I get nothing but this:
my net configs:
I ended up rolling back to 1.0.0 which was the first to booth with both interfaces working. 1.0.1 same problem as 1.0.2 here. |
considering the feedback for the v1.1.0-test1 release, and now the need to deal with the Stack Clash CVE, I've merged the PR into v1.0.x, and am building and uploading an |
(haven't tried this yet) but silly question if I use the new ISO will it
also install that version or download whatever the latest (broken) version
is?
James Mills / prologic
E: prologic@shortcircuit.net.au
W: prologic.shortcircuit.net.au
…On Fri, Jun 23, 2017 at 8:42 AM, Sven Dowideit ***@***.***> wrote:
please try https://github.com/rancher/os/releases/tag/v1.0.3-rc1
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1812 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABOv-n7XcwgU_96Rj1MC3QsEzxB43_e0ks5sG9zdgaJpZM4NGY4k>
.
|
the iso is :magic: - it'll install that version (er, i hope, so long as the magic didn't leak out) |
@SvenDowideit I've upgraded one server, looks good. 🎉 |
I'm going to close this one for https://github.com/rancher/os/releases/tag/v1.0.3 |
I am suffered the similar problem around one week, with Dual Nic config, the ip address can't be allocated correctly for everytime host rebooting. Going to upgrade my system now, not sure if my problem can be fixed too. |
@SvenDowideit Thanks for your work. I upgraded to 1.0.3 , and test with 10 times rebooting, each time the dual ip addresses are allocated correctly. no pain now. :) 👍 |
**RancherOS Version: (ros os version)**0.9.2-0.1.0
Where are you running RancherOS? (docker-machine, AWS, GCE, baremetal, etc.)
KVM VM with two virtual NICs (one public, one private)
I've been following #1477 and #1705 and have run into more woes with my dual NIC setup.
When I upgraded to -> rancheros 1.0.0 on my dual NIC VM, I found that again I had only one interface working (the one with the gateway), I rolled back to 0.9.2, and found the same issue there, so I kept rolling back to 0.9.1
That fixed it for me and I now have two working NICs in separate networks again.
relevant network section
The text was updated successfully, but these errors were encountered: