Skip to content
This repository has been archived by the owner on Oct 11, 2023. It is now read-only.

iPXE boot doesn't load cloud-config (DHCP based DNS is not being used) #1790

Closed
gizmotronic opened this issue Apr 17, 2017 · 27 comments
Closed

Comments

@gizmotronic
Copy link
Contributor

RancherOS Version: (ros os version)
0.9.2-rc2 and later

Where are you running RancherOS? (docker-machine, AWS, GCE, baremetal, etc.)
vSphere Hypervisor 6.5.0 (ESXi) virtual machine, iPXE, diskless

RancherOS loads on a diskless VM using iPXE, but starting with 0.9.2-rc2, it's unable to load the cloud-config specified using the rancher.cloud_init.datasources kernel command line parameter.

I've isolated the problem to changeset 79a7e59. Reverting this change allows 1.0.0 to work normally.

@SvenDowideit
Copy link
Contributor

oh wonderful - that change was made to fix AWS cloud-init.

@SvenDowideit
Copy link
Contributor

@gizmotronic - can you post the log file of boot with rancher.debug=true? I'm curious to see what's going on.

@gizmotronic
Copy link
Contributor Author

@SvenDowideit It took me a bit to figure out how to get past the rate limited printk, but here it is: rancheros-dmesg.txt

This was really helpful. My cloud-config can't load because the host it's coming from is on my internal network, which can't be resolved by Google DNS. I don't expect it to be using Google because my DHCP server is providing the necessary (private) DNS configuration.

In case some future reader is wondering how I turned off the rate limiting, I added printk.devkmsg=on to my kernel command line parameters.

@SvenDowideit
Copy link
Contributor

yes, that's not good. can you add your local dns server seting to the boot cmdline for now?

@SvenDowideit SvenDowideit changed the title iPXE boot doesn't load cloud-config iPXE boot doesn't load cloud-config (DHCP based DNS is not being used) Apr 18, 2017
@SvenDowideit
Copy link
Contributor

this might be the root cause of something I'm trcking down atm - thank you for the analysis

@gizmotronic
Copy link
Contributor Author

I've confirmed that setting the local DNS server is an effective workaround.

@SvenDowideit SvenDowideit added this to the vNext milestone Apr 26, 2017
@SvenDowideit SvenDowideit modified the milestones: v1.0.2, vNext May 17, 2017
@SvenDowideit
Copy link
Contributor

argh! it works for my server (on an installed disk), guess I need to kick it with pxe-dust

[root@rancher rancher]# cat /etc/resolv.conf 
# Generated by dhcpcd from eth0.dhcp, eth1.dhcp
# /etc/resolv.conf.head can replace this line
domain home.org.au
nameserver 10.10.10.1
nameserver 8.8.8.8
# /etc/resolv.conf.tail can replace this line
[root@rancher rancher]# ros -v
ros version v1.0.1
[root@rancher rancher]# system-docker exec -it network dhcpcd -U eth0
broadcast_address=10.10.10.255
dhcp_lease_time=864000
dhcp_message_type=5
dhcp_server_identifier=10.10.10.1
domain_name=home.org.au
domain_name_servers='10.10.10.1 8.8.8.8'
ip_address=10.10.10.2
network_number=10.10.10.0
routers=10.10.10.1
subnet_cidr=24
subnet_mask=255.255.255.0
eth0: dhcp6_dump: No such file or directory

@SvenDowideit
Copy link
Contributor

SvenDowideit commented May 17, 2017

yup, and the dns servers are also set and working at cloud-init-save when I pixieboot it with a local dns only datasource url :(

DEBU[0035] framesize: 110                               
cloud-init_1 | > time="2017-05-17T05:04:13Z" level=info msg="Running DHCP on eth1: dhcpcd -MA4 -e force_hostname=true eth1" 
DEBU[0035] framesize: 110                               
cloud-init_1 | > time="2017-05-17T05:04:13Z" level=info msg="Running DHCP on eth0: dhcpcd -MA4 -e force_hostname=true eth0" 
DEBU[0035] framesize: 42                                
cloud-init_1 | sending commands to master dhcpcd process
[   66.748924] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
[   67.130548] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   69.623691] tg3 0000:03:00.1 eth1: Link is up at 1000 Mbps, full duplex
[   69.946168] tg3 0000:03:00.1 eth1: Flow control is on for TX and on for RX
[   70.281179] tg3 0000:03:00.1 eth1: EEE is disabled
[   70.521612] tg3 0000:03:00.0 eth0: Link is up at 1000 Mbps, full duplex
[   70.843881] tg3 0000:03:00.0 eth0: Flow control is on for TX and on for RX
[   71.178719] tg3 0000:03:00.0 eth0: EEE is disabled
[   71.419367] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
[   71.724858] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   72.134437] time="2017-05-17T05:04:20Z" level=info msg="Apply Network Config SyncHostname" 
DEBU[00[   72.541819] time="2017-05-17T05:04:20Z" level=debug msg="datasources that will be consided: []string{\"url:http://mini/test.yml\"}" 
41] framesize: 82                                [   73.123418] time="2017-05-17T05:04:21Z" level=info msg="cloud-init: Checking availability of \"url\"\n" 

cloud-init_1 | > time="2017-05-17T05:04:20Z" level=info msg="Apply Network Config SyncHostname" 
DEBU[0042] framesize: 123                               
cloud-init_1 | > time="2017-05-17T05:04:20Z" level=debug msg="datasources that will be consided: []string{\"url:http://mini/test.yml\"}" 
DEBU[0042] framesize: 95                                
cloud-init_1 | > time="2017-05-17T05:04:21Z" level=info msg="cloud-init: Checking availability of \"url\"\n" 
[   74.122492] time="2017-05-17T05:04:22Z" level=info msg="cloud-init: Datasource unavailable, skipping: url: http://mini/test.yml (lastError: Unable to fetch data: Get http://mini/test.yml: dial tcp 10.10.10.14:80: getsockopt: connection refused)" 
DEBU[0043] framesize: 237                               
cloud-init_1 | > time="2017-05-17T05:04:22Z" level=info msg="cloud-init: Datasource unavailable, skipping: url: http://mini/test.yml (lastError: Unable to fetch data: Get http://mini/test.yml: dial tcp 10.10.10.14:80: getsockopt: connection refused)" 

@SvenDowideit
Copy link
Contributor

@gizmotronic I guess the next question is - what do oyu get when you run system-docker exec -it network dhcpcd -U eth0

@SvenDowideit SvenDowideit modified the milestones: vNext, v1.0.2 May 17, 2017
@gizmotronic
Copy link
Contributor Author

The VM I'm using (typical of my environment) is diskless. The problem doesn't happen when a disk is attached, whether with a local install or used as a state partition.

I'm readily able to reproduce the problem with v1.0.0, but it appears to be resolved in v1.0.1. I've just booted v1.0.1 with 6 different configurations and had no trouble with any of them. I was also able to build and boot ToT without any trouble.

Any ideas on what might have changed to fix this?

@SvenDowideit
Copy link
Contributor

we think there's a race condition when there's a delay in the DHCP - but so far, its all theories - I can't manage to make it happen.

@SvenDowideit
Copy link
Contributor

SvenDowideit commented Jun 19, 2017

@gizmotronic I made a test build with #1921 in it - see https://github.com/rancher/os/releases/tag/v1.1.0-test1

is there any chance you could see if this also solves your problem?

hopefully, its related to #1812

@gizmotronic
Copy link
Contributor Author

I had no trouble booting v1.1.0-test1 in my test VM.

@SvenDowideit SvenDowideit modified the milestones: v1.1.0, vNext Jun 21, 2017
@liyimeng
Copy link

Thanks @SvenDowideit I have hit on the same problem, you test build solve the problem. Could we release this asap?

@SvenDowideit
Copy link
Contributor

Have you tried v1.0.3? It contains one part of the 1.1.0 test changes.
The other needs a bit more testing as its more intrusive

@SvenDowideit
Copy link
Contributor

SvenDowideit commented Jun 29, 2017

I'm hoping this will be r

resolved by #1921

@liyimeng
Copy link

I guess I did, only 1.1.0 works. But I could double check when in office tomorrow.

@liyimeng
Copy link

I run my test again, using https://releases.rancher.com/os/latest/vmlinuz
It seems work, cloud-config is applied.
However, after boot, it always shows the version as RancherOS v1.0.2, Not v.1.0.3!

@liyimeng
Copy link

Sorry, my fault! I have an disk attach to my machine, and it is format with RANCHER_STATE. It stores the config I loaded early!

New test shows: only v1.1.0-test work!

@liyimeng
Copy link

Even v1.1.0-test is no more working. :(

@SvenDowideit
Copy link
Contributor

#1921 backported to v1.0.4 too

@liyimeng
Copy link

liyimeng commented Aug 23, 2017

Hi @SvenDowideit I run v1.04. This is still not working. Here is my boot script

#!ipxe

set base-url http://10.10.10.1:8000
kernel ${base-url}/vmlinuz rancher.autologin=tty1 rancher.state.dev=LABEL=RANCHER_STATE rancher.state.autoformat=[/dev/sda,/dev/vda] rancher.cloud_init.datasources=[url:${base-url}/cloud-config]
initrd ${.base-url}/initrd

[rancher@rancher ~]$ sudo ros config export
EXTRA_CMDLINE: /init
rancher:
  autologin: tty1
  cloud_init:
    datasources:
    - url:http://10.10.10.1:8000/cloud-config
  environment:
    EXTRA_CMDLINE: /init
  state:
    autoformat:
    - /dev/sda
    - /dev/vda
    dev: LABEL=RANCHER_STATE
ssh_authorized_keys: []


[rancher@rancher ~]$ wget  http://10.10.10.1:8000/cloud-config
Connecting to 10.10.10.1:8000 (10.10.10.1:8000)
cloud-config         100% |*************************************************************************************************************|   531   0:00:00 ETA

[rancher@rancher ~]$ cat cloud-config 

#cloud-config

hostname: host-119

ssh_authorized_keys:
  - ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDbw3HEUCApEnStLH5NibRhP6KipG3l8ENCdXTBnDzQ51dUsD/sVgEIA1OwJUcEcNWCgSbnP7GE7hdsRfySNUjNcGDEIv70uR59b0r/nJ6ySgAcRL9RlvuiW/Vas7ZUS6JW/8uOVrb1D32Z0pV804nAU4Afym3NiIpH9GqSZMg9Etge764pT8aiWMx1RKl8UiYznIuBnT/gzWGOnm+s/udRAx9g8xAYd67Gzw4H05RlnR/3yHOdMTXJhlcovsDOpoKBEsG+MmM2W/S9G/ia84zbfWkUSI7bLy+UxMs6nJEAYCr66JWTr+EB4IpOYmiu5H6cyuZhSTy9QnHRGigv6RUP liyi.meng@ericsson.com
rancher:
  network:
    interfaces:
      eth0:
        dhcp: true[rancher@rancher ~]$  cat /etc/issue 

               ,        , ______                 _                 _____ _____TM
  ,------------|'------'| | ___ \\               | |               /  _  /  ___|
 / .           '-'    |-  | |_/ /__ _ _ __   ___| |__   ___ _ __  | | | \\ '--.
 \\/|             |    |   |    // _' | '_ \\ / __| '_ \\ / _ \\ '__' | | | |'--. \\
   |   .________.'----'   | |\\ \\ (_| | | | | (__| | | |  __/ |    | \\_/ /\\__/ /
   |   |        |   |     \\_| \\_\\__,_|_| |_|\\___|_| |_|\\___|_|     \\___/\\____/
   \\___/        \\___/     \s \r

         RancherOS v1.0.4 \n \l
         eth0: 10.10.10.119 eth1: 10.168.122.222 lo: 127.0.0.1
[rancher@rancher ~]$ 

@liyimeng
Copy link

@SvenDowideit BTW, for a personal question, are you the only guy work on RancherOS now? :)

@SvenDowideit
Copy link
Contributor

I just re-did the handling of resolve.conf during boot, it should make dhcp based DNS much more functional

it'll be released in v1.1.0 (ga) this week.

@gizmotronic
Copy link
Contributor Author

This has resolved the issue for me. I was having trouble with v1.0.4 late last week but v1.1.0 boots without issue. Thank you!

@liyimeng
Copy link

liyimeng commented Sep 6, 2017

This is still not working if you have more than one interface on your machine. To reproduce, create a KVM VM with two interfaces, pxe boot from one of the interface. In case you also have another dhcp server running on another interface, it is almost 100% fails. I will propose RancherOS add another kernel parameter to indicate from which interface that the iPXE boot is supposed to happen.

@SvenDowideit
Copy link
Contributor

@liyimeng can you please raise a new issue for that - I keep forgetting that there's an extra problem

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants