Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apu2c4 fails to TFTP boot after it did not become healthy after an update #29

Closed
stapelberg opened this issue Apr 25, 2019 · 6 comments
Closed

Comments

@stapelberg
Copy link
Contributor

I have seen this a few times now, but finally have a serial log from when it happened.

The symptom is:

  1. The automated update writes a new version and triggers a kexec reboot
  2. The apu2c4 comes up, but fails to obtain a DHCP4 or DHCP6 lease, or even do DNS resolution using the IP address it had before the reboot. It seems like all network packets are dropped.
  3. The updater recognizes the apu2c4 is unhealthy and triggers a reboot after starting a TFTP server (to revert to the old revision).
  4. The apu2c4 prints the following, then hangs indefinitely:
Next server: 10.0.0.76
Filename: lpxelinux.0
tftp://10.0.0.76/lpxelinux.0... ok
lpxelinux.0 : 74379 bytes [PXE-NBP]

The updater’s log contains:

2019/04/25 07:09:47 client will use IP address 10.0.0.1 during recovery
2019/04/25 07:09:47 building github.com/rtr7/tools/cmd/rtr7-recovery-init
2019/04/25 07:09:48 serving TFTP, HTTP, DHCP (for PXE clients) on 10.0.0.76 (enp0s31f6)
2019/04/25 07:10:06 [dhcp] 00:0d:b9:xx:yy:zz DHCPDISCOVER → DHCPOFFER
2019/04/25 07:10:06 [dhcp] 00:0d:b9:xx:yy:zz DHCPREQUEST → DHCPACK
2019/04/25 07:10:22 [tftp] lpxelinux.0: success
2019/04/25 07:52:06 [dhcp] 00:0d:b9:xx:yy:zz DHCPDISCOVER → DHCPOFFER
2019/04/25 07:52:06 [dhcp] 00:0d:b9:xx:yy:zz DHCPREQUEST → DHCPACK

Given pcengines/coreboot#181, I wonder if the network interfaces sometimes don’t properly come up, perhaps because we are doing a kexec reboot.

I’ll check the network LED the next time it happens, and will try to disable kexec to see if that helps.

@stapelberg
Copy link
Contributor Author

stapelberg commented May 30, 2019

I was recently in a position to reproduce this issue. It seems like any network traffic not related to the network boot confuses the apu to the point where it won’t continue network booting.

Establishing a point-to-point link (instead of connecting the apu to the rest of my network) helps.

This should probably be reported as a coreboot bug? We should probably reproduce this issue with a more recent coreboot build first, and check their changelogs in case this is a known issue.

Currently, my apu reports:
coreboot build 08/30/2017
BIOS version v4.6.1

https://pcengines.github.io/#mr-22 offers a much more recent build from just a few weeks ago.

@stapelberg
Copy link
Contributor Author

I could successfully reproduce the issue with v4.6.1, statically configuring the MAC address and running the dnsflood program:

package main

import (
	"log"
	"time"

	"github.com/miekg/dns"
)

func main() {
	for range time.Tick(100 * time.Millisecond) {
		go func() {
			m := new(dns.Msg)
			m.SetQuestion("miek.nl.", dns.TypeMX)
			c := new(dns.Client)
			in, rtt, err := c.Exchange(m, "192.168.1.1:53")
			log.Printf("in = %v, rtt = %v, err = %v", in, rtt, err)
		}()
	}
}

With v4.9.0.5, I get a log message from PXELINUX:

lpxelinux.0 : 74379 bytes [PXE-NBP]

PXELINUX 6.03 lwIP 20171017 Copyright (C) 1994-2014 H. Peter Anvin et al

I also tried upgrading the version of SYSLINUX, but the most recent version gets just as far:

PXELINUX 6.04 lwIP 6.04-pre3 Copyright (C) 1994-2015 H. Peter Anvin et al

@stapelberg
Copy link
Contributor Author

Reached out to the SYSLINUX mailing list: https://www.syslinux.org/archives/2019-June/026447.html

@lukastribus
Copy link

Could this be related:
https://marcoguerri.github.io/linux/pxe/datacenter/2016/03/20/pxeboot-failures-chelsio.html

Maybe the apu2c4 nic does not correctly set PXENV_UNDI_ISR_OUT_NOT_OURS in all situations? Using pxelinux instead of lpxelinux could indeed be worth a try.

@stapelberg
Copy link
Contributor Author

Thanks for the pointer! That’s a very interesting post. I had planned to do some more debugging based on the replies I got on the mailing list today or tomorrow, as time permits. Will check if the same fix works here, too.

@stapelberg
Copy link
Contributor Author

Maybe the apu2c4 nic does not correctly set PXENV_UNDI_ISR_OUT_NOT_OURS in all situations?

I haven’t verified, but it sounds unlikely that the flag would only be set in response to network traffic.

The blog post was very interesting regardless, so thanks again for the pointer :)

Using pxelinux instead of lpxelinux could indeed be worth a try.

Indeed, that did the trick! Now using pxelinux.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants