Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node (slot) 3 won't boot Talos OS on cm4 #13

Open
bhuism opened this issue Apr 17, 2023 · 35 comments
Open

Node (slot) 3 won't boot Talos OS on cm4 #13

bhuism opened this issue Apr 17, 2023 · 35 comments

Comments

@bhuism
Copy link

bhuism commented Apr 17, 2023

serial log:

RPI Compute Module 4 (0xd03141)
Core:  209 devices, 16 uclasses, devicetree: board
MMC:   mmcnr@7e300000: 1, mmc@7e340000: 0
Loading Environment from FAT... Unable to read "uboot.env" from mmc0:1... 
In:    serial
Out:   vidconsole
Err:   vidconsole
Net:   eth0: ethernet@7d580000
PCIe BRCM: link up, 5.0 Gbps x1 (SSC)
"Error" handler, esr 0xbf000002
elr: 00000000000af544 lr : 00000000000af500 (reloc)
elr: 000000003df81544 lr : 000000003df81500
x0 : 000000000000dead x1 : 0000000000100000
x2 : 0000000000008000 x3 : 00000000fd508000
x4 : 0000000000000000 x5 : 0000000000000001
x6 : 000000003df82aac x7 : 000000003db40890
x8 : 0000000000008a6c x9 : 0000000000000008
x10: 000000003db4023c x11: 0000000000000002
x12: 0000000000000140 x13: 000000003db40228
x14: 000000003db40890 x15: 0000000000000000
x16: 000000003df82b84 x17: d4244e8100000000
x18: 000000003db4dd70 x19: 0000000000000001
x20: 000000003db40300 x21: 000000003db5b480
x22: 0000000000000000 x23: 0000000000010000
x24: 000000003dfc60a1 x25: 000000003db5b0b0
x26: 000000000000ffff x27: 0000000000000000
x28: 0000000000000000 x29: 000000003db40260

Code: 350001f3 f94017e0 39400000 92401c00 (d5033fbf) 
Resetting CPU ...

@wenyi0421
Copy link
Owner

Does the problematic pi work on other slots 1 2 4 or on the official carrier board?

@bhuism
Copy link
Author

bhuism commented Apr 17, 2023

@wenyi0421 yes, perfectly and the other 2 daughterboards with cm4 show the same symptoms, all work in all slots except slot 3. And none of them work in slot 3. But since this looks like a u-boot issue, maybe I should post in issue with them. I've wrote a howto here: https://github.com/bhuism/talos-tpi2 if you want to reproduce.

@bhuism bhuism changed the title Node (slot) 3 won't boot when using u-boot Node (slot) 3 won't boot when using u-boot (used by Talos) Apr 17, 2023
@wenyi0421
Copy link
Owner

Please use the official image of the Raspberry Pi to test it, and consider replacing the CM4 Adapter V1.0 to try, this may be a hardware problem

@bhuism
Copy link
Author

bhuism commented Apr 17, 2023

I've replaced the adapter+cm4 already, I have 4 adapters and 3 cm4 modules, none of them work (with Talos using u-boot) in slot 3, all of them work in all other slots.

I'll try raspberry pi os

@krarey
Copy link

krarey commented Apr 17, 2023

I can confirm this same issue, with the same error state (syndrome register value 0xbf000002) on any of my CM4 modules (8GB, Wifi, eMMC) connected to the node3 slot that boots using U-Boot. In my case the version packaged in the Fedora IoT 37 image.

@bhuism
Copy link
Author

bhuism commented Apr 17, 2023

Raspberry Pi OS lite boots just fine in node3 slot, I swapped cd4 and daughterboards now to get u-boot to work.

@wenyi0421
Copy link
Owner

wenyi0421 commented Apr 19, 2023

Maybe because node3 is connected to sata, uboot wants to start from sata, but it fails.This may require modifying the uboot startup items in the flashed OS

@bhuism
Copy link
Author

bhuism commented Apr 26, 2023

I'm afraid so

@Daedaluz
Copy link

I gave this an attempt by building a custom u-boot image with modified boot-order and patched the pre-built talos image with it.
While it managed to get into grub, it failed silently and reset anyway.
I didn't find any obvious issues with the grub configuration, but i might look closer into it another day.

Maybe grub also probes SATA and fails?

@bhuism
Copy link
Author

bhuism commented May 21, 2023

At least interesting @Daedaluz , thanks for looking into it

@jlec
Copy link

jlec commented May 22, 2023

Let me know if there is something to test. Happy to tinker around as well

@Daedaluz
Copy link

I poked around a little bit again, and it looks like grub actually loads the kernel / initramfs properly, but the kernel itself makes it reset. I can manually load the kernel and the initramfs without issues, the reset happens after the boot command.

I kind of expected something from the kernel, like an oops or panic whatnot, but the only thing i get after boot command is:

EFI stub: Booting Linux Kernel...
EFI stub: Using DTB from configuration table
EFI stub: Exiting boot services...

Ideas?

@bhuism
Copy link
Author

bhuism commented May 25, 2023

Thanks @Daedaluz, this is too hard core kernel for me, btw did you know that slot/node 2 boots fine with a sata mini pcie card inserted, with the same chipset as on the tpiv2 board? namely asmedia asm 1061 chipset. Maybe this helps?

@jlec
Copy link

jlec commented May 25, 2023

Thanks @Daedaluz, this is too hard core kernel for me, btw did you know that slot/node 2 boots fine with a sata mini pcie card inserted, with the same chipset as on the tpiv2 board? namely asmedia asm 1061 chipset. Maybe this helps?

Different for me, doesn't boot here.

@jlec
Copy link

jlec commented May 25, 2023

I poked around a little bit again, and it looks like grub actually loads the kernel / initramfs properly, but the kernel itself makes it reset. I can manually load the kernel and the initramfs without issues, the reset happens after the boot command.

I kind of expected something from the kernel, like an oops or panic whatnot, but the only thing i get after boot command is:

EFI stub: Booting Linux Kernel...
EFI stub: Using DTB from configuration table
EFI stub: Exiting boot services...

Ideas?

Not much. is there something similar in a non uboot system we can compare to? Maybe report upstream at talos or uboot or both as well.

@bhuism
Copy link
Author

bhuism commented May 26, 2023

obviously, when rk1 comes out, we definitely needs talos to boot on it

@Daedaluz
Copy link

obviously!

@bhuism
Copy link
Author

bhuism commented May 26, 2023

Working on this a little bit, conform: https://github.com/u-boot/u-boot/blob/master/doc/develop/crash_dumps.rst

I get:

$ echo 'Code: 350001f3 f94017e0 39400000 92401c00 (d5033fbf)' |   CROSS_COMPILE=aarch64-linux-gnu- ARCH=arm64 scripts/decodecode
Code: 350001f3 f94017e0 39400000 92401c00 (d5033fbf)
All code
========
   0:	350001f3 	cbnz	w19, 0x3c
   4:	f94017e0 	ldr	x0, [sp, #40]
   8:	39400000 	ldrb	w0, [x0]
   c:	92401c00 	and	x0, x0, #0xff
  10:*	d5033fbf 	dmb	sy		<-- trapping instruction

Code starting with the faulting instruction
===========================================
   0:	d5033fbf 	dmb	sy

@SheGe
Copy link

SheGe commented Jun 19, 2023

I hit the same issue by working on alpine based image powered by uboot. In my opinion the problem is related to uboot and sata controller. Node3 is not working because has native SATA controller connected. Both, Node1 and Node2 are working. When mpcie SATA controller is connected those nodes behave the same as Node3.

@bhuism
Copy link
Author

bhuism commented Jun 19, 2023

@SheGe My experience was not the same with talos, I booted talos fine with a satacontroller in the mpcie slot in node2 (and same chip a on the tpiv2 board on node 3) go figure

@bhuism
Copy link
Author

bhuism commented Aug 6, 2023

this issues is also reported a sidero here: siderolabs/talos#7358

@bhuism
Copy link
Author

bhuism commented Aug 6, 2023

I make a custom rpi image with a logging/trace enabled u-boot, see log attached

ubootlog.txt

@bhuism bhuism changed the title Node (slot) 3 won't boot when using u-boot (used by Talos) Node (slot) 3 won't boot when using the default Talos OS rpi4 image Aug 6, 2023
@bhuism bhuism changed the title Node (slot) 3 won't boot when using the default Talos OS rpi4 image Node (slot) 3 won't boot Talos OS on cm4 Aug 6, 2023
@bhuism
Copy link
Author

bhuism commented Aug 6, 2023

new log and map:

u-boot.txt
ubootlogv2.log

@bhuism
Copy link
Author

bhuism commented Aug 7, 2023

I've got talos booted on node3 with a workaround, a custom u-boot.bin (and thus talos image) was needed, it's a hack though.

@maxromanovsky
Copy link

@bhuism I'm facing this same issue with eMMC CM4s, but also in slots 1 & 2 (as I have mini PCIe SATA cards installed there).

How did you get UART working there? Did it work out of the box? Or did you tinker with BOOT_UART in EEPROM, config.txt or something similar in the image flashed to eMMC?

In my case (RPI debug probe @ 115200) output is empty.

@bhuism
Copy link
Author

bhuism commented Aug 10, 2023

@maxromanovsky uarts work out of the box, come to the discord chat, and search for serial debug, you can easily get serial to any node from the bmc command line

@maxromanovsky
Copy link

@bhuism thanks!
What command do you use?
I tried the following one on BMC, and the output is always empty:

# tpi --uart=get -n 1
{
	"response":	[{
			"uart":	""
		}]
}#

@bhuism
Copy link
Author

bhuism commented Aug 11, 2023

@maxromanovsky the serial ports of the cm4's are all connected to serial devices on the bmc, all 4. I've wrote something up here: https://github.com/bhuism/talos-tpi2#hardwired-bmc-serial-port-connections-to-nodes, you can use microcom or picocom of the bmc.

@CFSworks
Copy link

I've got talos booted on node3 with a workaround, a custom u-boot.bin (and thus talos image) was needed, it's a hack though.

Since you're already set up to build custom u-boot.bins, could you revert your workaround, confirm the problem still occurs, and then try this patch?
0001-pci-pcie-brcmstb-do-not-rely-on-CLKREQ-signal.patch

If it works for you, you may provide (at your option) an Acked-by:/Reported-by:/Tested-by: that I will use to credit you on the patch when I submit it upstream, if you'd like.

@bhuism
Copy link
Author

bhuism commented Aug 11, 2023

@CFSworks will do asap

@bhuism
Copy link
Author

bhuism commented Aug 12, 2023

@CFSworks it does get past u-boot now and into grub, but gets stuck in booting the kernel:

Booting `A - Talos v1.4.7-dirty'

EFI stub: Booting Linux Kernel...
RPI Compute Module 4 (0xc03141)
PCIe BRCM: link up, 5.0 Gbps x1 (SSC)
PCI: Failed autoconfig bar 10
PCI: Failed autoconfig bar 14
PCI: Failed autoconfig bar 18
PCI: Failed autoconfig bar 1c
PCI: Failed autoconfig bar 20

after this boot loops

this log is not clean btw, I use picotom from the bmc (I ssh into bmc) and the lines that come back often look garbled

I tried u-boot development branche and the exact u-boot version (2023.1) talos 1.4.7 is using, incl their patches, both same result, I was using development in my patch.

(this talos image with ur patch boots fine on a normal rpi4b btw)

@CFSworks
Copy link

The pasted output is all (presumably) normal output from U-Boot.

Could you log into the BMC and have this running:
microcom -s 115200 /dev/ttyS4 | tee node3.log
...and try a boot? This should capture all of the characters into node3.log, so that later terminal shenanigans don't overwrite earlier output.

@bhuism
Copy link
Author

bhuism commented Aug 14, 2023

@CFSworks here u go

node3.log

@CFSworks
Copy link

I just reproduced this boot loop on my own hardware. I'll spend some time today seeing if this new problem is a shortcoming in my U-Boot patch or a problem in a different component of Talos.

@CFSworks
Copy link

Editing the GRUB boot entry with e and adding the following to the kernel cmdline: earlycon=pl011,0xfe201000,115200
...allows kernel boot log output. The kernel is failing to boot, with:

[    2.226887] pci_bus 0000:00: root bus resource [bus 00-ff]
[    2.232470] pci_bus 0000:00: root bus resource [mem 0x600000000-0x63fffffff] (bus address [0xc0000000-0xffffffff])
[    2.243009] pci 0000:00:00.0: [14e4:2711] type 01 class 0x060400
[    2.249199] pci 0000:00:00.0: PME# supported from D0 D3hot
[    2.258616] pci_bus 0000:01: supply vpcie3v3 not found, using dummy regulator
[    2.266062] pci_bus 0000:01: supply vpcie3v3aux not found, using dummy regulator
[    2.273635] pci_bus 0000:01: supply vpcie12v not found, using dummy regulator
[    2.334384] brcm-pcie fd500000.pcie: link up, 5.0 GT/s PCIe x1 (SSC)
[    2.389016] SError Interrupt on CPU1, code 0x00000000bf000002 -- SError
[    2.389033] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.1.44-talos #1
[    2.389046] Hardware name: Unknown Unknown Product/Unknown Product, BIOS 2023.07-00970-gd74fa80c0a 07/01/2023
[    2.389053] pstate: 204000c5 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    2.389064] pc : pci_generic_config_read+0x64/0xf0
[    2.389093] lr : pci_generic_config_read+0x4c/0xf0
[    2.389107] sp : ffff80000802b7d0
[    2.389111] x29: ffff80000802b7d0 x28: ffff3828409cf800 x27: 0000000000000001
[    2.389128] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
[    2.389141] x23: 0000000000000000 x22: 0000000000000000 x21: ffff80000802b854
[    2.389153] x20: 0000000000000004 x19: ffff3828409cf800 x18: 0000000000000000
[    2.389165] x17: 6f74616c75676572 x16: 0000000000000107 x15: 072f075407470720
[    2.389178] x14: 0730072e07350720 x13: 072f075407470720 x12: 0730072e07350720
[    2.389191] x11: 0720072007200720 x10: 0720072007200720 x9 : ffffc99190cc7ebc
[    2.389203] x8 : ffff80000802b578 x7 : 000000000002ffe8 x6 : 00000000000affa8
[    2.389215] x5 : ffffc99190cc7e70 x4 : ffff80000802b854 x3 : 000000000000000b
[    2.389227] x2 : ffff800008bc9000 x1 : 00000000deaddead x0 : ffff800008bc8000
[    2.389242] Kernel panic - not syncing: Asynchronous SError Interrupt
[    2.389247] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.1.44-talos #1
[    2.389256] Hardware name: Unknown Unknown Product/Unknown Product, BIOS 2023.07-00970-gd74fa80c0a 07/01/2023
[    2.389262] Call trace:
[    2.389267]  dump_backtrace.part.0+0xec/0x100
[    2.389282]  show_stack+0x30/0x40
[    2.389291]  dump_stack_lvl+0x64/0x80
[    2.389310]  dump_stack+0x18/0x34
[    2.389321]  panic+0x180/0x35c
[    2.389334]  nmi_panic+0xbc/0xc0
[    2.389345]  arm64_serror_panic+0x78/0x84
[    2.389355]  do_serror+0x30/0x7c
[    2.389365]  el1h_64_error_handler+0x3c/0x70
[    2.389379]  el1h_64_error+0x78/0x7c
[    2.389387]  pci_generic_config_read+0x64/0xf0
[    2.389400]  pci_bus_read_config_dword+0xa0/0x160
[    2.389414]  pci_bus_generic_read_dev_vendor_id+0x40/0x180
[    2.389431]  pci_scan_single_device+0xb4/0x120
[    2.389447]  pci_scan_slot+0x6c/0x200
[    2.389461]  pci_scan_child_bus_extend+0x48/0x240
[    2.389478]  pci_scan_bridge_extend+0x158/0x580
[    2.389494]  pci_scan_child_bus_extend+0xd0/0x240
[    2.389509]  pci_scan_root_bus_bridge+0x6c/0xe0
[    2.389525]  pci_host_probe+0x24/0xd0
[    2.389533]  brcm_pcie_probe+0x258/0x630
[    2.389545]  platform_probe+0x70/0xcc
[    2.389563]  really_probe+0xc8/0x2e4
[    2.389576]  __driver_probe_device+0x80/0x11c
[    2.389589]  driver_probe_device+0x4c/0x120
[    2.389601]  __driver_attach+0xa4/0x170
[    2.389614]  bus_for_each_dev+0x84/0xdc
[    2.389624]  driver_attach+0x34/0x44
[    2.389636]  bus_add_driver+0x15c/0x210
[    2.389648]  driver_register+0x7c/0x13c
[    2.389661]  __platform_driver_register+0x38/0x4c
[    2.389677]  brcm_pcie_driver_init+0x30/0x64
[    2.389689]  do_one_initcall+0x60/0x270
[    2.389700]  kernel_init_freeable+0x478/0x554
[    2.389709]  kernel_init+0x30/0x140
[    2.389723]  ret_from_fork+0x10/0x20
[    2.389738] SMP: stopping secondary CPUs
[    2.389748] Kernel Offset: 0x499188380000 from 0xffff800008000000
[    2.389754] PHYS_OFFSET: 0xffffc7d8c0000000
[    2.389758] CPU features: 0x40000,2013c080,0000421b
[    2.389765] Memory Limit: none

...which is this panic tracked upstream in the Linux kernel bug database.

This might be exacerbated by the timing of U-Boot using the PCIe RC for a while and then shutting it down later in the boot when EFI boot services are exited, but is not itself the fault of U-Boot, so I'm going to send that patch upstream now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants