Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boot fails on PC Engines apu platforms #4

Closed
krystian-hebel opened this issue Dec 30, 2019 · 21 comments
Closed

Boot fails on PC Engines apu platforms #4

krystian-hebel opened this issue Dec 30, 2019 · 21 comments

Comments

@krystian-hebel
Copy link

Platform either hangs or reboots (depending on coreboot revision) after Shutting down firmware services... or Installing a safe environment... with verbose output. See
esxi_debug_2.log

Code for relocation is PIC, but it then returns to mboot.c which isn't. It uses pointers to global data as well as to constants (e.g. parameters to Log()). Boot fails if any of these sections is relocated because old pointers are still used.

Modified memory tables (so that modules are loaded at different addresses and don't need to be relocated) allow ESXi to boot, at the cost of wasting RAM and possibly breaking boot for different systems.

@TimothyPMann
Copy link
Contributor

Code for relocation is PIC, but it then returns to mboot.c which isn't. It uses pointers to global data as well as to constants (e.g. parameters to Log()). Boot fails if any of these sections is relocated because old pointers are still used.

If by "code for relocation" you mean trampoline.s, it does not return to mboot.c, although it does call into mboot.c (etc.). However, it does not move (relocate) the code of the bootloader itself, and does not overwrite that code. It only moves the modules around.

Modified memory tables (so that modules are loaded at different addresses and don't need to be relocated) allow ESXi to boot, at the cost of wasting RAM and possibly breaking boot for different systems.

I don't understand what you mean by that.

@pietrushnic
Copy link

@TimothyPMann can you elaborate on why you closed that issue? Are you considering it solved?

@TimothyPMann
Copy link
Contributor

Sorry, I should have left it open, as I don't dispute the filer's observation that boot fails on the PC Engines APU platform that he has, only his attempt to explain what the problem is.

Most likely the problem is that coreboot has a bug in its memory map, such that when mboot moves modules into memory that (according the memory map) is free for use by the OS, it breaks something in coreboot that is still needed and causes an unhandled exception. That's the kind of issue we sometimes see on prototype or non-mainstream-server machines, and the fix is typically a firmware update, not something in mboot. However, this is just a guess, so I can't say for sure the bug can't be in mboot somewhere.

Unfortunately I can't debug it from the log. I've rarely been able to debug mboot issues if I don't have the hardware in hand to experiment with. :-(

@TimothyPMann TimothyPMann reopened this Jan 7, 2020
@TimothyPMann
Copy link
Contributor

p.s. It may be worth a try to boot UEFI mode instead if the platform supports that. If this was not coreboot I'd say to definitely use UEFI, but coreboot is different and I'm not familiar enough with it to say whether coreboot+SeaBIOS or coreboot+Tianocore is more stable.

@pietrushnic
Copy link

@TimothyPMann @krystian-hebel works with me and we maintain coreboot for PC Engines for almost 4 years. Unfortunately, there is no UEFI firmware for it, we can try TianoCore payload which is closer to what we can get to UEFI.

What we can deliver to you to help you in confirming that bug? Maybe we can expose the remote system, so you can just confirm we really see a bug in mboot?

@krystian-hebel any ideas on how to provide a memory map that we expose to mboot? Also if you think there are any other means of proving our claims please let us know.

@krystian-hebel
Copy link
Author

@TimothyPMann @pietrushnic e820 memory map is included in the log I attached.

If by "code for relocation" you mean trampoline.s, it does not return to mboot.c, although it does call into mboot.c (etc.). However, it does not move (relocate) the code of the bootloader itself, and does not overwrite that code. It only moves the modules around.

By "code for relocation" I meant do_reloc(), called by install_trampoline(), which does return to main() in mboot.c.

Line from here is not present in the log file, neither is the one with error, some lines above that. However, by binary patching mboot.c32 (adding mov al, 'x'; out 0x3f8, al and a dead loop to stop further execution) I discovered that Log() gets called and returns, but doesn't print anything, as if it were given empty string. Pointer to that string (and every other static or global data) is loaded using address relative to EBX register, which is implicitly loaded with address of some known (from compiler's point of view) point in the code/data in memory on (almost) every function entry in PIC. As install_trampoline() returns, EBX is not updated. It would be updated if that function used a call (or even a jump) to a new function instead of returning.

As EBX does not change, I assume that underlying memory was overwritten by do_reloc() at this point. Interesting thing is that apparently data was relocated and overwritten, but the code was not - it might have been relocated but the old code section could be still in place. I haven't dug too deep, but is it possible that mboot.c32 code and data sections treated as separate objects, which causes its memory to get fragmented?

In any case, the way i understand the code, the memory used by mboot.c32 should be blacklisted and because of that safe from being moved around, which it apparently isn't...

Modified memory tables (so that modules are loaded at different addresses and don't need to be relocated) allow ESXi to boot, at the cost of wasting RAM and possibly breaking boot for different systems.

I don't understand what you mean by that.

I've marked memory in range 0x100000-0x400000 as reserved in e820 memory map (in coreboot tables actually, but they are parsed later by SeaBIOS and converted to e820, effect is the same). This is the memory to which SYSLINUX, mboot.c32 etc. was initially loaded. After this change, the relocation map is different, which allows installer to boot.

This issue affects many people, see here, here and here.

@TimothyPMann
Copy link
Contributor

Thanks for the additional details. I'll take a further look at this and see what I can find out.

@TimothyPMann
Copy link
Contributor

do_reloc is called twice. The first call, from install_trampoline, only copies the trampoline section to a new location. do_reloc itself is in the trampoline section (along with trampoline.s), but this first call runs entirely from the section's original location, not where it's copied to. Also, the place the trampoline section is copied to is "safe memory", which is memory that the e820 table says is available and that doesn't overlap either mboot itself or the old or new locations of the boot modules. So there shouldn't be any problem with this first invocation returning to its caller.

The second call to do_reloc is after jump_to_trampoline jumps to the new copy of the trampoline section. This does not call back into mboot, or use any static mboot data outside of safe memory, or return to mboot.

So this all should work fine. But I will look some more to see if any bugs crept in that I can see, with help from your observations.

@TimothyPMann
Copy link
Contributor

I'm not seeing any explanation for why Log(LOG_DEBUG, "Installing a safe environment...\n"); was printed successfully but something went wrong with the call Log(LOG_INFO, "Relocating modules and starting up the kernel...\n"). As I mentioned, the call from install_trampoline to do_reloc only copies the trampoline code plus the data that it needs. The destinations of those copies were logged just above:

[t] 17735c - 17b6f7 -> 2cb97bd0 - 2cb9bf6b (17308 bytes)
[t] 170e86 - 170f72 -> 2cb9bf70 - 2cb9c05c (237 bytes)

The 17308 bytes are data: handoff structure, reloc table, 237 bytes is the code from trampoline.s and do_reloc. The destinations are way outside where mboot itself was loaded. From earlier in the log, it was loaded at 0x160000:

mboot __executable_start is at 0x160000

The size isn't logged, but looking at the mboot binary shows it's 0x2c480 (through the end of the bss).


It's worth thinking about whether the move could have smashed the active stack. I don't believe so, because COM32R modules are started by syslinux with their stack at "the end of available memory" (https://wiki.syslinux.org/wiki/index.php?title=Doc/comboot). The top in 32-bit mode unpaged mode, that is, so memory above 0x100000000 is not available. From the logged e820 memory map, the top of available memory seems to be 0xdfe88fff. I don't see anything getting moved nearly that high.


You said that the hang or reboot sometimes occurs earlier, after "Shutting down firmware services...". So perhaps whatever is happening is actually happening sooner, but not always evident until later. But if "Shutting down firmware services..." is really the last message, it's tough to guess what, since that is quite a lot earlier than "Installing a safe environment...". If a crash occurred during the "moving xxx (size yyy) temporarily to zzz" phase, it would look like one of those moves was stepping on some memory that's in use. But then "Shutting down firmware services..." should not be the last log message; there should be more before the crash.

Do you have some logs from other crashes? Perhaps they could help.

I would be willing to add more instrumentation to an mboot build for you, if I had a clue of what to add.

@krystian-hebel
Copy link
Author

"Installing a safe environment..." is printed only with debug flags: "-D -S 1". Without those, the last printed line is "Shutting down firmware services..." and log gathered this way is pretty useless. All debug logs are identical (for the same FW/ESXi versions), except for some intertwined lines around lines printed with no debug - they are printed twice.

This time I've gathered new logs from official release (previously it was edited with ESXi-Customizer-PS-v2.5.ps1 - vmkusb removed), version 6.7U2 with SHA256 = 73950afff2637470b9d347847b2f21cad5a6992becd8c14324668ac6a6ee0f3f. Image was written to USB stick using Rufus (with Syslinux and menu.c32 from the ISO image), and I added "text nofb com1_baud=115200 com1_Port=0x3f8 tty2Port=com1 gdbPort=none logPort=none" to boot.cfg. No other changes were made. It was tested on firmware with and without additional memory reservation for range 0x100000-0x400000, here are the logs:
esxi_without_res_mem.log
esxi_with_res_mem.log

There is one range that did not change its addresses (both before and after relocation) for both versions: [s] 18c1b0 - 18c1b5 -> 602a8e - 602a93 (6 bytes). Interestingly, it is inside that reserved range, as well as is the original trampoline code - here address after relocation changed, before relocation is the same:

 Preparing a safe environment...
-Allocation table count=8, max=4096
-[t] 1771dc - 17b5af -> 29afc6b0 - 29b00a83 (17364 bytes)
-[t] 170ce6 - 170dd2 -> 29b00a90 - 29b00b7c (237 bytes)
+Allocation table count=3, max=4096
+[t] 1771dc - 17b5af -> 29b6ec50 - 29b73023 (17364 bytes)
+[t] 170ce6 - 170dd2 -> 29b73030 - 29b7311c (237 bytes)
 Installing a safe environment...

In fact, mboot __executable_start is at 0x160000 in both cases - I haven't noticed this before. It seems like the real problem starts earlier than I suspected, even before mboot.

@TimothyPMann
Copy link
Contributor

Tangential comment: "text" and "nofb" don't do anything; those are Linux options.

It's syslinux that loaded mboot itself at 0x160000. Maybe syslinux can't handle the e820 table trying to reserve the range 0xf0000 - 0x3fffff. I suppose it's conceivable that syslinux is also somehow responsible for the crash by leaving something vital in memory without modifying the e820 table to protect it. But still, we haven't seen crashes like this on any other machine.

I'll look some more at the detailed logs to see if they suggest anything further.

@krystian-hebel
Copy link
Author

krystian-hebel commented Feb 19, 2020

Update: looking at Syslinux's code it seems that it doesn't look at e820 at all when initially loading itself and always uses 0x100000. It then takes first RAM section above 1 MB and uses only its end address, possibly reduced when there are non-RAM sections defined for overlapping range (apparently, some BIOSes add reserved section without reducing available RAM). Multiple RAM regions are merged only if they are contiguous. If there are holes, only the first contiguous region is taken into the account.

The address obtained this way is used as a top of available memory. The first byte is calculated from the end of module memory, which due to the fact that module lands in reserved memory range, is also in that range. For the memory range I reserved, this is visible in lines:

Malloc arena:
  (free) address 0x18c300, size 3746446468
  (used) address 0xdfe80000, size 64
  (used) address 0xdfe80040, size 48
  (used) address 0xdfe80070, size 32
  (free) address 0xdfe80090, size 36720

The used entries here used to be the first ones and only one free entry was reported without memory reservation. Now the used ones lay close to the end of memory, so most likely the stage that loaded them was taking e820 into account. On the other hand, I would expect them to be somewhere close to the beginning of available memory, given that they were at the beginning with no memory reservation, but perhaps they are relocated because mboot.c32 is bigger than menu.c32.

Apart from semi-working support for e820 from Syslinux, I'm still not sure why it boots with memory reservation in place. I suspect that either Syslinux or SeaBIOS (given that similar problems were not reported for other platforms) uses some memory for runtime code without reserving it. Perhaps by doing it manually we force SeaBIOS to use different parts of memory and some non-reserved runtime code lands somewhere where Syslinux (or its modules) won't overwrite it. Syslinux uses BIOS interrupts somewhat more often compared to other bootloaders, so maybe this problem existed for some time, but was dormant until now.

Blog entry about the issue: https://blog.3mdeb.com/2020/2020-03-04-esxi-67-boot-issue-part1/

@ADIsdouthit
Copy link

Just ran into this issue. I submitted #7 which fixed the issue for me if you'd like to test.

@TimothyPMann
Copy link
Contributor

That change and the reasoning behind it looks reasonable -- thanks. Clobbering syslinux usually wouldn't matter, because most of the relocations are done in the trampoline after Log (etc.) is no longer called. But there can be exceptional cases where the memory is written into earlier, such as extra relocations to break cycles. So maybe one of the early memory uses is what causes the trouble on the PC Engines machine.

There is a small danger the change will break booting in legacy BIOS mode on occasional machines that have very little memory below 4GB and thus actually need to use the newly reserved area. However I'm inclined not to worry about that. Such machines typically would have UEFI and should preferably be booted in UEFI mode anyway, which allows memory above 4GB to be used.

The github version of esx-boot is mostly just a coarse sequence of snapshots copied from the version we maintain internally, one per ESXi release where esx-boot changes. So accepting the merge request into github wouldn't accomplish much. But I'll try pulling it into our internal version. Then if all goes well it will show up in a later push up to github -- probably the one for ESXi 7.0.3.

@ADIsdouthit
Copy link

Thanks Timothy, I'll keep my eyes open for a 7.0.3 drop.

We could tighten up the bounds of the blacklist to just cover SysLINUX by moving start to 0x100000 if you think the couple hundred KB in the legacy BIOS region are needed. My thinking was that the larger blacklist would be cheap insurance against any buggy BIOS that's not marking it's lowmem reserved regions correctly, but you're right that it's not needed for a non-buggy BIOS with valid e820 tables.

I can confirm that I was hitting one of those exceptional cycle dependency break relocations on my board. This extra info probably isn't needed at this point, but just posting here for completeness.

I ported the exception handler from coreboot (using only serial port logging given the nature of this failure) to catch and dump any general protection faults and saw the following in the log:

<snip>
...moving 0x7230020 (size 0x31d08) temporarily to 0x100000
Unexpected Exception: 13 @ 20:0010305c - Halting
Code: 0 eflags: 00010002
eax: 00000000 ebx: 7f6afe1c ecx: 0000002c edx: 00000000
edi: 0016a883 esi: 00000021 ebp: 7f6afe74 esp: 7f6afdf4
Code:
0x103018: 6c6c6163 65672079 6172656e 2e646574
0x103028: 6e412020 68632079 65676e61 6f792073
0x103038: 616d2075 230a656b 6e616d20 6c6c6175
0x103048: 69772079 62206c6c 6f6c2065 61207473
0x103058: 68742074 656e2065 62207478 646c6975
0x103068: 0a230a2e 6c502023 65736165 69646520
0x103078: 643c2074 65766972 645f3e72 63697665
0x103088: 702e7365 69662079 6620656c 7020726f
Stack:
0x7f6afdb0: 0016a883 7f6afdf4 001600bb 7f6afdc0
0x7f6afdc0: 00000000 0000002c 00000000 7f6afe1c
0x7f6afdd0: 7f6afdf4 7f6afe74 00000021 0016a883
0x7f6afde0: 0000000d 00000000 0010305c 00000020
0x7f6afdf0: 00010002 0016da46 00176fe8 0018b3a5
0x7f6afe00: 00000023 00176fe8 00176fe8 000003fd
0x7f6afe10: 0018b383 7f6afea0 0016cbd1 00000084
0x7f6afe20: 00000000 00000000 00000000 00000000

ndisasm gives:
0010305C 65206E65 and [gs:bp+0x65],ch

We're basically trying to run a text file at this point, and happened to fault here since GS points to the null segment in the GDT.

@TimothyPMann
Copy link
Contributor

Nice work. I always wanted to give mboot an exception handler of its own since it seems like only a few systems have one in BIOS/UEFI, but never found time.

@TimothyPMann
Copy link
Contributor

Hmm, actually I still don't understand why overwriting syslinux as late as relocate_runtime_services() is causing a problem, as I don't see why we would be calling into syslinux for logging anymore then.

The rule is really supposed to be that no COM32 (syslinux) calls are allowed after firmware_shutdown(), but I don't see that we carefully safeguard that. We should -- probably intcall should panic if called after firmware_shutdown.

However, we do call log_unsubscribe(firmware_print) quite early -- both from fbcon_init() and video_set_mode() , which is called when mboot puts its GUI on the screen, quite early on. I don't think the folks seeing this bug were booting headless (no VGA/SVGA), which would suppress the GUI. So after the GUI goes up, Log() should only be drawing to the screen itself and possibly writing to serial itself, not calling into COM32.

It's going to bother me to put in this patch without understanding why it's needed, so I am going to try to debug it some more. The fact that the patch fixes the issue is a good clue and so I should be able to debug just by instrumenting to see what is calling into COM32 after firmware_shutdown().

@TimothyPMann
Copy link
Contributor

Aha! Looking back at Krystian's original logs, he is using -H (headless). There is definitely a bug in that case when booting with legacy BIOS. The bug is really that mboot will continue to call com32_putc even after firmware_shutdown was called. That happens because on the headless path, nothing calls log_unsubscribe(firmware_print), and neither firmware_print nor any of the com32_* functions checks whether com32 has been shut down before calling into it.

@jailbird777
Copy link

jailbird777 commented May 4, 2021 via email

@TimothyPMann
Copy link
Contributor

So, one correct fix would be to call log_unsubscribe(firmware_print) from the com32 version of exit_boot_services (the one in bios/com32/init.c, right after the call to com32_cleanup.

This has the disadvantage that in headless mode, logging via COM32 will stop at that point. mboot's native logging to serial will still work if enabled.

@TimothyPMann
Copy link
Contributor

I forgot to close this issue. It's fixed in the current release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants