Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] static structs not zero after leaving bootloader mode #1408

Closed
laurensvalk opened this issue Jan 31, 2024 · 15 comments · Fixed by pybricks/pybricks-micropython#231
Closed
Labels
bug Something isn't working hub: technichub Issues related to the LEGO Technic hub (Hub No. 2)

Comments

@laurensvalk
Copy link
Member

Describe the bug
This is a bit of a weird one. In the past, I think we've occasionally seen that a hub wouldn't immediately start after installing the firmware, and this may be related.

Sometimes, it's a bit worse. It appears that for some builds it does start, and then keeps restarting. In my current build it crashes before the lights come on, so it appears not to start at all. It can only start if you pull the batteries first.

So we can't release such a build. Even if it did boot, all bets are off with other undefined behavior.

Digging deeper, it appears to crash here which calls this.

Apparently, self->parent_object != NULL returns true, which causes the protection against calling the stop function to fail.

But dcmotor[i].parent.parent_object is statically allocated and should be NULL.

So just a hunch, but maybe the LEGO bootloader happens to use the memory region that happens to overlap with this struct in this particular build?

Reproduce
This build is not released yet, but will be shortly.

@laurensvalk laurensvalk added bug Something isn't working hub: technichub Issues related to the LEGO Technic hub (Hub No. 2) labels Jan 31, 2024
@laurensvalk
Copy link
Member Author

So just a hunch, but maybe the LEGO bootloader happens to use the memory region that happens to overlap with this struct in this particular build?

There are two ways to trigger it:

  • Leaving bootloader after update
  • Leaving bootloader after entering it manually (press and hold until it blinks purple, then let go).

If there is a way to detect whether we just came out of bootloader update mode, maybe we could do a self reset to fix this.

@laurensvalk
Copy link
Member Author

laurensvalk commented Jan 31, 2024

If there is a way to detect whether we just came out of bootloader update mode, maybe we could do a self reset to fix this.

So I think we can do:

  • Update our normal soft reset function such that we can detect that we used it. (e.g. set magic flag to 12345678 instead of AAAAAAAA.) **
  • That way, any other soft reset reason means we came out of BLE update mode.
  • So on detecting coming out of BLE update mode, call our own reset.

This should leave all existing experiences unchanged, and ensure a clean boot, I think?

** That doesn't work, presumably because the bootloader resets it, so maybe this means we should use a separate magic address that the LEGO bootloader does not look at, or find another way to tell it was a Pybricks reboot or a BLE update mode exit.

@laurensvalk
Copy link
Member Author

laurensvalk commented Jan 31, 2024

Or is the zero-ing something we can just fix in the linker script and startup.s instead?

Or is it more likely to be something else entirely? 🤔

@dlech
Copy link
Member

dlech commented Jan 31, 2024

startup.s is zeroing the data: https://github.com/pybricks/pybricks-micropython/blob/e5bd9a4de44675a3870111f9cd701d94aa20ce9a/lib/pbio/platform/prime_hub/startup.s#L101

So likely this is something else, like writing past the end of some static array overwriting this.

@dlech
Copy link
Member

dlech commented Jan 31, 2024

Could also be a GCC LTO bug, so trying a build with that disabled could be worthwhile.

@laurensvalk
Copy link
Member Author

Thanks. Not-booting tends to vary build by build, perhaps depending on how static memory happens to be aligned, so maybe I'll have to keep an eye on both this and the non-LTO version for a while.

It does appear at least deterministic once built, and the current version crashes well after uart is initialized, so fortunately there are some rudimentary debug tools.

I was hoping to launch a quick beta to get some things tested before announcing a bigger new feature but maybe it isn't very practical that way.

@laurensvalk
Copy link
Member Author

If we know one of the affected addresses, can we get any meaningful information by analyzing the binary? Even if just to confirm this is the kind of issue to look for, rather than something else entirely.

@laurensvalk
Copy link
Member Author

I also wondered what the difference could be in that this only occurs when coming out of bootloader update mode.

@dlech
Copy link
Member

dlech commented Jan 31, 2024

If we know one of the affected addresses, can we get any meaningful information by analyzing the binary? Even if just to confirm this is the kind of issue to look for, rather than something else entirely.

It could be useful to look at the .map file in the build artifacts to see what the neighboring variables are the could be potentially overwriting. It would be especially helpful to get a .map file from a "good" build and and "bad" build to be able to compare and hopefully find the smoking gun.

@dlech
Copy link
Member

dlech commented Jan 31, 2024

I also wondered what the difference could be in that this only occurs when coming out of bootloader update mode.

If the bootloader doesn't disable interrupts, then I suppose it could be possible that an interrupt occurs that writes over that memory before we update the VTOR table in SystemInit. IIRC, the newer hubs starting with SPIKE Prime do disable interrupts, but the older hubs don't. So maybe we need to disable interrupts early in startup.s on those (before anything that touches RAM) and then set PBDRV_CONFIG_INIT_ENABLE_INTERRUPTS_ARM on those hubs to enable interrupts later at the appropriate time.

@laurensvalk
Copy link
Member Author

laurensvalk commented Feb 1, 2024

So maybe we need to disable interrupts early in startup.s on those (before anything that touches RAM) and then set PBDRV_CONFIG_INIT_ENABLE_INTERRUPTS_ARM on those hubs to enable interrupts later at the appropriate time.

Like this?

diff --git a/lib/pbio/platform/technic_hub/pbdrvconfig.h b/lib/pbio/platform/technic_hub/pbdrvconfig.h
index e3fbfb194..6e5321eb6 100644
--- a/lib/pbio/platform/technic_hub/pbdrvconfig.h
+++ b/lib/pbio/platform/technic_hub/pbdrvconfig.h
@@ -103,3 +103,4 @@
 #define PBDRV_CONFIG_LAST_MOTOR_PORT        PBIO_PORT_ID_D
 
 #define PBDRV_CONFIG_SYS_CLOCK_RATE 80000000
+#define PBDRV_CONFIG_INIT_ENABLE_INTERRUPTS_ARM     (1)
diff --git a/lib/pbio/platform/technic_hub/startup.s b/lib/pbio/platform/technic_hub/startup.s
index 61ac94791..faa57f716 100644
--- a/lib/pbio/platform/technic_hub/startup.s
+++ b/lib/pbio/platform/technic_hub/startup.s
@@ -71,20 +71,22 @@ defined in linker script */
  * @param  None
  * @retval : None
 */
 
     .section   .text.Reset_Handler
        .weak   Reset_Handler
        .type   Reset_Handler, %function
 Reset_Handler:
   ldr   sp, =_estack    /* Atollic update: set stack pointer */
 
+  cpsid i   /* Disable interrupts */
+
 /* Copy the data segment initializers from flash to SRAM */
   movs r1, #0
   b    LoopCopyDataInit
 
 CopyDataInit:
        ldr     r3, =_sidata
        ldr     r3, [r3, r1]
        str     r3, [r0, r1]

If the bootloader doesn't disable interrupts (...)

Perhaps a silly question but just trying to wrap my head around this... So once turned back on, it will use our own ISRs as set up in our application's vector table, so nothing previously set up by the bootloader will run anymore?

@laurensvalk
Copy link
Member Author

This does get the (at least one) affected build to boot after update (thanks!). Since this change probably doesn't affect static memory allocation, that seems more likely a good sign than accidental like we've seen across other build changes.

@dlech
Copy link
Member

dlech commented Feb 1, 2024

Like this?

Pretty much. I would even make it the first instruction in the file, so even before updating the stack pointer. (and mention in the comment that the bootloader doesn't do this which is why we have to do it here)

So once turned back on, it will use our own ISRs as set up in our application's vector table, so nothing previously set up by the bootloader will run anymore?

Correct.

This does get the (at least one) affected build to boot after update (thanks!).

Awesome!

laurensvalk added a commit to pybricks/pybricks-micropython that referenced this issue Feb 1, 2024
We appear to be getting corrupted data or nonzero static global structs when running the firmware after leaving BLE firmware update mode.

Depending on the build, this could make the hub not boot after installation.

Fixes pybricks/support#1408
@laurensvalk laurensvalk reopened this Feb 22, 2024
@laurensvalk
Copy link
Member Author

laurensvalk commented Feb 22, 2024

I may be seeing some strange side effects of this.

Just before releasing v3.4.0b2 last week, I was seeing some strange things on the Technic Hub. Programs would not be saved on the hub, and I was able to trace it back to HAL_FLASHEx_Erase failing with HAL_ERROR.

It was consistent, but it would only show under a very specific set of circumstances, which may just be a coincidental combination of memory usage:

  • Use a Technic Hub with the Xbox Controller, as in the script below.
  • Stop the program and turn the hub off.
  • When turned back on, the program was not there.
from pybricks.hubs import TechnicHub
from pybricks.iodevices import XboxController
from pybricks.parameters import Color
from pybricks.tools import wait

# Set up all devices.
technic_hub = TechnicHub()
controller = XboxController()

# The main program starts here.
print('Hello, Pybricks!')
technic_hub.light.on(Color.ORANGE)

# Cancel the program and turn the hub off.
# If you turn it back on, the program is not there.
wait(2000000)

As I was working on the auto-disconnect step then, I thought that must have somehow caused it. I moved a few things around in pybricks/pybricks-micropython@ea0379d and it worked around it.

That change wasn't bad generally, but it doesn't quite explain why it fixed saving programs.

Today then, I went back to pybricks/pybricks-micropython@9610d03 which was the last commit that reproduced the above, and instead reverted the commit that fixed the boot issue (pybricks/pybricks-micropython@68828ac).

This also fixed saving, and it seems a more likely candidate.


tl;dr: Could the boot fix have side effects that we missed? Could it have caused the programs-not-saving issue? (And maybe other odd effects, in different builds. Maybe Gatt issues, etc.)

Update: Ok, maybe reverting the interrupt boot fix didn't resolve that either.

@laurensvalk
Copy link
Member Author

The not-saving issue seems unrelated, so closing this again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working hub: technichub Issues related to the LEGO Technic hub (Hub No. 2)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants