Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arc64 local variable corruption in z_init_static_threads #60071

Closed
keith-packard opened this issue Jul 6, 2023 · 11 comments · Fixed by #60376, zephyrproject-rtos/sdk-ng#692 or #63467
Closed
Assignees
Labels
area: ARC ARC Architecture area: picolibc Picolibc C Standard Library bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug

Comments

@keith-packard
Copy link
Collaborator

Describe the bug
When running the sample.portability.cmsis_rtos_v2.timer_synchronization test on arc64 under qemu using picolibc, z_init_static_threads goes into an infinite loop in the second instance of _FOREACH_STATIC_THREAD where it's calling schedule_new_thread.

Debugging this, I discover that the register holding the termination value has been trashed and holds a completely bogus value (0x4275de2 instead of 0x80407594). Perturbing the code above the loop in essentially any way at all (I added __asm__("") just after k_sched_lock()) makes this issue go away. The difference in the assembly code is trivial -- it moves the load of __static_thread_data_list_start after the call to k_sched_lock

Please also mention any information which could help others to understand
the problem you're facing:

  • What target platform are you using?

    • qemu_arc_hs5x
    • SDK version zephyr-sdk-0.16.2-rc1-5-g8d9b4fe
    • picolibc instead of minimal C library.
  • What have you tried to diagnose or workaround this issue?

Modifying the code in even trivial ways appears to work around this issue. I applied this diff:

diff --git a/kernel/thread.c b/kernel/thread.c
index fb578eb48d..5721023929 100644
--- a/kernel/thread.c
+++ b/kernel/thread.c
@@ -783,6 +783,7 @@ void z_init_static_threads(void)
         * delay of K_FOREVER.
         */
        k_sched_lock();
+       __asm__("");
        _FOREACH_STATIC_THREAD(thread_data) {
                if (thread_data->init_delay != K_TICKS_FOREVER) {
                        schedule_new_thread(thread_data->init_thread,

This generated the following change in the generated code:

--- broken.S	2023-07-05 23:27:30.536430480 -0700
+++ working.S	2023-07-05 23:27:18.483894200 -0700
@@ -1073,8 +1073,8 @@
 	add2	r14,r14,11
 .L106:
 	blo_s	.L107
-	mov_s	r14,@__static_thread_data_list_start
 	bl	@k_sched_lock
+	mov_s	r14,@__static_thread_data_list_start
 .L108:
 	mov_s	r0,@__static_thread_data_list_end
 	cmp_s	r14,r0

setting a break point on the cmp_s instruction here also makes things work. That makes me suspect that qemu might be at fault somehow?

  • Is this a regression? If yes, have you been able to "git bisect" it to a
    specific commit?
    • I haven't tried bisecting.

To Reproduce
Steps to reproduce the behavior:

  1. ./scripts/twister -O arc64-bug -xCONFIG_PICOLIBC=y -p qemu_arc_hs5x --test samples/subsys/portability/cmsis_rtos_v2/timer_synchronization/sample.portability.cmsis_rtos_v2.timer_synchronization

Expected behavior
Test completes successfully

Impact
This will delay the switch to using picolibc as the default C library.

Environment (please complete the following information):

  • OS: Debian unstable
  • Toolchain: SDK version 0.16.2-rc1-5-g8d9b4fe
  • Commit SHA or Version used: 8560fb3 (main as of this writing)
@keith-packard keith-packard added bug The issue is a bug, or the PR is fixing a bug area: ARC ARC Architecture area: picolibc Picolibc C Standard Library labels Jul 6, 2023
@jhedberg jhedberg added the priority: low Low impact/importance bug label Jul 11, 2023
@ruuddw
Copy link
Member

ruuddw commented Jul 11, 2023

Thanks Keith for the initial investigations, I'll assign to @evgeniy-paltsev.

@evgeniy-paltsev
Copy link
Collaborator

Hi @keith-packard

as I can see you are not using release SDK version. I'm wondering if I can download this 0.16.2-rc1-5-g8d9b4fe SDK from somewhere? Or I need to build it myself?

Thanks!

@keith-packard
Copy link
Collaborator Author

sorry about the random SDK build. Turns out the regular 0.16.2-rc1 also shows the problem, get that from here:

https://github.com/zephyrproject-rtos/sdk-ng/releases/download/v0.16.2-rc1/zephyr-sdk-0.16.2-rc1_linux-x86_64.tar.xz

Note that the error I get is this:

ASSERTION FAIL [!sys_dnode_is_linked(&to->node)] @ ZEPHYR_BASE/kernel/timeout.c:110

@evgeniy-paltsev
Copy link
Collaborator

@keith-packard I'm wondering it this issue reproduces every time or not?

I've run twister 50 times in a loop - and I'm not able to reproduce it.
I was using v0.16.2-rc1 SDK (exact version as you've proposed last time), exact same twister command and exact same zephyr commit (8560fb3).

Note that the error I get is this:
ASSERTION FAIL [!sys_dnode_is_linked(&to->node)] @ ZEPHYR_BASE/kernel/timeout.c:110

May I also ask you to send full error log?

@evgeniy-paltsev
Copy link
Collaborator

evgeniy-paltsev commented Jul 13, 2023

And may I ask you to send your value of git describe --abbrev=12 --always HEAD in Zephyr repo? (at the repo state when you can reproduce issue)
This string is embedded in Zephyr binary - so it may cause code/data layout change.

@keith-packard
Copy link
Collaborator Author

oof, I was hoping this would be easy to reproduce...

I think the only bits that should affect this are the Zephyr version and the SDK version, right?

$ arc64-zephyr-elf-gcc --version
arc64-zephyr-elf-gcc (Zephyr SDK 0.16.2-rc1) 12.2.0
$ git describe --abbrev=12 --always HEAD
zephyr-v3.4.0-1074-g18636af6d665

I made sure to run 'west update' in the zephyr repo before building. Here's the commands and their output; you can see that it's just hanging today, not generating an error message.

$ west update
=== updating acpica (modules/lib/acpica):
HEAD is now at f16a0b4d0 Update version to 20230628
=== updating canopennode (modules/lib/canopennode):
HEAD is now at dec12fa stack: map CANopenNode crc16_ccitt() to Zephyr crc16_itu_t()
=== updating chre (modules/lib/chre):
HEAD is now at b7955c27 Merge pull request #7 from stephanosio/require_full_libcpp
=== updating cmsis (modules/hal/cmsis):
HEAD is now at 74981bf Update DSP library for additional platforms
=== updating edtt (tools/edtt):
HEAD is now at 64e5105 edttt_bsim: Avoid infinite exception loop
=== updating fatfs (modules/fs/fatfs):
HEAD is now at 427159b zephyr: fix ffconf.h override use of STRINGIZE
=== updating hal_altera (modules/hal/altera):
HEAD is now at 0d225dd zephyr: update include paths to use <zephyr/...>
=== updating hal_atmel (modules/hal/atmel):
HEAD is now at 5ab4300 Remove LITTLE_ENDIAN and __LITTLE_ENDIAN defines
=== updating hal_espressif (modules/hal/espressif):
HEAD is now at 9531c1c178 hal: flash: avoid lock in ISR
=== updating hal_ethos_u (modules/hal/ethos_u):
HEAD is now at 90ada2e Clarify README regarding CPU choice
=== updating hal_gigadevice (modules/hal/gigadevice):
HEAD is now at 2994b7d README.md: add pllmf and pack exception for gd32a50x
=== updating hal_infineon (modules/hal/infineon):
HEAD is now at 0bebc14 hal: infineon: Add blob support for BLESS
=== updating hal_microchip (modules/hal/microchip):
HEAD is now at 5d079f1 mpfs:CMakeLists.txt: Adding mss_gpio directory to build
=== updating hal_nordic (modules/hal/nordic):
HEAD is now at 20ae0a7 nrfx: Update to version 3.1.0
=== updating hal_nuvoton (modules/hal/nuvoton):
HEAD is now at 8a2b5de hal: NuMaker M46x: fix warnings of M467 StdDriver
=== updating hal_nxp (modules/hal/nxp):
HEAD is now at e0116b8e s32k3: remove CONFIG_NXP_S32_PM
=== updating hal_openisa (modules/hal/openisa):
HEAD is now at d1e61c0 vega_sdk_riscv: Ignore GCC warning about potential uninitialized variable
=== updating hal_quicklogic (modules/hal/quicklogic):
HEAD is now at b3a66fe Remove references to FPGA IP core drivers
=== updating hal_renesas (modules/hal/renesas):
HEAD is now at f2d791d Add da1469x_hal
=== updating hal_rpi_pico (modules/hal/rpi_pico):
HEAD is now at b7801e4 Merge pull request #4 from soburi/update-1.5.0
=== updating hal_silabs (modules/hal/silabs):
HEAD is now at a143f03 gecko: add pa-conversions for efr32xg27
=== updating hal_st (modules/hal/st):
HEAD is now at 5948f7b3 sensor/stmemsc: Align stmemsc i/f to v2.02
=== updating hal_stm32 (modules/hal/stm32):
HEAD is now at 65b64134 dts: st: wba: generate stm32wba*-pinctrl.dtsi wba addition
=== updating hal_telink (modules/hal/telink):
HEAD is now at 38573af B91 ADC driver support
=== updating hal_ti (modules/hal/ti):
HEAD is now at ae1db23 zephyr: update SYS_INIT calls
=== updating hal_wurthelektronik (modules/hal/wurthelektronik):
HEAD is now at 24ca987 Initialized WE HAL module repository.
=== updating hal_xtensa (modules/hal/xtensa):
HEAD is now at 41a631d cmake: enable using SOC_TOOLCHAIN_NAME for SoC overlay
=== updating libmetal (modules/hal/libmetal):
HEAD is now at b91611a lib: update libmetal to release v2023.04.0
=== updating liblc3 (modules/lib/liblc3):
HEAD is now at 448f3de Merge pull request #8 from carlescufi/rename
=== updating littlefs (modules/fs/littlefs):
HEAD is now at ca583fd Merge tag 'v2.5.0' of https://github.com/littlefs-project/littlefs into zephyr
=== updating loramac-node (modules/lib/loramac-node):
HEAD is now at ce57712f Don't memset the TXRX buffer when radio is set to receive mode (#1345)
=== updating lvgl (modules/lib/gui/lvgl):
HEAD is now at f7cf01d4 Merge pull request #41 from faxe1008/update_8.3.7
=== updating lz4 (modules/lib/lz4):
HEAD is now at 8e303c2 zephyr: make this repo a zephyr module
=== updating mbedtls (modules/crypto/mbedtls):
HEAD is now at 6e7841e5a Move the implication of MBEDTLS_PSA_CRYPTO_CLIENT where it belongs
=== updating mcuboot (bootloader/mcuboot):
HEAD is now at 6a8746d7 boot_serial: fix image number handle in image upload request
=== updating mipi-sys-t (modules/debug/mipi-sys-t):
HEAD is now at a819419 library: allow building without wchar support
=== updating nanopb (modules/lib/nanopb):
HEAD is now at 42fa8b2 Merge pull request #2 from pdgendt/zephyr
=== updating net-tools (tools/net-tools):
HEAD is now at e0828aa net-setup: Add instructions for NAT and routing
=== updating nrf_hw_models (modules/bsim_hw_models/nrf_hw_models):
HEAD is now at 33b999d GPIO: Fix bug during initialization
=== updating open-amp (modules/lib/open-amp):
HEAD is now at c904a01 lib: update open-amp lib to release v2023.04.0
=== updating openthread (modules/lib/openthread):
HEAD is now at 37fb77098 [cli] fix `childsupervision` commands documentation (#9201)
=== updating picolibc (modules/lib/picolibc):
HEAD is now at d07c38ff0 Version 1.8.2
=== updating segger (modules/debug/segger):
HEAD is now at 5792675 segger: update to 3.40
=== updating sof (modules/audio/sof):
HEAD is now at ffbf9c2a6 Merge pull request #27 from nashif/cache_api
=== updating tflite-micro (modules/lib/tflite-micro):
HEAD is now at 9156d05 tflite-micro: fix module name
=== updating tinycrypt (modules/crypto/tinycrypt):
HEAD is now at 3e9a49d cmake: Fix conditional in root CMakeLists.txt
=== updating TraceRecorderSource (modules/debug/TraceRecorder):
HEAD is now at bc839bf zephyr: update SYS_INIT calls
=== updating trusted-firmware-m (modules/tee/tf-m/trusted-firmware-m):
HEAD is now at 79a6115d3 Build: Install crypto_platform.h for Crypto service
=== updating trusted-firmware-a (modules/tee/tf-a/trusted-firmware-a):
HEAD is now at 421dc0502 Merge pull request #3 from SgrrZhf/topics/huizha01/2.9.0-updates
=== updating tf-m-tests (modules/tee/tf-m/tf-m-tests):
HEAD is now at 0f80a65 Merge pull request #6 from microbuilder/tfm17_limit_attestation
=== updating psa-arch-tests (modules/tee/tf-m/psa-arch-tests):
HEAD is now at 6a17330 dev-apis: platform: Increase timeout for nRF devs
=== updating uoscore-uedhoc (modules/lib/uoscore-uedhoc):
HEAD is now at e892019 Merge pull request #11 from rlubos/zephyr-integration
=== updating zcbor (modules/lib/zcbor):
HEAD is now at 67fd8bb Update version from 0.6.99 to 0.7.0
=== updating zscilib (modules/lib/zscilib):
HEAD is now at 34c3432 Merge pull request #50 from microbuilder/ztest_fixes
=== updating thrift (modules/lib/thrift):
HEAD is now at 10023645a Merge pull request #1 from zephyrproject-rtos/add-zephyr-module-dir
$ ./scripts/twister -O arc64-bug -xCONFIG_PICOLIBC=y -p qemu_arc_hs5x --test samples/subsys/portability/cmsis_rtos_v2/timer_synchronization/sample.portability.cmsis_rtos_v2.timer_synchronization
Install tabulate python module with pip to use --device-testing option.
Renaming output directory to arc64-bug.6
INFO    - Using Ninja..
INFO    - Zephyr version: zephyr-v3.4.0-1074-g18636af6d665
INFO    - Using 'zephyr' toolchain.
INFO    - Building initial testsuite list...
INFO    - Writing JSON report arc64-bug/testplan.json
INFO    - JOBS: 64
INFO    - Adding tasks to the queue...
INFO    - Added initial list of jobs to queue
ERROR   - qemu_arc_hs5x             samples/subsys/portability/cmsis_rtos_v2/timer_synchronization/sample.portability.cmsis_rtos_v2.timer_synchronization  FAILED : Timeout
ERROR   - see: /build/keithp/zephyrproject/zephyr/arc64-bug/qemu_arc_hs5x/samples/subsys/portability/cmsis_rtos_v2/timer_synchronization/sample.portability.cmsis_rtos_v2.timer_synchronization/handler.log
INFO    - Total complete:    1/   1  100%  skipped:    0, failed:    1, error:    0
INFO    - 1 test scenarios (1 test instances) selected, 0 configurations skipped (0 by static filter, 0 at runtime).
INFO    - 0 of 1 test configurations passed (0.00%), 1 failed, 0 errored, 0 skipped with 0 warnings in 66.81 seconds
INFO    - In total 1 test cases were executed, 0 skipped on 1 out of total 584 platforms (0.17%)
INFO    - 1 test configurations executed on platforms, 0 test configurations were only built.
INFO    - Saving reports...
INFO    - Writing JSON report /build/keithp/zephyrproject/zephyr/arc64-bug/twister.json
INFO    - Writing xunit report /build/keithp/zephyrproject/zephyr/arc64-bug/twister.xml...
INFO    - Writing xunit report /build/keithp/zephyrproject/zephyr/arc64-bug/twister_report.xml...
INFO    - -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
INFO    - The following issues were found (showing the top 10 items):
INFO    - 1) samples/subsys/portability/cmsis_rtos_v2/timer_synchronization/sample.portability.cmsis_rtos_v2.timer_synchronization on qemu_arc_hs5x failed (Timeout)
INFO    - 
INFO    - To rerun the tests, call twister using the following commandline:
INFO    - west twister -p <PLATFORM> -s <TEST ID>, for example:
INFO    - 
INFO    - west twister -p qemu_arc_hs5x -s samples/subsys/portability/cmsis_rtos_v2/timer_synchronization/sample.portability.cmsis_rtos_v2.timer_synchronization
INFO    - or with west:
INFO    - west build -p -b qemu_arc_hs5x -T samples/subsys/portability/cmsis_rtos_v2/timer_synchronization/sample.portability.cmsis_rtos_v2.timer_synchronization
INFO    - -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
INFO    - Run completed
$ cat /build/keithp/zephyrproject/zephyr/arc64-bug/qemu_arc_hs5x/samples/subsys/portability/cmsis_rtos_v2/timer_synchronization/sample.portability.cmsis_rtos_v2.timer_synchronization/handler.log
*** Booting Zephyr OS build zephyr-v3.4.0-1074-g18636af6d665 ***

@keith-packard
Copy link
Collaborator Author

Did a bit more debugging today. With the code as above, the fault happens when vprintk is called from boot_banner.

(gdb) x/20i vprintk
   0x80400214 <vprintk>:	st.aw	r15,[sp,-4]
   0x80400218 <vprintk+4>:	st.aw	r14,[sp,-4]
   0x8040021c <vprintk+8>:	st.aw	blink,[sp,-4]
   0x80400220 <vprintk+12>:	sub3	sp,sp,0x2
   0x80400224 <vprintk+16>:	mov_s	r14,r0
   0x80400226 <vprintk+18>:	mov_s	r15,r1
   0x80400228 <vprintk+20>:	mov_s	r2,0x10
   0x8040022a <vprintk+22>:	mov_s	r1,0
   0x8040022c <vprintk+24>:	bl.d	27444	;0x80406d60 <memset>
   0x80400230 <vprintk+28>:	mov_s	r0,sp
   0x80400232 <vprintk+30>:	stb	2,[sp,2]
   0x80400236 <vprintk+34>:	mov_s	r0,0x804001fc
   0x8040023c <vprintk+40>:	mov_s	r2,r15
   0x8040023e <vprintk+42>:	st.as	r0,[sp,1]
   0x80400242 <vprintk+46>:	mov_s	r1,r14
   0x80400244 <vprintk+48>:	bl.d	27708	;0x80406e80 <vfprintf>
   0x80400248 <vprintk+52>:	mov_s	r0,sp
   0x8040024a <vprintk+54>:	add3	sp,sp,0x2
   0x8040024e <vprintk+58>:	ld.ab	blink,[sp,4]
   0x80400252 <vprintk+62>:	ld.ab	r14,[sp,4]

When I place a break point at 0x8040024a (right after vfprintf returns), I notice a difference in register contents depending upon whether I also place a break point at 0x80400242 (right before vprintf is called). When the system stops at 242 and is then continued, the application works correctly. When there is no break point at 242, the application hangs, having branched to address 0x0.

The register contents at 0x8040024a when the application works looks like this:

r0             0x41                65
r1             0x804087c9          -2143254583
r2             0x80408980          -2143254144
r3             0x80409c4c          -2143249332
r4             0xffffffff          -1
r5             0x0                 0
r6             0x80408a80          -2143253888
r7             0xffffffff          -1
r8             0x0                 0
r9             0x0                 0
r10            0x0                 0
r11            0x0                 0
r12            0x0                 0
r13            0xaaaaaaaa          -1431655766
r14            0x80408788          -2143254648
r15            0x80409c5c          -2143249316
r16            0x0                 0
r17            0x0                 0
r18            0x80409c9c          -2143249252
r19            0xaaaaaaaa          -1431655766
r20            0xaaaaaaaa          -1431655766
r21            0xaaaaaaaa          -1431655766
r22            0xaaaaaaaa          -1431655766
r23            0xaaaaaaaa          -1431655766
r24            0xaaaaaaaa          -1431655766
r25            0xaaaaaaaa          -1431655766
r26            0xaaaaaaaa          -1431655766
fp             0xaaaaaaaa          -1431655766
sp             0x80409c3c          -2143249348
ilink          0x80400244          -2143288764
gp             0x80409c94          -2143249260
blink          0x8040024a          -2143288758
acc0           0xaaaaaaaa          -1431655766
pcl            0x80400248          -2143288760
pc             0x8040024a          0x8040024a <vprintk+54>
status32       0x7ffe8008081c      [ E=14 Z RB=0 AD IE ]
bta            0x8040024a          0x8040024a <vprintk+54>
eret           0x0                 0x0
timer_build    0x7ffe00000304      [ version=4 t0 t1 p0=0 p1=0 ]
irq_build      0xf141901           [ version=1 IRQs=25 exts=20 p=15 ]
vecbase_build  0x10                [ version=4 addr=0 ]
isa_config     0x12e47402          [ version=2 va_size=4 pa_size=4 a=3 n m=2 c=0 d=1 ]
timer_count0   0x7fffcaaf          2147469999
timer_ctrl0    0x3                 [ ie nh ]
timer_limit0   0x7fffcaaf          2147469999
timer_count1   0x80000927          -2147481305
timer_ctrl1    0x0                 [ ]
timer_limit1   0xffffff            16777215
erstatus       0x0                 [ e=0 rb=0 ]
erbta          0x0                 0x0
ecr            0x0                 [ parameter=0 causecode=0 vectornumber=0 ]
efa            0x7ffe00000000      140728898420736
icause         0x0                 0
aux_irq_ctrl   0x207               [ nr=7 b ]
aux_irq_act    0x0                 [ active=0 ]
irq_priority_pending 0x0           0
aux_irq_hint   0x0                 0
irq_select     0x10                16
irq_enable     0x1                 true
irq_trigger    0x0                 false
irq_status     0x7ffe00000010      [ p=0 e ]
irq_pulse_cancel 0x0               false
irq_priority   0x0                 0
mmu_ctrl       {en = 0x0, ku = 0x0, wx = 0x0, tce = 0x0} {en = false, ku = false, wx = false, tce = false}
mmu_rtp0       {addr = 0x0, ASID = 0x0} {addr = 0x0, ASID = 0x0}
mmu_rtp1       {addr = 0x0, ASID = 0x0} {addr = 0x0, ASID = 0x0}

the register contents at 0x8040024a when the application fails looks like this:

r0             0x41                65
r1             0x804087c9          -2143254583
r2             0x80408980          -2143254144
r3             0x80409c4c          -2143249332
r4             0xffffffff          -1
r5             0x0                 0
r6             0x80408a80          -2143253888
r7             0xffffffff          -1
r8             0x0                 0
r9             0x0                 0
r10            0x0                 0
r11            0x0                 0
r12            0x0                 0
r13            0xaaaaaaaa          -1431655766
r14            0x80408788          -2143254648
r15            0x80409c5c          -2143249316
r16            0x0                 0
r17            0x0                 0
r18            0x80409c9c          -2143249252
r19            0xaaaaaaaa          -1431655766
r20            0xaaaaaaaa          -1431655766
r21            0xaaaaaaaa          -1431655766
r22            0xaaaaaaaa          -1431655766
r23            0xaaaaaaaa          -1431655766
r24            0xaaaaaaaa          -1431655766
r25            0xaaaaaaaa          -1431655766
r26            0xaaaaaaaa          -1431655766
fp             0xaaaaaaaa          -1431655766
sp             0x80409c38          -2143249352
ilink          0x8040024a          -2143288758
gp             0x80409c94          -2143249260
blink          0x8040024a          -2143288758
acc0           0xaaaaaaaa          -1431655766
pcl            0x8040024a          -2143288758
pc             0x8040024a          0x8040024a <vprintk+54>
status32       0x7ffc8008081c      [ E=14 Z RB=0 AD IE ]
bta            0x80401570          0x80401570 <_rirq_exit+72>
eret           0x0                 0x0
timer_build    0x7ffc00000304      [ version=4 t0 t1 p0=0 p1=0 ]
irq_build      0xf141901           [ version=1 IRQs=25 exts=20 p=15 ]
vecbase_build  0x10                [ version=4 addr=0 ]
isa_config     0x12e47402          [ version=2 va_size=4 pa_size=4 a=3 n m=2 c=0 d=1 ]
timer_count0   0x7fffcaaf          2147469999
timer_ctrl0    0x3                 [ ie nh ]
timer_limit0   0x7fffcaaf          2147469999
timer_count1   0x80000927          -2147481305
timer_ctrl1    0x0                 [ ]
timer_limit1   0xffffff            16777215
erstatus       0x0                 [ e=0 rb=0 ]
erbta          0x0                 0x0
ecr            0x0                 [ parameter=0 causecode=0 vectornumber=0 ]
efa            0x7ffc00000000      140720308486144
icause         0x0                 0
aux_irq_ctrl   0x207               [ nr=7 b ]
aux_irq_act    0x0                 [ active=0 ]
irq_priority_pending 0x0           0
aux_irq_hint   0x0                 0
irq_select     0x10                16
irq_enable     0x1                 true
irq_trigger    0x0                 false
irq_status     0x7ffc00000010      [ p=0 e ]
irq_pulse_cancel 0x0               false
irq_priority   0x0                 0
mmu_ctrl       {en = 0x0, ku = 0x0, wx = 0x0, tce = 0x0} {en = false, ku = false, wx = false, tce = false}
mmu_rtp0       {addr = 0x0, ASID = 0x0} {addr = 0x0, ASID = 0x0}
mmu_rtp1       {addr = 0x0, ASID = 0x0} {addr = 0x0, ASID = 0x0}

Checking the differences, we see:

--- reg-working	2023-07-13 11:54:43.495966032 -0700
+++ reg-broken	2023-07-13 11:26:44.252615552 -0700
@@ -27,17 +27,17 @@
 r25            0xaaaaaaaa          -1431655766
 r26            0xaaaaaaaa          -1431655766
 fp             0xaaaaaaaa          -1431655766
-sp             0x80409c3c          -2143249348
-ilink          0x80400244          -2143288764
+sp             0x80409c38          -2143249352
+ilink          0x8040024a          -2143288758
 gp             0x80409c94          -2143249260
 blink          0x8040024a          -2143288758
 acc0           0xaaaaaaaa          -1431655766
-pcl            0x80400248          -2143288760
+pcl            0x8040024a          -2143288758
 pc             0x8040024a          0x8040024a <vprintk+54>
-status32       0x7ffe8008081c      [ E=14 Z RB=0 AD IE ]
-bta            0x8040024a          0x8040024a <vprintk+54>
+status32       0x7ffc8008081c      [ E=14 Z RB=0 AD IE ]
+bta            0x80401570          0x80401570 <_rirq_exit+72>
 eret           0x0                 0x0
-timer_build    0x7ffe00000304      [ version=4 t0 t1 p0=0 p1=0 ]
+timer_build    0x7ffc00000304      [ version=4 t0 t1 p0=0 p1=0 ]
 irq_build      0xf141901           [ version=1 IRQs=25 exts=20 p=15 ]
 vecbase_build  0x10                [ version=4 addr=0 ]
 isa_config     0x12e47402          [ version=2 va_size=4 pa_size=4 a=3 n m=2 c=0 d=1 ]
@@ -50,7 +50,7 @@
 erstatus       0x0                 [ e=0 rb=0 ]
 erbta          0x0                 0x0
 ecr            0x0                 [ parameter=0 causecode=0 vectornumber=0 ]
-efa            0x7ffe00000000      140728898420736
+efa            0x7ffc00000000      140720308486144
 icause         0x0                 0
 aux_irq_ctrl   0x207               [ nr=7 b ]
 aux_irq_act    0x0                 [ active=0 ]
@@ -59,7 +59,7 @@
 irq_select     0x10                16
 irq_enable     0x1                 true
 irq_trigger    0x0                 false
-irq_status     0x7ffe00000010      [ p=0 e ]
+irq_status     0x7ffc00000010      [ p=0 e ]
 irq_pulse_cancel 0x0               false
 irq_priority   0x0                 0
 mmu_ctrl       {en = 0x0, ku = 0x0, wx = 0x0, tce = 0x0} {en = false, ku = false, wx = false, tce = false}

Critically important here is that sp is different. In the failing case, when blink is loaded from the stack, the value is 0, causing the system to branch to 0x0.

If I reset sp and ilink to the correct values at this point, then the application works correctly.

Ok, after even more debugging, I think I've discovered a bug in qemu?

Here's the last two instructions in vfprintf:

   0x80406f10 <vfprintf+136>:	j_s.d	[blink]
   0x80406f12 <vfprintf+138>:	ld.ab	r25,[sp,4]

When stopped at the first one, blink points at the return address in vprintk:

   0x8040023e <vprintk+42>:	st.as	r0,[sp,1]
   0x80400242 <vprintk+46>:	mov_s	r1,r14
   0x80400244 <vprintk+48>:	bl.d	27716	;0x80406e88 <vfprintf>
   0x80400248 <vprintk+52>:	mov_s	r0,sp
=> 0x8040024a <vprintk+54>:	add3	sp,sp,0x2
   0x8040024e <vprintk+58>:	ld.ab	blink,[sp,4]
   0x80400252 <vprintk+62>:	ld.ab	r14,[sp,4]
   0x80400256 <vprintk+66>:	j_s.d	[blink]
   0x80400258 <vprintk+68>:	ld.ab	r15,[sp,4]

I stick a break point on both addresses along with _isr_wrapper, stop at the first to wait for a timer interrupt to pend. At this point, sp is 0x80409c40. Continue, and the timer interrupt fires causing the processor to handle the interrupt. At this point, sp is 0x80409c00. That means it moved the stack by 16 values, but there are 17 values to store here -- 14 registers, blink, pc and status32. Looking at memory at the old stack location we find that qemu has actually smashed that value, storing status32 there.

When the rtie instruction executes, it pops all 17 values from the stack, leaving sp at 0x80409c44, which doesn't match the value from before the interrupt was handled.

@keith-packard
Copy link
Collaborator Author

ok, I'm pretty sure this is a bug in qemu caused by combining an interrupt and a jump with a delay slot. In this bug, the timer fires exactly as vfprintf is in the middle of returning, in this sequence:

   0x80406f08 <vfprintf+128>:	ld.ab	r23,[sp,4]
   0x80406f0c <vfprintf+132>:	ld.ab	r24,[sp,4]
   0x80406f10 <vfprintf+136>:	j_s.d	[blink]
   0x80406f12 <vfprintf+138>:	ld.ab	r25,[sp,4]

The instruction in the delay slot increments the stack pointer. If the timer fires right on top of the j_s.d [blink] instruction, then the stack pointer gets mangled for the next few instructions. If you place a break point before that happens, even a single instruction, then the timer will expire while the breakpoint is pending and fire immediately after you continue execution, so the interrupt occurs before the branch and things are just fine.

I'm thinking we should just disable branch delay slots in the qemu targets; they're certainly not an optimization for this platform and it seems that qemu doesn't have them quite right. When I do that, everything is just fine.

keith-packard added a commit to keith-packard/zephyr that referenced this issue Jul 14, 2023
I spent several hours debugging a weird stack pointer corruption bug and
discovered that QEMU appears to mess up register contents when an interrupt
fires during the execution of a branch with a delay slot.

Instead of trying to fix qemu, let's just tell the compiler to not generate
code that uses the branch instructions with delay slots.

Closes: zephyrproject-rtos#60071

Signed-off-by: Keith Packard <keithp@keithp.com>
@keith-packard
Copy link
Collaborator Author

Ok, looks like this qemu bug is fixed upstream and we just need a newer version in the SDK to get this to work.

@evgeniy-paltsev
Copy link
Collaborator

Reopen as we haven't switched to new SDK yet (but the qequired fixes are merged to SDK repo)

carlescufi pushed a commit that referenced this issue Sep 13, 2023
I spent several hours debugging a weird stack pointer corruption bug and
discovered that QEMU appears to mess up register contents when an interrupt
fires during the execution of a branch with a delay slot.

Instead of trying to fix qemu, let's just tell the compiler to not generate
code that uses the branch instructions with delay slots.

Closes: #60071

Signed-off-by: Keith Packard <keithp@keithp.com>
@ruuddw
Copy link
Member

ruuddw commented Sep 14, 2023

Re-opening as a reminder to revert the workaround when the new SDK with qemu fix is used in CI.

@ruuddw ruuddw reopened this Sep 14, 2023
coran21 pushed a commit to coran21/zephyr that referenced this issue Sep 21, 2023
I spent several hours debugging a weird stack pointer corruption bug and
discovered that QEMU appears to mess up register contents when an interrupt
fires during the execution of a branch with a delay slot.

Instead of trying to fix qemu, let's just tell the compiler to not generate
code that uses the branch instructions with delay slots.

Closes: zephyrproject-rtos#60071

Signed-off-by: Keith Packard <keithp@keithp.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: ARC ARC Architecture area: picolibc Picolibc C Standard Library bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug
Projects
None yet
4 participants