kernel: ASSERTION FAIL [z_spin_lock_valid(l)] #13411

carlescufi · 2019-02-14T19:27:54Z

This is similar to #13289 but 100% reproducible with a simple main.c

Run the following on qemu_x86 or any nrf board:

https://github.com/lemrey/zephyr/tree/bug-workq

prj.conf:
CONFIG_ASSERT=y
---
main.c:
#include <zephyr.h>
#include <misc/printk.h>

struct k_delayed_work dw;

void work_handler(struct k_work *w)
{
	k_delayed_work_submit(&dw, 10);
}

void main(void)
{
	printk("Hello World! %s\n", CONFIG_BOARD);

	k_delayed_work_init(&dw, work_handler);
	k_delayed_work_submit(&dw, K_NO_WAIT);

	while (1) {
		k_sleep(K_SECONDS(1));
	}
}

results in:
ASSERTION FAIL [z_spin_lock_valid(l)] @ zephyr.git/include/spinlock.h:66

The text was updated successfully, but these errors were encountered:

carlescufi · 2019-02-14T19:28:27Z

From @lemrey:

I suspect this is a kernel bug, or some bug that was uncovered by recent kernel work at least. I am able to reproduce this assert consistently by running a modified hello_world sample, in this tree.
However, I have not been able to reproduce the test failure.

This is far from my expertise and I might grossly off, but what I observe in that sample is that the first k_delayed_work_submit call executes its work handler before it has returned, thus spinlocking twice.

edit: this is not an arch bug, it is reproducible on x86 and ARM.

A rudimentary augmentation of k_spin_lock and k_delayed_work_submit_to_queue shows:

Hello World! qemu_x86
k_delayed_work_submit_to_queue enters
>> spinlocking: 0x00400178
queue_insert
>> spinlocking: 0x00400128
_find_first_thread_to_unpend
>> spinlocking: 0x004000c4
<< spinrelease: 0x004000c4
_unpend_thread_no_timeout
>> spinlocking: 0x004000c4
<< spinrelease: 0x004000c4
_abort_timeout
>> spinlocking: 0x00400188
<< spinrelease: 0x00400188
_add_thread_to_ready_q
>> spinlocking: 0x004000c4
z_clock_elapsed
>> spinlocking: 0x00400030
<< spinrelease: 0x00400030
z_set_timeout_expiry
>> spinlocking: 0x00400188
z_clock_set_timeout
>> spinlocking: 0x00400030
<< spinrelease: 0x00400030
<< spinrelease: 0x00400188
<< spinrelease: 0x004000c4
k_delayed_work_submit_to_queue enters
>> spinlocking: 0x00400178
ASSERTION FAIL [z_spin_lock_valid(l)] @ ../../../../include/spinlock.h:72
Recursive spinlock

k_delayed_work_submit_to_queue was supposed to print something before returning; that doesn't happen.

carlescufi · 2019-02-14T19:29:08Z

From @ioannisg:

@andyross @lemrey and all: I can also reproduce this.
Also it is not a stack-overflow (Enabled HW_STACK_PROTECTION and go no MPU errors).

Edit: reproduced it in ARMv8-M (nrf9160_pca10090) as well, with exact same result.

carlescufi · 2019-02-14T19:29:39Z

From @lemrey:

a37a981 is the first bad commit

kernel/work_q: Spinlockify

Each work_q object gets a separate spinlock to synchronize access
instead of the global lock.  Note that there was a recursive lock
condition in k_delayed_work_cancel(), so that's been split out into an
internal unlocked version and the API entry point that wraps it with a
lock.

andyross · 2019-02-14T20:10:52Z

Looking. Note that that patch series is a bunch of independent changes. You can revert just the work_q one (e.g. "git revert a37a981" apples builds and runs on master for me), which goes back to the previous irq_lock() implementation (and, for production builds without CONFIG_ASSERT, generates the same code). Can you confirm that fixes the issue?

carlescufi · 2019-02-14T20:30:20Z

@andyross thanks for looking into this. I tried reverting a37a981, but:

[8/98] Building C object zephyr/CMakeFiles/offsets.dir/arch/x86/core/offsets/offsets.c.obj
FAILED: zephyr/CMakeFiles/offsets.dir/arch/x86/core/offsets/offsets.c.obj
/home/carles/bin/zephyr-sdk/sysroots/x86_64-pokysdk-linux/usr/bin/i586-zephyr-elf/i586-zephyr-elf-gcc -DBUILD_VERc
In file included from ../../../kernel/include/kernel_structs.h:10:0,
                 from /home/carles/src/zephyr/zephyr/arch/x86/core/offsets/offsets.c:29:
../../../include/kernel.h:1735:20: error: field ‘lock’ has incomplete type
  struct k_spinlock lock;
                    ^~~~
../../../include/kernel.h:2377:20: error: field ‘lock’ has incomplete type
  struct k_spinlock lock;
                    ^~~~
../../../include/kernel.h:3154:20: error: field ‘lock’ has incomplete type
  struct k_spinlock lock;
                    ^~~~
../../../include/kernel.h:3457:20: error: field ‘lock’ has incomplete type
  struct k_spinlock lock;
                    ^~~~
../../../include/kernel.h:3637:20: error: field ‘lock’ has incomplete type
  struct k_spinlock lock;  /**< Synchronization lock */
                    ^~~~
ninja: build stopped: subcommand failed.

andyross · 2019-02-14T20:37:48Z

Oh sorry. That's just a missing include. This was the first patch in the series to put spinlock.h into kernel_includes.h and later patches need it too. You can apply this to re-add the tiny bit needed:

--- a/include/kernel_includes.h
+++ b/include/kernel_includes.h
@@ -34,5 +34,6 @@
 #include <arch/cpu.h>
 #include <misc/rb.h>
 #include <sys_clock.h>
+#include <spinlock.h>

carlescufi · 2019-02-14T20:40:28Z

@andyross Thanks. I can confirm that reverting a37a981 (+ adding the missing include) fixes the assert on master, both on qemu_x86 and on a random nrf board.

andyross · 2019-02-14T20:41:15Z

And I'm pretty sure I see where the issue is, though I'm still not seeing it in my own qemu runs. The work_q abstraction is wrapped around a k_queue, and k_queue_insert() is a potentially blocking/rescheduling operation. You can't call that with a spinlock held, and of course we do.

(And FWIW, it was always sort of a mess that we were stacking calls like this up with an irq_lock held too. I know this seems like it has no benefit to you, but better detection of poor locking hygine like this is totally a feature of the spinlock work, I promise!)

carlescufi · 2019-02-14T20:43:28Z

@andyross funny that you can't reproduce.

I just applied the patch below on top of 65451db:

diff --git a/samples/hello_world/prj.conf b/samples/hello_world/prj.conf
index b2a4ba5910..e167126b36 100644
--- a/samples/hello_world/prj.conf
+++ b/samples/hello_world/prj.conf
@@ -1 +1,2 @@
 # nothing here
+CONFIG_ASSERT=y
diff --git a/samples/hello_world/src/main.c b/samples/hello_world/src/main.c
index 2d4a7f0d5f..dc442e5155 100644
--- a/samples/hello_world/src/main.c
+++ b/samples/hello_world/src/main.c
@@ -7,7 +7,21 @@
 #include <zephyr.h>
 #include <misc/printk.h>

+struct k_delayed_work dw;
+
+void work_handler(struct k_work *w)
+{
+       k_delayed_work_submit(&dw, 10);
+}
+
 void main(void)
 {
        printk("Hello World! %s\n", CONFIG_BOARD);
+
+       k_delayed_work_init(&dw, work_handler);
+       k_delayed_work_submit(&dw, K_NO_WAIT);
+
+       while (1) {
+               k_sleep(K_SECONDS(1));
+       }
 }

nashif · 2019-02-14T20:47:19Z

if it was happening with qemu_x86 we would have seen in this CI already, no?

andyross · 2019-02-14T20:49:42Z

PEBKAC. I forgot that while the tests/ directory has CONFIG_ASSERT=y automatically, the samples/ directory does not. Blows up nicely for me.

And no, we don't actually have a test for this particular code path (adding a work_q item with zero delay) as it turns out. Which is sort of a problem.

There's actually a parallel coverage gotcha in there too -- looking at it right now I'm reminded that the queue.c code actually has two completely different implementations depending on whether CONFIG_POLL is enabled, and I'm like 83% confident that we don't repeat all the work_q tests in both configurations.

Work queues are implemented in terms of k_queue objects which provide their own synchronization. In particular insertion is potentially blocking and always acts as a reschedule point, which means that it must not be called with spinlocks held. Release the lock first, and do a little cleanup of the resulting k_delayed_work_submit_to_queue() logic. Fixes zephyrproject-rtos#13411 Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

andyross · 2019-02-14T21:04:02Z

OK, try that patch. I'm thinking maybe I should add a "spinlock depth count" or something to the validation layer to catch this a little earlier and more clearly...

Work queues are implemented in terms of k_queue objects which provide their own synchronization. In particular insertion is potentially blocking and always acts as a reschedule point, which means that it must not be called with spinlocks held. Release the lock first, and do a little cleanup of the resulting k_delayed_work_submit_to_queue() logic. Fixes #13411 Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

Merge upstream up to commit ec424b7. This includes the fix for the Kernel spinlock assert: zephyrproject-rtos/zephyr#13411. Signed-off-by: Emanuele Di Santo <emdi@nordicsemi.no>

- Update fw-nrfconnect-zephyr revision, fixing a Kernel panic issue: zephyrproject-rtos/zephyr#13411 Signed-off-by: Emanuele Di Santo <emdi@nordicsemi.no>

carlescufi added bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug labels Feb 14, 2019

carlescufi assigned andyross and andrewboie Feb 14, 2019

andyross mentioned this issue Feb 14, 2019

kernel/work_q: Fix block-in-spinlock bug #13412

Merged

nashif closed this as completed in #13412 Feb 15, 2019

lemrey mentioned this issue Feb 19, 2019

[nrf mergeup] Merge upstream up to commit ec424b7a45 nrfconnect/sdk-zephyr#99

Merged

lemrey mentioned this issue Feb 19, 2019

manifest: update fw-nrfconnect-zephyr nrfconnect/sdk-nrf#479

Merged

lemrey mentioned this issue Feb 19, 2019

tests/kernel/fifo/fifo_timeout fails on nrf52840_pca10056 #13289

Closed

carlescufi pushed a commit to lemrey/sdk-nrf that referenced this issue Feb 19, 2019

manifest: update fw-nrfconnect-zephyr

53ea2ec

- Update fw-nrfconnect-zephyr revision, fixing a Kernel panic issue: zephyrproject-rtos/zephyr#13411 Signed-off-by: Emanuele Di Santo <emdi@nordicsemi.no>

carlescufi pushed a commit to nrfconnect/sdk-nrf that referenced this issue Feb 19, 2019

manifest: update fw-nrfconnect-zephyr

d1f2c45

- Update fw-nrfconnect-zephyr revision, fixing a Kernel panic issue: zephyrproject-rtos/zephyr#13411 Signed-off-by: Emanuele Di Santo <emdi@nordicsemi.no>

carlescufi pushed a commit to nrfconnect/sdk-nrf that referenced this issue Feb 19, 2019

manifest: update fw-nrfconnect-zephyr

5a91fa2

- Update fw-nrfconnect-zephyr revision, fixing a Kernel panic issue: zephyrproject-rtos/zephyr#13411 Signed-off-by: Emanuele Di Santo <emdi@nordicsemi.no>

carlescufi mentioned this issue Feb 20, 2019

tests/blutooth/tester: ASSERTION FAIL due to Recursive spinlock when running bt tester on qemu-cortex-m3 #13591

Closed

simensrostad mentioned this issue May 19, 2020

Firmware does not respond to higher publication intervals bifravst/firmware#46

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kernel: ASSERTION FAIL [z_spin_lock_valid(l)] #13411

kernel: ASSERTION FAIL [z_spin_lock_valid(l)] #13411

carlescufi commented Feb 14, 2019 •

edited

carlescufi commented Feb 14, 2019 •

edited

carlescufi commented Feb 14, 2019 •

edited

carlescufi commented Feb 14, 2019 •

edited

andyross commented Feb 14, 2019

carlescufi commented Feb 14, 2019

andyross commented Feb 14, 2019

carlescufi commented Feb 14, 2019 •

edited

andyross commented Feb 14, 2019

carlescufi commented Feb 14, 2019

nashif commented Feb 14, 2019

andyross commented Feb 14, 2019

andyross commented Feb 14, 2019

kernel: ASSERTION FAIL [z_spin_lock_valid(l)] #13411

kernel: ASSERTION FAIL [z_spin_lock_valid(l)] #13411

Comments

carlescufi commented Feb 14, 2019 • edited

carlescufi commented Feb 14, 2019 • edited

carlescufi commented Feb 14, 2019 • edited

carlescufi commented Feb 14, 2019 • edited

andyross commented Feb 14, 2019

carlescufi commented Feb 14, 2019

andyross commented Feb 14, 2019

carlescufi commented Feb 14, 2019 • edited

andyross commented Feb 14, 2019

carlescufi commented Feb 14, 2019

nashif commented Feb 14, 2019

andyross commented Feb 14, 2019

andyross commented Feb 14, 2019

carlescufi commented Feb 14, 2019 •

edited

carlescufi commented Feb 14, 2019 •

edited

carlescufi commented Feb 14, 2019 •

edited

carlescufi commented Feb 14, 2019 •

edited

carlescufi commented Feb 14, 2019 •

edited