Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel: ASSERTION FAIL [z_spin_lock_valid(l)] #13411

Closed
carlescufi opened this issue Feb 14, 2019 · 12 comments
Closed

kernel: ASSERTION FAIL [z_spin_lock_valid(l)] #13411

carlescufi opened this issue Feb 14, 2019 · 12 comments
Assignees
Labels
bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug

Comments

@carlescufi
Copy link
Member

carlescufi commented Feb 14, 2019

This is similar to #13289 but 100% reproducible with a simple main.c

Run the following on qemu_x86 or any nrf board:

https://github.com/lemrey/zephyr/tree/bug-workq

prj.conf:
CONFIG_ASSERT=y
---
main.c:
#include <zephyr.h>
#include <misc/printk.h>

struct k_delayed_work dw;

void work_handler(struct k_work *w)
{
	k_delayed_work_submit(&dw, 10);
}

void main(void)
{
	printk("Hello World! %s\n", CONFIG_BOARD);

	k_delayed_work_init(&dw, work_handler);
	k_delayed_work_submit(&dw, K_NO_WAIT);

	while (1) {
		k_sleep(K_SECONDS(1));
	}
}

results in:
ASSERTION FAIL [z_spin_lock_valid(l)] @ zephyr.git/include/spinlock.h:66

@carlescufi carlescufi added bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug labels Feb 14, 2019
@carlescufi
Copy link
Member Author

carlescufi commented Feb 14, 2019

From @lemrey:

I suspect this is a kernel bug, or some bug that was uncovered by recent kernel work at least. I am able to reproduce this assert consistently by running a modified hello_world sample, in this tree.
However, I have not been able to reproduce the test failure.

This is far from my expertise and I might grossly off, but what I observe in that sample is that the first k_delayed_work_submit call executes its work handler before it has returned, thus spinlocking twice.

edit: this is not an arch bug, it is reproducible on x86 and ARM.

A rudimentary augmentation of k_spin_lock and k_delayed_work_submit_to_queue shows:

Hello World! qemu_x86
k_delayed_work_submit_to_queue enters
>> spinlocking: 0x00400178
queue_insert
>> spinlocking: 0x00400128
_find_first_thread_to_unpend
>> spinlocking: 0x004000c4
<< spinrelease: 0x004000c4
_unpend_thread_no_timeout
>> spinlocking: 0x004000c4
<< spinrelease: 0x004000c4
_abort_timeout
>> spinlocking: 0x00400188
<< spinrelease: 0x00400188
_add_thread_to_ready_q
>> spinlocking: 0x004000c4
z_clock_elapsed
>> spinlocking: 0x00400030
<< spinrelease: 0x00400030
z_set_timeout_expiry
>> spinlocking: 0x00400188
z_clock_set_timeout
>> spinlocking: 0x00400030
<< spinrelease: 0x00400030
<< spinrelease: 0x00400188
<< spinrelease: 0x004000c4
k_delayed_work_submit_to_queue enters
>> spinlocking: 0x00400178
ASSERTION FAIL [z_spin_lock_valid(l)] @ ../../../../include/spinlock.h:72
Recursive spinlock

k_delayed_work_submit_to_queue was supposed to print something before returning; that doesn't happen.

@carlescufi
Copy link
Member Author

carlescufi commented Feb 14, 2019

From @ioannisg:

@andyross @lemrey and all: I can also reproduce this.
Also it is not a stack-overflow (Enabled HW_STACK_PROTECTION and go no MPU errors).

Edit: reproduced it in ARMv8-M (nrf9160_pca10090) as well, with exact same result.

@carlescufi
Copy link
Member Author

carlescufi commented Feb 14, 2019

From @lemrey:

a37a981 is the first bad commit

kernel/work_q: Spinlockify

Each work_q object gets a separate spinlock to synchronize access
instead of the global lock.  Note that there was a recursive lock
condition in k_delayed_work_cancel(), so that's been split out into an
internal unlocked version and the API entry point that wraps it with a
lock.

@andyross
Copy link
Contributor

Looking. Note that that patch series is a bunch of independent changes. You can revert just the work_q one (e.g. "git revert a37a981" apples builds and runs on master for me), which goes back to the previous irq_lock() implementation (and, for production builds without CONFIG_ASSERT, generates the same code). Can you confirm that fixes the issue?

@carlescufi
Copy link
Member Author

@andyross thanks for looking into this. I tried reverting a37a981, but:

[8/98] Building C object zephyr/CMakeFiles/offsets.dir/arch/x86/core/offsets/offsets.c.obj
FAILED: zephyr/CMakeFiles/offsets.dir/arch/x86/core/offsets/offsets.c.obj
/home/carles/bin/zephyr-sdk/sysroots/x86_64-pokysdk-linux/usr/bin/i586-zephyr-elf/i586-zephyr-elf-gcc -DBUILD_VERc
In file included from ../../../kernel/include/kernel_structs.h:10:0,
                 from /home/carles/src/zephyr/zephyr/arch/x86/core/offsets/offsets.c:29:
../../../include/kernel.h:1735:20: error: field ‘lock’ has incomplete type
  struct k_spinlock lock;
                    ^~~~
../../../include/kernel.h:2377:20: error: field ‘lock’ has incomplete type
  struct k_spinlock lock;
                    ^~~~
../../../include/kernel.h:3154:20: error: field ‘lock’ has incomplete type
  struct k_spinlock lock;
                    ^~~~
../../../include/kernel.h:3457:20: error: field ‘lock’ has incomplete type
  struct k_spinlock lock;
                    ^~~~
../../../include/kernel.h:3637:20: error: field ‘lock’ has incomplete type
  struct k_spinlock lock;  /**< Synchronization lock */
                    ^~~~
ninja: build stopped: subcommand failed.

@andyross
Copy link
Contributor

Oh sorry. That's just a missing include. This was the first patch in the series to put spinlock.h into kernel_includes.h and later patches need it too. You can apply this to re-add the tiny bit needed:

--- a/include/kernel_includes.h
+++ b/include/kernel_includes.h
@@ -34,5 +34,6 @@
 #include <arch/cpu.h>
 #include <misc/rb.h>
 #include <sys_clock.h>
+#include <spinlock.h>

@carlescufi
Copy link
Member Author

carlescufi commented Feb 14, 2019

@andyross Thanks. I can confirm that reverting a37a981 (+ adding the missing include) fixes the assert on master, both on qemu_x86 and on a random nrf board.

@andyross
Copy link
Contributor

And I'm pretty sure I see where the issue is, though I'm still not seeing it in my own qemu runs. The work_q abstraction is wrapped around a k_queue, and k_queue_insert() is a potentially blocking/rescheduling operation. You can't call that with a spinlock held, and of course we do.

(And FWIW, it was always sort of a mess that we were stacking calls like this up with an irq_lock held too. I know this seems like it has no benefit to you, but better detection of poor locking hygine like this is totally a feature of the spinlock work, I promise!)

@carlescufi
Copy link
Member Author

@andyross funny that you can't reproduce.

I just applied the patch below on top of 65451db:

diff --git a/samples/hello_world/prj.conf b/samples/hello_world/prj.conf
index b2a4ba5910..e167126b36 100644
--- a/samples/hello_world/prj.conf
+++ b/samples/hello_world/prj.conf
@@ -1 +1,2 @@
 # nothing here
+CONFIG_ASSERT=y
diff --git a/samples/hello_world/src/main.c b/samples/hello_world/src/main.c
index 2d4a7f0d5f..dc442e5155 100644
--- a/samples/hello_world/src/main.c
+++ b/samples/hello_world/src/main.c
@@ -7,7 +7,21 @@
 #include <zephyr.h>
 #include <misc/printk.h>

+struct k_delayed_work dw;
+
+void work_handler(struct k_work *w)
+{
+       k_delayed_work_submit(&dw, 10);
+}
+
 void main(void)
 {
        printk("Hello World! %s\n", CONFIG_BOARD);
+
+       k_delayed_work_init(&dw, work_handler);
+       k_delayed_work_submit(&dw, K_NO_WAIT);
+
+       while (1) {
+               k_sleep(K_SECONDS(1));
+       }
 }

@nashif
Copy link
Member

nashif commented Feb 14, 2019

if it was happening with qemu_x86 we would have seen in this CI already, no?

@andyross
Copy link
Contributor

PEBKAC. I forgot that while the tests/ directory has CONFIG_ASSERT=y automatically, the samples/ directory does not. Blows up nicely for me.

And no, we don't actually have a test for this particular code path (adding a work_q item with zero delay) as it turns out. Which is sort of a problem.

There's actually a parallel coverage gotcha in there too -- looking at it right now I'm reminded that the queue.c code actually has two completely different implementations depending on whether CONFIG_POLL is enabled, and I'm like 83% confident that we don't repeat all the work_q tests in both configurations.

andyross pushed a commit to andyross/zephyr that referenced this issue Feb 14, 2019
Work queues are implemented in terms of k_queue objects which provide
their own synchronization.  In particular insertion is potentially
blocking and always acts as a reschedule point, which means that it
must not be called with spinlocks held.

Release the lock first, and do a little cleanup of the resulting
k_delayed_work_submit_to_queue() logic.

Fixes zephyrproject-rtos#13411

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
@andyross
Copy link
Contributor

OK, try that patch. I'm thinking maybe I should add a "spinlock depth count" or something to the validation layer to catch this a little earlier and more clearly...

nashif pushed a commit that referenced this issue Feb 15, 2019
Work queues are implemented in terms of k_queue objects which provide
their own synchronization.  In particular insertion is potentially
blocking and always acts as a reschedule point, which means that it
must not be called with spinlocks held.

Release the lock first, and do a little cleanup of the resulting
k_delayed_work_submit_to_queue() logic.

Fixes #13411

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
lemrey added a commit to lemrey/sdk-zephyr that referenced this issue Feb 19, 2019
Merge upstream up to commit ec424b7.

This includes the fix for the Kernel spinlock assert:
zephyrproject-rtos/zephyr#13411.

Signed-off-by: Emanuele Di Santo <emdi@nordicsemi.no>
carlescufi pushed a commit to lemrey/sdk-nrf that referenced this issue Feb 19, 2019
- Update fw-nrfconnect-zephyr revision, fixing a Kernel panic issue:
  zephyrproject-rtos/zephyr#13411

Signed-off-by: Emanuele Di Santo <emdi@nordicsemi.no>
carlescufi pushed a commit to nrfconnect/sdk-nrf that referenced this issue Feb 19, 2019
- Update fw-nrfconnect-zephyr revision, fixing a Kernel panic issue:
  zephyrproject-rtos/zephyr#13411

Signed-off-by: Emanuele Di Santo <emdi@nordicsemi.no>
carlescufi pushed a commit to nrfconnect/sdk-nrf that referenced this issue Feb 19, 2019
- Update fw-nrfconnect-zephyr revision, fixing a Kernel panic issue:
  zephyrproject-rtos/zephyr#13411

Signed-off-by: Emanuele Di Santo <emdi@nordicsemi.no>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug
Projects
None yet
Development

No branches or pull requests

4 participants