New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kernel: ASSERTION FAIL [z_spin_lock_valid(l)] #13411
Comments
From @lemrey: I suspect this is a kernel bug, or some bug that was uncovered by recent kernel work at least. I am able to reproduce this assert consistently by running a modified This is far from my expertise and I might grossly off, but what I observe in that sample is that the first edit: this is not an arch bug, it is reproducible on x86 and ARM. A rudimentary augmentation of
|
From @lemrey: a37a981 is the first bad commit
|
Looking. Note that that patch series is a bunch of independent changes. You can revert just the work_q one (e.g. "git revert a37a981" apples builds and runs on master for me), which goes back to the previous irq_lock() implementation (and, for production builds without CONFIG_ASSERT, generates the same code). Can you confirm that fixes the issue? |
@andyross thanks for looking into this. I tried reverting a37a981, but:
|
Oh sorry. That's just a missing include. This was the first patch in the series to put spinlock.h into kernel_includes.h and later patches need it too. You can apply this to re-add the tiny bit needed:
|
And I'm pretty sure I see where the issue is, though I'm still not seeing it in my own qemu runs. The work_q abstraction is wrapped around a k_queue, and k_queue_insert() is a potentially blocking/rescheduling operation. You can't call that with a spinlock held, and of course we do. (And FWIW, it was always sort of a mess that we were stacking calls like this up with an irq_lock held too. I know this seems like it has no benefit to you, but better detection of poor locking hygine like this is totally a feature of the spinlock work, I promise!) |
@andyross funny that you can't reproduce. I just applied the patch below on top of 65451db:
|
if it was happening with qemu_x86 we would have seen in this CI already, no? |
PEBKAC. I forgot that while the tests/ directory has CONFIG_ASSERT=y automatically, the samples/ directory does not. Blows up nicely for me. And no, we don't actually have a test for this particular code path (adding a work_q item with zero delay) as it turns out. Which is sort of a problem. There's actually a parallel coverage gotcha in there too -- looking at it right now I'm reminded that the queue.c code actually has two completely different implementations depending on whether CONFIG_POLL is enabled, and I'm like 83% confident that we don't repeat all the work_q tests in both configurations. |
Work queues are implemented in terms of k_queue objects which provide their own synchronization. In particular insertion is potentially blocking and always acts as a reschedule point, which means that it must not be called with spinlocks held. Release the lock first, and do a little cleanup of the resulting k_delayed_work_submit_to_queue() logic. Fixes zephyrproject-rtos#13411 Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
OK, try that patch. I'm thinking maybe I should add a "spinlock depth count" or something to the validation layer to catch this a little earlier and more clearly... |
Work queues are implemented in terms of k_queue objects which provide their own synchronization. In particular insertion is potentially blocking and always acts as a reschedule point, which means that it must not be called with spinlocks held. Release the lock first, and do a little cleanup of the resulting k_delayed_work_submit_to_queue() logic. Fixes #13411 Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
Merge upstream up to commit ec424b7. This includes the fix for the Kernel spinlock assert: zephyrproject-rtos/zephyr#13411. Signed-off-by: Emanuele Di Santo <emdi@nordicsemi.no>
- Update fw-nrfconnect-zephyr revision, fixing a Kernel panic issue: zephyrproject-rtos/zephyr#13411 Signed-off-by: Emanuele Di Santo <emdi@nordicsemi.no>
- Update fw-nrfconnect-zephyr revision, fixing a Kernel panic issue: zephyrproject-rtos/zephyr#13411 Signed-off-by: Emanuele Di Santo <emdi@nordicsemi.no>
- Update fw-nrfconnect-zephyr revision, fixing a Kernel panic issue: zephyrproject-rtos/zephyr#13411 Signed-off-by: Emanuele Di Santo <emdi@nordicsemi.no>
This is similar to #13289 but 100% reproducible with a simple
main.c
Run the following on
qemu_x86
or anynrf
board:https://github.com/lemrey/zephyr/tree/bug-workq
results in:
ASSERTION FAIL [z_spin_lock_valid(l)] @ zephyr.git/include/spinlock.h:66
The text was updated successfully, but these errors were encountered: