New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reproducible qemu_x86_64 SMP failures #30360
Comments
https://github.com/pabigot/zephyr/commits/pr/29668 is on the rebased #29668 and adds a commit that changes the Neither does using zephyr-sdk-0.12.0-beta-2 instead of zephyr-sdk-0.11.4. |
Are you sure this isn't just a race in the test? I see two threads modifying the same flags bitfield using standard C syntax without locking. Those operations aren't atomic. Can you repeat using either a spinlock around the bit operations or the atomic bit operations API and still see the problem? |
A version with atomics is at https://github.com/pabigot/zephyr/commits/pr/29668 and has no affect. In any case, any race condition doesn't explain why the application aborts. |
FYI: This appears to be a qemu issue: when the pr/29668 version is run on up-squared it works just fine. |
#29618 has a new kernel work queue API that works very nicely, but intermittently fails its tests with SMP enabled. To narrow down the problem https://github.com/pabigot/zephyr/commits/nordic/20201201a has a replacement
hello_world
sample that simulates the basic functionality: work items are placed on a queue which is monitored by a thread that removes the items, does something with them, and marks them as complete.The main thread repeatedly submit items and spins until the completion signal is detected.
When run under qemu_x86_64 the application fails with no diagnostics after a varying number of iterations. Rarely, the failure exhibits as a deadlock (the program stops displaying but does not terminate).
This behavior can be observed on master, and when run with (rebased) #29668 that fixes potential problems with thread resumption.
The code is simple enough it seems correct, and the behavior is not within the realm of expectation. This may be related to #28105 as failures in the real test case sometimes exhibited the same "Attempt to resume un-suspended thread object" diagnostic.
The text was updated successfully, but these errors were encountered: