Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reproducible qemu_x86_64 SMP failures #30360

Closed
pabigot opened this issue Dec 1, 2020 · 5 comments
Closed

reproducible qemu_x86_64 SMP failures #30360

pabigot opened this issue Dec 1, 2020 · 5 comments
Assignees
Labels
area: SMP Symmetric multiprocessing bug The issue is a bug, or the PR is fixing a bug Duplicate This issue is a duplicate of another issue (please specify) priority: medium Medium impact/importance bug

Comments

@pabigot
Copy link
Collaborator

pabigot commented Dec 1, 2020

#29618 has a new kernel work queue API that works very nicely, but intermittently fails its tests with SMP enabled. To narrow down the problem https://github.com/pabigot/zephyr/commits/nordic/20201201a has a replacement hello_world sample that simulates the basic functionality: work items are placed on a queue which is monitored by a thread that removes the items, does something with them, and marks them as complete.

The main thread repeatedly submit items and spins until the completion signal is detected.

When run under qemu_x86_64 the application fails with no diagnostics after a varying number of iterations. Rarely, the failure exhibits as a deadlock (the program stops displaying but does not terminate).

This behavior can be observed on master, and when run with (rebased) #29668 that fixes potential problems with thread resumption.

The code is simple enough it seems correct, and the behavior is not within the realm of expectation. This may be related to #28105 as failures in the real test case sometimes exhibited the same "Attempt to resume un-suspended thread object" diagnostic.

@pabigot pabigot added bug The issue is a bug, or the PR is fixing a bug area: SMP Symmetric multiprocessing labels Dec 1, 2020
@pabigot
Copy link
Collaborator Author

pabigot commented Dec 1, 2020

https://github.com/pabigot/zephyr/commits/pr/29668 is on the rebased #29668 and adds a commit that changes the volatile int used to communicate work state to an atomic_t. This has no visible effect on the failures. (A hypothesis was that the atomic instructions do not work correctly between processors in qemu_x86_64, regardless of GCC's sequential consistency promise. That has not been ruled out.)

Neither does using zephyr-sdk-0.12.0-beta-2 instead of zephyr-sdk-0.11.4.

@nashif nashif added the priority: medium Medium impact/importance bug label Dec 1, 2020
@andyross
Copy link
Contributor

andyross commented Dec 1, 2020

Are you sure this isn't just a race in the test? I see two threads modifying the same flags bitfield using standard C syntax without locking. Those operations aren't atomic. Can you repeat using either a spinlock around the bit operations or the atomic bit operations API and still see the problem?

@pabigot
Copy link
Collaborator Author

pabigot commented Dec 1, 2020

A version with atomics is at https://github.com/pabigot/zephyr/commits/pr/29668 and has no affect. In any case, any race condition doesn't explain why the application aborts.

@pabigot
Copy link
Collaborator Author

pabigot commented Dec 1, 2020

FYI: This appears to be a qemu issue: when the pr/29668 version is run on up-squared it works just fine.

@dcpleung
Copy link
Member

This is actually the same as #28105.

If you enable CONFIG_TEST=y, you can see the exception print out. So closing this one as a duplicate to #28105.

@dcpleung dcpleung added the Duplicate This issue is a duplicate of another issue (please specify) label Jan 20, 2021
@rljordan rljordan assigned dcpleung and unassigned andyross Jan 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: SMP Symmetric multiprocessing bug The issue is a bug, or the PR is fixing a bug Duplicate This issue is a duplicate of another issue (please specify) priority: medium Medium impact/importance bug
Projects
None yet
Development

No branches or pull requests

4 participants