thread_num / thread_num_dynamic are never decremented in CMSIS v2 thread.c #23062

raidoz · 2020-02-25T06:45:31Z

Terminating threads through CMSIS v2 APIs does not free up resources allocated to the threads by the CMSISv2 API. Upon inspection, it is easy to spot that the thread_num and thread_num_dynamic variables are only ever incremented, though they should also be decremented when osThreadExit is called by the thread or osThreadTerminate is called elsewhere. Simply decrementing thread_num in the correct place should be enough for statically allocated threads, however thread_num_dynamic needs some additional handling as the value is used for choosing the stack for the next dynamic thread.

andrewboie · 2020-05-18T19:25:08Z

Simply decrementing thread_num in the correct place should be enough for statically allocated threads

That won't work, thread_num is used to select the thread object in cv2_thread_pool. Decrementing it when a thread exits will not do what you expect, at worst it will cause thread objects to be used twice and crash the kernel. We need something like k_mem_slab for managing this.

The code is seriously incorrect anyway, it tries to manage concurrency with respect to thread/stack IDs with atomic variables but is full of race windows where it falls over.

andrewboie · 2020-05-18T21:08:09Z

Converting this to use k_mem_slab() wasn't bad and I have a patch ready for that.

The problem lies with race conditions with re-using these thread objects. The CMSISv2 wrapper code doesn't seem to handle thread exits very robustly:

If a thread self-exits, it goes through the latter half of zephyr_thread_wrapper(), which then returns to the main thread wrapper which calls k_thread_abort(k_current_get())
osThreadExit() calls k_thread_abort() on the current thread
osThreadTerminate() calls k_thread_abort() on the target thread

What I need is some kind of centralized hook that gets invoked when a CMSIS thread exits (regardless of method), where it is completely safe to release the k_thread and associated stack back to the pool. And I can't seem to safely to this in the context of the exiting thread itself.

This seems a more fundamental race, as the core kernel has a similar problem with k_thread_join() where a joining thread could be woken up due to the logic in k_thread_single_abort(), but the target thread is still using its associated k_thread in the subsequent call to z_swap_unlocked() -- it's only after the context switch completes that the target k_thread is not used again. If the joining thread attempts to re-use the thread object, data corruption might occur.

andrewboie · 2020-05-18T21:35:42Z

In order to get this and #23063 fixed robustly I think I need the following:

An unconditional guarantee that if k_thread_join() is successfully called on some thread X, when that call returns, neither the k_thread or the k_thread_stack_t associated with that thread are ever read/written by the kernel, they have to be truly unused.
The same guarantee for the userspace initialization state of the thread and its stack object; once marked as uninitialized (as is currently done in z_thread_single_abort) it is immediately permissible to re-use the stack or thread object by calling k_thread_create() on it.
A common hook function that runs once a thread abort, which is permitted to dispose of memory for the thread or its stack. We have thread.fn_abort.
_THREAD_DEAD to mean that the thread is really dead, and is not still in use as it swaps itself out for the last time.

This is currently falling over for threads that self-abort because the relevant logic in z_thread_single_abort() takes place on the thread's own stack, and then z_swap_unlocked() is called which further uses members of the k_thread and is also on the thread's own stack.

The same guarantees must also hold if the target thread, instead of being the current thread, is instead running on some other CPU.

I think we need to defer this stuff until we have swapped to a different thread's context. For CONFIG_SWITCH architectures I am eyeing doing this work in the latter half of do_swap(). If this approach pans out I'll have to hook swap-based arches as well.

This probably should also result in some kind of torture test to hammer all CPUs with multiple threads being created, aborted, joining with each other, etc.

andrewboie · 2020-06-26T18:46:04Z

#26486 is most of what's needed to fix this -- if all the z_thread_single_abort() actions are guaranteed to take place after the thread is truly dead (even if it self-aborts) then the rest I think is straightforward.

github-actions · 2020-08-26T00:52:39Z

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

Fixes races where threads on another CPU are joining the exiting thread, since it could still be running when the joiners wake up on a different CPU. Fixes problems where the thread object is still being used by the kernel when the fn_abort() function is called, preventing the thread object from being recycled or freed back to a slab pool. Precedent for doing this comes from FreeRTOS, which also performs final thread cleanup in the idle thread. Add some extra comments and log messages to the meta-irq test case. Some logic in z_thread_single_abort() rearranged such that when we release sched_spinlock, the thread object pointer is never dereferenced by the kernel again; join waiters or fn_abort() logic may free it immediately. An assertion added to z_thread_single_abort() to ensure it never gets called with thread == _current outside of an ISR. This requires an increase of the idle thread stack size. This was already necessary for a lot of applications that have any kind of power management since PM infrastructure gets invoked from the idle thread. Some logic has been added to ensure z_thread_single_abort() tasks don't run more than once. Fixes: zephyrproject-rtos#26486 Related to: zephyrproject-rtos#23063 zephyrproject-rtos#23062 Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>

Fixes races where threads on another CPU are joining the exiting thread, since it could still be running when the joiners wake up on a different CPU. Fixes problems where the thread object is still being used by the kernel when the fn_abort() function is called, preventing the thread object from being recycled or freed back to a slab pool. Fixes a race where a thread is aborted from one CPU while it self-aborts on another CPU, that was currently worked around with a busy-wait. Precedent for doing this comes from FreeRTOS, which also performs final thread cleanup in the idle thread. Some logic in z_thread_single_abort() rearranged such that when we release sched_spinlock, the thread object pointer is never dereferenced by the kernel again; join waiters or fn_abort() logic may free it immediately. An assertion added to z_thread_single_abort() to ensure it never gets called with thread == _current outside of an ISR. Some logic has been added to ensure z_thread_single_abort() tasks don't run more than once. Fixes: zephyrproject-rtos#26486 Related to: zephyrproject-rtos#23063 zephyrproject-rtos#23062 Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>

andrewboie · 2020-09-24T20:13:46Z

WIP branch still debugging stuff, https://github.com/andrewboie/zephyr/tree/cmsis-thread-slabs

Fixes races where threads on another CPU are joining the exiting thread, since it could still be running when the joiners wake up on a different CPU. Fixes problems where the thread object is still being used by the kernel when the fn_abort() function is called, preventing the thread object from being recycled or freed back to a slab pool. Fixes a race where a thread is aborted from one CPU while it self-aborts on another CPU, that was currently worked around with a busy-wait. Precedent for doing this comes from FreeRTOS, which also performs final thread cleanup in the idle thread. Some logic in z_thread_single_abort() rearranged such that when we release sched_spinlock, the thread object pointer is never dereferenced by the kernel again; join waiters or fn_abort() logic may free it immediately. An assertion added to z_thread_single_abort() to ensure it never gets called with thread == _current outside of an ISR. Some logic has been added to ensure z_thread_single_abort() tasks don't run more than once. Fixes: zephyrproject-rtos#26486 Related to: zephyrproject-rtos#23063 zephyrproject-rtos#23062 Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>

Fixes races where threads on another CPU are joining the exiting thread, since it could still be running when the joiners wake up on a different CPU. Fixes problems where the thread object is still being used by the kernel when the fn_abort() function is called, preventing the thread object from being recycled or freed back to a slab pool. Fixes a race where a thread is aborted from one CPU while it self-aborts on another CPU, that was currently worked around with a busy-wait. Precedent for doing this comes from FreeRTOS, which also performs final thread cleanup in the idle thread. Some logic in z_thread_single_abort() rearranged such that when we release sched_spinlock, the thread object pointer is never dereferenced by the kernel again; join waiters or fn_abort() logic may free it immediately. An assertion added to z_thread_single_abort() to ensure it never gets called with thread == _current outside of an ISR. Some logic has been added to ensure z_thread_single_abort() tasks don't run more than once. Fixes: #26486 Related to: #23063 #23062 Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>

github-actions · 2020-12-29T01:36:24Z

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

raidoz added the bug The issue is a bug, or the PR is fixing a bug label Feb 25, 2020

carlescufi assigned nashif and andrewboie Feb 25, 2020

jhedberg added the priority: low Low impact/importance bug label Feb 25, 2020

nashif added priority: medium Medium impact/importance bug and removed priority: low Low impact/importance bug labels Mar 12, 2020

carlescufi added the area: Kernel label May 21, 2020

andrewboie mentioned this issue Jun 26, 2020

possible SMP race with k_thread_join() #26486

Closed

andrewboie mentioned this issue Aug 13, 2020

Option for safe k_thread_abort #21484

Closed

github-actions bot added the Stale label Aug 26, 2020

andrewboie removed the Stale label Aug 26, 2020

andrewboie mentioned this issue Sep 2, 2020

kernel: handle thread self-aborts on idle thread #27995

Merged

nashif removed their assignment Sep 9, 2020

nashif added priority: low Low impact/importance bug and removed priority: medium Medium impact/importance bug labels Oct 29, 2020

github-actions bot added the Stale label Dec 29, 2020

github-actions bot closed this as completed Jan 12, 2021

chen-png mentioned this issue Aug 12, 2021

cmsis_rtos_v1: decrease instances number in osThreadTerminate. #37493

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

thread_num / thread_num_dynamic are never decremented in CMSIS v2 thread.c #23062

thread_num / thread_num_dynamic are never decremented in CMSIS v2 thread.c #23062

raidoz commented Feb 25, 2020

andrewboie commented May 18, 2020

andrewboie commented May 18, 2020 •

edited

Loading

andrewboie commented May 18, 2020 •

edited

Loading

andrewboie commented Jun 26, 2020 •

edited

Loading

github-actions bot commented Aug 26, 2020

andrewboie commented Sep 24, 2020

github-actions bot commented Dec 29, 2020

thread_num / thread_num_dynamic are never decremented in CMSIS v2 thread.c #23062

thread_num / thread_num_dynamic are never decremented in CMSIS v2 thread.c #23062

Comments

raidoz commented Feb 25, 2020

andrewboie commented May 18, 2020

andrewboie commented May 18, 2020 • edited Loading

andrewboie commented May 18, 2020 • edited Loading

andrewboie commented Jun 26, 2020 • edited Loading

github-actions bot commented Aug 26, 2020

andrewboie commented Sep 24, 2020

github-actions bot commented Dec 29, 2020

andrewboie commented May 18, 2020 •

edited

Loading

andrewboie commented May 18, 2020 •

edited

Loading

andrewboie commented Jun 26, 2020 •

edited

Loading