-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kernel: workq: introduce delayable work #28891
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are all drive-by thoughts, I haven't looked at this in detail yet, I'm sharing this because an overall refactor of Zephyr IPC is something that's been on my mind lately.
I'm wondering if the whole workqueue API set could be moved out of the core kernel into library code (lib/os).
For the case of non-delayed work, all we're doing is interacting with a thread and a k_queue, so there's no need for it to be in the core kernel at all.
For the case of delayed work, we additionally have timeouts to deal with, so it has to be in the core kernel as it's calling scheduler and timeout apis like z_abort_timeout....but maybe there's some lower-level timeout functionality we could expose as public kernel APIs to close this.
This isn't unique to workqueues, I've thought for a while that we could move quite a bit of our IPC into library code, expressed in terms of some low-level wait/wake APIs, much like how Linux userland implements a variety of IPC based on futexes. This is not to say we should make all our IPC futex-based, you can't synchronize with ISRs without spinlocks, but I still feel there is an opportunity for a layer of abstraction such that we could move a lot of IPC stuff out of the kernel itself. (with the caveat that such abstractions mustn't mess with the kernel's overall footprint)
We also have a gap in that none of the delayed work APIs work with user mode, but I don't have any suggestions about that at this time, I'd have to think about it some more (it's been a few years)
BTW what I've written above need not be addressed in this PR...one of the really positive things about all the recent discussion is that we're homing in on a very specific unambiguous definition on how we want these IPC APIs to behave, so if much later the implementation gets refactored again that's just behind-the-scenes stuff. |
3910e96
to
bb6650b
Compare
4613116
to
6b3ba77
Compare
I've brought this out of draft a bit early so we can get started on the discussion. There are at roughly 80 source files in drivers, samples, or subsys that use the I fixed one example in Bluetooth so far. I found another problem in Many of the others are not obviously correct, so all maintainers of affected code should review them. At this point I intend to do additional fixes only for things I have some responsibility for, or if necessary to confirm a solution idiom that will be recommended in the (to be pushed) Best Practices documentation. |
6b3ba77
to
d7567f6
Compare
d7567f6
to
b2ee0a9
Compare
drivers/serial/uart_nrfx_uart.c
Outdated
@@ -76,7 +80,8 @@ static struct { | |||
volatile size_t rx_counter; | |||
volatile size_t rx_offset; | |||
int32_t rx_timeout; | |||
struct k_delayed_work rx_timeout_work; | |||
atomic_t pending_flags[1]; | |||
struct k_work_delayable rx_timeout_work; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is no need to involve any k_work
here, and k_delayed_work
was simply misused in this case. Just k_timer
should be sufficient, as it is in uart_nrfx_uarte.c
. Same applies to TX side. @Mierunski Did I miss some reason why k_timer
would not work here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the way the Nordic serial folks want to go I'll remove this commit, but the test fails with the current code (#29173) so it needs to be fixed. (It'd be a lot simpler if the async API allowed the timeout to be from the first received character, rather than the last received character, as the constant rescheduling likely has a significant impact on ISR time.)
@eanderlind @pabigot @eanderlind As I pointed out above using adding bit to tracking the timeout submission within an app code can lead to race if timeout is reenabled. Then you don't know if the timeout that has triggered is the new one or the one that you could not cancel. For timeouts it means app will have to check time passed within the work callback or implement counting of works it started that were not successfully canceled. I think the best way would be to add an official cancel api that could be used from defined context. |
@eanderlind There is no problem using a spinlock from within an ISR. @pdunaj Unless the application follows the rules (including not invoking the delayed API from an ISR) this doesn't add new problems that applications don't already have. It just provides API that doesn't mislead applications into thinking their cancellations actually work. That mistaken expectation is why Nordic async UART is broken. If an application really can't tolerate having a work handler invoked multiple times, it has to make sure it isn't submitted multiple times, or take care of checking in the handler function itself. Or just don't use a work item. |
@pabigot all this is true, especially the last part. |
I'm keeping this in DNM until @andrewboie confirms it's acceptable. Then my plan is to remove the example commits that change existing uses to new solutions from this PR, merge the new API, then resubmit the change-existing-use commits as independent commits that can be considered by the respective maintainers. I'll bring the solution to #29173 back in its own PR at that time. |
@@ -147,6 +151,51 @@ may or may not be invoked. | |||
must check return values and be prepared to react when either submission or | |||
cancellation fails. | |||
|
|||
.. _k_delayable_work: | |||
|
|||
Delayable Work |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being so late in the game and perhaps this has been discussed already in the PR. The k_delayable_work
is almost identical to k_delayed_work
, was there any consideration to name the new one as k_scheduled_work
as that would look similar to new scheduled API naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it started out that way, but as I came to understood the core concepts I changed it. Thinking of the life cycle as "scheduled" then "submitted" is fundamental to understanding what can and cannot be done: i.e. it can be cancelled only if it's scheduled, not if it's submitted.
The structure may or may not be scheduled (it can go directly to pending) so I think it's accurate to name it as a work item that supports being delayed.
Also k_work_schedule
pairs with k_work_cancel
nicely. Making it k_scheduled_work_submit()
and k_scheduled_work_cancel()
to follow the consistent-prefix rule is (a) longer which annoys some people, and (b) doesn't as clearly distinguish the two phases of scheduled and submitted (IMO) (and k_scheduled_work_schedule()
seems redundant).
@pabigot Sounds like no consensus to make any large change, even in new API. Looking at Linux and mbed, they both provide similar APIs that don't introduce (from what I can see) these complex race conditions. Linux e.g. chose to I should have said "undesireable to wait long on spinlock in an ISR". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the spinlock in work_q.c is protecting all these timeout objects, we need to make sure we are holding that lock whenever they are accessed. That does not seem to be the case for k_work_delayable_remaining_ticks, k_work_delayable_expires_ticks, k_work_delayable_remaining_get
Or is the set of active system timeouts protected by the timeout_lock instead? @andyross can you comment on the scope of these locks?
We are not at all in a hurry to merge this. @eanderlind from your perspective do you have an alternate proposal on what this mechanism could look like? |
AFAICT those are all protected by the global I added those only to match the existing ones for |
@andrewboie As alluded to in previous comments, something closer to Linux API that permits caller to control what happens without building a new framework in the caller. We know that using separate functions for "peaking" and "doing" won't work, since state can change in-between. It should be straightforward (and efficient) to implement following cases: |
@eanderlind thanks. @pabigot would you be willing to flesh this approach further? The PR in its current state is an improvement over what was present before, but I concur with others that we really haven't solved its fundamentally race-prone nature.
You can't have one lock protecting a data structure in some contexts, and another lock protecting it in others. Uniprocessor systems get away with this because spinlocks just disable interrupts, but in SMP this can and will fall over in hard to debug ways. For any given shared data structure it really need to be identified what lock protects it and serialize all access with that lock. |
No, that's not what happens. All
I'll see what can be done. |
OK, there are some things we could do. First, let's clearly identify the states that a (delayable) work item passes through today:
All existing Zephyr work API is safe to invoke from ISRs. It never sleeps, and the time required to submit a work item is bounded. Zephyr's current work queue architecture uses only one bit of the atomic structure and calls it PENDING. This allows distinguishing three groups of states that a work item may be in:
Here's the problematic race condition in Zephyr today:
For example, Zephyr code like the Nordic UART driver uses this from an ISR and requires that a specific observable event (notifying the application of a timeout) from W1 will not occur after W2. It does this by either implicitly or explicitly trying to unschedule/unsubmit W1. W2 can occur when the W1 submission of the item is in state S3 though S8. If W1 is in S7 then there is a race condition, because the work handler may or may not have reached the point where the undesired observable event has been generated.
Mitigating against W1 being in S7 cannot be solved by the workqueue API without introducing a function that waits until W1 has left S7. I can rework the implementation so that we can distinguish all necessary states with these caveats:
This would enable a more flexible API:
I don't think this is too terribly difficult to do, but it's a significant change from the existing implementation. And none of this will prevent the W1-W2 race condition from ISR context where the wait-for-S7 capability can't be used. Users of the work API will still have to figure out how to solve that. @andrewboie how do you want me to proceed? |
@pabigot Peter - Just want to thank you for writing this up. Excellent detail for moving forward. |
Not in the current implementation that avoids locks and relies only on atomic flags. All those states can be observed now depending on thread priorities and ISR activity and API use. If we switch to another implementation the set of observable states will be different.
The intent of that description wasn't clear. The same code invokes the same operation on the same work item for both W1 and W2. W1 refers to the state transitions from the first operation; W2 from the second. Keep in mind that the same work item can be:
all simultaneously. All the work queues mentioned could be different, too (in principle). So yes, the basic point is that the race conditions that are implicit in multiple existing uses of |
That is not true as current implementation uses k_queue which uses spinlock. You need to use spin lock at some point as list cannot be manipulated only with atomics. If lock is moved to workq and linked list is used directly in workq, it would not be a problem to track the state or cancel the work (unless until it is taken over by the thread). Now, if spinlock is used in workq directly there is one more thing that could be implemented. You could add tag to work. This way both callback (work) pointer and a tag could be copied to workq thread atomically. |
True; but what I meant was there is no lock currently associated with maintaining the state of the work item. The lock in A lock in each workqueue wouldn't cover cases where the work item moves between queues (and I do have a use case for that). The tag/token idea has some potential, but could be tricky since the work item can be active in as many as three states simultaneously. You could need to store three distinct tokens for the W1, W2, and W3 submissions. Which might tempt somebody into thinking they could control which instance was to be cancelled, which starts to get ugly. But it's an idea worth considering. |
include/kernel.h
Outdated
* | ||
* @param work Address passed to the work handler | ||
* | ||
* @return Address of the containing @c k_delayable_work structure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the use of k_delayable_work and k_work_delayable in different places is confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this misspelling. I'm leaving the name used as an RST anchor in the documentation discussion of delayable work.
This commit introduces a new API to support delayed submission to a queue. It conforms more closely to the original documented behavor of k_delayed_work. The semantics and documentation depends on a distinction between scheduling work item submission, and performing work item submission. The naming follows: API allowes a delayable work item to be scheduled, and if scheduled to be cancelled, until it is submitted at which point it matches the base undelayable work queue behavior and the schedule API knows nothing about it. It also preserves the base k_work_submit semantics when an item that is already pending is rescheduled without a wait (i.e. the item does not get moved to the end of the queue). Signed-off-by: Peter Bigot <peter.bigot@nordicsemi.no>
There are points in the code where the db_hash is needed immediately, without waiting for a pending recalculation to complete. The indicator for recalculation depended on unstable behavior of k_delayed_work_cancel(), assuming that if it returned zero then there was a pending recalculation that had successfully been averted. The replacement delayable work API only allows avoiding submitting a work item if it appears no longer needed: once submitted it cannot be stopped, nor can its completion be detected. Add API to schedule a delayed update and to process an update immediately in a way that reliably avoids recalculation regardless of the status of a cancelled delayed calculation. This avoids issues with work state change from ISRs and on multi-core architectures, but the work and non-work hash access still require cooperative threads. Signed-off-by: Peter Bigot <peter.bigot@nordicsemi.no>
Use an explicit flag to indicate when beacon reconfiguration should be inhibited rather than assuming that canceling the work will be successful (this may not be true in more complex applications). Signed-off-by: Peter Bigot <peter.bigot@nordicsemi.no>
ba79062
to
5265c82
Compare
API meeting 20th October:
|
Superseded by whatever happens in #29618. |
Current plan is to replace this PR with one based on the discussion at #28891 (comment)
DNM blocker: The commits that change existing use are here as examples and need to be removed before merge, so that the affected maintainers can consider how they want to solve the problems.
As a follow-on to #28579 this PR introduces a new API to support delayed submission of a work item to a work queue. It conforms more closely to the original documented behavior of
k_delayed_work
: it does not allow cancel or submit to affect the state of an item after the timeout expires and the item is made pending.The semantics and documentation depends on a distinction between scheduling work item submission, and performing work item submission. The naming follows:
k_work_schedule
allows a delayable work item to be scheduled after a (possibly no-wait) delay, andk_work_cancel
stops timeout countdown and works until the timeout complex and the item is submitted, after which the schedule API knows nothing about its state.Except for the periods when a scheduled work item has an incomplete delay the behavior of delayable work matches the behavior of undelayable work. This extends to the semantics when a delayable item that is already pending is scheduled with
K_NO_WAIT
(i.e. the item does not get moved to the end of the queue as it does withk_delayed_work()
).This replaces #28579 in resolving #27356 without breaking anything that depends on the existing behavior of the delayed work API.