Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel: workq: introduce delayable work #28891

Closed
wants to merge 4 commits into from

Conversation

pabigot
Copy link
Collaborator

@pabigot pabigot commented Oct 3, 2020

Current plan is to replace this PR with one based on the discussion at #28891 (comment)

DNM blocker: The commits that change existing use are here as examples and need to be removed before merge, so that the affected maintainers can consider how they want to solve the problems.

As a follow-on to #28579 this PR introduces a new API to support delayed submission of a work item to a work queue. It conforms more closely to the original documented behavior of k_delayed_work: it does not allow cancel or submit to affect the state of an item after the timeout expires and the item is made pending.

The semantics and documentation depends on a distinction between scheduling work item submission, and performing work item submission. The naming follows: k_work_schedule allows a delayable work item to be scheduled after a (possibly no-wait) delay, and k_work_cancel stops timeout countdown and works until the timeout complex and the item is submitted, after which the schedule API knows nothing about its state.

Except for the periods when a scheduled work item has an incomplete delay the behavior of delayable work matches the behavior of undelayable work. This extends to the semantics when a delayable item that is already pending is scheduled with K_NO_WAIT (i.e. the item does not get moved to the end of the queue as it does with k_delayed_work()).

This replaces #28579 in resolving #27356 without breaking anything that depends on the existing behavior of the delayed work API.

@github-actions github-actions bot added area: API Changes to public APIs area: Documentation area: Kernel area: Tests Issues related to a particular existing or missing test labels Oct 3, 2020
Copy link
Contributor

@andrewboie andrewboie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all drive-by thoughts, I haven't looked at this in detail yet, I'm sharing this because an overall refactor of Zephyr IPC is something that's been on my mind lately.

I'm wondering if the whole workqueue API set could be moved out of the core kernel into library code (lib/os).

For the case of non-delayed work, all we're doing is interacting with a thread and a k_queue, so there's no need for it to be in the core kernel at all.

For the case of delayed work, we additionally have timeouts to deal with, so it has to be in the core kernel as it's calling scheduler and timeout apis like z_abort_timeout....but maybe there's some lower-level timeout functionality we could expose as public kernel APIs to close this.

This isn't unique to workqueues, I've thought for a while that we could move quite a bit of our IPC into library code, expressed in terms of some low-level wait/wake APIs, much like how Linux userland implements a variety of IPC based on futexes. This is not to say we should make all our IPC futex-based, you can't synchronize with ISRs without spinlocks, but I still feel there is an opportunity for a layer of abstraction such that we could move a lot of IPC stuff out of the kernel itself. (with the caveat that such abstractions mustn't mess with the kernel's overall footprint)

We also have a gap in that none of the delayed work APIs work with user mode, but I don't have any suggestions about that at this time, I'd have to think about it some more (it's been a few years)

@andrewboie
Copy link
Contributor

BTW what I've written above need not be addressed in this PR...one of the really positive things about all the recent discussion is that we're homing in on a very specific unambiguous definition on how we want these IPC APIs to behave, so if much later the implementation gets refactored again that's just behind-the-scenes stuff.

@pabigot pabigot force-pushed the nordic/20201002a branch 2 times, most recently from 3910e96 to bb6650b Compare October 4, 2020 17:49
@pabigot pabigot force-pushed the nordic/20201002a branch 4 times, most recently from 4613116 to 6b3ba77 Compare October 9, 2020 20:16
@pabigot pabigot added the DNM This PR should not be merged (Do Not Merge) label Oct 9, 2020
@pabigot pabigot marked this pull request as ready for review October 9, 2020 21:16
@pabigot
Copy link
Collaborator Author

pabigot commented Oct 9, 2020

I've brought this out of draft a bit early so we can get started on the discussion.

There are at roughly 80 source files in drivers, samples, or subsys that use the k_delayed_work_ API and should be reviewed to see whether they violate the conditions required to make delayed work function properly. If they do, they should be fixed.

I fixed one example in Bluetooth so far. I found another problem in uart_nrfx_uart0, which was confirmed incidently in #29092, and it's fixed too.

Many of the others are not obviously correct, so all maintainers of affected code should review them. At this point I intend to do additional fixes only for things I have some responsibility for, or if necessary to confirm a solution idiom that will be recommended in the (to be pushed) Best Practices documentation.

doc/reference/kernel/threads/workqueue.rst Outdated Show resolved Hide resolved
@@ -76,7 +80,8 @@ static struct {
volatile size_t rx_counter;
volatile size_t rx_offset;
int32_t rx_timeout;
struct k_delayed_work rx_timeout_work;
atomic_t pending_flags[1];
struct k_work_delayable rx_timeout_work;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is no need to involve any k_work here, and k_delayed_work was simply misused in this case. Just k_timer should be sufficient, as it is in uart_nrfx_uarte.c. Same applies to TX side. @Mierunski Did I miss some reason why k_timer would not work here?

Copy link
Collaborator Author

@pabigot pabigot Oct 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the way the Nordic serial folks want to go I'll remove this commit, but the test fails with the current code (#29173) so it needs to be fixed. (It'd be a lot simpler if the async API allowed the timeout to be from the first received character, rather than the last received character, as the constant rescheduling likely has a significant impact on ISR time.)

@pdunaj
Copy link
Collaborator

pdunaj commented Oct 16, 2020

@eanderlind
I mentioned applications will have problems after this change #28579 (comment).
Still the change is according to the original design. So it seems that without rethinking cancel cannot stay.

@pabigot @eanderlind
I thought about adding an API call that would tell if work is pending but this does not help much. The main reason to use delayed works is to handle timeouts. If you stop activity that required a timeout, knowing that work is pending does not help you much. You have no chance of canceling this work anyway so app will have to handle the work callback anyway.

As I pointed out above using adding bit to tracking the timeout submission within an app code can lead to race if timeout is reenabled. Then you don't know if the timeout that has triggered is the new one or the one that you could not cancel. For timeouts it means app will have to check time passed within the work callback or implement counting of works it started that were not successfully canceled.

I think the best way would be to add an official cancel api that could be used from defined context.

@pabigot
Copy link
Collaborator Author

pabigot commented Oct 16, 2020

@eanderlind There is no problem using a spinlock from within an ISR.

@pdunaj Unless the application follows the rules (including not invoking the delayed API from an ISR) this doesn't add new problems that applications don't already have. It just provides API that doesn't mislead applications into thinking their cancellations actually work. That mistaken expectation is why Nordic async UART is broken.

If an application really can't tolerate having a work handler invoked multiple times, it has to make sure it isn't submitted multiple times, or take care of checking in the handler function itself. Or just don't use a work item.

@pdunaj
Copy link
Collaborator

pdunaj commented Oct 16, 2020

@pabigot all this is true, especially the last part.
It leads to important conclusion to carefully review all places where works were used. Taking into consideration using other mechanisms instead.

@pabigot
Copy link
Collaborator Author

pabigot commented Oct 16, 2020

I'm keeping this in DNM until @andrewboie confirms it's acceptable. Then my plan is to remove the example commits that change existing uses to new solutions from this PR, merge the new API, then resubmit the change-existing-use commits as independent commits that can be considered by the respective maintainers. I'll bring the solution to #29173 back in its own PR at that time.

@@ -147,6 +151,51 @@ may or may not be invoked.
must check return values and be prepared to react when either submission or
cancellation fails.

.. _k_delayable_work:

Delayable Work
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being so late in the game and perhaps this has been discussed already in the PR. The k_delayable_work is almost identical to k_delayed_work, was there any consideration to name the new one as k_scheduled_work as that would look similar to new scheduled API naming.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it started out that way, but as I came to understood the core concepts I changed it. Thinking of the life cycle as "scheduled" then "submitted" is fundamental to understanding what can and cannot be done: i.e. it can be cancelled only if it's scheduled, not if it's submitted.

The structure may or may not be scheduled (it can go directly to pending) so I think it's accurate to name it as a work item that supports being delayed.

Also k_work_schedule pairs with k_work_cancel nicely. Making it k_scheduled_work_submit() and k_scheduled_work_cancel() to follow the consistent-prefix rule is (a) longer which annoys some people, and (b) doesn't as clearly distinguish the two phases of scheduled and submitted (IMO) (and k_scheduled_work_schedule() seems redundant).

@eanderlind
Copy link

@pabigot Sounds like no consensus to make any large change, even in new API. Looking at Linux and mbed, they both provide similar APIs that don't introduce (from what I can see) these complex race conditions. Linux e.g. chose to
provide two cancel apis
bool cancel_delayed_work(struct delayed_work * dwork) /* asynch -caller checks bool to see whether anything was pended*/
bool cancel_delayed_work_sync(struct delayed_work * dwork) /*synchronous blocking cancel */
there is also a dedicated api for modifying delayed work.
mbed's simpler eventqueue uses mutexes to provide the guarantees.
Before refactor implementation, would be good to define an api that is actually easy to use from a driver or application dev perspective. Since implementation is non-trivial, even more important that the api definition is solid. I don't think the current 2-stage api definition with race conditions is the right starting point. An interim step may be to provide an example for how driver should use current apis to implement case where race conditions and launching a 2nd one work item are avoided.

I should have said "undesireable to wait long on spinlock in an ISR".

Copy link
Contributor

@andrewboie andrewboie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the spinlock in work_q.c is protecting all these timeout objects, we need to make sure we are holding that lock whenever they are accessed. That does not seem to be the case for k_work_delayable_remaining_ticks, k_work_delayable_expires_ticks, k_work_delayable_remaining_get

Or is the set of active system timeouts protected by the timeout_lock instead? @andyross can you comment on the scope of these locks?

@andrewboie
Copy link
Contributor

Before refactor implementation, would be good to define an api that is actually easy to use from a driver or application dev perspective. Since implementation is non-trivial, even more important that the api definition is solid. I don't think the current 2-stage api definition with race conditions is the right starting point. An interim step may be to provide an example for how driver should use current apis to implement case where race conditions and launching a 2nd one work item are avoided.

We are not at all in a hurry to merge this. @eanderlind from your perspective do you have an alternate proposal on what this mechanism could look like?

@pabigot
Copy link
Collaborator Author

pabigot commented Oct 16, 2020

If the spinlock in work_q.c is protecting all these timeout objects, we need to make sure we are holding that lock whenever they are accessed. That does not seem to be the case for k_work_delayable_remaining_ticks, k_work_delayable_expires_ticks, k_work_delayable_remaining_get

AFAICT those are all protected by the global timeout_lock for timeout-specific API calls. The workq lock protects uses of the timeout in the context of work queue.

I added those only to match the existing ones for k_delayed_work. Most of the existing in-tree uses for the old API are basically checking whether the timeout is zero, which is just as unreliable as k_work_pending, and are used only to avoid trying to cancel the timeout. There are a very few cases where the value is used to reconstruct when the item was scheduled.

@eanderlind
Copy link

@andrewboie As alluded to in previous comments, something closer to Linux API that permits caller to control what happens without building a new framework in the caller. We know that using separate functions for "peaking" and "doing" won't work, since state can change in-between. It should be straightforward (and efficient) to implement following cases:
a) an ISR either cancels or reschedules a pending timeout action (or creates a new one) for a workqueue handler
b) a workqueue handler reschedules itself in the future
c) handle situation where both variants used simultaneously and threads have different or same priorities
In my opinion, having kernel manage this complexity, especially when scheduler gets more sophisticated, and may use more than one CPU, is better than pushing problem onto app/driver developer. Taking a lock while running on system event queue is risky, while same functionality in a kernel api call is more straightforward to deal with.

@andrewboie
Copy link
Contributor

@eanderlind thanks. @pabigot would you be willing to flesh this approach further? The PR in its current state is an improvement over what was present before, but I concur with others that we really haven't solved its fundamentally race-prone nature.

AFAICT those are all protected by the global timeout_lock for timeout-specific API calls. The workq lock protects uses of the timeout in the context of work queue.

You can't have one lock protecting a data structure in some contexts, and another lock protecting it in others. Uniprocessor systems get away with this because spinlocks just disable interrupts, but in SMP this can and will fall over in hard to debug ways. For any given shared data structure it really need to be identified what lock protects it and serialize all access with that lock.

@pabigot
Copy link
Collaborator Author

pabigot commented Oct 17, 2020

AFAICT those are all protected by the global timeout_lock for timeout-specific API calls. The workq lock protects uses of the timeout in the context of work queue; the lock there is held

You can't have one lock protecting a data structure in some contexts, and another lock protecting it in others.

No, that's not what happens. All z_timeout API uses the lock in timeout.c. That would have been a better answer to your original question. The lock in workq specifically covers the list structure for work. As I look at it, it's probably held unnecessarily in several places the original implementation of k_delayed_work, including when z_timeout() API is used.

would you be willing to flesh this approach further?

I'll see what can be done.

@pabigot
Copy link
Collaborator Author

pabigot commented Oct 17, 2020

OK, there are some things we could do.

First, let's clearly identify the states that a (delayable) work item passes through today:

  • S1 initialized and never submitted; goto S2 (schedule) or S3 (submit)
  • S2 delayed on timeout; goto S3 (timeout callback invoked) or S8 (cancel timeout before callback)
  • S3 marked pending but not on queue (by k_submit_to_queue)
  • S4 on queue (by k_submit_to_queue)
  • S5 removed from queue (by z_work_q_main)
  • S6 cleared pending (by z_work_q_main)
  • S7 invoking handler (by z_work_q_main)
  • S8 complete (by z_work_q_main); goto S2 (schedule) or S3 (submit)

All existing Zephyr work API is safe to invoke from ISRs. It never sleeps, and the time required to submit a work item is bounded.

Zephyr's current work queue architecture uses only one bit of the atomic structure and calls it PENDING. This allows distinguishing three groups of states that a work item may be in:

  • S1, S6, S7, and S8 where there is no delay and PENDING is cleared
  • S2 where the work item is delayed
  • S3, S4, S5 where PENDING is set.

Here's the problematic race condition in Zephyr today:

  • W1 schedules or submits the work item
  • W2 reschedules or cancels the work item and continues with an expectation that W1 has been blocked from having any further effect

For example, Zephyr code like the Nordic UART driver uses this from an ISR and requires that a specific observable event (notifying the application of a timeout) from W1 will not occur after W2. It does this by either implicitly or explicitly trying to unschedule/unsubmit W1.

W2 can occur when the W1 submission of the item is in state S3 though S8.

If W1 is in S7 then there is a race condition, because the work handler may or may not have reached the point where the undesired observable event has been generated.

  • k_work_schedule() and k_work_cancel() in this PR work in state S2, but won't stop W1 in states S3 through S7.
  • k_delayed_work_submit() and k_delayed_work_cancel() in Zephyr today work in states S2 and S4, but won't stop W1 in state S3 or S5 through S7.
  • Linux mod_delayed_work and cancel_delayed_work are ISR safe and handle the analogs of all states except S7 which is specifically excluded.

Mitigating against W1 being in S7 cannot be solved by the workqueue API without introducing a function that waits until W1 has left S7.

I can rework the implementation so that we can distinguish all necessary states with these caveats:

  • Loop-free atomic twiddling won't work: we'll need a spinlock or IRQ lock+atomic (Linux uses both);
  • Most of it will be real function calls, not inline as it is today;
  • External synchronization will still be required for anything outside the work queue API to guarantee that the state it's told isn't stale.

This would enable a more flexible API:

  • k_work_schedule() supporting:
  • k_work_cancel() would succeed in states S2 through S6, but not S7.
  • New blocking API to ensure work item that was scheduled or submitted completes even if in S7 with two variants:
    • Like Linux cancel_work_sync which blocks further submissions while cancelling;
    • Like Linux flush_work that does not prevent further submissions while cancelling.

I don't think this is too terribly difficult to do, but it's a significant change from the existing implementation.

And none of this will prevent the W1-W2 race condition from ISR context where the wait-for-S7 capability can't be used. Users of the work API will still have to figure out how to solve that.

@andrewboie how do you want me to proceed?

@eanderlind
Copy link

@pabigot Peter - Just want to thank you for writing this up. Excellent detail for moving forward.
Two observations:
-Is there possibility to merge states S3/S4 or S6 with either S5 or S7? (ok to defer answer until decide how to move forward)
-For case where only W1 or W2 invokes a handler this would provide improvement. If both W1 and W2 can invoke handlers,
it becomes a general thread synch problem, which the application has to deal with.

@pabigot
Copy link
Collaborator Author

pabigot commented Oct 17, 2020

@pabigot Peter - Just want to thank you for writing this up. Excellent detail for moving forward.
Two observations:
-Is there possibility to merge states S3/S4 or S6 with either S5 or S7? (ok to defer answer until decide how to move forward)

Not in the current implementation that avoids locks and relies only on atomic flags. All those states can be observed now depending on thread priorities and ISR activity and API use. If we switch to another implementation the set of observable states will be different.

-For case where only W1 or W2 invokes a handler this would provide improvement. If both W1 and W2 can invoke handlers,
it becomes a general thread synch problem, which the application has to deal with.

The intent of that description wasn't clear. The same code invokes the same operation on the same work item for both W1 and W2. W1 refers to the state transitions from the first operation; W2 from the second. Keep in mind that the same work item can be:

  1. Scheduled for submission to a work queue at a future time (W3);
  2. Sitting on a work queue waiting to be executed (W2); and
  3. Being executed by a work queue thread (W1)

all simultaneously. All the work queues mentioned could be different, too (in principle).

So yes, the basic point is that the race conditions that are implicit in multiple existing uses of k_delayed_work in Zephyr because the code doesn't follow the (newly documented) rules--which is how this whole thing got started--can't be eliminated in the general case. It can only be addressed by the addition of the blocking wait-until-completed API, which would have to be invoked from thread context, along with careful review of each use to determine how to handle potential multiple simultaneous states for the specific work item.

@pdunaj
Copy link
Collaborator

pdunaj commented Oct 17, 2020

@pabigot

in the current implementation that avoids locks and relies only on atomic flag

That is not true as current implementation uses k_queue which uses spinlock. You need to use spin lock at some point as list cannot be manipulated only with atomics. If lock is moved to workq and linked list is used directly in workq, it would not be a problem to track the state or cancel the work (unless until it is taken over by the thread).

Now, if spinlock is used in workq directly there is one more thing that could be implemented. You could add tag to work. This way both callback (work) pointer and a tag could be copied to workq thread atomically.
Application would set tag during submission of work. Before canceling app could change its copy of the tag. If cancel is not successful (work has been passed to thread) workq thread would use the old tag along with work. That would require change in api and tag cannot be part of the work structure but this would help to catch the case when user is too late to cancel.

@pabigot
Copy link
Collaborator Author

pabigot commented Oct 17, 2020

in the current implementation that avoids locks and relies only on atomic flag

That is not true as current implementation uses k_queue which uses spinlock

True; but what I meant was there is no lock currently associated with maintaining the state of the work item. The lock in k_queue doesn't prevent observation of any transient states for the work item itself.

A lock in each workqueue wouldn't cover cases where the work item moves between queues (and I do have a use case for that).

The tag/token idea has some potential, but could be tricky since the work item can be active in as many as three states simultaneously. You could need to store three distinct tokens for the W1, W2, and W3 submissions. Which might tempt somebody into thinking they could control which instance was to be cancelled, which starts to get ugly. But it's an idea worth considering.

include/kernel.h Outdated
*
* @param work Address passed to the work handler
*
* @return Address of the containing @c k_delayable_work structure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the use of k_delayable_work and k_work_delayable in different places is confusing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this misspelling. I'm leaving the name used as an RST anchor in the documentation discussion of delayable work.

This commit introduces a new API to support delayed submission to a
queue.  It conforms more closely to the original documented behavor of
k_delayed_work.  The semantics and documentation depends on a
distinction between scheduling work item submission, and performing
work item submission.  The naming follows: API allowes a delayable
work item to be scheduled, and if scheduled to be cancelled, until it
is submitted at which point it matches the base undelayable work queue
behavior and the schedule API knows nothing about it.

It also preserves the base k_work_submit semantics when an item that
is already pending is rescheduled without a wait (i.e. the item does
not get moved to the end of the queue).

Signed-off-by: Peter Bigot <peter.bigot@nordicsemi.no>
There are points in the code where the db_hash is needed immediately,
without waiting for a pending recalculation to complete.  The
indicator for recalculation depended on unstable behavior of
k_delayed_work_cancel(), assuming that if it returned zero then there
was a pending recalculation that had successfully been averted.

The replacement delayable work API only allows avoiding submitting a
work item if it appears no longer needed: once submitted it cannot be
stopped, nor can its completion be detected.

Add API to schedule a delayed update and to process an update
immediately in a way that reliably avoids recalculation regardless of
the status of a cancelled delayed calculation.  This avoids issues
with work state change from ISRs and on multi-core architectures, but
the work and non-work hash access still require cooperative threads.

Signed-off-by: Peter Bigot <peter.bigot@nordicsemi.no>
Use an explicit flag to indicate when beacon reconfiguration should be
inhibited rather than assuming that canceling the work will be
successful (this may not be true in more complex applications).

Signed-off-by: Peter Bigot <peter.bigot@nordicsemi.no>
@carlescufi
Copy link
Member

API meeting 20th October:

  • General impression of the rework is positive
  • Needs the improvements that are described in this comment to consider replacing the current users of the existing API to the new one.
  • Work to add the 2 features listed in the comment above needs to complete, then users ported one by one
  • Finally, deprecation of the existing API can happen.

@pabigot
Copy link
Collaborator Author

pabigot commented Nov 23, 2020

Superseded by whatever happens in #29618.

@pabigot pabigot closed this Nov 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: API Changes to public APIs area: Documentation area: Kernel area: Samples Samples area: Tests Issues related to a particular existing or missing test DNM This PR should not be merged (Do Not Merge)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants