[discussion] retrying a `spawn` / `schedule` is almost always wrong #209

japaric · 2019-06-24T11:25:48Z

Blocking and non-blocking

Abstractions that provide both non-blocking and blocking versions of an
operation usually name their APIs try_{action} (non-blocking) and {action}
(blocking). One example of such convention is the crossbeam-channel crate: its
Sender API provides a non-blocking try_send method and a blocking send
method. The blocking method works by (smartly) retrying the non-blocking one.

`spawn` and `schedule`

Due to the fixed capacity nature of our message queues the spawn and
schedule APIs are non-blocking and fallible -- the operation fails when the
queue is full. Furthermore, retrying either non-blocking operation to make it
blocking is wrong in most cases. Consider this scenario:

#[task(priority = 2, spawn = [b])]
fn a(c: a::Context) {
    // send message; retry if the queue is full
    while c.spawn.b(0).is_err() {}
}

#[task(priority = 1)]
fn b(c: b::Context, x: i32) {
    // ..
}

Here a higher priority task sends a message to a lower priority one. Because the
operation may not succeed the user retries the operation until it succeeds.
However, due to how task scheduling works this retry operation may block
forever: b cannot run while a is executing so if b's queue is full when
a starts it will remain like that for the whole execution of task a. This
situation also arises if both a and b have the same priority.

If in the previous example we reverse the priorities then the retry operation is
not needed at all because a can't never fill b's queue -- b will preempt
a as soon as the latter spawns the first.

#[task(priority = 1)]
fn a(c: a::Context) {
    // b is dispatched immediately so these never fill b's queue
    c.spawn.b(0).unwrap();
    c.spawn.b(1).unwrap();
    c.spawn.b(2).unwrap();
}

#[task(priority = 2)]
fn b(c: b::Context, x: i32) {
    // ..
}

However, if locks are involved then even with these priorities a can
saturate b's queue.

#[task(priority = 1, resources = [X])]
fn a(c: a::Context) {
    let x = c.resources.X;

    x.lock(|_| {
        // priority is now 2 (or higher)

        c.spawn.b(0).unwrap();
        c.spawn.b(1).unwrap(); // this will always panic
    })
}

#[task(priority = 2)]
fn b(c: b::Context, x: i32) {
    // ..
}

The only scenario where it is sensible to retry a spawn call is when the
target task runs on another core. The reason for this is that the other core can
drain the message queue regardless of the task priorities involved -- the task
priorities used on one core have no effect on the task scheduling performed on
other cores.

// highest priority possible on this core
#[task(core = 0, priority = 7, spawn = [b])]
fn a(_: a::Context) {
    // ..

    // spawn; retry if the queue is full
    while c.spawn.b(0).is_err() {} // this will *not* block forever

    // ..
}

// lowest priority possible on this core
#[task(core = 1, priority = 1)]
fn b(_: b::Context) {
    // ..
}

Timer queues are core-local -- the same core drains its own queue -- so
schedule calls, both core-local and cross-core, can never be successfully retried.

Improving the situation

We want to prevent users from running into those retry deadlocks but the current
API certainly lets them run into this pitfall.

Some ideas to counter the problem:

Keep the API as it is but document the pitfall

What the subtitle says.

Change the API to make retrying impossible

AFAIK there are two reasonable operations one can perform on the Result
returned by spawn / schedule:

you drop the value / ignore the result; this indicates that either you don't
care if the operation fails or that you expect the operation to never fail so
you never bother checking the output.
you unwrap the result to catch (during development) programmer errors where
a capacity that was too small was selected.

If these are indeed the only two reasonable options then we can change and
extend the API to provide only these two options:

spawn.$task (schedule.$task) would have the unwrap semantics: send the
message or panic if the queue is full
try_spawn.$task (try_schedule.$task) would have the drop semantics: send
the message or drop the payload if the queue is full

Both spawn.$task (schedule.$task) and try_spawn.$task (try_schedule.$task)
return the unit type: ().

spawn.$task is shorter to type and that's the API one should start out with.
try_spawn.$task should be used only where one is sure the operation will not
fail or where dropping / missing a message is not important.

With these new APIs it would not be possible to do a blocking cross-core
spawn call, but that may be a good thing because blocking behavior
doesn't feel like it really belongs within the RTFM API.

The names spawn and try_spawn appear match the naming convention used for
blocking and non-blocking variants but don't so we may want to pick different names
if we go down this route.

Some other option?

Is there some approach we could take here?

The text was updated successfully, but these errors were encountered:

korken89 · 2019-06-25T18:48:06Z

This is a nasty one.
I'd start with documenting the issue, but I have no good idea for clearing it up more than what is discussed already here.

The thing is that in event driven systems you should (probably) never retry like this, but schedule an event to try later - not sure how to communicate that though.

perlindgren · 2019-07-02T09:02:09Z

Hi Folks.

In the original RTFM-C implementation, sending a message was done using the keyword "async", with optional "after", setting the base-line offset, and "before" setting a new deadline. The return value was a message pointer, allowing to abort outstanding (but not yet scheduled messages). One could think of doing the same in the Rust re-implementation. Possibly with an option, None indicating that the buffer was full.

In the original RTFM-C implementation, each task definition came with Worst Case (shortest) Inter-arrival Time. Using this information, the maximum number of outstanding messages can be bounded (and hence sufficient capacity computed). That's why we did not have an "option" return type (since we computed the sufficient capacity).

We could go both ways in the Rust re-implementation. Either the "hard" way, towards giving guarantees (with more info/analysis), or the soft/sloppy way. The latter approach is what you typically see in systems like FreeRTOS, which is best effort scheduling..

We could also chose to fork, having a sloppy version, and a hard one.

What do you think out there?

By the way: Sloppy is not meaning its bad, its just best effort.

Best regards
Per

japaric · 2019-07-04T13:54:09Z

@korken89 yeah, we should definitively document this and backport the docs into the v0.4.x branch.

I don't think we need to decide on this for the v0.5.0 release. We can add a new API and deprecate the existing one in a new patch release (v0.5.x). I just tested that #[deprecated] works when included in code generated from a macro. The user would get warnings like this:

warning: use of deprecated item 'APP::<impl init::Spawn>::a': use `try_spawn.a` instead

@perlindgren

The return value was a message pointer, allowing to abort outstanding (but not yet scheduled messages)

This sounds like a new feature.

Using this information, the maximum number of outstanding messages can be bounded (and hence sufficient capacity computed). That's why we did not have an "option" return type (since we computed the sufficient capacity).

Isn't this equivalent to spawn and ignore the result API? If you rely on user input then the user could enter an arbitrarily large time that produces a capacity of 1 and then proceed to spawn N tasks causing (N-1) messages to be silently lost. Or they could incorrectly enter a value (e.g. forgot a zero or extra zero) that causes the capacity to be miscomputed also leading to missed messages.

korken89 · 2019-07-05T12:59:23Z

SGTM!

japaric · 2019-07-08T11:16:08Z

Ok then I'm going to assign the discussion about a potential API change to the v0.5.x milestone (things to discuss after the v0.5.0 release)

205: rtfm-syntax refactor + heterogeneous multi-core support r=japaric a=japaric this PR implements RFCs #178, #198, #199, #200, #201, #203 (only the refactor part), #204, #207, #211 and #212. most cfail tests have been removed because the test suite of `rtfm-syntax` already tests what was being tested here. The `rtfm-syntax` crate also has tests for the analysis pass which we didn't have here -- that test suite contains a regression test for #183. the remaining cfail tests have been upgraded into UI test so we can more thoroughly check / test the error message presented to the end user. the cpass tests have been converted into plain examples EDIT: I forgot, there are some examples of the multi-core API for the LPC541xx in [this repository](https://github.com/japaric/lpcxpresso54114) people that would like to try out this API but have no hardware can try out the x86_64 [Linux port] which also has multi-core support. [Linux port]: https://github.com/japaric/linux-rtfm closes #178 #198 #199 #200 #201 #203 #204 #207 #211 #212 closes #163 cc #209 (documents how to deal with errors) Co-authored-by: Jorge Aparicio <jorge@japaric.io>

korken89 · 2019-12-29T10:24:33Z

I think it is time to ping this issue again when we restart our meetings after new-year.

209: Bump cortex-m-rt-macro version to 0.1.6 for release r=korken89 a=therealprof Closes #206 Signed-off-by: Daniel Egger <daniel@eggers-club.de> Co-authored-by: Daniel Egger <daniel@eggers-club.de>

perlindgren · 2021-12-07T18:36:07Z

We close this issue for now, not sure if it is relevant in the case of the single core focus for RTIC 0.6. Feel free to re-open the issue if interested in further elaborating this topic.

japaric added the RFC This issue needs you input! label Jun 24, 2019

japaric added this to the v0.5.0 milestone Jun 24, 2019

japaric modified the milestones: v0.5.0, v0.5.x Jul 8, 2019

japaric mentioned this issue Aug 21, 2019

rtfm-syntax refactor + heterogeneous multi-core support #205

Merged

perlindgren closed this as completed Dec 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[discussion] retrying a `spawn` / `schedule` is almost always wrong #209

[discussion] retrying a `spawn` / `schedule` is almost always wrong #209

japaric commented Jun 24, 2019

korken89 commented Jun 25, 2019

perlindgren commented Jul 2, 2019

japaric commented Jul 4, 2019

korken89 commented Jul 5, 2019

japaric commented Jul 8, 2019

korken89 commented Dec 29, 2019

perlindgren commented Dec 7, 2021

[discussion] retrying a spawn / schedule is almost always wrong #209

[discussion] retrying a spawn / schedule is almost always wrong #209

Comments

japaric commented Jun 24, 2019

Blocking and non-blocking

spawn and schedule

Improving the situation

Keep the API as it is but document the pitfall

Change the API to make retrying impossible

Some other option?

korken89 commented Jun 25, 2019

perlindgren commented Jul 2, 2019

japaric commented Jul 4, 2019

korken89 commented Jul 5, 2019

japaric commented Jul 8, 2019

korken89 commented Dec 29, 2019

perlindgren commented Dec 7, 2021

[discussion] retrying a `spawn` / `schedule` is almost always wrong #209

[discussion] retrying a `spawn` / `schedule` is almost always wrong #209

`spawn` and `schedule`