Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[discussion] retrying a spawn / schedule is almost always wrong #209

Closed
japaric opened this issue Jun 24, 2019 · 7 comments
Closed

[discussion] retrying a spawn / schedule is almost always wrong #209

japaric opened this issue Jun 24, 2019 · 7 comments
Labels
RFC This issue needs you input!
Milestone

Comments

@japaric
Copy link
Collaborator

japaric commented Jun 24, 2019

Blocking and non-blocking

Abstractions that provide both non-blocking and blocking versions of an
operation usually name their APIs try_{action} (non-blocking) and {action}
(blocking). One example of such convention is the crossbeam-channel crate: its
Sender API provides a non-blocking try_send method and a blocking send
method. The blocking method works by (smartly) retrying the non-blocking one.

spawn and schedule

Due to the fixed capacity nature of our message queues the spawn and
schedule APIs are non-blocking and fallible -- the operation fails when the
queue is full. Furthermore, retrying either non-blocking operation to make it
blocking is wrong in most cases. Consider this scenario:

#[task(priority = 2, spawn = [b])]
fn a(c: a::Context) {
    // send message; retry if the queue is full
    while c.spawn.b(0).is_err() {}
}

#[task(priority = 1)]
fn b(c: b::Context, x: i32) {
    // ..
}

Here a higher priority task sends a message to a lower priority one. Because the
operation may not succeed the user retries the operation until it succeeds.
However, due to how task scheduling works this retry operation may block
forever: b cannot run while a is executing so if b's queue is full when
a starts it will remain like that for the whole execution of task a. This
situation also arises if both a and b have the same priority.

If in the previous example we reverse the priorities then the retry operation is
not needed at all because a can't never fill b's queue -- b will preempt
a as soon as the latter spawns the first.

#[task(priority = 1)]
fn a(c: a::Context) {
    // b is dispatched immediately so these never fill b's queue
    c.spawn.b(0).unwrap();
    c.spawn.b(1).unwrap();
    c.spawn.b(2).unwrap();
}

#[task(priority = 2)]
fn b(c: b::Context, x: i32) {
    // ..
}

However, if locks are involved then even with these priorities a can
saturate b's queue.

#[task(priority = 1, resources = [X])]
fn a(c: a::Context) {
    let x = c.resources.X;

    x.lock(|_| {
        // priority is now 2 (or higher)

        c.spawn.b(0).unwrap();
        c.spawn.b(1).unwrap(); // this will always panic
    })
}

#[task(priority = 2)]
fn b(c: b::Context, x: i32) {
    // ..
}

The only scenario where it is sensible to retry a spawn call is when the
target task runs on another core. The reason for this is that the other core can
drain the message queue regardless of the task priorities involved -- the task
priorities used on one core have no effect on the task scheduling performed on
other cores.

// highest priority possible on this core
#[task(core = 0, priority = 7, spawn = [b])]
fn a(_: a::Context) {
    // ..

    // spawn; retry if the queue is full
    while c.spawn.b(0).is_err() {} // this will *not* block forever

    // ..
}

// lowest priority possible on this core
#[task(core = 1, priority = 1)]
fn b(_: b::Context) {
    // ..
}

Timer queues are core-local -- the same core drains its own queue -- so
schedule calls, both core-local and cross-core, can never be successfully retried.

Improving the situation

We want to prevent users from running into those retry deadlocks but the current
API certainly lets them run into this pitfall.

Some ideas to counter the problem:

Keep the API as it is but document the pitfall

What the subtitle says.

Change the API to make retrying impossible

AFAIK there are two reasonable operations one can perform on the Result
returned by spawn / schedule:

  • you drop the value / ignore the result; this indicates that either you don't
    care if the operation fails or that you expect the operation to never fail so
    you never bother checking the output.

  • you unwrap the result to catch (during development) programmer errors where
    a capacity that was too small was selected.

If these are indeed the only two reasonable options then we can change and
extend the API to provide only these two options:

  • spawn.$task (schedule.$task) would have the unwrap semantics: send the
    message or panic if the queue is full

  • try_spawn.$task (try_schedule.$task) would have the drop semantics: send
    the message or drop the payload if the queue is full

Both spawn.$task (schedule.$task) and try_spawn.$task (try_schedule.$task)
return the unit type: ().

spawn.$task is shorter to type and that's the API one should start out with.
try_spawn.$task should be used only where one is sure the operation will not
fail or where dropping / missing a message is not important.

With these new APIs it would not be possible to do a blocking cross-core
spawn call, but that may be a good thing because blocking behavior
doesn't feel like it really belongs within the RTFM API.

The names spawn and try_spawn appear match the naming convention used for
blocking and non-blocking variants but don't so we may want to pick different names
if we go down this route.

Some other option?

Is there some approach we could take here?

@japaric japaric added the RFC This issue needs you input! label Jun 24, 2019
@japaric japaric added this to the v0.5.0 milestone Jun 24, 2019
@korken89
Copy link
Collaborator

This is a nasty one.
I'd start with documenting the issue, but I have no good idea for clearing it up more than what is discussed already here.

The thing is that in event driven systems you should (probably) never retry like this, but schedule an event to try later - not sure how to communicate that though.

@perlindgren
Copy link
Collaborator

Hi Folks.

In the original RTFM-C implementation, sending a message was done using the keyword "async", with optional "after", setting the base-line offset, and "before" setting a new deadline. The return value was a message pointer, allowing to abort outstanding (but not yet scheduled messages). One could think of doing the same in the Rust re-implementation. Possibly with an option, None indicating that the buffer was full.

In the original RTFM-C implementation, each task definition came with Worst Case (shortest) Inter-arrival Time. Using this information, the maximum number of outstanding messages can be bounded (and hence sufficient capacity computed). That's why we did not have an "option" return type (since we computed the sufficient capacity).

We could go both ways in the Rust re-implementation. Either the "hard" way, towards giving guarantees (with more info/analysis), or the soft/sloppy way. The latter approach is what you typically see in systems like FreeRTOS, which is best effort scheduling..

We could also chose to fork, having a sloppy version, and a hard one.

What do you think out there?

By the way: Sloppy is not meaning its bad, its just best effort.

Best regards
Per

@japaric
Copy link
Collaborator Author

japaric commented Jul 4, 2019

@korken89 yeah, we should definitively document this and backport the docs into the v0.4.x branch.

I don't think we need to decide on this for the v0.5.0 release. We can add a new API and deprecate the existing one in a new patch release (v0.5.x). I just tested that #[deprecated] works when included in code generated from a macro. The user would get warnings like this:

warning: use of deprecated item 'APP::<impl init::Spawn>::a': use `try_spawn.a` instead

@perlindgren

The return value was a message pointer, allowing to abort outstanding (but not yet scheduled messages)

This sounds like a new feature.

Using this information, the maximum number of outstanding messages can be bounded (and hence sufficient capacity computed). That's why we did not have an "option" return type (since we computed the sufficient capacity).

Isn't this equivalent to spawn and ignore the result API? If you rely on user input then the user could enter an arbitrarily large time that produces a capacity of 1 and then proceed to spawn N tasks causing (N-1) messages to be silently lost. Or they could incorrectly enter a value (e.g. forgot a zero or extra zero) that causes the capacity to be miscomputed also leading to missed messages.

@korken89
Copy link
Collaborator

korken89 commented Jul 5, 2019

SGTM!

@japaric
Copy link
Collaborator Author

japaric commented Jul 8, 2019

Ok then I'm going to assign the discussion about a potential API change to the v0.5.x milestone (things to discuss after the v0.5.0 release)

@japaric japaric modified the milestones: v0.5.0, v0.5.x Jul 8, 2019
bors bot added a commit that referenced this issue Sep 15, 2019
205: rtfm-syntax refactor + heterogeneous multi-core support r=japaric a=japaric

this PR implements RFCs #178, #198, #199, #200, #201, #203 (only the refactor
part), #204, #207, #211 and #212.

most cfail tests have been removed because the test suite of `rtfm-syntax`
already tests what was being tested here. The `rtfm-syntax` crate also has tests
for the analysis pass which we didn't have here -- that test suite contains a
regression test for #183.

the remaining cfail tests have been upgraded into UI test so we can more
thoroughly check / test the error message presented to the end user.

the cpass tests have been converted into plain examples

EDIT: I forgot, there are some examples of the multi-core API for the LPC541xx in [this repository](https://github.com/japaric/lpcxpresso54114)

people that would like to try out this API but have no hardware can try out the
x86_64 [Linux port] which also has multi-core support.

[Linux port]: https://github.com/japaric/linux-rtfm

closes #178 #198 #199 #200 #201 #203 #204 #207 #211 #212 
closes #163 
cc #209 (documents how to deal with errors)

Co-authored-by: Jorge Aparicio <jorge@japaric.io>
bors bot added a commit that referenced this issue Sep 15, 2019
205: rtfm-syntax refactor + heterogeneous multi-core support r=japaric a=japaric

this PR implements RFCs #178, #198, #199, #200, #201, #203 (only the refactor
part), #204, #207, #211 and #212.

most cfail tests have been removed because the test suite of `rtfm-syntax`
already tests what was being tested here. The `rtfm-syntax` crate also has tests
for the analysis pass which we didn't have here -- that test suite contains a
regression test for #183.

the remaining cfail tests have been upgraded into UI test so we can more
thoroughly check / test the error message presented to the end user.

the cpass tests have been converted into plain examples

EDIT: I forgot, there are some examples of the multi-core API for the LPC541xx in [this repository](https://github.com/japaric/lpcxpresso54114)

people that would like to try out this API but have no hardware can try out the
x86_64 [Linux port] which also has multi-core support.

[Linux port]: https://github.com/japaric/linux-rtfm

closes #178 #198 #199 #200 #201 #203 #204 #207 #211 #212 
closes #163 
cc #209 (documents how to deal with errors)

Co-authored-by: Jorge Aparicio <jorge@japaric.io>
@korken89
Copy link
Collaborator

I think it is time to ping this issue again when we restart our meetings after new-year.

andrewgazelka pushed a commit to andrewgazelka/cortex-m-rtic that referenced this issue Nov 3, 2021
209: Bump cortex-m-rt-macro version to 0.1.6 for release r=korken89 a=therealprof

Closes #206

Signed-off-by: Daniel Egger <daniel@eggers-club.de>

Co-authored-by: Daniel Egger <daniel@eggers-club.de>
@perlindgren
Copy link
Collaborator

We close this issue for now, not sure if it is relevant in the case of the single core focus for RTIC 0.6. Feel free to re-open the issue if interested in further elaborating this topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC This issue needs you input!
Projects
None yet
Development

No branches or pull requests

3 participants