Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API k_delayed_work_submit_to_queue() make a delayed_work unusable #5952

Closed
noodlefighter opened this issue Feb 2, 2018 · 2 comments · Fixed by #6189
Closed

API k_delayed_work_submit_to_queue() make a delayed_work unusable #5952

noodlefighter opened this issue Feb 2, 2018 · 2 comments · Fixed by #6189
Assignees
Labels
area: Kernel bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug
Milestone

Comments

@noodlefighter
Copy link

Calling k_delayed_work_submit_to_queue() repeatly would broke the delayed_work object.

test case:

K_THREAD_STACK_DEFINE(stack_area, 1024);
struct k_work_q workq;
volatile struct k_delayed_work dwork;

void dwork_handler (struct k_work *work) {
    printk("work done!!!!!!!!!!!!!!!!!!!!!\n");
}

void submit (void) {
    int ret = k_delayed_work_submit_to_queue(&workq, &dwork, 1);
    if (ret < 0) {
        printk("submit fail, ret=%d, flag=0x%02x\n", ret, dwork.work.flags[0]);
    } else {
        printk("submit ok, flag=0x%02x\n", dwork.work.flags[0]);
    }
}

int main (void) {
    k_work_q_start(&workq, stack_area, K_THREAD_STACK_SIZEOF(stack_area), 0);

    k_delayed_work_init(&dwork, dwork_handler);

    printk("\ndestroy it.............\n");
    for (int i = 0; i < 100; i++) {
        submit();
        while (!k_work_pending(&dwork.work));
        submit();
        k_sleep(10);
    }

    k_sleep(1000);
    printk("\nbroken..............\n");
    submit();
    k_sleep(1000);
    submit();
    k_sleep(1000);
    submit();
    k_sleep(100);
}

run:

[QEMU] CPU: cortex-m3
***** BOOTING ZEPHYR OS v1.10.0- - BUILD: Feb  2 2018 05:50:13 *****

destroy it.............
submit ok, flag=0x00
submit ok, flag=0x01
submit fail, ret=-22, flag=0x01
submit fail, ret=-22, flag=0x01
...
submit fail, ret=-22, flag=0x01
submit fail, ret=-22, flag=0x01

broken..............
submit fail, ret=-22, flag=0x01
submit fail, ret=-22, flag=0x01
submit fail, ret=-22, flag=0x01

-22 is "-EINVAL", let's check the docs:

-EINVAL Work item is being processed or has completed its work.

Is it a bug or just some wrong with my operation?

@nashif nashif added the bug The issue is a bug, or the PR is fixing a bug label Feb 6, 2018
@nashif nashif added priority: medium Medium impact/importance bug area: Kernel labels Feb 6, 2018
@nashif nashif added this to the v1.11.0 milestone Feb 6, 2018
@nashif nashif added priority: high High impact/importance bug and removed priority: medium Medium impact/importance bug labels Feb 8, 2018
@andyross
Copy link
Contributor

The test case seems a little weird. It looks like this line has a boolean inversion:

    while (!k_work_pending(&dwork.work));

My guess is that this was intended to busy-wait until the handler executes (itself a fragile idea -- that will only work if the work_q has a higher priority than the main thread). But instead it's waiting until the work item is pending. And since it was just submitted with a non-zero timeout, it's guaranteed to be pending here and this while() will only run the test once.

So what happens is that your code falls straight through to a second submission of the same item, which succeeds as a noop per docs. But then, indeed, there's a bug. The handler never actually executes after that second submission, and in fact the third submission then fails.

andyross pushed a commit to andyross/zephyr that referenced this issue Feb 13, 2018
As discovered in zephyrproject-rtos#5952

...a duplicate call to k_delayed_work_submit_to_queue() on a work item
whose timeout had expired but which had not yet executed (i.e. it was
pending in the queue for the active work queue thread) would fail,
because the cancellation step wouldn't clear the PENDING bit, causing
the resubmission to see the object in an invalid state.  Trivially
fixed by adding a bit clear.

It also turns out that the behavior of the code doesn't match the
docs, which state that a PENDING work item is not supposed to be
cancelled at all.  Fix the docs to remove that.

And on yet further review, it turns out that there's no way to make a
test like the one in the linked bug threadsafe.  The work queue does
no synchronization by design, so if the user code does no external
synchronization it might very well clobber the running handler.  Added
a sentence to the docs to reflect this gotcha.

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
@andyross
Copy link
Contributor

Never mind, the inversion was mine, not yours: "pending" in this context means "the timeout is finished and we're in the main work queue" not "pending on timeout". I see what you're doing now, and indeed it failed to try to resubmit a pending object, despite the fact that the code was trying to do exactly that.

There's a fix for this specific case up at #6189

Note that careful review shows that this particular usage can't ever be safe, though. The work queue itself does no synchronization, so it's not legal to call this function to "restart the timeout" without external synchronization in the general case. Cleaned up the docs to reflect that. But it seems to make your particular case work. Be very careful.

nashif pushed a commit that referenced this issue Feb 13, 2018
As discovered in #5952

...a duplicate call to k_delayed_work_submit_to_queue() on a work item
whose timeout had expired but which had not yet executed (i.e. it was
pending in the queue for the active work queue thread) would fail,
because the cancellation step wouldn't clear the PENDING bit, causing
the resubmission to see the object in an invalid state.  Trivially
fixed by adding a bit clear.

It also turns out that the behavior of the code doesn't match the
docs, which state that a PENDING work item is not supposed to be
cancelled at all.  Fix the docs to remove that.

And on yet further review, it turns out that there's no way to make a
test like the one in the linked bug threadsafe.  The work queue does
no synchronization by design, so if the user code does no external
synchronization it might very well clobber the running handler.  Added
a sentence to the docs to reflect this gotcha.

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Kernel bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants