API k_delayed_work_submit_to_queue() make a delayed_work unusable #5952

noodlefighter · 2018-02-02T06:28:54Z

Calling k_delayed_work_submit_to_queue() repeatly would broke the delayed_work object.

test case:

K_THREAD_STACK_DEFINE(stack_area, 1024);
struct k_work_q workq;
volatile struct k_delayed_work dwork;

void dwork_handler (struct k_work *work) {
    printk("work done!!!!!!!!!!!!!!!!!!!!!\n");
}

void submit (void) {
    int ret = k_delayed_work_submit_to_queue(&workq, &dwork, 1);
    if (ret < 0) {
        printk("submit fail, ret=%d, flag=0x%02x\n", ret, dwork.work.flags[0]);
    } else {
        printk("submit ok, flag=0x%02x\n", dwork.work.flags[0]);
    }
}

int main (void) {
    k_work_q_start(&workq, stack_area, K_THREAD_STACK_SIZEOF(stack_area), 0);

    k_delayed_work_init(&dwork, dwork_handler);

    printk("\ndestroy it.............\n");
    for (int i = 0; i < 100; i++) {
        submit();
        while (!k_work_pending(&dwork.work));
        submit();
        k_sleep(10);
    }

    k_sleep(1000);
    printk("\nbroken..............\n");
    submit();
    k_sleep(1000);
    submit();
    k_sleep(1000);
    submit();
    k_sleep(100);
}

run:

[QEMU] CPU: cortex-m3
***** BOOTING ZEPHYR OS v1.10.0- - BUILD: Feb  2 2018 05:50:13 *****

destroy it.............
submit ok, flag=0x00
submit ok, flag=0x01
submit fail, ret=-22, flag=0x01
submit fail, ret=-22, flag=0x01
...
submit fail, ret=-22, flag=0x01
submit fail, ret=-22, flag=0x01

broken..............
submit fail, ret=-22, flag=0x01
submit fail, ret=-22, flag=0x01
submit fail, ret=-22, flag=0x01

-22 is "-EINVAL", let's check the docs:

-EINVAL Work item is being processed or has completed its work.

Is it a bug or just some wrong with my operation?

The text was updated successfully, but these errors were encountered:

andyross · 2018-02-13T18:58:41Z

The test case seems a little weird. It looks like this line has a boolean inversion:

    while (!k_work_pending(&dwork.work));

My guess is that this was intended to busy-wait until the handler executes (itself a fragile idea -- that will only work if the work_q has a higher priority than the main thread). But instead it's waiting until the work item is pending. And since it was just submitted with a non-zero timeout, it's guaranteed to be pending here and this while() will only run the test once.

So what happens is that your code falls straight through to a second submission of the same item, which succeeds as a noop per docs. But then, indeed, there's a bug. The handler never actually executes after that second submission, and in fact the third submission then fails.

As discovered in zephyrproject-rtos#5952 ...a duplicate call to k_delayed_work_submit_to_queue() on a work item whose timeout had expired but which had not yet executed (i.e. it was pending in the queue for the active work queue thread) would fail, because the cancellation step wouldn't clear the PENDING bit, causing the resubmission to see the object in an invalid state. Trivially fixed by adding a bit clear. It also turns out that the behavior of the code doesn't match the docs, which state that a PENDING work item is not supposed to be cancelled at all. Fix the docs to remove that. And on yet further review, it turns out that there's no way to make a test like the one in the linked bug threadsafe. The work queue does no synchronization by design, so if the user code does no external synchronization it might very well clobber the running handler. Added a sentence to the docs to reflect this gotcha. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

andyross · 2018-02-13T20:25:37Z

Never mind, the inversion was mine, not yours: "pending" in this context means "the timeout is finished and we're in the main work queue" not "pending on timeout". I see what you're doing now, and indeed it failed to try to resubmit a pending object, despite the fact that the code was trying to do exactly that.

There's a fix for this specific case up at #6189

Note that careful review shows that this particular usage can't ever be safe, though. The work queue itself does no synchronization, so it's not legal to call this function to "restart the timeout" without external synchronization in the general case. Cleaned up the docs to reflect that. But it seems to make your particular case work. Be very careful.

As discovered in #5952 ...a duplicate call to k_delayed_work_submit_to_queue() on a work item whose timeout had expired but which had not yet executed (i.e. it was pending in the queue for the active work queue thread) would fail, because the cancellation step wouldn't clear the PENDING bit, causing the resubmission to see the object in an invalid state. Trivially fixed by adding a bit clear. It also turns out that the behavior of the code doesn't match the docs, which state that a PENDING work item is not supposed to be cancelled at all. Fix the docs to remove that. And on yet further review, it turns out that there's no way to make a test like the one in the linked bug threadsafe. The work queue does no synchronization by design, so if the user code does no external synchronization it might very well clobber the running handler. Added a sentence to the docs to reflect this gotcha. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

nashif added the bug The issue is a bug, or the PR is fixing a bug label Feb 6, 2018

nashif assigned andyross Feb 6, 2018

nashif added priority: medium Medium impact/importance bug area: Kernel labels Feb 6, 2018

nashif added this to the v1.11.0 milestone Feb 6, 2018

pfalcon mentioned this issue Feb 6, 2018

net: buf: Redesign with split data and meta-data #5721

Merged

nashif added priority: high High impact/importance bug and removed priority: medium Medium impact/importance bug labels Feb 8, 2018

andyross mentioned this issue Feb 13, 2018

work_q: Correctly clear pending flag in delayed work queue, update docs #6189

Merged

nashif closed this as completed in #6189 Feb 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API k_delayed_work_submit_to_queue() make a delayed_work unusable #5952

API k_delayed_work_submit_to_queue() make a delayed_work unusable #5952

noodlefighter commented Feb 2, 2018

andyross commented Feb 13, 2018

andyross commented Feb 13, 2018

API k_delayed_work_submit_to_queue() make a delayed_work unusable #5952

API k_delayed_work_submit_to_queue() make a delayed_work unusable #5952

Comments

noodlefighter commented Feb 2, 2018

andyross commented Feb 13, 2018

andyross commented Feb 13, 2018