Possible Race condition on canceling a workflow instance #4352

Zelldon · 2020-04-22T15:22:04Z

Describe the bug

Reported by a cloud user was that it seem that operate get out of sync. But there real problem was that the workflow instance get stuck during canceling the instance.

Imagine the following workflow:

Task B is completed and during completing and taking the next sequence flow the workflow instance is canceled by the user. What now can happen is that the canceling will not clean up correctly all scopes and the instance get stuck. In our case only the Task A and the Sub Process was terminated correctly. The multi instance and workflow instance was still alive.

To Reproduce

This is also reproducible via an engine unit test and the following process
multiBug.bpmn.txt.

Test

/*
 * Copyright © 2020  camunda services GmbH (info@camunda.com)
 *
 *  Licensed under the Apache License, Version 2.0 (the "License");
 *  you may not use this file except in compliance with the License.
 *  You may obtain a copy of the License at
 *
 *        http://www.apache.org/licenses/LICENSE-2.0
 *
 *  Unless required by applicable law or agreed to in writing, software
 *  distributed under the License is distributed on an "AS IS" BASIS,
 *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 *  See the License for the specific language governing permissions and
 *  limitations under the License.
 *
 */
package io.zeebe.engine.processor.workflow.multiinstance;
import io.zeebe.engine.util.EngineRule;

import io.zeebe.engine.util.RecordToWrite;

import io.zeebe.model.bpmn.Bpmn;

import io.zeebe.protocol.record.intent.JobIntent;

import io.zeebe.protocol.record.intent.WorkflowInstanceIntent;

import io.zeebe.protocol.record.value.BpmnElementType;

import io.zeebe.test.util.record.RecordingExporter;

import io.zeebe.test.util.record.RecordingExporterTestWatcher;

import java.util.Arrays;

import java.util.stream.Collectors;

import org.junit.ClassRule;

import org.junit.Rule;

import org.junit.Test;
public class MultiInstanceBugTest {
@ClassRule public static final EngineRule ENGINE = EngineRule.singlePartition();

public static final String TASK_ELEMENT_ID = "task";

private static final String PROCESS_ID = "process";

private static final String SUB_PROCESS_ELEMENT_ID = "sub-process";

private static final String JOB_TYPE = "test";

private static final String INPUT_COLLECTION = "items";

private static final String INPUT_ELEMENT = "item";
@Rule

public final RecordingExporterTestWatcher recordingExporterTestWatcher =

new RecordingExporterTestWatcher();
@Test

public void shouldActivateStartEventForEachElement() {

// given

final var resourceAsStream =

MultiInstanceBugTest.class.getResourceAsStream("/workflows/multiBug.bpmn");

final var bpmnModelInstance = Bpmn.readModelFromStream(resourceAsStream);

ENGINE.deployment().withXmlResource(bpmnModelInstance).deploy();
final long workflowInstanceKey =
    ENGINE
        .workflowInstance()
        .ofBpmnProcessId(PROCESS_ID)
        .withVariable(INPUT_COLLECTION, Arrays.asList(10, 20, 30))
        .create();

final var instanceRecordValueRecord =
    RecordingExporter.workflowInstanceRecords()
        .withIntent(WorkflowInstanceIntent.ELEMENT_ACTIVATED)
        .withElementType(BpmnElementType.PROCESS)
        .getFirst();

final var taskA =
    RecordingExporter.jobRecords().withIntent(JobIntent.CREATED).withType("a").getFirst();
final var taskB =
    RecordingExporter.jobRecords().withIntent(JobIntent.CREATED).withType("b").getFirst();
ENGINE.writeRecords(RecordToWrite.command().job(JobIntent.COMPLETE).key(taskB.getKey()));

RecordingExporter.jobRecords().withIntent(JobIntent.COMPLETED).withType("b").getFirst();
//    ENGINE.stop();

// when
ENGINE.writeRecords(
    RecordToWrite.command()
        .key(workflowInstanceKey)
        .workflowInstance(WorkflowInstanceIntent.CANCEL, instanceRecordValueRecord.getValue()));

//    ENGINE.start();

// then
final var instanceCanceled =
    RecordingExporter.workflowInstanceRecords()
        .withIntent(WorkflowInstanceIntent.ELEMENT_TERMINATED)
        .withElementType(BpmnElementType.PROCESS)
        .getFirst();

final var collect =
    RecordingExporter.workflowInstanceRecords()
        .withIntent(WorkflowInstanceIntent.ELEMENT_TERMINATED)
        .collect(Collectors.toList());

}

}

Be aware that this is a race condition, which means that the test might not fail on the first try.

Expected behavior

The workflow instance can be terminated without any problems.

Log/Stacktrace
We have extracted the records from the failed scenario you can find them in the
records.txt

We see that only the task and sub process are terminated and the sequence flow after the task B is taken. Actually the same sequence flow seems to be taken twice, but it has different scope ids, maybe related to the problem. Be aware that we can't share here the actual bpmn process to protect our user. So this means the output above doesn't match to our current model which we have shown in the example.

BUT if we run the test we can see similar output
output-test.txt

Environment:

Zeebe Version: 0.23.0

The text was updated successfully, but these errors were encountered:

menski · 2020-04-30T07:51:21Z

Support Case: https://jira.camunda.com/browse/SUPPORT-7623

Waiting for customer to prioritize

4590: chore(engine): migrate sub-process processor r=saig0 a=saig0 # Description * migrate sub-process processor * fix termination of an embedded sub-process with a waiting token on a joining parallel gateway * clean up tests for embedded sub-process ## Related issues closes #4474 closes #4400 closes #4352 # Co-authored-by: Philipp Ossler <philipp.ossler@gmail.com>

Zelldon added kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog labels Apr 22, 2020

npepinpe added severity/low Marks a bug as having little to no noticeable impact for the user Impact: Usability and removed severity/low Marks a bug as having little to no noticeable impact for the user labels Apr 27, 2020

saig0 mentioned this issue Apr 29, 2020

Interrupting a workflow instance waiting on a parallel (joining) gateway leads to a stuck instance #4400

Closed

menski added the support Marks an issue as related to a customer support request label Apr 30, 2020

npepinpe added the Status: Ready label May 2, 2020

saig0 added severity/mid Marks a bug as having a noticeable impact but with a known workaround and removed Severity: Major labels May 15, 2020

saig0 self-assigned this May 25, 2020

saig0 mentioned this issue May 25, 2020

chore(engine): migrate sub-process processor #4590

Merged

3 tasks

ghost closed this as completed in d2e92a6 May 26, 2020

npepinpe added the Release: 0.24.0 label Jul 3, 2020

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible Race condition on canceling a workflow instance #4352

Possible Race condition on canceling a workflow instance #4352

Zelldon commented Apr 22, 2020 •

edited

Loading

menski commented Apr 30, 2020

Possible Race condition on canceling a workflow instance #4352

Possible Race condition on canceling a workflow instance #4352

Comments

Zelldon commented Apr 22, 2020 • edited Loading

menski commented Apr 30, 2020

Zelldon commented Apr 22, 2020 •

edited

Loading