Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock during activity state machine cancelation #871

Closed
Spikhalskiy opened this issue Nov 15, 2021 · 0 comments · Fixed by #1690
Closed

Deadlock during activity state machine cancelation #871

Spikhalskiy opened this issue Nov 15, 2021 · 0 comments · Fixed by #1690
Assignees
Labels
epic bug High-impact bug (e.g. mis-behaving causing issues to prod traffic)
Milestone

Comments

@Spikhalskiy
Copy link
Contributor

Spikhalskiy commented Nov 15, 2021

Expected Behavior

Workflow method (Spikhalskiy@1d6d858#diff-8e1bd2b4d9c5c5fc5f71b84c7cccbfb73a8396d355e81421b4c2a66d491aa2c8R42):

    @Override
    public String execute() {
      TestActivities.VariousTestActivities activities = ...
      CancellationScope cancellationScope =
          Workflow.newCancellationScope(() -> Async.procedure(activities::activity1, 1));
      cancellationScope.run();
      try {
        // Forcing an end of WFT
        Workflow.sleep(1000);
        throw ApplicationFailure.newNonRetryableFailure("messsage", "type");
      } finally {
        cancellationScope.cancel();
      }
    }

Should be able to fail the workflow execution.

Actual Behavior

Deadlock with an activity state machine triggering an event loop on cancel:

Nov 14, 2021 8:31:17 PM com.google.common.cache.LocalCache processPendingNotifications
WARNING: Exception thrown by removal listener
io.temporal.internal.sync.PotentialDeadlockException: Potential deadlock detected: workflow thread "workflow-method-cdfe6e1a-c034-4f15-b422-f22073c9d5b-e05bc9c6-b9cb-42b1-9661-0a861ee383b1" didn't yield control for over a second. Other workflow threads:

null

	at java.base@11.0.13/jdk.internal.misc.Unsafe.park(Native Method)
	at java.base@11.0.13/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:234)
	at java.base@11.0.13/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2211)
	at app//io.temporal.internal.sync.WorkflowThreadContext.runUntilBlocked(WorkflowThreadContext.java:243)
	at app//io.temporal.internal.sync.WorkflowThreadImpl.runUntilBlocked(WorkflowThreadImpl.java:317)
	at app//io.temporal.internal.sync.DeterministicRunnerImpl.runUntilAllBlocked(DeterministicRunnerImpl.java:195)
	at app//io.temporal.internal.sync.SyncWorkflow.eventLoop(SyncWorkflow.java:156)
	at app//io.temporal.internal.replay.ReplayWorkflowExecutor.eventLoop(ReplayWorkflowExecutor.java:73)
	at app//io.temporal.internal.replay.ReplayWorkflowRunTaskHandler$StatesMachinesCallbackImpl.eventLoop(ReplayWorkflowRunTaskHandler.java:328)
	at app//io.temporal.internal.statemachines.WorkflowStateMachines.eventLoop(WorkflowStateMachines.java:455)
	at app//io.temporal.internal.statemachines.WorkflowStateMachines.lambda$scheduleActivityTask$ff6e4b43$1(WorkflowStateMachines.java:515)
	at app//io.temporal.internal.statemachines.WorkflowStateMachines$$Lambda$181/0x0000000800374840.apply(Unknown Source)
	at app//io.temporal.internal.statemachines.ActivityStateMachine.notifyCanceled(ActivityStateMachine.java:318)
	at app//io.temporal.internal.statemachines.ActivityStateMachine.cancelCommandNotifyCanceled(ActivityStateMachine.java:288)
	at app//io.temporal.internal.statemachines.ActivityStateMachine$$Lambda$185/0x0000000800373840.apply(Unknown Source)
	at app//io.temporal.internal.statemachines.FixedTransitionAction.apply(FixedTransitionAction.java:45)
	at app//io.temporal.internal.statemachines.StateMachine.executeTransition(StateMachine.java:147)
	at app//io.temporal.internal.statemachines.StateMachine.handleExplicitEvent(StateMachine.java:91)
	at app//io.temporal.internal.statemachines.EntityStateMachineBase.explicitEvent(EntityStateMachineBase.java:78)
	at app//io.temporal.internal.statemachines.ActivityStateMachine.cancel(ActivityStateMachine.java:277)
	at app//io.temporal.internal.statemachines.WorkflowStateMachines$$Lambda$201/0x0000000800389040.apply(Unknown Source)
	at app//io.temporal.internal.replay.ReplayWorkflowContextImpl.lambda$scheduleActivityTask$ef6c118$1(ReplayWorkflowContextImpl.java:213)
	at app//io.temporal.internal.replay.ReplayWorkflowContextImpl$$Lambda$202/0x0000000800389440.apply(Unknown Source)
	at app//io.temporal.internal.sync.SyncWorkflowContext.lambda$executeActivityOnce$d79fb25e$1(SyncWorkflowContext.java:225)
	at app//io.temporal.internal.sync.SyncWorkflowContext$$Lambda$203/0x0000000800389840.apply(Unknown Source)
	at app//io.temporal.internal.sync.CompletablePromiseImpl.lambda$thenApply$2df5ef44$1(CompletablePromiseImpl.java:210)
	at app//io.temporal.internal.sync.CompletablePromiseImpl$$Lambda$204/0x0000000800389c40.apply(Unknown Source)
	at app//io.temporal.internal.sync.CompletablePromiseImpl.lambda$handle$6a2a7e3d$1(CompletablePromiseImpl.java:219)
	at app//io.temporal.internal.sync.CompletablePromiseImpl$$Lambda$205/0x000000080038a040.apply(Unknown Source)
	at app//io.temporal.internal.sync.CompletablePromiseImpl.lambda$then$16b0e4cc$1(CompletablePromiseImpl.java:268)
	at app//io.temporal.internal.sync.CompletablePromiseImpl$$Lambda$206/0x000000080038a440.apply(Unknown Source)
	at app//io.temporal.internal.sync.CompletablePromiseImpl.invokeHandlers(CompletablePromiseImpl.java:276)
	at app//io.temporal.internal.sync.CompletablePromiseImpl.complete(CompletablePromiseImpl.java:167)
	at app//io.temporal.internal.sync.CancellationScopeImpl.cancel(CancellationScopeImpl.java:120)
	at app//io.temporal.workflow.determinism.CancelActivityDeadlockTest$ScheduleCancelActivityWorkflow.execute(CancelActivityDeadlockTest.java:63)
	at java.base@11.0.13/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base@11.0.13/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base@11.0.13/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base@11.0.13/java.lang.reflect.Method.invoke(Method.java:566)
	at app//io.temporal.internal.sync.POJOWorkflowImplementationFactory$POJOWorkflowImplementation$RootWorkflowInboundCallsInterceptor.execute(POJOWorkflowImplementationFactory.java:317)
	at app//io.temporal.common.interceptors.WorkflowInboundCallsInterceptorBase.execute(WorkflowInboundCallsInterceptorBase.java:37)
	at app//io.temporal.internal.sync.POJOWorkflowImplementationFactory$POJOWorkflowImplementation.execute(POJOWorkflowImplementationFactory.java:292)
	at app//io.temporal.internal.sync.WorkflowExecuteRunnable.run(WorkflowExecuteRunnable.java:72)
	at app//io.temporal.internal.sync.SyncWorkflow.lambda$start$0(SyncWorkflow.java:137)
	at app//io.temporal.internal.sync.SyncWorkflow$$Lambda$172/0x0000000800377840.run(Unknown Source)
	at app//io.temporal.internal.sync.CancellationScopeImpl.run(CancellationScopeImpl.java:101)
	at app//io.temporal.internal.sync.WorkflowThreadImpl$RunnableWrapper.run(WorkflowThreadImpl.java:111)
	at java.base@11.0.13/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base@11.0.13/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base@11.0.13/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base@11.0.13/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base@11.0.13/java.lang.Thread.run(Thread.java:829)

Root cause analysis

It probably happens when an activity state machine takes a SCHEDULE_COMMAND_CREATED --> CANCELED: CANCEL route with triggering back an event loop on a workflow thread that it's already done. It's just a guess for now.

Spikhalskiy added a commit to Spikhalskiy/java-sdk that referenced this issue Nov 15, 2021
@Spikhalskiy Spikhalskiy added the epic bug High-impact bug (e.g. mis-behaving causing issues to prod traffic) label Nov 16, 2021
Spikhalskiy added a commit to Spikhalskiy/java-sdk that referenced this issue Mar 7, 2023
@Spikhalskiy Spikhalskiy self-assigned this Mar 7, 2023
@Spikhalskiy Spikhalskiy added this to the 1.19.0 milestone Mar 7, 2023
Spikhalskiy added a commit that referenced this issue Mar 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic bug High-impact bug (e.g. mis-behaving causing issues to prod traffic)
Projects
None yet
1 participant