Improve Exception Handling #1099

mroeppis · 2023-06-09T23:41:38Z

Motivation

Some issues related to exception handling have been reported over time:

What they have in common is that users want to know how they can handle exceptions from self-written components at the caller level of the statemachine. They often ask why exceptions get caught inside the machine but are not passed to the outside.

It seems there are 2 principles that try to explain why Spring Statemachine catches exceptions extensively:

Run To Completion
Do Not Break The Machine (see also Guard breaks with Throwable #206 )

While one could argue that number 2 is merely a requirement derived from number 1 the "Run To Completion" actually refers only to how events should be processed. The "one by one" approach as explained on the linked sites is important because the machine needs to be in a well-defined and stable state before it can go on.

However, this does not take exceptions or errors into account. From my personal point of view I doubt these are even considered events (at least not the ones that occur unexpectedly). That is why I believe when it comes to exception handling we should move away from the very fundamental RTC paradigm and take a look at how users can operate a statemachine meaningfully.

Room For Improvement

In the following I will refer to exceptions in actions or guards but most of the points can be applied to other user-written components as well (e.g. listeners, interceptors).

The current (3.2.0) implementation has at least 2 shortcomings that force the implementer to take extra measures against exceptions:

Exceptions are not always propagated to the caller starting the statemachine or sending an event.
- As a consequence, callers must choose a container (e.g. extended state) to store an exception that happens during statemachine execution. This container must be accessible from outside the machine so that one can read and evaluate the exception after execution.
Exceptions can make a statemachine impossible to reuse. E.g. when they occur in an action bound to a triggerless transition the machine literally hangs in the transit state where the transition originates.
- Users that wish to reuse the same statemachine instance between events must extend their exception handling by one of the following:
  - Add a looping event transition to the transit state, i.e. one that leads to the same state again. Once an exception occurs the event has to be sent to continue on the former path since triggerless transitions get executed only once a state is entered.
  - Create a machine backup, e.g. with the help of Spring StateMachinePersister. This can be used to restore the statemachine from a stable state. In technical terms this means the machine experiences a reset.

Exception Handling Test

To demonstrate these points I created test based on the following statemachine:

S₁ .. S₅ := states
- S₁ := initial state
- S₂ := choice
- S₃ := event accepting state
- S₄ := transit state (with state entry + behavior + exit action)
- S₅ := end state
E := event
a := action
g := guard

The action is always registered with an errorAction that writes the exception into a container, specifically the ExtendedState. The guard is registered using a wrapper around it that provides the same exception handling.

The last column states whether the statemachine can be reused for transition execution. In case of a result state with only a triggerless transition or the end state the machine is stopped, reset and started again to see if its state changes. In case of S₃ that accepts an event another event is sent with no exception to see if the transition is taken.

Test Project

Attached. Built with Java 17.0.6 and Gradle 8.1.1.

statemachine-exception-handling.zip

Note that test cases were written not to fail at issues listed below. Instead, a comment was added to assertions proving an error.

Test Report

Item #	Path To Exception	Exception Origin	Exception Type	Exception Propagated?	Exception In Container?	Result State	SSM Reusable?
1	start → S₁	initial action	`RuntimeException`	✓	✓	✗	✗
2	start → S₁	initial action	`Error`	✓	✗	✗	✗
3	start → S₁ → S₂	S₁ → S₂ action	`RuntimeException`	✓	✓	S₁	✗
4	start → S₁ → S₂	S₁ → S₂ action	`Error`	✓	✗	S₁	✗
5	start → S₁ → S₂	S₂ option 1 guard	`RuntimeException`	✗	✓	S₅	✗
6	start → S₁ → S₂	S₂ option 1 guard	`Error`	✗	✓	S₅	✗
7	start → S₁ → S₂	S₂ option 1 action	`RuntimeException`	✓	✓	S₁	✗
8	start → S₁ → S₂	S₂ option 1 action	`Error`	✓	✗	S₁	✗
9	start → S₁ → S₂	S₂ option guards + default option action	`RuntimeException`	only action exception	✓	S₁	✗
10	start → S₁ → S₂	S₂ option guards + default option action	`Error`	only action error	only guard errors	S₁	✗
11	start → S₁ → S₂ → S₃ → S₄	S₃ → S₄ action	`RuntimeException`	✗	✓	S₃	✓
12	start → S₁ → S₂ → S₃ → S₄	S₃ → S₄ action	`Error`	✓	✗	S₃	✗
13	start → S₁ → S₂ → S₃ → S₄	S₃ → S₄ guard	`RuntimeException`	✗	✓	S₃	✓
14	start → S₁ → S₂ → S₃ → S₄	S₃ → S₄ guard	`Error`	✓	✓	S₃	✗
15	start → S₁ → S₂ → S₄	S₄ state entry + behavior + exit action	`RuntimeException`	✗	missing or extra exit action exception	S₄ or S₅	✗
16	start → S₁ → S₂ → S₄	S₄ state entry action	`Error`	✓	✗	S₄	✗
17	start → S₁ → S₂ → S₄	S₄ state behavior action	`Error`	✗	✗	S₅	✗
18	start → S₁ → S₂ → S₄	S₄ state exit action	`Error`	only sometimes	✗	S₄ or S₅	✗

Test Result Groups

Error Gets Propagated

Test cases in which an error gets propagated to the caller can be considered OK in my opinion. The machine cannot be reused but since we experienced an error this is probably not what we want anyway. This applies to test items 2,4,8,12,14 and 16.

Error Not Propagated

This was demonstrated in case of type Error in choice option guards in 6, and 10. Note that in 10 this allows an error from an action to slip through. The machine also continues transition execution and enters the end state. The same applies to item 17 where the error occurs in a state behavior action. The expected behavior here would be to terminate execution right away.

Statemachine Gets Caught In State After Exception

In case of type Exception we may want to reuse the machine to start it again or re-send an event because the nature of the exception might be temporary. This will not be possible if the result state of the statemachine does not allow for that. It was described earlier in "Room For Improvement - Point 2" from above and applies to test items 1,3,5,7,9 and 15.

Transition Execution Not Interrupted After Exception

Some test items demonstrate that the statemachine continues its transition logic despite an exception occurred. This applies to the choice option guards as seen in test items 5 and 6 as well as to state actions from 15. A more severe case is test item 17 where despite an error in S₄ behavior action the end state is entered.

Flaky Runs

Random erroneous behavior was experienced in test items 15 and 18 where the exit action from S₄ fires. Sometimes the action is late meaning at the time of verification it has not been executed yet. You may modify the tests to make the thread wait for another second before mock verification to see that it does finally execute. There are other times when the same action executes twice. Possibly related to:

The same happened in test items 11 and 13 when the event was sent a 2nd time (without exception).

What's more, the exception propagated to the caller is not always what we would expect:

java.util.ConcurrentModificationException
	at java.base/java.util.ArrayList$Itr.checkForComodification(ArrayList.java:1013)
	at java.base/java.util.ArrayList$Itr.next(ArrayList.java:967)
	at reactor.core.publisher.FluxIterable$IterableSubscription.slowPath(FluxIterable.java:259)
	
	[...100 more...]

	at reactor.core.publisher.FluxGenerate$GenerateSubscription.next(FluxGenerate.java:178)
	at org.springframework.statemachine.support.ReactiveStateMachineExecutor.lambda$handleTriggerlessTransitions$18(ReactiveStateMachineExecutor.java:349)
	at reactor.core.publisher.FluxGenerate.lambda$new$1(FluxGenerate.java:58)
	
	[...100 miles down the reactor...]

	at reactor.core.publisher.MonoIgnoreThen.subscribe(MonoIgnoreThen.java:51)
	at reactor.core.publisher.Mono.subscribe(Mono.java:4400)
	at reactor.core.publisher.Mono.block(Mono.java:1706)
	at org.springframework.statemachine.support.LifecycleObjectSupport.start(LifecycleObjectSupport.java:111)

This could be related to an issue that was meant to be fixed:

ConcurrentModificationException in DefaultStateMachineExecutor #736

Improvement Proposals

P-1: Propagate Exceptions

Catch an exception, interrupt transition execution, and rethrow the exception. Let it exit the machine so that operators can try-catch.

P-2: Recover The Statemachine

This could be achieved by a Back To Origin approach where the machine is reset to:

the pre-start state if the machine was started
the pre-event state if an event was sent

State in this context refers to any part of the statemachine including extended state, current error, etc.

This could make sense as a general feature but may be optional as well: configurer.withRecovery( ). E.g. when a statemachine is persisted in a database (via parts of itsStateMachineContext) and calling threads only query the machine to restore it, send a single event, evaluate success and then persist it again, one will not need statemachine recovery in case of an exception. In essence, there might be users who want to reuse the machine for several events and others do not.

P-3: Keep Up Development

Based on the number of issues that have piled up and reasonable doubt that has been expressed:

Active development? #1081

I guess a lot of users would be happy to see progress on but not limited to this topic. One way to start would be to keep up communication with those involved in issues.

The text was updated successfully, but these errors were encountered:

github-actions bot added the status/need-triage Team needs to triage and take a first look label Jun 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Exception Handling #1099

Improve Exception Handling #1099

mroeppis commented Jun 9, 2023 •

edited

Loading

Improve Exception Handling #1099

Improve Exception Handling #1099

Comments

mroeppis commented Jun 9, 2023 • edited Loading

Motivation

Room For Improvement

Exception Handling Test

Test Project

Test Report

Test Result Groups

Error Gets Propagated

Error Not Propagated

Statemachine Gets Caught In State After Exception

Transition Execution Not Interrupted After Exception

Flaky Runs

Improvement Proposals

P-1: Propagate Exceptions

P-2: Recover The Statemachine

P-3: Keep Up Development

mroeppis commented Jun 9, 2023 •

edited

Loading