Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Exception Handling #1099

Open
mroeppis opened this issue Jun 9, 2023 · 0 comments
Open

Improve Exception Handling #1099

mroeppis opened this issue Jun 9, 2023 · 0 comments
Labels
status/need-triage Team needs to triage and take a first look

Comments

@mroeppis
Copy link

mroeppis commented Jun 9, 2023

Motivation

Some issues related to exception handling have been reported over time:

  1. Spring State Machine Error Handling not working #548
  2. Exposing exception in StateMachine #601
  3. reactive error handling #997
  4. Not able to catch and rethrow spring state machine error #553
  5. Is there way to throw an Exception that occurred inside the Statemachine to the outside? #1055
  6. Make exceptions that prevent transitions available #970
  7. Error in Action doesn't trigger rollback #1076
  8. Exceptionhandling in guards #608
  9. Spring Security Exception and how to handle it. #340
  10. how to error handling in StateMachineInterceptorAdapter override method preStateChange #1090

What they have in common is that users want to know how they can handle exceptions from self-written components at the caller level of the statemachine. They often ask why exceptions get caught inside the machine but are not passed to the outside.

It seems there are 2 principles that try to explain why Spring Statemachine catches exceptions extensively:

  1. Run To Completion
  2. Do Not Break The Machine (see also Guard breaks with Throwable #206 )

While one could argue that number 2 is merely a requirement derived from number 1 the "Run To Completion" actually refers only to how events should be processed. The "one by one" approach as explained on the linked sites is important because the machine needs to be in a well-defined and stable state before it can go on.

However, this does not take exceptions or errors into account. From my personal point of view I doubt these are even considered events (at least not the ones that occur unexpectedly). That is why I believe when it comes to exception handling we should move away from the very fundamental RTC paradigm and take a look at how users can operate a statemachine meaningfully.

Room For Improvement

In the following I will refer to exceptions in actions or guards but most of the points can be applied to other user-written components as well (e.g. listeners, interceptors).

The current (3.2.0) implementation has at least 2 shortcomings that force the implementer to take extra measures against exceptions:

  1. Exceptions are not always propagated to the caller starting the statemachine or sending an event.
    • As a consequence, callers must choose a container (e.g. extended state) to store an exception that happens during statemachine execution. This container must be accessible from outside the machine so that one can read and evaluate the exception after execution.
  2. Exceptions can make a statemachine impossible to reuse. E.g. when they occur in an action bound to a triggerless transition the machine literally hangs in the transit state where the transition originates.
    • Users that wish to reuse the same statemachine instance between events must extend their exception handling by one of the following:
      • Add a looping event transition to the transit state, i.e. one that leads to the same state again. Once an exception occurs the event has to be sent to continue on the former path since triggerless transitions get executed only once a state is entered.
      • Create a machine backup, e.g. with the help of Spring StateMachinePersister. This can be used to restore the statemachine from a stable state. In technical terms this means the machine experiences a reset.

Exception Handling Test

To demonstrate these points I created test based on the following statemachine:

  • S1 .. S5 := states
    • S1 := initial state
    • S2 := choice
    • S3 := event accepting state
    • S4 := transit state (with state entry + behavior + exit action)
    • S5 := end state
  • E := event
  • a := action
  • g := guard

The action is always registered with an errorAction that writes the exception into a container, specifically the ExtendedState. The guard is registered using a wrapper around it that provides the same exception handling.

The last column states whether the statemachine can be reused for transition execution. In case of a result state with only a triggerless transition or the end state the machine is stopped, reset and started again to see if its state changes. In case of S3 that accepts an event another event is sent with no exception to see if the transition is taken.

Test Project

Attached. Built with Java 17.0.6 and Gradle 8.1.1.

statemachine-exception-handling.zip

Note that test cases were written not to fail at issues listed below. Instead, a comment was added to assertions proving an error.

Test Report

Item # Path To Exception Exception Origin Exception Type Exception Propagated? Exception In Container? Result State SSM Reusable?
1 start → S1 initial action RuntimeException
2 start → S1 initial action Error
3 start → S1S2 S1S2 action RuntimeException S1
4 start → S1S2 S1S2 action Error S1
5 start → S1S2 S2 option 1 guard RuntimeException S5
6 start → S1S2 S2 option 1 guard Error S5
7 start → S1S2 S2 option 1 action RuntimeException S1
8 start → S1S2 S2 option 1 action Error S1
9 start → S1S2 S2 option guards + default option action RuntimeException only action exception S1
10 start → S1S2 S2 option guards + default option action Error only action error only guard errors S1
11 start → S1S2S3S4 S3S4 action RuntimeException S3
12 start → S1S2S3S4 S3S4 action Error S3
13 start → S1S2S3S4 S3S4 guard RuntimeException S3
14 start → S1S2S3S4 S3S4 guard Error S3
15 start → S1S2S4 S4 state entry + behavior + exit action RuntimeException missing or extra exit action exception S4 or S5
16 start → S1S2S4 S4 state entry action Error S4
17 start → S1S2S4 S4 state behavior action Error S5
18 start → S1S2S4 S4 state exit action Error only sometimes S4 or S5

Test Result Groups

Error Gets Propagated

Test cases in which an error gets propagated to the caller can be considered OK in my opinion. The machine cannot be reused but since we experienced an error this is probably not what we want anyway. This applies to test items 2,4,8,12,14 and 16.

Error Not Propagated

This was demonstrated in case of type Error in choice option guards in 6, and 10. Note that in 10 this allows an error from an action to slip through. The machine also continues transition execution and enters the end state. The same applies to item 17 where the error occurs in a state behavior action. The expected behavior here would be to terminate execution right away.

Statemachine Gets Caught In State After Exception

In case of type Exception we may want to reuse the machine to start it again or re-send an event because the nature of the exception might be temporary. This will not be possible if the result state of the statemachine does not allow for that. It was described earlier in "Room For Improvement - Point 2" from above and applies to test items 1,3,5,7,9 and 15.

Transition Execution Not Interrupted After Exception

Some test items demonstrate that the statemachine continues its transition logic despite an exception occurred. This applies to the choice option guards as seen in test items 5 and 6 as well as to state actions from 15. A more severe case is test item 17 where despite an error in S4 behavior action the end state is entered.

Flaky Runs

Random erroneous behavior was experienced in test items 15 and 18 where the exit action from S4 fires. Sometimes the action is late meaning at the time of verification it has not been executed yet. You may modify the tests to make the thread wait for another second before mock verification to see that it does finally execute. There are other times when the same action executes twice. Possibly related to:

The same happened in test items 11 and 13 when the event was sent a 2nd time (without exception).

What's more, the exception propagated to the caller is not always what we would expect:

java.util.ConcurrentModificationException
	at java.base/java.util.ArrayList$Itr.checkForComodification(ArrayList.java:1013)
	at java.base/java.util.ArrayList$Itr.next(ArrayList.java:967)
	at reactor.core.publisher.FluxIterable$IterableSubscription.slowPath(FluxIterable.java:259)
	
	[...100 more...]

	at reactor.core.publisher.FluxGenerate$GenerateSubscription.next(FluxGenerate.java:178)
	at org.springframework.statemachine.support.ReactiveStateMachineExecutor.lambda$handleTriggerlessTransitions$18(ReactiveStateMachineExecutor.java:349)
	at reactor.core.publisher.FluxGenerate.lambda$new$1(FluxGenerate.java:58)
	
	[...100 miles down the reactor...]

	at reactor.core.publisher.MonoIgnoreThen.subscribe(MonoIgnoreThen.java:51)
	at reactor.core.publisher.Mono.subscribe(Mono.java:4400)
	at reactor.core.publisher.Mono.block(Mono.java:1706)
	at org.springframework.statemachine.support.LifecycleObjectSupport.start(LifecycleObjectSupport.java:111)

This could be related to an issue that was meant to be fixed:

Improvement Proposals

P-1: Propagate Exceptions

Catch an exception, interrupt transition execution, and rethrow the exception. Let it exit the machine so that operators can try-catch.

P-2: Recover The Statemachine

This could be achieved by a Back To Origin approach where the machine is reset to:

  • the pre-start state if the machine was started
  • the pre-event state if an event was sent

State in this context refers to any part of the statemachine including extended state, current error, etc.

This could make sense as a general feature but may be optional as well: configurer.withRecovery( ). E.g. when a statemachine is persisted in a database (via parts of itsStateMachineContext) and calling threads only query the machine to restore it, send a single event, evaluate success and then persist it again, one will not need statemachine recovery in case of an exception. In essence, there might be users who want to reuse the machine for several events and others do not.

P-3: Keep Up Development

Based on the number of issues that have piled up and reasonable doubt that has been expressed:

I guess a lot of users would be happy to see progress on but not limited to this topic. One way to start would be to keep up communication with those involved in issues.

@github-actions github-actions bot added the status/need-triage Team needs to triage and take a first look label Jun 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/need-triage Team needs to triage and take a first look
Projects
None yet
Development

No branches or pull requests

1 participant