Transition Race Condition #6

theckman · 2016-06-08T03:03:57Z

Hey there,

I was taking a look at how your library was written and noticed that your transitions were not serialized using a mutex. I figured this would mean that your library would fail in cases where it shouldn't.

When I managed to reproduce it in testing, the failure was a bit more explosive than I thought it would be. When I wrote the test, I expected one of the goroutines would either: A) Fail with an InvalidTransition, B) Fail with the state being unexpected. However, both goroutines reported an InvalidTransition error, while only one reported that the state was incorrect. At first-pass, I can't explain why both goroutines ended up getting an InvalidTransition error. It would seem to confirm the race condition.

Here is the diff and then the test output. The output shows two errors getting returned from Transition, when really one should do it.

diff --git a/fsm_test.go b/fsm_test.go
index 8a9fec0..f08dfb8 100644
--- a/fsm_test.go
+++ b/fsm_test.go
@@ -1,6 +1,9 @@
 package fsm_test

 import (
+   "fmt"
+   "runtime"
+   "sync"
    "testing"
    "time"

@@ -91,6 +94,59 @@ func TestMachineTransition(t *testing.T) {
    st.Expect(t, some_thing.State, fsm.State("started"))
 }

+func TestMachineTransitionConcurrent(t *testing.T) {
+   if runtime.GOMAXPROCS(0) < 2 {
+       t.Skip("unlikely that concurrency bug would manifest on system with one CPU core; skipping")
+   }
+
+   rules := fsm.Ruleset{}
+   rules.AddTransition(fsm.T{"pending", "started"})
+   rules.AddTransition(fsm.T{"pending", "aborted"})
+   rules.AddTransition(fsm.T{"started", "finished"})
+   rules.AddTransition(fsm.T{"aborted", "pending"})
+
+   thing := Thing{State: "pending"}
+   machine := fsm.New(fsm.WithRules(rules), fsm.WithSubject(&thing))
+
+   var err error
+   wg := &sync.WaitGroup{}
+
+   wg.Add(2)
+
+   go func() {
+       err = machine.Transition("started")
+       fmt.Println(thing.State)
+       st.Expect(t, err, nil)
+       st.Expect(t, thing.State, fsm.State("started"))
+       wg.Done()
+   }()
+
+   go func() {
+       time.Sleep(time.Nanosecond * 10000)
+       err = machine.Transition("aborted")
+       st.Expect(t, err, nil)
+       st.Expect(t, thing.State, fsm.State("aborted"))
+       wg.Done()
+   }()
+
+   wg.Wait()
+}
+
 func BenchmarkRulesetParallelGuarding(b *testing.B) {
    rules := fsm.Ruleset{}
    rules.AddTransition(fsm.T{"pending", "started"})

Output:

=== RUN   TestRulesetTransitions
--- PASS: TestRulesetTransitions (0.00s)
=== RUN   TestRulesetParallelGuarding
--- PASS: TestRulesetParallelGuarding (0.00s)
=== RUN   TestMachineTransition
--- PASS: TestMachineTransition (0.00s)
=== RUN   TestMachineTransitionConcurrent
started
--- FAIL: TestMachineTransitionConcurrent (0.00s)
    st.go:41:
        fsm_test.go:127: should be ==
            have: (*errors.errorString) invalid transition
            want: (<nil>) <nil>
    st.go:41:
        fsm_test.go:128: should be ==
            have: (fsm.State) started
            want: (fsm.State) aborted
    st.go:41:
        fsm_test.go:119: should be ==
            have: (*errors.errorString) invalid transition
            want: (<nil>) <nil>
FAIL
exit status 1
FAIL    github.com/ryanfaerman/fsm  0.015s

The text was updated successfully, but these errors were encountered:

ryanfaerman · 2016-06-08T04:07:35Z

If we take a step back and thing what is happening here, the story as I understand it is as follows:

machine is initialized as pending with valid transitions being to started or to aborted.
machine is transitioned to started, valid transitions now are: finished.
machine is transitioned to aborted, this is an invalid transition since the state of the machine is started.

The fact that these are running in goroutines does not make this a concurrent transition. There is only a single machine and it cannot be transitioned into an invalid state at any time. The second goroutine should have its expectations as:

st.Expect(t, err, fsm.InvalidTransition)
st.Expect(t, thing.State, fsm.State("started"))

Additionally, I was able to eliminate the strange behavior of having a valid transition seemingly return an error.

The issue is with declaring the var err error outside of the goroutines. You've introduced this variable into the scope above both goroutines, meaning that they will be sharing them. This is where the race condition actually lives. If the second goroutine manages to set the error before the first can run the expectations, you'll see this behavior.

But, by removing the err variable, I found a different issue. Now, randomly, either both sets of expectations would fail or both would pass. I don't think this is a bad thing. It means that the state machine is ensuring that the states are occurring according to their transition rule set.

Finally, I'm not really sure what you're actually trying to test here. From what I can tell, it seems that the race condition is actually in the test itself and not the state machine. Maybe if I understood more of what scenario you're trying to address, we can come up with a test case that accurately represents it.

EDIT: I've thought about it some more. I think a better test would involve two wait groups. We have an issue here in that we can't really predict when both goroutines will start. If we have each one wait until both goroutines signal they are running, and only start after that signal, we'll be able to see how the FSM really behaves.

theckman · 2016-06-08T04:20:46Z

Ugh. I'm sorry about the err variable. That's an embarrassing thing to miss in the test.

So I was looking at this block of code, and I was wondering if it's possible for two goroutines have the if statement evaluate to true, thus updating state twice:

https://github.com/ryanfaerman/fsm/blob/master/fsm.go#L106-L109

If the consumer doesn't confirm state at the end, they could be running with the wrong assumption.

ryanfaerman · 2016-06-08T04:27:45Z

I was just reviewing that exact same function. I think there is a possible issue there. There is a chance that two or more goroutines could run the Permitted function before the another has a chance to update the state.

It is undefined which one would be permitted in that scenario.

I'm on the fence if it's better to simply document the possible issue here and leave it as an exercise to the consumer or to add a mutex on the Machine.

ryanfaerman closed this as completed Feb 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transition Race Condition #6

Transition Race Condition #6

theckman commented Jun 8, 2016 •

edited

Loading

ryanfaerman commented Jun 8, 2016 •

edited

Loading

theckman commented Jun 8, 2016

ryanfaerman commented Jun 8, 2016

Transition Race Condition #6

Transition Race Condition #6

Comments

theckman commented Jun 8, 2016 • edited Loading

ryanfaerman commented Jun 8, 2016 • edited Loading

theckman commented Jun 8, 2016

ryanfaerman commented Jun 8, 2016

theckman commented Jun 8, 2016 •

edited

Loading

ryanfaerman commented Jun 8, 2016 •

edited

Loading