Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition during bilateral initial resend request #439

Closed
sami-sweng opened this issue Jun 11, 2021 · 0 comments · Fixed by #440
Closed

Race condition during bilateral initial resend request #439

sami-sweng opened this issue Jun 11, 2021 · 0 comments · Fixed by #440

Comments

@sami-sweng
Copy link
Contributor

sami-sweng commented Jun 11, 2021

Race condition during bilateral initial resend request

I believe I have found a race condition when the initiator successfully connects to an acceptor.
The specific scenarios which cause a bug is when both have a few messages of advance at the time of the first successful logon and the initiator unsent messages are all administrative messages (Logon).

Expected sequence of event

The expected flow is as follow:

Direction Message type
init -> acc Logon
acc -> init Logon
acc -> init ResendRequest
init -> acc ResendRequest
acc -> init News (the 1st message to resend)
acc -> init News (the 2nd message to resend)
acc -> init SequenceReset
init -> acc SequenceReset

Issue

The issue is that sometimes, the initiator sends the sequenceReset immediatly after the resendRequest of the acceptor
and then sends it's own resendRequest.

The sequenceReset is visibly out of order as it has a wrong timestamp (a later one) and the resendRequest continues with the sequence number of before the sequence reset.

The issue is that the acceptor that I'm connecting to (which is using Quickfix Java I think) doesn't tolerate the wrong sequence number and rejects it with a Logout.
And thus the resend mechanism fails and the acceptor messages are lost.

Actual sequence of event

Here are the outputed FIX messages during the scenario:

8=FIX.4.4|A9=76|A35=A|A34=4|A49=INIT|A52=20210525-22:05:30.686|A56=ACC|A369=0|A98=0|A108=90|A10=220|A
8=FIX.4.4|A9=71|A35=A|A34=14|A49=ACC|A52=20210525-22:05:30.803|A56=INIT|A98=0|A108=90|A10=239|A
8=FIX.4.4|A9=68|A35=2|A34=15|A49=ACC|A52=20210525-22:05:30.804|A56=INIT|A7=1|A16=0|A10=068|A
8=FIX.4.4|A9=107|A35=4|A34=1|A43=Y|A49=INIT|A52=20210525-22:05:30.833|A56=ACC|A122=20210525-22:05:30.833|A369=15|A36=6|A123=Y|A10=013|A
8=FIX.4.4|A9=73|A35=2|A34=5|A49=INIT|A52=20210525-22:05:30.821|A56=ACC|A369=0|A7=1|A16=0|A10=030|A
8=FIX.4.4|A9=108|A35=5|A34=16|A49=ACC|A52=20210525-22:05:30.848|A56=INIT|A58=MsgSeqNum too low, expecting 6 but received 5|A10=159|A
8=FIX.4.4|A9=76|A35=A|A34=7|A49=INIT|A52=20210525-22:05:40.882|A56=ACC|A369=1|A98=0|A108=90|A10=223|A
8=FIX.4.4|A9=71|A35=A|A34=17|A49=ACC|A52=20210525-22:05:40.996|A56=INIT|A98=0|A108=90|A10=000|A
8=FIX.4.4|A9=68|A35=2|A34=18|A49=ACC|A52=20210525-22:05:40.997|A56=INIT|A7=6|A16=0|A10=090|A
8=FIX.4.4|A9=107|A35=4|A34=6|A43=Y|A49=INIT|A52=20210525-22:05:41.031|A56=ACC|A122=20210525-22:05:41.031|A369=18|A36=9|A123=Y|A10=008|A
8=FIX.4.4|A9=73|A35=2|A34=8|A49=INIT|A52=20210525-22:05:41.015|A56=ACC|A369=1|A7=2|A16=0|A10=032|A
8=FIX.4.4|A9=108|A35=5|A34=19|A49=ACC|A52=20210525-22:05:41.046|A56=INIT|A58=MsgSeqNum too low, expecting 9 but received 8|A10=160|A

Root cause

I believe I identified the root cause of the issue.

There are essentially three goroutines involved in processing messages in quickfixgo:

  • the read loop (go readLoop()), which parse and sends messages to the messageIn channel.
  • the processing loop (session.run()), which selects from either messageIn (message to be processed) or messageEvent (signaling message in the sending queue)
  • the write loop (go writeLoop())

Step 1

When the acceptor Logon message is received, we are in the LogonState.
The Logon message is written in the messageIn channel, and the session gets it and forwards it to the FixMsgIn() of the LogonState.
So the FixMsgIn() of the logon state essentially create a resendRequest which is simply put in the queue of message to send as we are not logged and trigger the messageEvent signal.
This puts us in the state resendState which in our case will simply proxy the next FixMsgIn to the loggedOn state as there is nothing else to do.

Step 2.a

The session get the messageEvent signal from step 1, and as the resendState signal is considered a loggedOn state, it dequeue the single resend message and immediately sends it.

Step 2.b

The readLoop gets the resendRequest of the acceptor and writes it in the messageIn channel.
The session get the message and submits it to the FixMsgIn of the loggedOn state. This creates a resendRequest and IMMEDIATELY sends it as bytes.

Issue

The issue is that the step 2.a and 2.b are concurrent, as during the time requested to perform step 1, the readloop could have first written the resendRequest in the messageIn which would cause step 2.b to be executed before step 2.a.
The problem is that in that case, step 2.b ignores the fact that there is already a message to send in the queue (the resend) and bypass it.

Possible remediation

If step 2.b happens first, if instead of immediately sending the sequenceReset, it first send the queued resendRequest, no out of order messages are sent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant