Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disconnection for unexpected timeout on heartbeat response to a TestRequest #569

Closed
annacochetti opened this issue Nov 16, 2022 · 14 comments
Assignees
Labels

Comments

@annacochetti
Copy link

annacochetti commented Nov 16, 2022

In our production system we have an acceptor and an initiator both QFJ based.
At a certain time the initiator logged
errorEvent -> logError :149 - FIXT.1.1:TRADESTAC_TRANSACTION_TW->TRADESTAC_TW: Disconnecting: Timed out waiting for heartbeat
and abruptly disconnected the session,
while the hb time interval was set to 30 seconds and the last hb has been received 27 seconds before as a response to a test request (and was not the only message sent on the session)
After this issue the initiator was not able to understand that the acceptor was responding to the Logon messages and kept logging
errorEvent -> logError :149 - FIXT.1.1:TRADESTAC_TRANSACTION_TW->TRADESTAC_TW: Disconnecting: Timed out waiting for logon response

The complete log portion is:
2022-11-11 09:08:52,055 INFO [QFJ Timer] outgoing -> send :2572 - FIXT.1.1:TRADESTAC_TRANSACTION_TW->TRADESTAC_TW: 8=FIXT.1.1|9=81|35=0|34=348|49=TRADESTAC_TRANSACTION_TW|52=20221111-08:08:52.055|56=TRADESTAC_TW|10=123|
2022-11-11 09:10:52,952 INFO [NioProcessor-44] incoming -> messageReceived :129 - FIXT.1.1:TRADESTAC_TRANSACTION_TW->TRADESTAC_TW: 8=FIXT.1.1|9=81|35=0|34=409|49=TRADESTAC_TW|52=20221111-08:10:52.952|56=TRADESTAC_TRANSACTION_TW|10=120|
[...] // Test request
2022-11-11 09:10:55,052 INFO [QFJ Timer] outgoing -> send :2572 - FIXT.1.1:TRADESTAC_TRANSACTION_TW->TRADESTAC_TW: 8=FIXT.1.1|9=90|35=1|34=368|49=TRADESTAC_TRANSACTION_TW|52=20221111-08:10:55.052|56=TRADESTAC_TW|112=TEST|10=137|
[...] // Many execution reports
2022-11-11 09:10:55,052 INFO [QFJ Timer] event -> ? :? - FIXT.1.1:TRADESTAC_TRANSACTION_TW->TRADESTAC_TW: Sent test request TEST
2022-11-11 09:10:55,053 INFO [NioProcessor-44] incoming -> messageReceived :129 - FIXT.1.1:TRADESTAC_TRANSACTION_TW->TRADESTAC_TW: 8=FIXT.1.1|9=90|35=0|34=410|49=TRADESTAC_TW|52=20221111-08:10:55.053|56=TRADESTAC_TRANSACTION_TW|112=TEST|10=125|
2022-11-11 09:10:56,084 INFO [NioProcessor-44] incoming -> messageReceived :129 - FIXT.1.1:TRADESTAC_TRANSACTION_TW->TRADESTAC_TW: 8=FIXT.1.1|9=574|35=8|34=411|49=TRADESTAC_TW|52=20221111-08:10:56.084|56=TRADESTAC_TRANSACTION_TW|6=0|[...]
2022-11-11 09:11:15,446 INFO [NioProcessor-44] incoming -> messageReceived :129 - FIXT.1.1:TRADESTAC_TRANSACTION_TW->TRADESTAC_TW: 8=FIXT.1.1|9=852|35=8|34=412|49=TRADESTAC_TW|52=20221111-08:11:15.446|56=TRADESTAC_TRANSACTION_TW|6=0[...]
2022-11-11 09:11:17,466 INFO [NioProcessor-44] incoming -> messageReceived :129 - FIXT.1.1:TRADESTAC_TRANSACTION_TW->TRADESTAC_TW: 8=FIXT.1.1|9=703|35=8|34=413|49=TRADESTAC_TW|52=20221111-08:11:17.466|56=TRADESTAC_TRANSACTION_TW|6=0[...]
2022-11-11 09:11:19,452 INFO [NioProcessor-44] incoming -> messageReceived :129 - FIXT.1.1:TRADESTAC_TRANSACTION_TW->TRADESTAC_TW: 8=FIXT.1.1|9=812|35=8|34=414|49=TRADESTAC_TW|52=20221111-08:11:19.452|56=TRADESTAC_TRANSACTION_TW|6=0|[...]
2022-11-11 09:11:20,453 INFO [NioProcessor-44] incoming -> messageReceived :129 - FIXT.1.1:TRADESTAC_TRANSACTION_TW->TRADESTAC_TW: 8=FIXT.1.1|9=1007|35=8|34=415|49=TRADESTAC_TW|52=20221111-08:11:20.452|56=TRADESTAC_TRANSACTION_TW|6=0|[...]
2022-11-11 09:11:22,053 ERROR [QFJ Timer] errorEvent -> logError :149 - FIXT.1.1:TRADESTAC_TRANSACTION_TW->TRADESTAC_TW: Disconnecting: Timed out waiting for heartbeat
2022-11-11 09:11:22,751 INFO [NioProcessor-86] event -> ? :? - FIXT.1.1:TRADESTAC_TRANSACTION_TW->TRADESTAC_TW: MINA session created: local=/127.0.0.1:65078, class org.apache.mina.transport.socket.nio.NioSocketSession, remote=localhost/127.0.0.1:8089
2022-11-11 09:11:23,052 INFO [QFJ Timer] outgoing -> send :2572 - FIXT.1.1:TRADESTAC_TRANSACTION_TW->TRADESTAC_TW: 8=FIXT.1.1|9=104|35=A|34=1|49=TRADESTAC_TRANSACTION_TW|52=20221111-08:11:23.052|56=TRADESTAC_TW|98=0|108=30|141=Y|1137=7|10=198|
2022-11-11 09:11:23,054 INFO [QFJ Timer] event -> ? :? - FIXT.1.1:TRADESTAC_TRANSACTION_TW->TRADESTAC_TW: Initiated logon request
2022-11-11 09:11:23,057 INFO [NioProcessor-86] incoming -> messageReceived :129 - FIXT.1.1:TRADESTAC_TRANSACTION_TW->TRADESTAC_TW: 8=FIXT.1.1|9=104|35=A|34=1|49=TRADESTAC_TW|52=20221111-08:11:23.056|56=TRADESTAC_TRANSACTION_TW|98=0|108=30|141=Y|1137=7|10=202|
2022-11-11 09:11:23,058 INFO [NioProcessor-86] event -> ? :? - FIXT.1.1:TRADESTAC_TRANSACTION_TW->TRADESTAC_TW: Setting DefaultApplVerID (1137=7) from Logon
2022-11-11 09:11:34,051 ERROR [QFJ Timer] errorEvent -> logError :149 - FIXT.1.1:TRADESTAC_TRANSACTION_TW->TRADESTAC_TW: Disconnecting: Timed out waiting for logon response
2022-11-11 09:11:34,798 INFO [NioProcessor-100] event -> ? :? - FIXT.1.1:TRADESTAC_TRANSACTION_TW->TRADESTAC_TW: MINA session created: local=/127.0.0.1:65112, class org.apache.mina.transport.socket.nio.NioSocketSession, remote=localhost/127.0.0.1:8089
2022-11-11 09:11:35,051 INFO [QFJ Timer] outgoing -> send :2572 - FIXT.1.1:TRADESTAC_TRANSACTION_TW->TRADESTAC_TW: 8=FIXT.1.1|9=104|35=A|34=1|49=TRADESTAC_TRANSACTION_TW|52=20221111-08:11:35.051|56=TRADESTAC_TW|98=0|108=30|141=Y|1137=7|10=200|
2022-11-11 09:11:35,053 INFO [QFJ Timer] event -> ? :? - FIXT.1.1:TRADESTAC_TRANSACTION_TW->TRADESTAC_TW: Initiated logon request
2022-11-11 09:11:35,056 INFO [NioProcessor-100] incoming -> messageReceived :129 - FIXT.1.1:TRADESTAC_TRANSACTION_TW->TRADESTAC_TW: 8=FIXT.1.1|9=104|35=A|34=1|49=TRADESTAC_TW|52=20221111-08:11:35.054|56=TRADESTAC_TRANSACTION_TW|98=0|108=30|141=Y|1137=7|10=203|
2022-11-11 09:11:35,056 INFO [NioProcessor-100] event -> ? :? - FIXT.1.1:TRADESTAC_TRANSACTION_TW->TRADESTAC_TW: Setting DefaultApplVerID (1137=7) from Logon
2022-11-11 09:11:45,052 ERROR [QFJ Timer] errorEvent -> logError :149 - FIXT.1.1:TRADESTAC_TRANSACTION_TW->TRADESTAC_TW: Disconnecting: Timed out waiting for logon response
[...] // From this point on, last lines are repeated

We have no steps to reproduce this behaviour but we strongly suggest to look at the timers: sometimes they seem not to be reset correctly
We expect the reply logon message to be acknowledged and the session to restart correctly

system information:

  • OS: Windows Server
  • Java version JDK 8
  • QFJ Version 1.6.4

Additional context
Since this has been a production issue we cannot provide a unit test result.
We think that the issue is pretty frequent on the heartbeat, but this has been the first time we had the issue on the logon too.
Since we did not find any acknowledged bug on this behaviour we suspect that the QFJ version is not relevant.

@chrjohn
Copy link
Member

chrjohn commented Nov 16, 2022

Hi @annacochetti ,

thanks for the report.
Two things: could you add more info (the header with seqnum and sending time at least) of the ExecutionReports that you receive?
And could you please replace the SOH separator with the pipe symbol |. That will make reading the logs much easier.

Thanks,
Chris.

@chrjohn chrjohn self-assigned this Nov 16, 2022
@annacochetti
Copy link
Author

Done, thanks for picking it immediately

@chrjohn
Copy link
Member

chrjohn commented Nov 16, 2022

Some more questions:
Is this a new connection or are the problems only now showing?

How many FIX sessions do you run on your acceptor/initiator? Do they both only have one session?

What are you doing in your logic when you receive an ExecutionReport? Do you have some processing in your fromApp() callback that might block? This is the main reason for such problems because QFJ cannot process incoming messages (such as heartbeats) when something is blocking the message processing thread.

Is this problem always showing after reception of an application message, e.g. ExecReport?

@annacochetti
Copy link
Author

annacochetti commented Nov 16, 2022

The connections have been up and runnning for years now.
This is the first time that the issue has been reported because it has been followed by the second issue, more critical: in fact the initiator had to be restarted to solve the problem.
The acceptor accepts 2 sessions, both from the same initiator.
The initiator creates 4 more initiators to different acceptors.
All the other sessions run smoothly.
On that session the Acceptor sends only ExecutionReports as business messages, the initiator sends NewOrderSingle and OrderCancelRequest messages only
We suspect the test was the problem.
The fromApp() callback just queues the received ExceutionReport for a different thread that manages it.
I have looked backward almost 2 years in production logs and I did not found a single instance of a Test request like the one we had in this case.

@chrjohn
Copy link
Member

chrjohn commented Nov 16, 2022

Are you using one initiator instance that handles the various initiator sessions in one java process? What were the other initiator sessions doing around the time of the first disconnection? Were they connected?

Are you using a ThreadedSocketInitiator or SocketInitiator?

I think this is hard to analyse post mortem. IMHO it is obvious that something blocked the message processing thread, otherwise the heartbeats and the logon (after the disconnection) would have been processed.
But to prove this assumption one would need a thread dump the next time this happens.

@annacochetti
Copy link
Author

All the other sessions are working well. The only session affected is the FIXT.1.1:TRADESTAC_TRANSACTION_TW->TRADESTAC_TW
We are using the ThreadedSocketInitiator.

@chrjohn
Copy link
Member

chrjohn commented Nov 17, 2022

  1. And that ThreadedSocketInitiator handles all of your initiator sessions?
  2. And were all other session connected at that point when the problem happened?
  3. What happens between "2022-11-11 09:08:52,055" and "2022-11-11 09:10:55,052" in your log? There are only two messages sent out but the sequence number gap is 20 sequence numbers. Or did you remove messages there (you wrote "complete log portion")?

I checked the code and the message "Setting DefaultApplVerID (1137=7) from Logon" is logged before the message is queued for processing.

@Override
protected void processMessage(IoSession protocolSession, Message message) throws Exception {
final Optional<String> msgTypeField = message.getHeader().getOptionalString(MsgType.FIELD);
if (msgTypeField.isPresent() && msgTypeField.get().equals(MsgType.LOGON)) {
final SessionID sessionID = MessageUtils.getReverseSessionID(message);
if (sessionID.isFIXT()) {
if (message.isSetField(DefaultApplVerID.FIELD)) {
final ApplVerID applVerID = new ApplVerID(message.getString(DefaultApplVerID.FIELD));
quickfixSession.setTargetDefaultApplicationVersionID(applVerID);
quickfixSession.getLog().onEvent("Setting DefaultApplVerID (" + DefaultApplVerID.FIELD + "="
+ applVerID.getValue() + ") from Logon");
}
}
}
eventHandlingStrategy.onMessage(quickfixSession, message);
}

That supports my assumption of a blocked thread. Since you use the ThreadedSocketInitiator only the thread for that specific session will be blocked. But again, this can only be analysed with a thread dump. You said this happened for the first time, maybe you can create a thread dump before restarting the process if the problem occurred again?

Edit: by the way this could also be a bug in QFJ itself (your used version is some years old) but I cannot recall a specific bug at the moment.

Thanks,
Chris.

@chrjohn chrjohn changed the title Quickfixj abrupt disconnection for unexpected timeout on heartbeat response to a test message Disconnection for unexpected timeout on heartbeat response to a TestRequest Nov 17, 2022
@annacochetti
Copy link
Author

Each Session uses its own ThreadedSocketInitiator and tht's why we have only one session that has issues.
Due to your concerns I had a look to the messages received in the preceding minutes and I can see that the timestamp logged in the Quickfix log (FIX message) and in the log of the thread managing the message are increasingly different.
This could explain why the initiator sent the test request and why it did not manage the replies in time.
Nothing in the logs explains this slow response though. Nothing special happened, no messages received out of normal.

@chrjohn
Copy link
Member

chrjohn commented Nov 17, 2022

OK, so it is no longer an assumption that a thread is blocked. :)
I am afraid without a thread dump neither me nor you can tell where the problem really is.

@annacochetti
Copy link
Author

Agreed.

@annacochetti
Copy link
Author

annacochetti commented Nov 23, 2022

We found at last that we had incorrect and undetected repeated accesses to the database.
One more question is:
Should Quickfixj send a 35=5 before disconnecting when is not able to process the incoming messages?
Thanks

@chrjohn
Copy link
Member

chrjohn commented Nov 24, 2022

We found at last that we had incorrect and undetected repeated accesses to the database.

So can this thread be marked as "answered" then?

I actually thought the thread in your app couldn't be blocked because you said

The fromApp() callback just queues the received ExceutionReport for a different thread that manages it.

So I understood that you put the messages to a queue, instantly returning the processing back to fromApp(). But maybe I misunderstood.

Should Quickfixj send a 35=5 before disconnecting when is not able to process the incoming messages?

I am afraid that needs to be implemented by your custom application logic. You could maybe monitor the return value of initiator.getQueueSize() for this and call session.generateLogout() as appropriate. Maybe when queue size reaches a certain amount and doesn't shrink for a to be defined amount of time.
Hope that helps.

@annacochetti annacochetti closed this as not planned Won't fix, can't repro, duplicate, stale Nov 25, 2022
@annacochetti
Copy link
Author

I am closing this issue, since there is no bug here. Thanks for your help, Chris

@chrjohn
Copy link
Member

chrjohn commented Nov 25, 2022

You're welcome, Anna. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants