Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stale file handle and Stream Closed issues after session end time #540

Open
ajinkyatankhiwale1510 opened this issue Aug 29, 2022 · 12 comments
Labels

Comments

@ajinkyatankhiwale1510
Copy link

Describe the bug
We have two nodes(different physical machines) working in active-active mode, where only one node is supposed to handle all incoming traffic. In case of failover, F5 would transfer all the traffic to node 2 and then node 2 will refresh its state and continue processing. Both nodes share the same store settings and files. Its observed that during daily end time at 00:00:00 UTC. Node 1 first throws stale file handle error and after some time throws stream closed error indefinitely. End user then cannot establish FIX session, until both node 1 and node 2 are restarted. It's our requirement to have file based store.

20220824-00:00:00.582: Error reading/writing in MessageStore
java.io.IOException: Stale file handle
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at quickfix.FileStore.set(FileStore.java:373)
at quickfix.SessionState.set(SessionState.java:301)
at quickfix.Session.persist(Session.java:2563)
at quickfix.Session.sendRaw(Session.java:2622)
at quickfix.Session.generateLogout(Session.java:1479)
at quickfix.Session.generateLogout(Session.java:1448)
at quickfix.Session.reset(Session.java:884)
at quickfix.Session.resetIfSessionNotCurrent(Session.java:3022)
at quickfix.Session.next(Session.java:1921)
at quickfix.mina.SessionConnector$SessionTimerTask.run(SessionConnector.java:347)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)

and then

20220824-00:06:50.046: Error during message processing
quickfix.RuntimeError: quickfix.RuntimeError: java.io.IOException: Stream Closed
at quickfix.Session.next(Session.java:1181)
at quickfix.Session.next(Session.java:1228)
at quickfix.mina.ThreadPerSessionEventHandlingStrategy$MessageDispatchingThread.doRun(ThreadPerSessionEventHandlingStrategy.java:222)
at quickfix.mina.ThreadPerSessionEventHandlingStrategy$ThreadAdapter.run(ThreadPerSessionEventHandlingStrategy.java:146)
at java.lang.Thread.run(Thread.java:750)
Caused by: quickfix.RuntimeError: java.io.IOException: Stream Closed
at quickfix.SessionState.reset(SessionState.java:387)
at quickfix.Session.resetState(Session.java:2665)
at quickfix.Session.reset(Session.java:887)
at quickfix.Session.resetIfSessionNotCurrent(Session.java:3022)
at quickfix.Session.nextLogon(Session.java:2119)
at quickfix.Session.next(Session.java:1050)
... 4 more
Caused by: java.io.IOException: Stream Closed
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
at quickfix.FileStore.close(FileStore.java:218)
at quickfix.FileStore.close(FileStore.java:209)
at quickfix.FileStore.closeAndDeleteFiles(FileStore.java:223)
at quickfix.FileStore.initialize(FileStore.java:101)
at quickfix.FileStore.reset(FileStore.java:405)
at quickfix.SessionState.reset(SessionState.java:385)
... 9 more
Suppressed: java.io.IOException: Stream Closed
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
... 15 more
Cause: java.io.IOException: Stream Closed
quickfix.RuntimeError: java.io.IOException: Stream Closed
at quickfix.SessionState.reset(SessionState.java:387)
at quickfix.Session.resetState(Session.java:2665)
at quickfix.Session.reset(Session.java:887)
at quickfix.Session.resetIfSessionNotCurrent(Session.java:3022)
at quickfix.Session.nextLogon(Session.java:2119)
at quickfix.Session.next(Session.java:1050)
at quickfix.Session.next(Session.java:1228)
at quickfix.mina.ThreadPerSessionEventHandlingStrategy$MessageDispatchingThread.doRun(ThreadPerSessionEventHandlingStrategy.java:222)
at quickfix.mina.ThreadPerSessionEventHandlingStrategy$ThreadAdapter.run(ThreadPerSessionEventHandlingStrategy.java:146)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.IOException: Stream Closed
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
at quickfix.FileStore.close(FileStore.java:218)
at quickfix.FileStore.close(FileStore.java:209)
at quickfix.FileStore.closeAndDeleteFiles(FileStore.java:223)
at quickfix.FileStore.initialize(FileStore.java:101)
at quickfix.FileStore.reset(FileStore.java:405)
at quickfix.SessionState.reset(SessionState.java:385)
... 9 more
Suppressed: java.io.IOException: Stream Closed
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
... 15 more

To Reproduce
Node 1 and Node 2 has Start Time and End Time as 00:00:00-UTC - 00:00:00-UTC sharing same FileStorePath (configured for failover scenario) and an active FIX connection on node 1, during session end time, it's sometimes seen that file resetting does not work as expected.

Expected behavior
In case of failover setup, if members are connected to FIX environment. Session reset should happen seamlessly without any error.

system information:

  • OS: Red Hat Enterprise Linux 8.6 (Ootpa)
  • Java version : openjdk 8
  • QFJ Version 2.3.0

Additional context
It works well with single node, if both nodes are up and running then the issue is seen.

@chrjohn
Copy link
Member

chrjohn commented Aug 29, 2022

Seems to be related to #399

Any idea how this could be reproduced without two nodes since I don't have such a setup?
Could you please share your session settings?

@ajinkyatankhiwale1510
Copy link
Author

Node 1(machine 1) fix-acceptor-settings.cfg
[DEFAULT]
ConnectionType=acceptor
DataDictionary=./FIX44.xml
SenderCompID=XXXXX
ThreadModel=ThreadPerSession
BeginString=FIX.4.4
FileIncludeMilliseconds=Y
FileIncludeTimeStampForMessages=Y
FileLogPath=/local/fix-engine-acceptor_node/FIX_LOG <- Logs generated on node 1 /local/ directory
FileStorePath=/home/FIX_Seq_Store <- Store generated in /home/ directory, is a shared directory between node 1 and node 2
StartTime=00:00:00
EndTime=00:00:00
SocketUseSSL=Y
SocketKeyStore=/home/FIXClient.ks
SocketKeyStorePassword=clientpass
RefreshOnLogon=Y

[SESSION]
TargetCompID=ABCDE
SocketAcceptPort=12002

Node 2(machine 2 failover) fix-acceptor-settings.cfg
[DEFAULT]
ConnectionType=acceptor
DataDictionary=./FIX44.xml
SenderCompID=XXXXX
ThreadModel=ThreadPerSession
BeginString=FIX.4.4
FileIncludeMilliseconds=Y
FileIncludeTimeStampForMessages=Y
FileLogPath=/local/fix-engine-acceptor_node/FIX_LOG <- Logs generated on node 2 /local/ directory
FileStorePath=/home/FIX_Seq_Store <- Store generated in /home/ directory, is a shared directory between node 1 and node 2
StartTime=00:00:00
EndTime=00:00:00
SocketUseSSL=Y
SocketKeyStore=/home/FIXClient.ks
SocketKeyStorePassword=clientpass
RefreshOnLogon=Y

[SESSION]
TargetCompID=ABCDE
SocketAcceptPort=12002

Without two nodes its difficult to reproduce, on single node, starting two instances of acceptor and giving different SocketAcceptPort would be the identical scenario in this case. e.g. node 2's SocketAcceptPort can be 22002

@chrjohn
Copy link
Member

chrjohn commented Aug 29, 2022

Are you able to reproduce this in a test environment? If yes, could you test a pull request branch in your test environment?
The only solution I could come up with is to ignore the IOExceptions on closing the FileStore's files. This will help with the Stream Closed exceptions on closing the store but not with the Stale file handle exceptions (but actually I don't think we can do anything about them anyway).

@ajinkyatankhiwale1510
Copy link
Author

Yes, we were able to replicate the scenario in test environment, we used two separate machines hosted our acceptor on both machines and had FileStore shared between machines. However, I'm not sure if I could test any pull request in our environment. could you please elaborate your solution of ignoring the IOException and how it will help us to establish FIX session again?

@chrjohn
Copy link
Member

chrjohn commented Aug 30, 2022

Currently the IOException causes the message store to never close the old store and open new files for storing. This error then happens on every access to the store, as you reported.
On ignoring the Exception on closing, QFJ should be able to open new files for the message store.

@ajinkyatankhiwale1510
Copy link
Author

ajinkyatankhiwale1510 commented Aug 30, 2022

Thanks for your explanation, could you please let me know how can we fix this issue? Generally, in case of IOException there should be a finally block which must close this files or this is something which is currently missing? For time being, do you suggest any settings which would not reset the session, so such IOException do not occur? let's say using NonStopSession

@chrjohn
Copy link
Member

chrjohn commented Aug 31, 2022

Generally, in case of IOException there should be a finally block which must close this files or this is something which is currently missing

Currently the FileStore tries to close the files but there is an IOException thrown on closing. IMHO this exception can be ignored when closing the store since we can assume that the file has been moved/deleted or the like. When we catch that exception the store will subsequently create new files for the store and normal operation should be possible again.

Using NonStopSession is a good idea BTW, that should prevent the problem with the automated reset. But you need to reset the sequence numbers at some point so eventually might run into the same problem, don't you? Unless you can make sure that there is no failover taking place when trying to reset.

@ajinkyatankhiwale1510
Copy link
Author

Sequence reset can be done by initiator at SessionLogon, and that is perfectly fine for us. I could try NonStopSession and see if that helps our case.

Otherwise, we will have to somehow handle the exception to ignore IOException. Do you have any idea in that case what can be done?

@chrjohn
Copy link
Member

chrjohn commented Sep 5, 2022

Otherwise, we will have to somehow handle the exception to ignore IOException. Do you have any idea in that case what can be done?

I am not sure, but you could try to extend the FileStore and catch the thrown IOException in the reset() method and then re-initialize the store via the code that is in FileStore.initialize().

@ajinkyatankhiwale1510
Copy link
Author

do you see any issues with NonStopSession if the sequence number are not reset?

@chrjohn
Copy link
Member

chrjohn commented Oct 7, 2022

At some point (depending on how much messages you will send/receive) the sequence number will probably overflow (AFAIR we have a bug there). Moreover, the larger your store gets, the slower it will become. Potentially the counterparty could ask for a resend of all of your messages which could be billions of messages.

@ajinkyatankhiwale1510
Copy link
Author

I do agree with your approach of having our own implementation of FileStore, on that lines I have managed to implement MessageStore and in reset method of my implementation I have handled exceptions instead of throwing, do you think this would be a fine enough?
public class FixFileStore implements MessageStore, Closeable
...........
public void reset() {
try {
LOGGER.info("Resetting store files");
initialize(true);
} catch (IOException e) {
LOGGER.error("Error closing the store files, re-initialising store files again");
try {
close();
String mode = "rw" + (this.syncWrites ? "d" : "");
this.messageFileWriter = new RandomAccessFile(this.msgFileName, mode);
this.messageFileReader = new RandomAccessFile(this.msgFileName, "r");
this.senderSequenceNumberFile = new RandomAccessFile(this.senderSeqNumFileName, mode);
this.targetSequenceNumberFile = new RandomAccessFile(this.targetSeqNumFileName, mode);
initializeCache();
} catch (IOException e1) {
LOGGER.error("Error closing the store files in exception");
}
}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants