Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aeron Publisher offer succeeds but subscribers do not receive any data #611

Closed
harnitbakshi opened this issue Jan 12, 2019 · 37 comments
Closed
Assignees

Comments

@harnitbakshi
Copy link

harnitbakshi commented Jan 12, 2019

Hi,
I ran this test on the latest version 1.14.0, java version is: 1.8.0_161-b12, I am on a macOS. Please note the sequence of steps below:

  • I took the BasicPublisher class, changed the sleep to 300 ms instead of 1 second

  • Secondly I modified the aeron.client.liveness.timeout=300000000 (300 ms instead of 10 second)

  • Now start the MediaDriver(I used the low-latency-media-driver), BasicPublisher and please start two BasicSubscriber instances

  • What I noticed is that the BasicPublisher publishes and both the subscribers receive data but after some time both stop receiving data. But what is surprising is the Publisher does not notice this and keeps offering successfully. So now the subscribers do not receive any data but publisher is publishing

This only happens when I reduce the client liveness timeout. This is exactly what we found in our test environment as well

Also the following error was printed in the error log file:

***
7 observations from 2019-01-12 20:48:36.354+0530 to 2019-01-12 20:49:14.939+0530 for:
 io.aeron.driver.exceptions.ControlProtocolException: Unknown Subscription: 3
	at io.aeron.driver.DriverConductor.onRemoveSubscription(DriverConductor.java:741)
	at io.aeron.driver.ClientCommandAdapter.onMessage(ClientCommandAdapter.java:136)
	at org.agrona.concurrent.ringbuffer.ManyToOneRingBuffer.read(ManyToOneRingBuffer.java:157)
	at io.aeron.driver.ClientCommandAdapter.receive(ClientCommandAdapter.java:64)
	at io.aeron.driver.DriverConductor.doWork(DriverConductor.java:154)
	at org.agrona.concurrent.AgentRunner.doDutyCycle(AgentRunner.java:268)
	at org.agrona.concurrent.AgentRunner.run(AgentRunner.java:161)
	at java.lang.Thread.run(Thread.java:748)

Please let me know if this is a bug and in the mean time what is the best way to workaround this problem? Let me know if you need any more details

Thanks!

@mjpt777
Copy link
Contributor

mjpt777 commented Jan 12, 2019

Thanks. Can you post the output from AeronStat when this happens?

@tmontgomery
Copy link
Contributor

You did not change the client keepalive interval. So, it's only sending keepalives every 500ms. Thus, after the initial subscriptions/publications, the clients will timeout very quickly.

This will cause the publications and subscriptions to be removed in the driver.

Please do not adjust the client liveness without adjusting the rest of the timeouts and intervals in an appropriate way.

@harnitb
Copy link

harnitb commented Jan 13, 2019

Hi Todd I will retest with the keep alive interval adjusted. But my point was there was no exception seen in both the publisher and subscriber. In our test environment we have the client liveness timeout around 30 seconds and keep alive is unchanged and still it happened

Will post another run shortly

@tmontgomery
Copy link
Contributor

The inter service timeout is there to tell the client that it probably has been timed out by the driver. i.e. to catch this situation. However, it is not a guarantee and it operates on a assumption of a proper configuration (keepalive < timeout and in relation to slow tick cycle, etc.).

So, if you want to adjust the timeout of the driver (which sets the client inter service timeout), then you will need to adjust other timeouts and intervals also. Depending on what you are trying to do it's not simply the liveness of the client to consider....

If you need a 30 second timeout, is this due to thread starvation or GC? If so, then you need to consider much more than simply client liveness for things to operate reliably.

@harnitbakshi
Copy link
Author

Hi Todd,
I replicated this again with the following settings:

  • keepAlive was set to 250 ms
  • clientLiveness was set to 300 ms

What I see again is the Publisher happily offers successfully, the image becomes unavailable on all the Subscribers

So here are the questions I have:

  • If liveness timeout has occured and the media driver does remove the publication and subscriptions. How is the publisher client able to successfully offer?

  • Yes I had set the timeout to 30 seconds due to thread starvation or GC. Mind you this situation has happened only twice in 6 months. That too under really heavy load when our engine is processing a lot of Market Data

  • Lastly my point is I understand we need to tune our system to handle this edge cases but there was no indication from the Publisher, even though all the other Subscribers were not receiving messages

So what I am pointing out is we are blind in production in this scenario :(.
Any thoughts on recommendations on this?

Martin here is the aeron-stat dump you requested:
`07:41:30 - Aeron Stat (CnC v14), pid 2751

0: 14,624 - Bytes sent
1: 12,096 - Bytes received
2: 0 - Failed offers to ReceiverProxy
3: 0 - Failed offers to SenderProxy
4: 0 - Failed offers to DriverConductorProxy
5: 0 - NAKs sent
6: 0 - NAKs received
7: 119 - Status Messages sent
8: 123 - Status Messages received
9: 173 - Heartbeats sent
10: 144 - Heartbeats received
11: 0 - Retransmits sent
12: 0 - Flow control under runs
13: 0 - Flow control over runs
14: 0 - Invalid packets
15: 0 - Errors
16: 0 - Short sends
17: 0 - Failed attempts to free log buffers
18: 1 - Sender flow control limits applied
19: 0 - Unblocked Publications
20: 0 - Unblocked Control Commands
21: 0 - Possible TTL Asymmetry
22: 0 - ControllableIdleStrategy status
23: 0 - Loss gap fills
24: 3 - Client liveness timeouts
31: 1 - rcv-channel: aeron:udp?endpoint=localhost:40123
32: 1,547,345,490,262 - client-heartbeat: 2
36: 1,547,345,490,437 - client-heartbeat: 5
`

@tmontgomery
Copy link
Contributor

Liveness is checked on the slow tick.... roughly once a second. Timing with settings like this can easily cause the same situation to occur. The liveness timeout needs to be more than 5x or 10x or more than the interval. And since it is checked each second (assuming no starvation), it should be minimum of several seconds.

If liveness timeout has occured and the media driver does remove the publication and subscriptions. How is the publisher client able to successfully offer?

It can because the client is keeping the logbuffer around. The driver is done with it since it has timed out. As far as the driver is concerned, the client is done and gone. That is why the timeout needs to be long and the keepalive interval short.

Yes I had set the timeout to 30 seconds due to thread starvation or GC. Mind you this situation has happened only twice in 6 months. That too under really heavy load when our engine is processing a lot of Market Data

30 seconds is a long time, as you know, I am sure. Heck 10 seconds is a very very long time. The best solution is to keep the system responsive. Compensating for long pauses like this have to, as I have said, account for quite a few timeouts, etc. to keep things alive.... like Images, etc.

Lastly my point is I understand we need to tune our system to handle this edge cases but there was no indication from the Publisher, even though all the other Subscribers were not receiving messages.

So, are you positive that the client didn't have an Inter Service Timeout on the publisher? I mean it is possible that the client due to timing misses the service timeout, but the driver times out the client, but unlikely....

Also, you said the publisher Images went unavailable on the subscription side.... why is that not an error for the system? Is it expected?

@harnitbakshi
Copy link
Author

harnitbakshi commented Jan 13, 2019

Hi Todd,
Thanks so much for your answers

So, are you positive that the client didn't have an Inter Service Timeout on the publisher? I mean it is possible that the client due to timing misses the service timeout, but the driver times out the client, but unlikely....

I did not see any Inter Service Timeout, which I was expecting to see as well

Also, you said the publisher Images went unavailable on the subscription side.... why is that not an error for the system? Is it expected?

Actually when this first happened it happened during the shutdown/startup phase of all our applications on our test machine. Also when this happens, I have noticed bouncing subscribers does not help until the publisher was bounced, subscriber simply waits for subscription

Also in the test code for BasicPubilsher there is a code which checks publisher.isConnected, so this would also never detect the disconnection from subscribers?

Usually an image not available is an alert we do receive. I am just thinking what are our options here:

  • What is the best way to detect live publication and live subscribers in a pub sub environment?

  • Is there any way to recover from such a situation, that is detect dead publisher for example on publisher side and then close and restart the publisher programatically. At this point the offer returns normally.....

  • Or is the only other option to monitor alerts and have someone to bounce processes manually? This would not be a great option

Please let me know thanks

@mjpt777
Copy link
Contributor

mjpt777 commented Jan 13, 2019

24: 3 - Client liveness timeouts

This indicates the clients have been detected as not alive and thus will be cleaned up. You can see no counters for publications or subscriptions as they have been cleaned up. When subscriptions go away the publication does not instantly know as they have a flow control buffer which is in use. They will get back pressured once this buffer is exhausted or when the status message timeout occurs from the receiver which is 5s by default.

I do not believe you are seeing a bug and it is a case of mis-configuration in an environment without sufficient resources to run the tests you require.

@harnitbakshi
Copy link
Author

“They will get back pressured once this buffer is exhausted or when the status message timeout occurs from the receiver which is 5s by default.”

Hi Martin I do not see this happening

@mjpt777
Copy link
Contributor

mjpt777 commented Jan 13, 2019

Are you saying you can offer beyond 5s of the subscribers being disconnected?

@harnitbakshi
Copy link
Author

I could offer well over 5 seconds, I waited for over a minute and the publisher still offered

@mjpt777
Copy link
Contributor

mjpt777 commented Jan 13, 2019

@harnitbakshi Can you try with the build from master?

@harnitbakshi
Copy link
Author

Hi Martin will run the tests again in some time and let you know thanks

@harnitbakshi
Copy link
Author

harnitbakshi commented Jan 13, 2019

Hi Martin,
I ran the tests again, this time its a bit better but still not completely correct, I see the following

  • Both runs employ one Publisher and two Subscribers, transport is IPC
  • First run I see that the Publisher detects that no subscriber is connected, that is return code for offer is NOT_CONNECTED. One subscriber prints image unavailable, the other subscriber does not print anything, instead just stops receiving messages. After I kill both subscribers and restart them they can never get any subscription, meanwhile Publisher keeps printing NOT_CONNECTED. If I bounce the Publisher only then I can get a subscription going again
  • Second run I noticed that after some time the Publisher returns BACK_PRESSURED for the entire test. One subscriber prints image unavailable, the other subscriber does not print anything, instead just stops receiving messages. Even after killing both subscribers, the Publisher keeps printing BACK_PRESSURED well after the subscribers are killed. After I restart the subscribers both cannot get the subscription. If I bounce the Publisher only then I can get a subscription going again

Is this what you expect to happen here? If yes then how could someone recover from such a scenario

@tmontgomery
Copy link
Contributor

tmontgomery commented Jan 13, 2019

  • What is the best way to detect live publication and live subscribers in a pub sub environment?

Via timeout of activity. The application needs to have a means of detecting communication failure. Relying on the transport to ALWAYS provide an error is not going to happen. This is not just Aeron. This is TCP as well. If you don't build this into the application protocol, the application system will not be robust to failure.

  • Is there any way to recover from such a situation, that is detect dead publisher for example on publisher side and then close and restart the publisher programatically. At this point the offer returns normally.....

Lack of activity should pretty much always be the signal. And usually the result is an application restart.... or a graceful (or as graceful as possible) shutdown. Depends on the application.

  • Or is the only other option to monitor alerts and have someone to bounce processes manually? This would not be a great option

I agree. I would build activity detection into the application protocol and then use that.... honestly, there are enough edge cases (I know this is IPC, but you should treat it as UDP anyway) that Aeron will not be able to detect and provide an indication that the application needs to know that communication is not possible and to attempt a shutdown or restart.

The symptom you are seeing where a Publication.offer is a silent failure is not uncommon to a net split scenario. Or to a situation where the Subscription logic has hung. The logbuffers are big and it takes a while for them to drain slowly (if at all). Relying on Aeron (or TCP) to always inform you of an error is not a viable solution for the system. You can use errors as indications of problems before a timeout happens, but relying solely on them is doomed to miss a LOT of problems.

@tmontgomery
Copy link
Contributor

@mjpt777 the change you made to clamp pub-lmt only applies until the Counter is reclaimed. Which is 1 second. So, while it can help make the window smaller, it can't eliminate the possibility (especially in a 30 second client timeout as originally mentioned).

At the end of the day, InterServiceTimeout can say the driver has possibly timed out the client. But it is unreliable. The driver can timeout the client without it firing just due to timing.

I think what we should do is to have a client keepalive for an unknown client generate an onError from the driver. The client that sees that error can know it has timed out then. Use clientId for the offending Id. We can also add a system counter for tracking those if we wish.

If we simply add a client timeout event to the toClients buffer, that can help, but it could also be lost by a client (unlikely, but possible). A response on each keepalive means one will eventually get through.

@mjpt777
Copy link
Contributor

mjpt777 commented Jan 13, 2019

@tmontgomery I was thinking similar but more reactive would be to send a new client close message when a client is timed out that is for a particular client id.

The clamping of the window I was thinking is very useful in the network case as the publication lingers. Less useful as you point out in the IPC case but still of some but limited value.

@tmontgomery
Copy link
Contributor

A client close is a one shot and can be lost, as I mentioned. Still can do it, though. But an error on client keepalive could act as a backup.

@mjpt777
Copy link
Contributor

mjpt777 commented Jan 13, 2019

How can it be lost via the broadcast buffer? The error would surely come later?

@tmontgomery
Copy link
Contributor

Broadcast buffer is unreliable. Messages may be lost.

@tmontgomery
Copy link
Contributor

It is unlikely, yes. But in the event of a lot of activity, say a lot of timeouts happening at the same time and a lot of other events all at the same time, it is quite possible it will be lost due to a client not keeping up.

@mjpt777
Copy link
Contributor

mjpt777 commented Jan 13, 2019

Messages are only lost with the broadcast buffer on wrap, which is highly unlikely given traffic, and if it does there is an exception and client should end which is what we want.

@tmontgomery
Copy link
Contributor

that exception comes in, I think, via the error handler. Not sure that all applications will see that as a terminating error.

@mjpt777
Copy link
Contributor

mjpt777 commented Jan 13, 2019

I'm testing a change that will make the wrap situation cleaner.

@tmontgomery
Copy link
Contributor

Well, if we force a close of the client on a wrap, then we can just send a message on client timeout to notify the client. If it's wrapped, then the close will catch that anyway.

@mjpt777
Copy link
Contributor

mjpt777 commented Jan 13, 2019

Close on wrap feels like the right thing anyway. Makes no sense to continue when you don't know what you missed.

@tmontgomery
Copy link
Contributor

Agreed.

@tmontgomery
Copy link
Contributor

I will handle the client timeout notification. But it'll have to wait until tomorrow.

@tmontgomery tmontgomery self-assigned this Jan 13, 2019
@harnitbakshi
Copy link
Author

Hello, could you please tell me what is meant by close on wrap?

@mjpt777
Copy link
Contributor

mjpt777 commented Jan 15, 2019

Hello, could you please tell me what is meant by close on wrap?

The driver will broadcast events to connected clients. In the unlikely event that the client has been inactive for more time than a full revolution of the buffer used for broadcasting events then the client detects the wrap of this buffer and will automatically close itself.

@harnitbakshi
Copy link
Author

Got it thanks

@tmontgomery
Copy link
Contributor

I've added a client timeout notification to the Java driver. Will add to the C driver today.

@tmontgomery
Copy link
Contributor

C driver now generates the event also from a3be729

@tmontgomery
Copy link
Contributor

going to close this now. The generation of the client timeout event should help this situation. But notice that the client must read the broadcast. So, it will not be immediate and it means that the client conductor thread must eventually be given some time to run.

@harnitbakshi
Copy link
Author

Hi Todd/Martin,
Thank you for reverting this so quickly much appreciated, I have tested this and I can see the ClientTimeoutException being received via the errorHandler. Also the Publication is closed on wrap

@harnitbakshi
Copy link
Author

Hi Todd/Martin,
When do you think is the next release? And is there any chance of back porting this fix on 1.10.x
Thats the version we are currently on

Please let me know thank you

@mjpt777
Copy link
Contributor

mjpt777 commented Jan 16, 2019

We will likely do a new release in a few weeks time. Sorry but we do not offer back ports unless it is to a customer with a support contract.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants