Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stalled Publication/Subscription #281

Closed
patriknw opened this issue Sep 29, 2016 · 22 comments
Closed

Stalled Publication/Subscription #281

patriknw opened this issue Sep 29, 2016 · 22 comments

Comments

@patriknw
Copy link

Version 1.0.1

I'm debugging a case where sub.poll suddenly doesn't receive any more messages, and it stays that way. On the sending side pub.offer is successful (positive return value and connected=true closed=false).

Only two nodes.

  • node1: 172-31-10-77
  • node2: 172-31-8-204

node1 tries to send to node2, but node2 is not started yet. ~30 seconds later node2 is started and they successfully exchange a few messages (I have application level logs for that), and then node2 stops receiving more messages in sub.poll. onFragment in the FragmentAssembler is not invoked even though I repeatedly call poll. Those messages are below mtu size.

The systems are rather loaded in this scenario but not overloaded, and the load is stopped after a while.

AeronStat from node1 172-31-10-77:

 23:                    1 - recv-channel: aeron:udp?endpoint=ip-172-31-10-77:25520
 24:        1,627,656,896 - snd-pos: 8 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520
 25:                    1 - send-channel: aeron:udp?endpoint=ip-172-31-8-204:25520
 26:        1,636,045,504 - pub-lmt: 8 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520
 27:              170,432 - sub-pos: 1 434696915 1 aeron:udp?endpoint=ip-172-31-10-77:25520 @0
 28:              170,432 - rcv-hwm: 7 434696915 1 aeron:udp?endpoint=ip-172-31-10-77:25520
 29:           19,750,176 - sub-pos: 2 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520 @0
 30:           19,750,176 - rcv-hwm: 9 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520
 31:           19,750,176 - rcv-pos: 9 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520
 32:           28,953,216 - pub-lmt: 10 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520
 33:           20,564,608 - snd-pos: 10 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520

AeronStat from node2 172-31-8-204:

 23:                    1 - recv-channel: aeron:udp?endpoint=ip-172-31-8-204:25520
 24:                    1 - send-channel: aeron:udp?endpoint=ip-172-31-10-77:25520
 25:            8,559,040 - pub-lmt: 3 434696915 1 aeron:udp?endpoint=ip-172-31-10-77:25520
 26:              170,432 - snd-pos: 3 434696915 1 aeron:udp?endpoint=ip-172-31-10-77:25520
 27:           28,138,784 - pub-lmt: 4 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520
 28:           19,750,176 - snd-pos: 4 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520
 29:                2,176 - sub-pos: 1 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520 @0
 30:        1,627,656,896 - rcv-hwm: 5 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520
 31:                2,176 - rcv-pos: 5 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520
 32:           20,500,992 - sub-pos: 2 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520 @0
 33:           20,500,992 - rcv-hwm: 6 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520
 34:           20,500,992 - rcv-pos: 6 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520

The problematic session is 1287205435. The other streams seem to progress. I kept it running for minutes after the stall.

Stream 1 is our control stream and it's low traffic, a few messages per second on this stream.

I have all files, if you need more information.

@tmontgomery
Copy link
Contributor

The sender and rcv-hwm are setting at position 1,627,656,896. However, the subscription has only consumed (and the rcv-pos) is 2,176.

How big is the term length?

Any other settings in use from the defaults?

Also, you mention this is low traffic. But did you really send something like 1.5GB of data on that channel?

@patriknw
Copy link
Author

On stream 1 it's just a few heartbeat messages of a few hundred bytes each per second. It is stuck almost immediately (only few messages get through). Reproducing is not deterministic (about 1 in 10 it happens).

On stream 2 it's a lot more messages.

1.5 GB on the sender side. How can that be possible? I use offer with UnsafeBuffer wrapping an ByteBuffer.allocateDirect of capacity 256 KiB (from our pool). Byte order LITTLE_ENDIAN.

Term sizes not changed, using default. All settings default except increased a few timeouts:

  • client-liveness-timeout = 20 seconds
  • image-liveness-timeout = 10 seconds
  • driver-timeout = 20 seconds

@tmontgomery
Copy link
Contributor

The positions don't make a lot of sense, honestly.

Are you sure that all drivers and clients are using the same 1.0.1 version? it might be worth a shot to use the latest snapshot/head also just to see.

@tmontgomery
Copy link
Contributor

@mjpt777 anything come to mind for you?

@mjpt777
Copy link
Contributor

mjpt777 commented Sep 29, 2016

I think it would be worth looking at the publication and image buffers if you still have them. Also what at the first 30 counters so we can see the aggregate picture?

@tmontgomery I think it might be worth putting the Aeron version AeronStat output.

@patriknw
Copy link
Author

I'm sure about the version number. Running this on fresh ec2 instances from our sbt build and we have been using 1.0.1 since long back.

Here are all files: https://dl.dropboxusercontent.com/u/14579199/test35.zip

@tmontgomery
Copy link
Contributor

If the positions are correct, then looking at the images is going to be mostly fruitless. The hwm and sub-pos, for example, will be many term buffers in difference. Surely violating the cleaning assumptions.

@tmontgomery
Copy link
Contributor

tmontgomery commented Sep 29, 2016

The stats show only 20MB sent/received, so the high position value is definitely off for some reason. The only reason I can think for this occurring would be corruption of the header in some way that infers a different position. This seems to be happening as the message is put into the term as the snd-pos and pub-lmt are also corrupted. A possibility might be setting the reservedValue with an incorrect offset or a tryClaim that has messed with the offset incorrectly.

node-1

  0:           20,745,216 - Bytes sent
  1:           20,370,624 - Bytes received
  2:                    0 - Failed offers to ReceiverProxy
  3:                    0 - Failed offers to SenderProxy
  4:                    0 - Failed offers to DriverConductorProxy
  5:                   16 - NAKs sent
  6:                4,917 - NAKs received
  7:                3,340 - Status Messages sent
  8:                3,282 - Status Messages received
  9:                5,697 - Heartbeats sent
 10:                5,311 - Heartbeats received
 11:                    3 - Retransmits sent
 12:                    0 - Flow control under runs
 13:                    0 - Flow control over runs
 14:                    0 - Invalid packets
 15:                    0 - Errors
 16:                    0 - Short sends
 17:                  696 - Client keep-alives
 18:                    7 - Sender flow control limits applied
 19:                    0 - Unblocked Publications
 20:                    0 - Unblocked Control Commands
 21:                    0 - Possible TTL Asymmetry
 22:                    0 - ControllableIdleStrategy status
 23:                    1 - recv-channel: aeron:udp?endpoint=ip-172-31-10-77:25520
 24:        1,627,656,896 - snd-pos: 8 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520
 25:                    1 - send-channel: aeron:udp?endpoint=ip-172-31-8-204:25520
 26:        1,636,045,504 - pub-lmt: 8 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520
 27:              170,432 - sub-pos: 1 434696915 1 aeron:udp?endpoint=ip-172-31-10-77:25520 @0
 28:              170,432 - rcv-hwm: 7 434696915 1 aeron:udp?endpoint=ip-172-31-10-77:25520
 29:           19,750,176 - sub-pos: 2 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520 @0
 30:           19,750,176 - rcv-hwm: 9 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520
 31:           19,750,176 - rcv-pos: 9 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520
 32:           28,953,216 - pub-lmt: 10 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520
 33:           20,564,608 - snd-pos: 10 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520

node-2

  0:           20,086,208 - Bytes sent
  1:           20,681,152 - Bytes received
  2:                    0 - Failed offers to ReceiverProxy
  3:                    0 - Failed offers to SenderProxy
  4:                    0 - Failed offers to DriverConductorProxy
  5:                4,917 - NAKs sent
  6:                   16 - NAKs received
  7:                3,282 - Status Messages sent
  8:                3,320 - Status Messages received
  9:                5,312 - Heartbeats sent
 10:                5,657 - Heartbeats received
 11:                   16 - Retransmits sent
 12:                    0 - Flow control under runs
 13:                    0 - Flow control over runs
 14:                    0 - Invalid packets
 15:                    0 - Errors
 16:                    0 - Short sends
 17:                  599 - Client keep-alives
 18:                    2 - Sender flow control limits applied
 19:                    0 - Unblocked Publications
 20:                    0 - Unblocked Control Commands
 21:                    0 - Possible TTL Asymmetry
 22:                    0 - ControllableIdleStrategy status
 23:                    1 - recv-channel: aeron:udp?endpoint=ip-172-31-8-204:25520
 24:                    1 - send-channel: aeron:udp?endpoint=ip-172-31-10-77:25520
 25:            8,559,040 - pub-lmt: 3 434696915 1 aeron:udp?endpoint=ip-172-31-10-77:25520
 26:              170,432 - snd-pos: 3 434696915 1 aeron:udp?endpoint=ip-172-31-10-77:25520
 27:           28,138,784 - pub-lmt: 4 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520
 28:           19,750,176 - snd-pos: 4 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520
 29:                2,176 - sub-pos: 1 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520 @0
 30:        1,627,656,896 - rcv-hwm: 5 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520
 31:                2,176 - rcv-pos: 5 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520
 32:           20,500,992 - sub-pos: 2 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520 @0
 33:           20,500,992 - rcv-hwm: 6 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520
 34:           20,500,992 - rcv-pos: 6 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520

@mjpt777
Copy link
Contributor

mjpt777 commented Sep 29, 2016

I don't see how the rcv-pos or sub-pos can get so far behind the rcv-hwm. No flow control over or under runs so this would have been rejected. It is like something corrupted the counters. The image should have the latest contents.

@tmontgomery
Copy link
Contributor

@mjpt777 agreed. The only way snd-pos increments is via sending data. And as there is no way I can see sending that much data because of the byte counts, the snd-pos, hwm, and pub-lmt must be invalid somehow.

@tmontgomery
Copy link
Contributor

For the Publication logbuffer. Data is 0 until offset 2176 (position 2176). Which happens to be the sub-pos and rcv-pos. For a publication, this seems very very unusual normally.

For the Image logbuffer, there is data up to position 2176, then it is 0.

One possibility is that snd-pos was corrupted. If it got corrupted, then it would update pub-lmt with a bad value. Also, sending would use that position for scanning and might send junk (depending where it was). Wouldn't quite explain why rcv-hwm is at that value, though. But whatever hit snd-pos could hit rcv-hwm and corrupt it separately.... but it is a separate machine and file.

The 0 of the data in the publication might be explained by the snd-pos. Cleaning might do that if snd-pos was corrupted.

@tmontgomery
Copy link
Contributor

@patriknw is the stat values being read (or written to) by the app? Anything that could touch the cnc file?

@tmontgomery
Copy link
Contributor

tmontgomery commented Sep 29, 2016

@mjpt777 BTW, the rcv-hwm is incremented on a heartbeat irregardless of overrun. So, it would be possible to have a heartbeat message come in that sets the rcv-hwm up to some value if (and especially if) the snd-pos was corrupted. The snd-pos is used to fill the heartbeat.

@mjpt777
Copy link
Contributor

mjpt777 commented Sep 29, 2016

We should probably filter that out as an overrun.

@tmontgomery
Copy link
Contributor

Probably. But it's an orthogonal issue. The corruption of snd-pos is problematic.

@mjpt777
Copy link
Contributor

mjpt777 commented Sep 29, 2016

Looking at the code I think we should validate status messages for flow control to be in a valid range.

@tmontgomery
Copy link
Contributor

Sure. Can validate that the position is one that was sent. I assume you mean on the Publication side processing an SM. Although, not seeing how a bad SM could effect the snd-pos like what is observed. The Publication logbuffer has data from 2,176 to 317,632 and is 0 afterward. A bad SM could allow more to be sent, but not seeing how it could change snd-pos directly.

@tmontgomery
Copy link
Contributor

There is no counter for NAKs that are filtered by being outside of a termLength. Most of the NAKs received in the stats are filtered because of snd-pos being off by so much.

@tmontgomery
Copy link
Contributor

tmontgomery commented Sep 29, 2016

something else that is "odd" is that there is no rcv-pos on node-1 for the image with sessionId 434696915. The image logbuffer is there, though. And it has data up to the hwm, which is 170,432.

Now, it could be that the snapshot was taken just at the moment that the image was being closed and the rcv-pos was removed. But it is odd that it is not there.

Edit: a supposition is that perhaps the allocation of counters used slot 24 for both snd-pos and rcv-pos. However, this wouldn't account for the actual value and how they are handled. Also, all allocation and freeing of counters is done on the driver conductor thread and appears to normally work just fine.

@patriknw
Copy link
Author

I don't use reservedValue or tryClaim. Only offer and poll.

I read the stats from a separate process with aeron-samples AeronStats. We are not touching the cnc file, apart from that we poll it periodically for log errors using (a copy) of AeronErrorLog.

The media driver is started embedded, if that matters.

Could it be something with that I continue to offer (in busy-wait loop) even though I get NOT_CONNECTED?

The returned positions from offer on node1 looks good (to me):

[09/29/2016 03:38:37.837]  AeronSink offer result [-1] connected [false] closed [false]
[09/29/2016 03:39:25.376]  AeronSink offer result [-1] connected [false] closed [false]
[09/29/2016 03:39:25.444]  AeronSink offer result [320] connected [true] closed [false]
[09/29/2016 03:39:25.452]  AeronSink offer result [448] connected [true] closed [false]
[09/29/2016 03:39:26.584]  AeronSink offer result [576] connected [true] closed [false]
[09/29/2016 03:39:26.643]  AeronSink offer result [768] connected [true] closed [false]
[09/29/2016 03:39:26.643]  AeronSink offer result [1152] connected [true] closed [false]
[09/29/2016 03:39:26.647]  AeronSink offer result [1280] connected [true] closed [false]
[09/29/2016 03:39:27.053]  AeronSink offer result [1664] connected [true] closed [false]
[09/29/2016 03:39:27.643]  AeronSink offer result [2048] connected [true] closed [false]
[09/29/2016 03:39:27.738]  AeronSink offer result [2176] connected [true] closed [false]

...

[09/29/2016 03:44:26.642]  AeronSink offer result [315776] connected [true] closed [false]
[09/29/2016 03:44:27.054]  AeronSink offer result [316160] connected [true] closed [false]
[09/29/2016 03:44:27.067]  AeronSink offer result [316288] connected [true] closed [false]
[09/29/2016 03:44:27.642]  AeronSink offer result [316672] connected [true] closed [false]
[09/29/2016 03:44:28.642]  AeronSink offer result [316864] connected [true] closed [false]
[09/29/2016 03:44:28.642]  AeronSink offer result [317248] connected [true] closed [false]
[09/29/2016 03:44:29.642]  AeronSink offer result [317632] connected [true] closed [false]

@tmontgomery
Copy link
Contributor

@patriknw in the test, does the application create and release/close any other Publications or Subscriptions not shown in the stats?

@patriknw
Copy link
Author

I didn't use any filters apart from type=[1-9] with AeronStat.

However, node1 is trying to send to node2 before node2 has been started and after a while we give up that and close the Publication. I see in the logs that this happened in this test run. At 03:38:57. node2 is started 03:39:21 (clocks are in sync within 1 sec).

So there was one Publication before the things you see in the stats here.

@mjpt777 mjpt777 closed this as completed Oct 14, 2016
patriknw added a commit to akka/akka that referenced this issue Oct 17, 2016
* important bug fix in Aeron: real-logic/aeron#281
ktoso pushed a commit to ktoso/akka that referenced this issue Oct 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants