Stalled Publication/Subscription #281

patriknw · 2016-09-29T06:12:36Z

Version 1.0.1

I'm debugging a case where sub.poll suddenly doesn't receive any more messages, and it stays that way. On the sending side pub.offer is successful (positive return value and connected=true closed=false).

Only two nodes.

node1: 172-31-10-77
node2: 172-31-8-204

node1 tries to send to node2, but node2 is not started yet. ~30 seconds later node2 is started and they successfully exchange a few messages (I have application level logs for that), and then node2 stops receiving more messages in sub.poll. onFragment in the FragmentAssembler is not invoked even though I repeatedly call poll. Those messages are below mtu size.

The systems are rather loaded in this scenario but not overloaded, and the load is stopped after a while.

AeronStat from node1 172-31-10-77:

 23:                    1 - recv-channel: aeron:udp?endpoint=ip-172-31-10-77:25520
 24:        1,627,656,896 - snd-pos: 8 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520
 25:                    1 - send-channel: aeron:udp?endpoint=ip-172-31-8-204:25520
 26:        1,636,045,504 - pub-lmt: 8 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520
 27:              170,432 - sub-pos: 1 434696915 1 aeron:udp?endpoint=ip-172-31-10-77:25520 @0
 28:              170,432 - rcv-hwm: 7 434696915 1 aeron:udp?endpoint=ip-172-31-10-77:25520
 29:           19,750,176 - sub-pos: 2 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520 @0
 30:           19,750,176 - rcv-hwm: 9 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520
 31:           19,750,176 - rcv-pos: 9 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520
 32:           28,953,216 - pub-lmt: 10 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520
 33:           20,564,608 - snd-pos: 10 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520

AeronStat from node2 172-31-8-204:

 23:                    1 - recv-channel: aeron:udp?endpoint=ip-172-31-8-204:25520
 24:                    1 - send-channel: aeron:udp?endpoint=ip-172-31-10-77:25520
 25:            8,559,040 - pub-lmt: 3 434696915 1 aeron:udp?endpoint=ip-172-31-10-77:25520
 26:              170,432 - snd-pos: 3 434696915 1 aeron:udp?endpoint=ip-172-31-10-77:25520
 27:           28,138,784 - pub-lmt: 4 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520
 28:           19,750,176 - snd-pos: 4 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520
 29:                2,176 - sub-pos: 1 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520 @0
 30:        1,627,656,896 - rcv-hwm: 5 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520
 31:                2,176 - rcv-pos: 5 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520
 32:           20,500,992 - sub-pos: 2 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520 @0
 33:           20,500,992 - rcv-hwm: 6 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520
 34:           20,500,992 - rcv-pos: 6 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520

The problematic session is 1287205435. The other streams seem to progress. I kept it running for minutes after the stall.

Stream 1 is our control stream and it's low traffic, a few messages per second on this stream.

I have all files, if you need more information.

The text was updated successfully, but these errors were encountered:

tmontgomery · 2016-09-29T17:35:50Z

The sender and rcv-hwm are setting at position 1,627,656,896. However, the subscription has only consumed (and the rcv-pos) is 2,176.

How big is the term length?

Any other settings in use from the defaults?

Also, you mention this is low traffic. But did you really send something like 1.5GB of data on that channel?

patriknw · 2016-09-29T19:22:25Z

On stream 1 it's just a few heartbeat messages of a few hundred bytes each per second. It is stuck almost immediately (only few messages get through). Reproducing is not deterministic (about 1 in 10 it happens).

On stream 2 it's a lot more messages.

1.5 GB on the sender side. How can that be possible? I use offer with UnsafeBuffer wrapping an ByteBuffer.allocateDirect of capacity 256 KiB (from our pool). Byte order LITTLE_ENDIAN.

Term sizes not changed, using default. All settings default except increased a few timeouts:

client-liveness-timeout = 20 seconds
image-liveness-timeout = 10 seconds
driver-timeout = 20 seconds

tmontgomery · 2016-09-29T19:35:53Z

The positions don't make a lot of sense, honestly.

Are you sure that all drivers and clients are using the same 1.0.1 version? it might be worth a shot to use the latest snapshot/head also just to see.

tmontgomery · 2016-09-29T19:36:10Z

@mjpt777 anything come to mind for you?

mjpt777 · 2016-09-29T19:49:10Z

I think it would be worth looking at the publication and image buffers if you still have them. Also what at the first 30 counters so we can see the aggregate picture?

@tmontgomery I think it might be worth putting the Aeron version AeronStat output.

patriknw · 2016-09-29T20:02:18Z

I'm sure about the version number. Running this on fresh ec2 instances from our sbt build and we have been using 1.0.1 since long back.

Here are all files: https://dl.dropboxusercontent.com/u/14579199/test35.zip

tmontgomery · 2016-09-29T20:22:37Z

If the positions are correct, then looking at the images is going to be mostly fruitless. The hwm and sub-pos, for example, will be many term buffers in difference. Surely violating the cleaning assumptions.

tmontgomery · 2016-09-29T20:38:12Z

The stats show only 20MB sent/received, so the high position value is definitely off for some reason. The only reason I can think for this occurring would be corruption of the header in some way that infers a different position. This seems to be happening as the message is put into the term as the snd-pos and pub-lmt are also corrupted. A possibility might be setting the reservedValue with an incorrect offset or a tryClaim that has messed with the offset incorrectly.

node-1

  0:           20,745,216 - Bytes sent
  1:           20,370,624 - Bytes received
  2:                    0 - Failed offers to ReceiverProxy
  3:                    0 - Failed offers to SenderProxy
  4:                    0 - Failed offers to DriverConductorProxy
  5:                   16 - NAKs sent
  6:                4,917 - NAKs received
  7:                3,340 - Status Messages sent
  8:                3,282 - Status Messages received
  9:                5,697 - Heartbeats sent
 10:                5,311 - Heartbeats received
 11:                    3 - Retransmits sent
 12:                    0 - Flow control under runs
 13:                    0 - Flow control over runs
 14:                    0 - Invalid packets
 15:                    0 - Errors
 16:                    0 - Short sends
 17:                  696 - Client keep-alives
 18:                    7 - Sender flow control limits applied
 19:                    0 - Unblocked Publications
 20:                    0 - Unblocked Control Commands
 21:                    0 - Possible TTL Asymmetry
 22:                    0 - ControllableIdleStrategy status
 23:                    1 - recv-channel: aeron:udp?endpoint=ip-172-31-10-77:25520
 24:        1,627,656,896 - snd-pos: 8 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520
 25:                    1 - send-channel: aeron:udp?endpoint=ip-172-31-8-204:25520
 26:        1,636,045,504 - pub-lmt: 8 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520
 27:              170,432 - sub-pos: 1 434696915 1 aeron:udp?endpoint=ip-172-31-10-77:25520 @0
 28:              170,432 - rcv-hwm: 7 434696915 1 aeron:udp?endpoint=ip-172-31-10-77:25520
 29:           19,750,176 - sub-pos: 2 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520 @0
 30:           19,750,176 - rcv-hwm: 9 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520
 31:           19,750,176 - rcv-pos: 9 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520
 32:           28,953,216 - pub-lmt: 10 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520
 33:           20,564,608 - snd-pos: 10 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520

node-2

  0:           20,086,208 - Bytes sent
  1:           20,681,152 - Bytes received
  2:                    0 - Failed offers to ReceiverProxy
  3:                    0 - Failed offers to SenderProxy
  4:                    0 - Failed offers to DriverConductorProxy
  5:                4,917 - NAKs sent
  6:                   16 - NAKs received
  7:                3,282 - Status Messages sent
  8:                3,320 - Status Messages received
  9:                5,312 - Heartbeats sent
 10:                5,657 - Heartbeats received
 11:                   16 - Retransmits sent
 12:                    0 - Flow control under runs
 13:                    0 - Flow control over runs
 14:                    0 - Invalid packets
 15:                    0 - Errors
 16:                    0 - Short sends
 17:                  599 - Client keep-alives
 18:                    2 - Sender flow control limits applied
 19:                    0 - Unblocked Publications
 20:                    0 - Unblocked Control Commands
 21:                    0 - Possible TTL Asymmetry
 22:                    0 - ControllableIdleStrategy status
 23:                    1 - recv-channel: aeron:udp?endpoint=ip-172-31-8-204:25520
 24:                    1 - send-channel: aeron:udp?endpoint=ip-172-31-10-77:25520
 25:            8,559,040 - pub-lmt: 3 434696915 1 aeron:udp?endpoint=ip-172-31-10-77:25520
 26:              170,432 - snd-pos: 3 434696915 1 aeron:udp?endpoint=ip-172-31-10-77:25520
 27:           28,138,784 - pub-lmt: 4 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520
 28:           19,750,176 - snd-pos: 4 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520
 29:                2,176 - sub-pos: 1 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520 @0
 30:        1,627,656,896 - rcv-hwm: 5 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520
 31:                2,176 - rcv-pos: 5 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520
 32:           20,500,992 - sub-pos: 2 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520 @0
 33:           20,500,992 - rcv-hwm: 6 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520
 34:           20,500,992 - rcv-pos: 6 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520

mjpt777 · 2016-09-29T20:39:16Z

I don't see how the rcv-pos or sub-pos can get so far behind the rcv-hwm. No flow control over or under runs so this would have been rejected. It is like something corrupted the counters. The image should have the latest contents.

tmontgomery · 2016-09-29T20:59:54Z

@mjpt777 agreed. The only way snd-pos increments is via sending data. And as there is no way I can see sending that much data because of the byte counts, the snd-pos, hwm, and pub-lmt must be invalid somehow.

tmontgomery · 2016-09-29T21:20:56Z

For the Publication logbuffer. Data is 0 until offset 2176 (position 2176). Which happens to be the sub-pos and rcv-pos. For a publication, this seems very very unusual normally.

For the Image logbuffer, there is data up to position 2176, then it is 0.

One possibility is that snd-pos was corrupted. If it got corrupted, then it would update pub-lmt with a bad value. Also, sending would use that position for scanning and might send junk (depending where it was). Wouldn't quite explain why rcv-hwm is at that value, though. But whatever hit snd-pos could hit rcv-hwm and corrupt it separately.... but it is a separate machine and file.

The 0 of the data in the publication might be explained by the snd-pos. Cleaning might do that if snd-pos was corrupted.

tmontgomery · 2016-09-29T21:22:40Z

@patriknw is the stat values being read (or written to) by the app? Anything that could touch the cnc file?

tmontgomery · 2016-09-29T21:26:49Z

@mjpt777 BTW, the rcv-hwm is incremented on a heartbeat irregardless of overrun. So, it would be possible to have a heartbeat message come in that sets the rcv-hwm up to some value if (and especially if) the snd-pos was corrupted. The snd-pos is used to fill the heartbeat.

mjpt777 · 2016-09-29T21:35:24Z

We should probably filter that out as an overrun.

tmontgomery · 2016-09-29T21:49:26Z

Probably. But it's an orthogonal issue. The corruption of snd-pos is problematic.

mjpt777 · 2016-09-29T22:07:18Z

Looking at the code I think we should validate status messages for flow control to be in a valid range.

tmontgomery · 2016-09-29T22:40:07Z

Sure. Can validate that the position is one that was sent. I assume you mean on the Publication side processing an SM. Although, not seeing how a bad SM could effect the snd-pos like what is observed. The Publication logbuffer has data from 2,176 to 317,632 and is 0 afterward. A bad SM could allow more to be sent, but not seeing how it could change snd-pos directly.

tmontgomery · 2016-09-29T22:54:28Z

There is no counter for NAKs that are filtered by being outside of a termLength. Most of the NAKs received in the stats are filtered because of snd-pos being off by so much.

tmontgomery · 2016-09-29T23:02:06Z

something else that is "odd" is that there is no rcv-pos on node-1 for the image with sessionId 434696915. The image logbuffer is there, though. And it has data up to the hwm, which is 170,432.

Now, it could be that the snapshot was taken just at the moment that the image was being closed and the rcv-pos was removed. But it is odd that it is not there.

Edit: a supposition is that perhaps the allocation of counters used slot 24 for both snd-pos and rcv-pos. However, this wouldn't account for the actual value and how they are handled. Also, all allocation and freeing of counters is done on the driver conductor thread and appears to normally work just fine.

patriknw · 2016-09-30T05:40:34Z

I don't use reservedValue or tryClaim. Only offer and poll.

I read the stats from a separate process with aeron-samples AeronStats. We are not touching the cnc file, apart from that we poll it periodically for log errors using (a copy) of AeronErrorLog.

The media driver is started embedded, if that matters.

Could it be something with that I continue to offer (in busy-wait loop) even though I get NOT_CONNECTED?

The returned positions from offer on node1 looks good (to me):

[09/29/2016 03:38:37.837]  AeronSink offer result [-1] connected [false] closed [false]
[09/29/2016 03:39:25.376]  AeronSink offer result [-1] connected [false] closed [false]
[09/29/2016 03:39:25.444]  AeronSink offer result [320] connected [true] closed [false]
[09/29/2016 03:39:25.452]  AeronSink offer result [448] connected [true] closed [false]
[09/29/2016 03:39:26.584]  AeronSink offer result [576] connected [true] closed [false]
[09/29/2016 03:39:26.643]  AeronSink offer result [768] connected [true] closed [false]
[09/29/2016 03:39:26.643]  AeronSink offer result [1152] connected [true] closed [false]
[09/29/2016 03:39:26.647]  AeronSink offer result [1280] connected [true] closed [false]
[09/29/2016 03:39:27.053]  AeronSink offer result [1664] connected [true] closed [false]
[09/29/2016 03:39:27.643]  AeronSink offer result [2048] connected [true] closed [false]
[09/29/2016 03:39:27.738]  AeronSink offer result [2176] connected [true] closed [false]

...

[09/29/2016 03:44:26.642]  AeronSink offer result [315776] connected [true] closed [false]
[09/29/2016 03:44:27.054]  AeronSink offer result [316160] connected [true] closed [false]
[09/29/2016 03:44:27.067]  AeronSink offer result [316288] connected [true] closed [false]
[09/29/2016 03:44:27.642]  AeronSink offer result [316672] connected [true] closed [false]
[09/29/2016 03:44:28.642]  AeronSink offer result [316864] connected [true] closed [false]
[09/29/2016 03:44:28.642]  AeronSink offer result [317248] connected [true] closed [false]
[09/29/2016 03:44:29.642]  AeronSink offer result [317632] connected [true] closed [false]

tmontgomery · 2016-09-30T15:21:47Z

@patriknw in the test, does the application create and release/close any other Publications or Subscriptions not shown in the stats?

patriknw · 2016-09-30T17:38:32Z

I didn't use any filters apart from type=[1-9] with AeronStat.

However, node1 is trying to send to node2 before node2 has been started and after a while we give up that and close the Publication. I see in the logs that this happened in this test run. At 03:38:57. node2 is started 03:39:21 (clocks are in sync within 1 sec).

So there was one Publication before the things you see in the stats here.

* important bug fix in Aeron: real-logic/aeron#281

patriknw mentioned this issue Sep 29, 2016

Heartbeat messages not getting through akka/akka#21583

Closed

mjpt777 closed this as completed Oct 14, 2016

patriknw added a commit to akka/akka that referenced this issue Oct 17, 2016

update to Aeron 1.0.2, #21583

6c78874

* important bug fix in Aeron: real-logic/aeron#281

patriknw mentioned this issue Oct 17, 2016

update to Aeron 1.0.2, #21583 akka/akka#21682

Merged

ktoso pushed a commit to ktoso/akka that referenced this issue Oct 19, 2016

update to Aeron 1.0.2, akka#21583

0de4b1c

* important bug fix in Aeron: real-logic/aeron#281

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stalled Publication/Subscription #281

Stalled Publication/Subscription #281

patriknw commented Sep 29, 2016

tmontgomery commented Sep 29, 2016

patriknw commented Sep 29, 2016

tmontgomery commented Sep 29, 2016

tmontgomery commented Sep 29, 2016

mjpt777 commented Sep 29, 2016

patriknw commented Sep 29, 2016

tmontgomery commented Sep 29, 2016

tmontgomery commented Sep 29, 2016 •

edited

mjpt777 commented Sep 29, 2016

tmontgomery commented Sep 29, 2016

tmontgomery commented Sep 29, 2016

tmontgomery commented Sep 29, 2016

tmontgomery commented Sep 29, 2016 •

edited

mjpt777 commented Sep 29, 2016

tmontgomery commented Sep 29, 2016

mjpt777 commented Sep 29, 2016

tmontgomery commented Sep 29, 2016

tmontgomery commented Sep 29, 2016

tmontgomery commented Sep 29, 2016 •

edited

patriknw commented Sep 30, 2016

tmontgomery commented Sep 30, 2016

patriknw commented Sep 30, 2016

Stalled Publication/Subscription #281

Stalled Publication/Subscription #281

Comments

patriknw commented Sep 29, 2016

tmontgomery commented Sep 29, 2016

patriknw commented Sep 29, 2016

tmontgomery commented Sep 29, 2016

tmontgomery commented Sep 29, 2016

mjpt777 commented Sep 29, 2016

patriknw commented Sep 29, 2016

tmontgomery commented Sep 29, 2016

tmontgomery commented Sep 29, 2016 • edited

mjpt777 commented Sep 29, 2016

tmontgomery commented Sep 29, 2016

tmontgomery commented Sep 29, 2016

tmontgomery commented Sep 29, 2016

tmontgomery commented Sep 29, 2016 • edited

mjpt777 commented Sep 29, 2016

tmontgomery commented Sep 29, 2016

mjpt777 commented Sep 29, 2016

tmontgomery commented Sep 29, 2016

tmontgomery commented Sep 29, 2016

tmontgomery commented Sep 29, 2016 • edited

patriknw commented Sep 30, 2016

tmontgomery commented Sep 30, 2016

patriknw commented Sep 30, 2016

tmontgomery commented Sep 29, 2016 •

edited

tmontgomery commented Sep 29, 2016 •

edited

tmontgomery commented Sep 29, 2016 •

edited