Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Driver gets wedged with two subscribers, one publisher #87

Closed
strangelydim opened this issue Apr 10, 2015 · 28 comments
Closed

Driver gets wedged with two subscribers, one publisher #87

strangelydim opened this issue Apr 10, 2015 · 28 comments
Assignees
Labels

Comments

@strangelydim
Copy link
Contributor

Using the latest versions of PublisherTool and SubscriberTool, if I start one subscriber with an embedded driver, start another one on the same channel:stream ID without, and start a publisher...

$ java -cp aeron-tools/build/libs/tools.jar uk.co.real_logic.aeron.tools.SubscriberTool --driver=embedded
$ java -cp aeron-tools/build/libs/tools.jar uk.co.real_logic.aeron.tools.SubscriberTool
$ java -cp aeron-tools/build/libs/tools.jar uk.co.real_logic.aeron.tools.PublisherTool

the two subscribers get messages for a few seconds and then wedge, getting nothing, and the publisher is blocked from sending. If I kill the publisher and start it again, I get an exception on the PublisherTool app:

java.lang.IndexOutOfBoundsException: index=16777208, length=24, capacity=16777216
    at uk.co.real_logic.agrona.concurrent.UnsafeBuffer.boundsCheck(UnsafeBuffer.java:795)
    at uk.co.real_logic.agrona.concurrent.UnsafeBuffer.putBytes(UnsafeBuffer.java:724)
    at uk.co.real_logic.aeron.common.concurrent.logbuffer.LogAppender.appendPaddingFrame(LogAppender.java:293)
    at uk.co.real_logic.aeron.common.concurrent.logbuffer.LogAppender.appendUnfragmentedMessage(LogAppender.java:206)
    at uk.co.real_logic.aeron.common.concurrent.logbuffer.LogAppender.append(LogAppender.java:137)
    at uk.co.real_logic.aeron.Publication.offer(Publication.java:165)
    at uk.co.real_logic.aeron.tools.PublisherTool$PublisherThread.onNext(PublisherTool.java:416)
    at uk.co.real_logic.aeron.tools.RateController$SecondsAtBitsPerSecondInternal.sendNext(RateController.java:479)
    at uk.co.real_logic.aeron.tools.RateController.next(RateController.java:53)
    at uk.co.real_logic.aeron.tools.PublisherTool$PublisherThread.run(PublisherTool.java:357)
    at java.lang.Thread.run(Thread.java:745)

If I kill and restart the publisher again, I don't get another exception, but it thinks it's sending for a few more messages and then wedges again. Meanwhile, the subscribers haven't gotten any new messages since the very first run of the Publisher.

This could be some sort of bug in the tools themselves, but the fact that I get an exception from the LogAppender on the Publisher upon restarting the tool makes me think that perhaps it's not.

@tmontgomery
Copy link
Contributor

When driver is embedded, does it use the normal driver directory? if so, then 2 drivers using the same directory might be clobbering one another in bad ways.

@mjpt777
Copy link
Contributor

mjpt777 commented Apr 10, 2015

I have an open issue that the driver needs to use a lock file incase two
drivers use the same directory.

@strangelydim
Copy link
Contributor Author

I believe I only had one driver running. That's why there's only one command-line with --driver=embedded in the report; that causes that particular instance of the tool to start up a driver as well. The default is to not start a driver.

@strangelydim
Copy link
Contributor Author

Even more worrisome than the initial report: running a single subscriber also runs into problems of apparent redelivery of old messages every now and again. Running just a single subscriber:

java -cp aeron-tools/build/libs/tools.jar  uk.co.real_logic.aeron.tools.SubscriberTool --driver=embedded

and a single publisher:

java -cp aeron-tools/build/libs/tools.jar  uk.co.real_logic.aeron.tools.PublisherTool

Results in some interesting output on the receiver side. You'll notice it's getting a couple million messages per second, and only spitting out a handful of complaints of getting messages out-of-order, but the complaints are there nonetheless:

WARNING: conductor directory already exists: /var/folders/qy/rphc92v92yd_42kxt9zfhmkw0000gn/T/aeron/conductor
WARNING: data directory already exists: /var/folders/qy/rphc92v92yd_42kxt9zfhmkw0000gn/T/aeron/data
subscriber-thread 0 subscribing to: udp://localhost:31111#1
1.000031: 0.000 msgs/sec 0 bps
0.999988: 0.000 msgs/sec 0 bps
NEW CONNECTION: channel "udp://localhost:31111", stream 1, session 1
1.000167: 2.063 Mmsgs/sec 528.2 Mbps
0.999897: 2.288 Mmsgs/sec 585.8 Mbps
1.000038: 2.378 Mmsgs/sec 608.9 Mbps
[subscriber-thread 0] WARN uk.co.real_logic.aeron.tools.SubscriberTool - Channel udp://localhost:31111:1[1]: Verifiable message stream received sequence number 6292143, but was expecting 7190919. Possibly missed -898776 messages.
[subscriber-thread 0] WARN uk.co.real_logic.aeron.tools.SubscriberTool - Channel udp://localhost:31111:1[1]: Verifiable message stream received sequence number 7489800, but was expecting 6591024. Possibly missed 898776 messages.
0.999948: 2.228 Mmsgs/sec 570.3 Mbps
0.999950: 2.330 Mmsgs/sec 596.4 Mbps
1.001989: 2.192 Mmsgs/sec 561.2 Mbps
0.998010: 2.168 Mmsgs/sec 554.9 Mbps
[subscriber-thread 0] WARN uk.co.real_logic.aeron.tools.SubscriberTool - Channel udp://localhost:31111:1[1]: Verifiable message stream received sequence number 16177968, but was expecting 17076744. Possibly missed -898776 messages.
[subscriber-thread 0] WARN uk.co.real_logic.aeron.tools.SubscriberTool - Channel udp://localhost:31111:1[1]: Verifiable message stream received sequence number 17076745, but was expecting 16177969. Possibly missed 898776 messages.
[subscriber-thread 0] WARN uk.co.real_logic.aeron.tools.SubscriberTool - Channel udp://localhost:31111:1[1]: Verifiable message stream received sequence number 16179872, but was expecting 17078648. Possibly missed -898776 messages.
[subscriber-thread 0] WARN uk.co.real_logic.aeron.tools.SubscriberTool - Channel udp://localhost:31111:1[1]: Verifiable message stream received sequence number 17376336, but was expecting 16477560. Possibly missed 898776 messages.
0.999998: 2.082 Mmsgs/sec 532.9 Mbps
1.000007: 2.271 Mmsgs/sec 581.5 Mbps
0.999992: 2.172 Mmsgs/sec 556.0 Mbps
1.000002: 2.278 Mmsgs/sec 583.2 Mbps
[subscriber-thread 0] WARN uk.co.real_logic.aeron.tools.SubscriberTool - Channel udp://localhost:31111:1[1]: Verifiable message stream received sequence number 24867578, but was expecting 25766354. Possibly missed -898776 messages.
[subscriber-thread 0] WARN uk.co.real_logic.aeron.tools.SubscriberTool - Channel udp://localhost:31111:1[1]: Verifiable message stream received sequence number 26064504, but was expecting 25165728. Possibly missed 898776 messages.
0.999999: 2.074 Mmsgs/sec 530.8 Mbps
1.000002: 2.282 Mmsgs/sec 584.1 Mbps

etc. So either it's getting gaps in the message delivery which are then retransmitted, or messages are being delivered out of order, or... something. All I know is that it certainly didn't used to do this a few weeks ago - the tools could run for quite some time without complaint.

@mjpt777
Copy link
Contributor

mjpt777 commented Apr 13, 2015

Can you bisect the history and find the point it was introduced?

@strangelydim
Copy link
Contributor Author

It appears that things became borked up on commit: 01464d7 , "Moved completion tracking from the Receiver to Conductor thread in the driver", on April 3rd.

If I check out at the commit immediately before that, b912b4b , everything works fine.

@tmontgomery
Copy link
Contributor

Fascinating that it is so regular in the number of possible missed offsets. Bet that is a clue.

@strangelydim
Copy link
Contributor Author

Probably is. The tools themselves send their own sequence number in the payload of the messages they're sending, and keep track of the next expected SQN. PublisherTool always starts at SQN 0 and increments the SQN by one for each message it successfully sends. SubscriberTool expects to get SQN 0 as the first SQN on a given stream, and then expects SQN N + 1 next. If it gets an unexpected SQN, it reports that and then adjusts its expectations for the next message. So if it gets 0, 1, 2, 3, 5, then it will issue a warning upon receipt of 5 that it possibly missed 1 message, but then set its expected SQN to 6.

So the output above is talking about the tool-specific sequence numbers, not Aeron positions or offsets.

@tmontgomery
Copy link
Contributor

OK. And what is the message size then? That should give us an idea of gap and equate to Terms and offsets.

@strangelydim
Copy link
Contributor Author

As another possible hint, if I'm running with two subscribers as per the original bug report, then in addition to SQNs being delivered out of order or with gaps, I also get message checksum errors on both subscribers like these:

[subscriber-thread 0] WARN uk.co.real_logic.aeron.tools.SubscriberTool - Channel udp://localhost:31111:1[1]: Verifiable message per-message checksum invalid; received 0 but calculated -515913915
[subscriber-thread 0] WARN uk.co.real_logic.aeron.tools.SubscriberTool - Channel udp://localhost:31111:1[1]: Verifiable message per-message checksum invalid; received 0 but calculated -1437035475
[subscriber-thread 0] WARN uk.co.real_logic.aeron.tools.SubscriberTool - Channel udp://localhost:31111:1[1]: Verifiable message per-message checksum invalid; received 0 but calculated 1661765329
[subscriber-thread 0] WARN uk.co.real_logic.aeron.tools.SubscriberTool - Channel udp://localhost:31111:1[1]: Verifiable message per-message checksum invalid; received 0 but calculated 1762207865
[subscriber-thread 0] WARN uk.co.real_logic.aeron.tools.SubscriberTool - Channel udp://localhost:31111:1[1]: Verifiable message per-message checksum invalid; received 0 but calculated -613421714
[subscriber-thread 0] WARN uk.co.real_logic.aeron.tools.SubscriberTool - Channel udp://localhost:31111:1[1]: Verifiable message per-message checksum invalid; received 0 but calculated -2101812506
[subscriber-thread 0] WARN uk.co.real_logic.aeron.tools.SubscriberTool - Channel udp://localhost:31111:1[1]: Verifiable message per-message checksum invalid; received 0 but calculated -27597524
[subscriber-thread 0] WARN uk.co.real_logic.aeron.tools.SubscriberTool - Channel udp://localhost:31111:1[1]: Verifiable message per-message checksum invalid; received 0 but calculated 1441910465

When verifiable messages are used, PublisherTool writes in an application header at the beginning of each message that includes a CRC32 checksum of all the bytes in the message (including the header itself, with the checksum field temporarily set to 0). It's rather unlikely (1 out of 4 billion-ish) that the checksum is going to happen to be 0, and yet that's what SubscriberTool is receiving, and then calculating that it should've been something else. So for those messages, I'm not even sure if non-mangled data is being delivered, let alone being delivered in the correct order without gaps.

@strangelydim
Copy link
Contributor Author

Message size is 32 bytes.

Aeron driver options are all defaults.

@tmontgomery
Copy link
Contributor

What about the length of the message being 0? Wouldn't the checksum then be 0?

@strangelydim
Copy link
Contributor Author

The checksum would be whatever Java's default CRC32 class's initial checksum is set to. But in any case... When verifiable messages are used, there's a minimum message length of 16 bytes enforced on PublisherTool, to account for the application-level verifiable message header. And remember the data that is used to produce the per-message checksum includes that header, with the checksum field temporarily blanked out to 0. So in a default run, if SubscriberTool is receiving a 0-length message, or if PublisherTool is sending one, something has already gone pretty wrong.

@tmontgomery
Copy link
Contributor

When running checksum over the header with the checksum field set with the checksum value, it should always come out to 0. IP, UDP, and TCP all do this on the receiving side to check. i.e. check against 0 instead of blanking the field.

Just throwing ideas out, actually, to why it could be 0.

@strangelydim
Copy link
Contributor Author

I didn't do IP-style checksums, I just did a simple CRC32 checksum with the correct value (not necessarily zero) being sent as part of the message. See MessageStream:getNext for the publisher side, and MessageStream:putNext for the subscriber side of things.

@mjpt777
Copy link
Contributor

mjpt777 commented Apr 13, 2015

Let me know if I'm the one that needs to mop the rain!

@tmontgomery
Copy link
Contributor

@mjpt777 Not just quite yet. Once I put some changes I am doing to bed in the next couple days, I will look deeper into this and see what I can find.

@tmontgomery
Copy link
Contributor

something is really quite odd.

The receiver isn't just sometimes seeing old data, it seems to also (from the output you showed) be seeing into the future also. Alternating between them.

mjpt777 added a commit that referenced this issue Apr 14, 2015
@mjpt777
Copy link
Contributor

mjpt777 commented Apr 14, 2015

I found one potential issue that could explain a difference and committed a change.

@strangelydim
Copy link
Contributor Author

Same problems after the commit... but at least another potential bug got fixed.

@tmontgomery : Not really seeing into the future - remember, the subscriber is adjusting his expected next SQN to what it received + 1 each time it gets a message, even if what it received wasn't what it was expecting. So if it receives SQNs 0, 1, 2, 4, 3, 5, in that order, for example, then when it gets SQN 5 it's going to say it was "expecting" 4 (even though it already got it) - so 5 will appear to be from "the future". It's not using a bitmap for tracking a window of the last N SQNs received or anything fancy, just a single, simple expected next SQN.

@tmontgomery tmontgomery self-assigned this Apr 16, 2015
@strangelydim
Copy link
Contributor Author

I added a few lines of code to grab the term ID and offset, etc. of the messages that SubscriberTool was getting, and save the values from the previously-delivered message so I could compare them to the values on the first message the triggers the out-of-order exception.

Usually, even when the exception is hit, the values seem to make sense. The previous message and the current exception-triggering message are both in the same term ID. The current message's offset is exactly a frame length ahead of the last message's.

However, sometimes - and it may possibly be easier to hit with larger-than-the-default message sizes, as I can hit this at message size of 160 bytes - the values do not, in fact, make any sense at all. The previous message might have a term ID of 1297779264, and the current message has a term ID of 1297779261, as one real example. I would expect the term ID of the current message to be either the same as that of the previous message or exactly one higher. I wouldn't expect it to be lower than the previous message.

But like I said - sometimes, I still hit the exception, but all the header values seem to be what I'd expect - so the header looks reasonable, but the message payload itself is not right.

@strangelydim
Copy link
Contributor Author

I just did another run with the latest changes to see if anything had changed magically; looks like this is still a bug. But I do notice something interesting... The sequence numbers reported in the exceptions in this run are always ahead/behind by the same number of messages, exactly 898776. The frame lengths in this run are going to always be 56 (or should be, anyway - 24 bytes of header plus 32 byte application payload). Term sizes are all defaults.

$ java -cp aeron-tools/build/libs/tools.jar uk.co.real_logic.aeron.tools.SubscriberTool --driver=embedded
WARNING: conductor directory already exists: /var/folders/5q/shz1fbsx1vs9mnkszj2m5c7m0000gn/T/aeron/conductor
WARNING: data directory already exists: /var/folders/5q/shz1fbsx1vs9mnkszj2m5c7m0000gn/T/aeron/data
subscriber-thread 0 subscribing to: udp://localhost:31111#1
1.000037: 0.000 msgs/sec 0 bps
0.999984: 0.000 msgs/sec 0 bps
NEW CONNECTION: channel "udp://localhost:31111", stream 1, session 1
1.000056: 1.489 Mmsgs/sec 381.1 Mbps
[subscriber-thread 0] WARN uk.co.real_logic.aeron.tools.SubscriberTool - Channel udp://localhost:31111:1[1]: Verifiable message stream received sequence number 1200453, but was expecting 2099229. Possibly missed -898776 messages.
[subscriber-thread 0] WARN uk.co.real_logic.aeron.tools.SubscriberTool - Channel udp://localhost:31111:1[1]: Verifiable message stream received sequence number 2133217, but was expecting 1234441. Possibly missed 898776 messages.
[subscriber-thread 0] WARN uk.co.real_logic.aeron.tools.SubscriberTool - Channel udp://localhost:31111:1[1]: Verifiable message stream received sequence number 1234519, but was expecting 2133295. Possibly missed -898776 messages.
[subscriber-thread 0] WARN uk.co.real_logic.aeron.tools.SubscriberTool - Channel udp://localhost:31111:1[1]: Verifiable message stream received sequence number 2133587, but was expecting 1234811. Possibly missed 898776 messages.
[subscriber-thread 0] WARN uk.co.real_logic.aeron.tools.SubscriberTool - Channel udp://localhost:31111:1[1]: Verifiable message stream received sequence number 1234884, but was expecting 2133660. Possibly missed -898776 messages.
[subscriber-thread 0] WARN uk.co.real_logic.aeron.tools.SubscriberTool - Channel udp://localhost:31111:1[1]: Verifiable message stream received sequence number 2133806, but was expecting 1235030. Possibly missed 898776 messages.
[subscriber-thread 0] WARN uk.co.real_logic.aeron.tools.SubscriberTool - Channel udp://localhost:31111:1[1]: Verifiable message stream received sequence number 1235062, but was expecting 2133838. Possibly missed -898776 messages.
[subscriber-thread 0] WARN uk.co.real_logic.aeron.tools.SubscriberTool - Channel udp://localhost:31111:1[1]: Verifiable message stream received sequence number 2134349, but was expecting 1235573. Possibly missed 898776 messages.
[subscriber-thread 0] WARN uk.co.real_logic.aeron.tools.SubscriberTool - Channel udp://localhost:31111:1[1]: Verifiable message stream received sequence number 1348859, but was expecting 2247635. Possibly missed -898776 messages.
[subscriber-thread 0] WARN uk.co.real_logic.aeron.tools.SubscriberTool - Channel udp://localhost:31111:1[1]: Verifiable message stream received sequence number 2396736, but was expecting 1497960. Possibly missed 898776 messages.
[subscriber-thread 0] WARN uk.co.real_logic.aeron.tools.SubscriberTool - Channel udp://localhost:31111:1[1]: Verifiable message stream received sequence number 2698498, but was expecting 3597274. Possibly missed -898776 messages.
0.999987: 2.207 Mmsgs/sec 565.1 Mbps

Also, note that when an older sequence number message is suddenly delivered out of order, it appears that the next few old messages after that one are also being delivered before suddenly switching back to the current stream. It's a whole run of older messages; the length of the run isn't consistent, though. The first run of older messages appears to be 33988 messages long (1234441 - 1200453). But the second run of old messages is only 292 (1234811 - 1234519).

tmontgomery pushed a commit that referenced this issue Apr 19, 2015
…ss duplicate data seen in #87. Added system property to disable printing of frame headers for LogInspector.
@tmontgomery
Copy link
Contributor

Worth doing a run with the latest update to see if the duplicate data goes away.

The original issue reported, though, probably isn't addressed.

@strangelydim
Copy link
Contributor Author

A quick test with the updated code looks promising for addressing the single-subscriber case. Running for a couple minutes doesn't give me any out-of-order old messages. I'll do a longer-running test tomorrow.

Sadly, two subscribers is still an issue; if I start two instances of SubscriberTool and then start PublisherTool, I get tons of these on both SubscriberTool instances:

WARN Channel udp://localhost:31111:1[1]: Verifiable message per-message checksum invalid; received 0 but calculated 245542246
WARN Channel udp://localhost:31111:1[1]: Verifiable message per-message checksum invalid; received 0 but calculated 2090063423
WARN Channel udp://localhost:31111:1[1]: Verifiable message per-message checksum invalid; received 0 but calculated -1453958975
WARN Channel udp://localhost:31111:1[1]: Verifiable message per-message checksum invalid; received 0 but calculated 245504540
WARN Channel udp://localhost:31111:1[1]: Verifiable message per-message checksum invalid; received 0 but calculated -1909512619
WARN Channel udp://localhost:31111:1[1]: Verifiable message per-message checksum invalid; received 0 but calculated -275125590
WARN Channel udp://localhost:31111:1[1]: Verifiable message per-message checksum invalid; received 0 but calculated -247174118
WARN Channel udp://localhost:31111:1[1]: Verifiable message per-message checksum invalid; received 0 but calculated -1379873868

So I'm not sure what sort of messages are being delivered, but in those instances, they're not the messages I sent... The output above was over about a second of wall-clock time, during which that particular instance of SubscriberTool received a little under 2 million messages. So 8 messed-up messages out of 2 million is not a big percentage, but it's more than 0%, so something's still up.

@brianhorst
Copy link
Contributor

The 0 checksum may be an artifact of the SubscriberTool. If I understand what's happening correctly, the two subscribers are both reading from the same log buffer. In the MessageStream checksum verification each subscriber independently sets the checksum to 0 in the log buffer before doing updating the CRC32. It does set the checksum value back when done, but that leaves an amount of time where the checksum is read as 0 by the other subscriber. The buffer being used in the MessageStream class is misnamed "copybuf" because DirectBuffer.wrap just sets the address of the buffer. A quick check to verify this could be to copy out the data or remove the need to change the data in the log buffer.

@strangelydim
Copy link
Contributor Author

Ahhhh, that would make a lot of sense. You're probably right... Let me try that out first and see if that's all it is. At least there was one real bug here.

@strangelydim
Copy link
Contributor Author

OK - turns out Brian was right, and the 0-checksum problem was the test tool itself. It didn't occur to me that the two different subscribers using the same media driver on the same machine would, of course, be reading out of the same buffer. I changed the code a bit to treat the buffer as immutable - but still calculate the checksum - and all is now well.

So - feel free to close this bug out if you like.

@mjpt777
Copy link
Contributor

mjpt777 commented Apr 20, 2015

Nice one!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants