New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Driver gets wedged with two subscribers, one publisher #87
Comments
When driver is embedded, does it use the normal driver directory? if so, then 2 drivers using the same directory might be clobbering one another in bad ways. |
I have an open issue that the driver needs to use a lock file incase two |
I believe I only had one driver running. That's why there's only one command-line with --driver=embedded in the report; that causes that particular instance of the tool to start up a driver as well. The default is to not start a driver. |
Even more worrisome than the initial report: running a single subscriber also runs into problems of apparent redelivery of old messages every now and again. Running just a single subscriber:
and a single publisher:
Results in some interesting output on the receiver side. You'll notice it's getting a couple million messages per second, and only spitting out a handful of complaints of getting messages out-of-order, but the complaints are there nonetheless:
etc. So either it's getting gaps in the message delivery which are then retransmitted, or messages are being delivered out of order, or... something. All I know is that it certainly didn't used to do this a few weeks ago - the tools could run for quite some time without complaint. |
Can you bisect the history and find the point it was introduced? |
Fascinating that it is so regular in the number of possible missed offsets. Bet that is a clue. |
Probably is. The tools themselves send their own sequence number in the payload of the messages they're sending, and keep track of the next expected SQN. PublisherTool always starts at SQN 0 and increments the SQN by one for each message it successfully sends. SubscriberTool expects to get SQN 0 as the first SQN on a given stream, and then expects SQN N + 1 next. If it gets an unexpected SQN, it reports that and then adjusts its expectations for the next message. So if it gets 0, 1, 2, 3, 5, then it will issue a warning upon receipt of 5 that it possibly missed 1 message, but then set its expected SQN to 6. So the output above is talking about the tool-specific sequence numbers, not Aeron positions or offsets. |
OK. And what is the message size then? That should give us an idea of gap and equate to Terms and offsets. |
As another possible hint, if I'm running with two subscribers as per the original bug report, then in addition to SQNs being delivered out of order or with gaps, I also get message checksum errors on both subscribers like these:
When verifiable messages are used, PublisherTool writes in an application header at the beginning of each message that includes a CRC32 checksum of all the bytes in the message (including the header itself, with the checksum field temporarily set to 0). It's rather unlikely (1 out of 4 billion-ish) that the checksum is going to happen to be 0, and yet that's what SubscriberTool is receiving, and then calculating that it should've been something else. So for those messages, I'm not even sure if non-mangled data is being delivered, let alone being delivered in the correct order without gaps. |
Message size is 32 bytes. Aeron driver options are all defaults. |
What about the length of the message being 0? Wouldn't the checksum then be 0? |
The checksum would be whatever Java's default CRC32 class's initial checksum is set to. But in any case... When verifiable messages are used, there's a minimum message length of 16 bytes enforced on PublisherTool, to account for the application-level verifiable message header. And remember the data that is used to produce the per-message checksum includes that header, with the checksum field temporarily blanked out to 0. So in a default run, if SubscriberTool is receiving a 0-length message, or if PublisherTool is sending one, something has already gone pretty wrong. |
When running checksum over the header with the checksum field set with the checksum value, it should always come out to 0. IP, UDP, and TCP all do this on the receiving side to check. i.e. check against 0 instead of blanking the field. Just throwing ideas out, actually, to why it could be 0. |
I didn't do IP-style checksums, I just did a simple CRC32 checksum with the correct value (not necessarily zero) being sent as part of the message. See MessageStream:getNext for the publisher side, and MessageStream:putNext for the subscriber side of things. |
Let me know if I'm the one that needs to mop the rain! |
@mjpt777 Not just quite yet. Once I put some changes I am doing to bed in the next couple days, I will look deeper into this and see what I can find. |
something is really quite odd. The receiver isn't just sometimes seeing old data, it seems to also (from the output you showed) be seeing into the future also. Alternating between them. |
I found one potential issue that could explain a difference and committed a change. |
Same problems after the commit... but at least another potential bug got fixed. @tmontgomery : Not really seeing into the future - remember, the subscriber is adjusting his expected next SQN to what it received + 1 each time it gets a message, even if what it received wasn't what it was expecting. So if it receives SQNs 0, 1, 2, 4, 3, 5, in that order, for example, then when it gets SQN 5 it's going to say it was "expecting" 4 (even though it already got it) - so 5 will appear to be from "the future". It's not using a bitmap for tracking a window of the last N SQNs received or anything fancy, just a single, simple expected next SQN. |
I added a few lines of code to grab the term ID and offset, etc. of the messages that SubscriberTool was getting, and save the values from the previously-delivered message so I could compare them to the values on the first message the triggers the out-of-order exception. Usually, even when the exception is hit, the values seem to make sense. The previous message and the current exception-triggering message are both in the same term ID. The current message's offset is exactly a frame length ahead of the last message's. However, sometimes - and it may possibly be easier to hit with larger-than-the-default message sizes, as I can hit this at message size of 160 bytes - the values do not, in fact, make any sense at all. The previous message might have a term ID of 1297779264, and the current message has a term ID of 1297779261, as one real example. I would expect the term ID of the current message to be either the same as that of the previous message or exactly one higher. I wouldn't expect it to be lower than the previous message. But like I said - sometimes, I still hit the exception, but all the header values seem to be what I'd expect - so the header looks reasonable, but the message payload itself is not right. |
I just did another run with the latest changes to see if anything had changed magically; looks like this is still a bug. But I do notice something interesting... The sequence numbers reported in the exceptions in this run are always ahead/behind by the same number of messages, exactly 898776. The frame lengths in this run are going to always be 56 (or should be, anyway - 24 bytes of header plus 32 byte application payload). Term sizes are all defaults.
Also, note that when an older sequence number message is suddenly delivered out of order, it appears that the next few old messages after that one are also being delivered before suddenly switching back to the current stream. It's a whole run of older messages; the length of the run isn't consistent, though. The first run of older messages appears to be 33988 messages long (1234441 - 1200453). But the second run of old messages is only 292 (1234811 - 1234519). |
…ss duplicate data seen in #87. Added system property to disable printing of frame headers for LogInspector.
Worth doing a run with the latest update to see if the duplicate data goes away. The original issue reported, though, probably isn't addressed. |
A quick test with the updated code looks promising for addressing the single-subscriber case. Running for a couple minutes doesn't give me any out-of-order old messages. I'll do a longer-running test tomorrow. Sadly, two subscribers is still an issue; if I start two instances of SubscriberTool and then start PublisherTool, I get tons of these on both SubscriberTool instances:
So I'm not sure what sort of messages are being delivered, but in those instances, they're not the messages I sent... The output above was over about a second of wall-clock time, during which that particular instance of SubscriberTool received a little under 2 million messages. So 8 messed-up messages out of 2 million is not a big percentage, but it's more than 0%, so something's still up. |
The 0 checksum may be an artifact of the SubscriberTool. If I understand what's happening correctly, the two subscribers are both reading from the same log buffer. In the MessageStream checksum verification each subscriber independently sets the checksum to 0 in the log buffer before doing updating the CRC32. It does set the checksum value back when done, but that leaves an amount of time where the checksum is read as 0 by the other subscriber. The buffer being used in the MessageStream class is misnamed "copybuf" because DirectBuffer.wrap just sets the address of the buffer. A quick check to verify this could be to copy out the data or remove the need to change the data in the log buffer. |
Ahhhh, that would make a lot of sense. You're probably right... Let me try that out first and see if that's all it is. At least there was one real bug here. |
OK - turns out Brian was right, and the 0-checksum problem was the test tool itself. It didn't occur to me that the two different subscribers using the same media driver on the same machine would, of course, be reading out of the same buffer. I changed the code a bit to treat the buffer as immutable - but still calculate the checksum - and all is now well. So - feel free to close this bug out if you like. |
Nice one! |
Using the latest versions of PublisherTool and SubscriberTool, if I start one subscriber with an embedded driver, start another one on the same channel:stream ID without, and start a publisher...
the two subscribers get messages for a few seconds and then wedge, getting nothing, and the publisher is blocked from sending. If I kill the publisher and start it again, I get an exception on the PublisherTool app:
If I kill and restart the publisher again, I don't get another exception, but it thinks it's sending for a few more messages and then wedges again. Meanwhile, the subscribers haven't gotten any new messages since the very first run of the Publisher.
This could be some sort of bug in the tools themselves, but the fact that I get an exception from the LogAppender on the Publisher upon restarting the tool makes me think that perhaps it's not.
The text was updated successfully, but these errors were encountered: