I have a pretty simple kestrel setup:
1 kestrel server, backed by an amazon EBS file system, a number of queues
4 different java VMs on 3 machines, all running grabbyhands
2 of these are publishers only.
2 of these consume from certain queues and then also add to secondary queues, which they eventually consume from.
The publishers are webservers that only ever write to the queues, using transactional puts (ie, they wait for the put to complete).
The consumer vms use transactional reads, opening, then closing or aborting, using grabbyhands and try/finally blocks.
The strange behavior that occurs, almost without fail, is that restarting one of the webservers causes the consumer VMs to dequeue a very old message, one that seems to have originated when the VMs first started up last. Eg 24 hours old or so.
This is confusing to me for two reasons:
a. I can't figure out why these items are stuck in the queue in the first place, my code seems pretty tight with the try/finally
b. I can't understand why restarting the publisher would "abort" this operation and make it eligible for dequeueing. It would make more sense if I was restarting the consumer.
what version of kestrel is this?
one thing you can do is look at "stats" on your running servers. if any of them have a transaction open even when there are no jobs, then something is holding open that transaction.
This is Kestrel 2.1.5 . Unfortunately, our queues never have a dull moment where we could reliably glimpse that stat.
one other thing you can try is: the next time this happens on a queue, make a copy of that queue's journal file.
then, dump it with "qdump.sh" (in the scripts/ folder), which will show you the operations happening on that queue recently. with "-d", it'll also dump out the job contents. if you see the old job in there around the time a consumer got it, that would at least tell us that kestrel perceived the job as being added around the time it handed it to a consumer. if you don't, then something much weirder is going on. :)
I'm just getting back to this issue - it happens everytime my publishers are restarted ( I get an old message drained at the consumer). I'm not sure how to do what you are describing in the last comment. It seems the issue happens when the producer first starts - some of the first queue puts get stuck. Then when I restart that process ( a java vm) they come through in the queue to the consumer. I'm not clear on when/how I could do this dump. Maybe I need to stop using transactional puts w/ grabbyhands? I do like the certainty that the item is in the queue.
There isn't a notion of transactional put in kestrel. I gather you've configured kestrel to fsync immediately, so when you get a result of "STORED" from the server it means the item has successfully been written to disk. NOT_STORED would indicate it wasn't stored (though this only occurs if you configure kestrel to limit queue sizes or if you catch it at shutdown). A disconnect or error response gives you no information: The item may have been enqueued or not, depending on when the error in the put operation. (A client error, however, would mean the item was not enqueued.)
Furthermore, invoking the put method on a GrabbyHands sendQueue (as in the example in the GrabbyHands README file) is just handing an item to a java.util.concurrent.LinkedBlockingQueue. Depending on the configured value of sendQueueDepth, your producing thread may block waiting for a slot to open in the LinkedBlockingQueue, but once the item is in the LinkedBlockingQueue, there's no way to know if the item was ever sent to kestrel, let alone whether the item made it or not.
Having written all that, I can't think of any reason that a publisher being restarted would suddenly cause a message to reappear on a queue, unless the publisher itself is re-publishing it. If it is a bug, it would likely be in the networking code, which was heavily modified in kestrel 2.2.0 or in grabbyhands.
Robey's suggestion was to copy the queue's journal to a directory outside the one you use for your queues (e.g., using cp on linux). So if the queue's name is FOO and your kestrel config's queuePath is "/var/spool/kestrel", then "cp /var/spool/kestrel/FOO /tmp/". At that point you can run qdump.sh from the scripts directory of the kestrel distribution and it should dump out the contents of the journal. If there's a recent ADD operation for the old item, then the producer wrote it recently. You may need to copy more than one file if the journal is being rotated frequently.
Ok I will try this (copying the journal file). It is easy to reproduce, but a pain to remedy ( I have hundreds of queues, and it is never clear to me whether I am past this mode, or if real users are going to lose data). Just restarted servers because of a new code deployment and had a few hundred items come through that were a week old. I think they must be getting stuck in the grabby hands layer, because kestrel queue sizes are 0 based on stats. It's very strange because I had the queue depth set to 1 (the LinkedBlockingQueue) and I also can get through other messages on all the queues, it's not as if they are stuck, they just tend to lose things at the beginning (that come back as zombies on restart).
Thanks for helping. I did a qdump and I see a lot of interesting data, but I'm not sure how to interpret it. How can I see "if there's a recent ADD for an old item". The dump looks like this:
002406a6 ADD 425
00240864 RSV 336354
00240869 ACK 336354
0024086e ADD 425
00240a2c RSV 336355
00240a31 ACK 336355
I'm going to upgrade to kestrel 2.4.1 to see if this helps. I'm starting to suspect grabby-hands as being the culprit, and the dev on that hasn't done a commit in two years, so I don't have much hope there. Does anyone know if the current github version of grabby-hands what twitter still uses?
So are there can canceled (CAN) items in there? Those are being returned to kestrel and then you would subsequently see a new RSV and then ACK for the item. If the payloads are text (or close enough) you can add the -d flag to see them dumped as well, which might help you find the item you're trying to track down.
There is no newer version of grabby-hands available. Most JVM-based applications using kestrel within Twitter have migrated to the finagle-kestrel client. The public repo for finagle is https://github.com/twitter/finagle, and I believe the most recently published version is 5.3.22.
So is CAN from a transactional read being cancelled? The order I see is, about 10 times in my snapshot of the current logfile. I'm not sure if items are "stuck" in here, but I do have cancels:
00116cdb RSV 333735
0011da63 CAN 333735
Never anything after the can. Is the number the ID ? or is the hex the ID?
I will certainly look at finagle-kestrel , didn't even know about that. I would much rather be on something more modern that is maintained.
The format of those lines is:
So yes, something is starting a transaction with id 333735 (decimal) and subsequently the transaction is being canceled. Cancellation occurs either explicitly if your app calls GrabbyHands' Read.cancel() or automatically if there is an open transaction on a connection and that connection is clsoed. This could happen if Read.close() is never called by your app.
Looking at the GrabbyHands code, it offers a Read object on the BlockingQueue[Read] obtained via GrabbyHands.getRecvTransQueue (and will wait for effectively forever for the application code to call poll on the BlockingQueue). Once the item has been handed off, it waits on a CountDownLatch for the application code to call cancel() or close() on the Read object. It seems pretty bullet proof. I'd make sure your logic for transactional reads always calls cancel() or close() in a timely fashion.
Grabbyhands uses java.util.logging with a Logger named "grabbyhands". If you crank up the logging to "finest" you will see log messages for each reader thread, including when it receives an item and when the item is handed off to your application. Depending on the rate of messages in your system it may not be feasible to turn this on, but should help you track down when this happens.