Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RedisClusterConnection stops operation after .RedisCommandTimeoutException #42

Closed
taer opened this issue Feb 26, 2015 · 11 comments
Closed
Labels
type: bug A general bug
Milestone

Comments

@taer
Copy link
Contributor

taer commented Feb 26, 2015

Setup: current master checkout of 3.1-SNAPSHOT: One thread, which has a single RedisClusterConnection<String,String>

The thread does a hgetall in a loop. You want to force a timeout on this thread. Do so by placing a VM breakpoint(pause all threads) in Utf8StringCodec.decode(). Let this hang for a bit(60 seconds I believe is the default timeout)

Now, once you're sure you've expired the timeout, remove the breakpoint and resume the JVM. The command you paused will throw from
FutureSyncInvocationHandler:59 return LettuceFutures.await(command, timeout, unit);

It doesn't seem to occur on a normal RedisConnection.

This is as far as I have made it. What happens next is the original caller gets an exception, which is perfectly fine. However, future calls to hgetall on the same connection will now simply timeout.

! com.lambdaworks.redis.RedisCommandTimeoutException: Command timed out
! at com.lambdaworks.redis.LettuceFutures.await(LettuceFutures.java:68) ~[tanknew1.jar:na]
! at com.lambdaworks.redis.FutureSyncInvocationHandler.handleInvocation(FutureSyncInvocationHandler.java:59) ~[tanknew1.jar:na]
! at com.lambdaworks.com.google.common.reflect.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:87) ~[tanknew1.jar:na]
! at com.sun.proxy.$Proxy95.hgetall(Unknown Source) ~[na:na]

Additional info: The pause all threads didn't seem to matter. I updated to just pause the single thread in the decode, and it broke the same way.

That was how to reproduce the error. Here is all I've managed to dig up so far.

Did a Thread dump. I have numerous requests blocked waiting for lock on
com.lambdaworks.redis.RedisAsyncConnectionImpl.dispatch()

The guy who has that monitor is blocked on
at com.lambdaworks.redis.protocol.Command.get(Command.java:109)
which is a CountDownLatch.await()

@taer
Copy link
Contributor Author

taer commented Feb 26, 2015

Ok, I see the problem now, but not sure how this happens. In CommandHandler:118

        while (!queue.isEmpty() && rsm.decode(buffer, queue.peek(), queue.peek().getOutput())) {

The queue is NOT empty. On the next time through after I break the connection via the UTF8Decoder pause, it has 2 things in it.
Head of the queue is my original clustercommand
Command [type=HGETALL, output=null]
Next in queue is the one I just retried
Command [type=HGETALL, output=MapOutput [output={}, error='null']]

rsm.decode is getting a null output, but returning false, which is exiting the while loop, abandoning my command that I just sent in.

Inverting the return at RedisStateMachine:81 to

        if (output == null) {
            return !stack.isEmpty();
        }

Sort of solves this problem(future commands work to that key). The old command is taken from the queue, completed, and loops over again, sending the result to my real command. But I haven't dug through RSM too much. I think the queue in CommandHandler should be removing the timed out Commands when they time out. Is that the intention of the above stack.isEmpty?

Also, with that small hack above, isMoved() in ClusterCommand NPEs due to otput being null. That could probably be checked for.

@taer
Copy link
Contributor Author

taer commented Feb 26, 2015

e5fbd1e has a patch that makes it not freeze. However, like I said above, I have no idea if this breaks other things.

taer added a commit to taer/lettuce that referenced this issue Feb 27, 2015
@taer
Copy link
Contributor Author

taer commented Feb 27, 2015

Have a simple test case that reproduces this on standalone and cluster.
a80dfdd

@mp911de
Copy link
Collaborator

mp911de commented Feb 27, 2015

Hi @taer I'll take a look on this.

taer added a commit to taer/lettuce that referenced this issue Feb 27, 2015
On inturruption, the Command class is nulling the output. But
this is causing the statemachine to not know what to do with the
actual output it eventually gets. If we leave the output there,
the RSM can still update the output, even though the client has
moved on.
@taer
Copy link
Contributor Author

taer commented Feb 27, 2015

@mp911de Think I found a solution(and updated the test as well) The inversion of that line in RSM is definitely incorrect.

I pushed a new version of BreakServer(should be called breakClient) bd5227f . The fix is 3e20467

@mp911de mp911de added the type: bug A general bug label Feb 27, 2015
@mp911de
Copy link
Collaborator

mp911de commented Feb 27, 2015

You're faster than me. Would you like to create a PR that contains the fix and the cluster setup (your test perhaps makefile-changes as well)?

@taer
Copy link
Contributor Author

taer commented Feb 27, 2015

Sure. I'm updating the test now to fit with the rest of the integration tests. And am putting it someplace sane.

Since the command fix fixed both(standalone and cluster), and cluster setup in a test is much more prone to errors, I removed the cluster part.

@mp911de
Copy link
Collaborator

mp911de commented Feb 27, 2015

There is a Cluster setup rule (JUnit Test Rule) (see existing cluster tests) which might help you

This was referenced Feb 27, 2015
@mp911de mp911de added this to the Lettuce 3.1 milestone Feb 27, 2015
@mp911de
Copy link
Collaborator

mp911de commented Feb 27, 2015

Can I close this issue after #43 is merged?

@taer
Copy link
Contributor Author

taer commented Feb 27, 2015

It satisfies this case. I don't see if breaking anything else. I don't see it causing leaks, since if there is an output, the command will be taken from the queue in CommandHandler.java, GCing both.

@mp911de
Copy link
Collaborator

mp911de commented Feb 27, 2015

Closing, everything is fine. Thanks @taer

@mp911de mp911de closed this as completed Feb 27, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug A general bug
Projects
None yet
Development

No branches or pull requests

2 participants