Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RedisClusterClient unable to read partition information and results in ConcurrentModificationException #333

Closed
OutOfBrain opened this issue Aug 11, 2016 · 8 comments
Labels
type: bug A general bug
Milestone

Comments

@OutOfBrain
Copy link

Hey,

we are using the RedisClusterClient in our setup and encountered several variations of those two errors:

Error: com.lambdaworks.redis.RedisException: Cannot determine a partition to read for slot 7558

see [stacktrace1]
see [stacktrace2]

and

Error: com.lambdaworks.redis.RedisException: java.util.ConcurrentModificationException

see [stacktrace3]

This is our configuration / connection code which seems to be the issue:

final ClusterTopologyRefreshOptions refreshOptions = ClusterTopologyRefreshOptions.builder()
    .enablePeriodicRefresh(60, TimeUnit.SECONDS) // this one breaks it
    .enableAllAdaptiveRefreshTriggers()
    .build();
final ClusterClientOptions clusterClientOptions = ClusterClientOptions.builder()
    .autoReconnect(true)
    .maxRedirects(4)
    .topologyRefreshOptions(refreshOptions)
    .build();

String redisHost = context.getConfiguration().get(REDIS_HOST_KEY);
RedisClusterClient redisClusterClient = RedisClusterClient.create(RedisURI.Builder.redis(redisHost).build());

redisClusterClient.setOptions(clusterClientOptions);

connection = redisClusterClient.connect().sync();

And we pinned it down to that marked line:

.enablePeriodicRefresh(60, TimeUnit.SECONDS) 

Without that line we don't experience any of the above exceptions.
The actual redis command that triggers those exceptions is a hkeys:

redis.hmset(playerKey, playerProfile);
// [...snip...]
redis.hkeys(playerKey);  // this triggers the error
// [...snip...]
redis.hdel(...)

Our use case is quite simple - we only use hmset, hkeys, hdel in that order.

tldr;

For RedisClusterClient setting the ClusterTopologyRefreshOptions with enablePeriodicRefresh(60, TimeUnit.SECONDS) results in ConcurrentModificationException and Cannot determine a partition to read for slot exceptions.

Stacktraces

[stacktrace1]

Error: com.lambdaworks.redis.RedisException: Cannot determine a partition to read for slot 1630 (Partitions: Partitions [RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.28', port=7001], nodeId='9fc072f1a3228f3fea613cbe7b50ab791310d625', connected=true, slaveOf='2d9f6f64b566c45f8caf0b8b0c3662185f23041a', pingSentTimestamp=0, pongReceivedTimestamp=1470761453982, configEpoch=33, flags=[SLAVE], slot count=0], RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.24', port=7001], nodeId='8b2442dd5659702c59d1a6f3197b048f490d9727', connected=true, slaveOf='d30ffe3ed863cf028f6cad8f9a6babb97142a630', pingSentTimestamp=0, pongReceivedTimestamp=1470761453982, configEpoch=36, flags=[SLAVE], slot count=0], RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.25', port=6379], nodeId='2494cbc3a5d0d8539d255ee0711f725655a2d06f', connected=true, slaveOf='cd836fbc2f6b9e19f9768702febb4a85759930fd', pingSentTimestamp=0, pongReceivedTimestamp=1470761451926, configEpoch=27, flags=[SLAVE], slot count=0], RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.27', port=7001], nodeId='8d8f63487f1b1d8f562fee0efcd71d33c38ed612', connected=true, slaveOf='e59e15e90f131fead7cd13b43891144b14773519', pingSentTimestamp=0, pongReceivedTimestamp=1470761453982, configEpoch=32, flags=[SLAVE], slot count=0], RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.25', port=7001], nodeId='2d9f6f64b566c45f8caf0b8b0c3662185f23041a', connected=true, slaveOf='null', pingSentTimestamp=0, pongReceivedTimestamp=1470761452440, configEpoch=33, flags=[MASTER], slot count=3278], RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.27', port=6379], nodeId='d61059d427a992748bd63d44f26fa963a823503b', connected=true, slaveOf='null', pingSentTimestamp=0, pongReceivedTimestamp=1470761452956, configEpoch=34, flags=[MASTER], slot count=3276], RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.24', port=6379], nodeId='4c41014792825ff94a596e86da9fac4da9254585', connected=true, slaveOf='d61059d427a992748bd63d44f26fa963a823503b', pingSentTimestamp=0, pongReceivedTimestamp=1470761453466, configEpoch=34, flags=[SLAVE], slot count=0], RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.23', port=6379], nodeId='d30ffe3ed863cf028f6cad8f9a6babb97142a630', connected=true, slaveOf='null', pingSentTimestamp=0, pongReceivedTimestamp=0, configEpoch=36, flags=[MASTER, MYSELF], slot count=3276], RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.23', port=7001], nodeId='e59e15e90f131fead7cd13b43891144b14773519', connected=true, slaveOf='null', pingSentTimestamp=0, pongReceivedTimestamp=1470761453982, configEpoch=32, flags=[MASTER], slot count=3278], RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.28', port=6379], nodeId='cd836fbc2f6b9e19f9768702febb4a85759930fd', connected=true, slaveOf='null', pingSentTimestamp=0, pongReceivedTimestamp=1470761452956, configEpoch=27, flags=[MASTER], slot count=3276], RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.26', port=6379], nodeId='8c6623b00a586bf9f16f1b88bac1baad218a6cd7', connected=false, slaveOf='null', pingSentTimestamp=1470761451407, pongReceivedTimestamp=0, configEpoch=0, flags=[HANDSHAKE], slot count=0]]) with setting com.lambdaworks.redis.ReadFromImpl$ReadFromMaster@3003697
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getReadConnection(PooledClusterConnectionProvider.java:130)
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getConnection(PooledClusterConnectionProvider.java:69)
at com.lambdaworks.redis.cluster.ClusterDistributionChannelWriter.write(ClusterDistributionChannelWriter.java:98)
at com.lambdaworks.redis.RedisChannelHandler.dispatch(RedisChannelHandler.java:118)
at com.lambdaworks.redis.cluster.StatefulRedisClusterConnectionImpl.dispatch(StatefulRedisClusterConnectionImpl.java:189)
at com.lambdaworks.redis.AbstractRedisAsyncCommands.dispatch(AbstractRedisAsyncCommands.java:1798)
at com.lambdaworks.redis.AbstractRedisAsyncCommands.hkeys(AbstractRedisAsyncCommands.java:437)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.lambdaworks.redis.cluster.StatefulRedisClusterConnectionImpl$ClusterFutureSyncInvocationHandler.handleInvocation(StatefulRedisClusterConnectionImpl.java:334)
at com.lambdaworks.redis.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:54)
at com.sun.proxy.$Proxy17.hkeys(Unknown Source)
[...snip...]
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

[stacktrace2]

Error: com.lambdaworks.redis.RedisException: Cannot determine a partition to read for slot 7558 (Partitions: Partitions []) with setting com.lambdaworks.redis.ReadFromImpl$ReadFromMaster@408613cc
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getReadConnection(PooledClusterConnectionProvider.java:130)
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getConnection(PooledClusterConnectionProvider.java:69)
at com.lambdaworks.redis.cluster.ClusterDistributionChannelWriter.write(ClusterDistributionChannelWriter.java:98)
at com.lambdaworks.redis.RedisChannelHandler.dispatch(RedisChannelHandler.java:118)
at com.lambdaworks.redis.cluster.StatefulRedisClusterConnectionImpl.dispatch(StatefulRedisClusterConnectionImpl.java:189)
at com.lambdaworks.redis.AbstractRedisAsyncCommands.dispatch(AbstractRedisAsyncCommands.java:1798)
at com.lambdaworks.redis.AbstractRedisAsyncCommands.hkeys(AbstractRedisAsyncCommands.java:437)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.lambdaworks.redis.cluster.StatefulRedisClusterConnectionImpl$ClusterFutureSyncInvocationHandler.handleInvocation(StatefulRedisClusterConnectionImpl.java:334)
at com.lambdaworks.redis.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:54)
at com.sun.proxy.$Proxy17.hkeys(Unknown Source)
[...snip...]
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

[stacktrace3]

Error: com.lambdaworks.redis.RedisException: java.util.ConcurrentModificationException
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getConnection(PooledClusterConnectionProvider.java:77)
at com.lambdaworks.redis.cluster.ClusterDistributionChannelWriter.write(ClusterDistributionChannelWriter.java:98)
at com.lambdaworks.redis.RedisChannelHandler.dispatch(RedisChannelHandler.java:118)
at com.lambdaworks.redis.cluster.StatefulRedisClusterConnectionImpl.dispatch(StatefulRedisClusterConnectionImpl.java:189)
at com.lambdaworks.redis.AbstractRedisAsyncCommands.dispatch(AbstractRedisAsyncCommands.java:1798)
at com.lambdaworks.redis.AbstractRedisAsyncCommands.hkeys(AbstractRedisAsyncCommands.java:437)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.lambdaworks.redis.cluster.StatefulRedisClusterConnectionImpl$ClusterFutureSyncInvocationHandler.handleInvocation(StatefulRedisClusterConnectionImpl.java:334)
at com.lambdaworks.redis.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:54)
at com.sun.proxy.$Proxy17.hkeys(Unknown Source)
[...snip...]
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.util.ConcurrentModificationException
at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
at java.util.ArrayList$Itr.next(ArrayList.java:851)
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getReadCandidates(PooledClusterConnectionProvider.java:174)
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getReadConnection(PooledClusterConnectionProvider.java:116)
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getConnection(PooledClusterConnectionProvider.java:69) ... 22 more
@mp911de mp911de added the type: bug A general bug label Aug 12, 2016
mp911de added a commit that referenced this issue Aug 12, 2016
All non-threadsafe access to Partitions is now synchronized. Functional blocks that change the state of Partitions are grouped into atomar blocks that prevent reads to stale data and reads of partially updated data.
@mp911de mp911de added this to the Lettuce 4.2.2 milestone Aug 12, 2016
mp911de added a commit that referenced this issue Aug 12, 2016
All non-threadsafe access to Partitions is now synchronized. Functional blocks that change the state of Partitions are grouped into atomar blocks that prevent reads to stale data and reads of partially updated data.
@mp911de
Copy link
Collaborator

mp911de commented Aug 12, 2016

Thanks for your bug report. The periodic topology refresh performs in your case an update each 60 seconds. You have also the adaptive topology update configured that determines based on Redis responses whether to update the topology. In fact, the adaptive update is the better one as it updates the topology only on demand. I propose to raise the periodic update to 1-2 hrs or get rid of it at all.

I deployed a fixed snapshot (4.2.2-SNAPSHOT) to Sonatype's OSS Snapshot Repository. Would you mind to give that snapshot a try with your current settings?

@OutOfBrain
Copy link
Author

Hey, we tested with 4.2.2-Snapshot and got a NullPointerException:

Error: com.lambdaworks.redis.RedisException: java.lang.NullPointerException
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getConnection(PooledClusterConnectionProvider.java:77)
at com.lambdaworks.redis.cluster.ClusterDistributionChannelWriter.write(ClusterDistributionChannelWriter.java:100)
at com.lambdaworks.redis.RedisChannelHandler.dispatch(RedisChannelHandler.java:118)
at com.lambdaworks.redis.cluster.StatefulRedisClusterConnectionImpl.dispatch(StatefulRedisClusterConnectionImpl.java:178)
at com.lambdaworks.redis.AbstractRedisAsyncCommands.dispatch(AbstractRedisAsyncCommands.java:1797)
at com.lambdaworks.redis.AbstractRedisAsyncCommands.hkeys(AbstractRedisAsyncCommands.java:436)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.lambdaworks.redis.cluster.StatefulRedisClusterConnectionImpl$ClusterFutureSyncInvocationHandler.handleInvocation(StatefulRedisClusterConnectionImpl.java:321)
at com.lambdaworks.redis.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:54)
at com.sun.proxy.$Proxy18.hkeys(Unknown Source)
[...snip...]
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.lang.NullPointerException
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getReadCandidates(PooledClusterConnectionProvider.java:175)
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getReadConnection(PooledClusterConnectionProvider.java:116)
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getConnection(PooledClusterConnectionProvider.java:69) ... 22 more

@mp911de
Copy link
Collaborator

mp911de commented Aug 12, 2016

Do you have a bit more of the stack trace? It shows that NullPointerException is wrapped inside RedisException but not where the NPE occurred.

@OutOfBrain
Copy link
Author

That is the full length of the stack trace - 3 lines of our internal calls.
I think the last part of that stracktrace is the interesting part:

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.lang.NullPointerException
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getReadCandidates(PooledClusterConnectionProvider.java:175)
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getReadConnection(PooledClusterConnectionProvider.java:116)
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getConnection(PooledClusterConnectionProvider.java:69) ... 22 more

Looks like the NullPointerException originates from there:
https://github.com/mp911de/lettuce/blob/4.2.x/src/main/java/com/lambdaworks/redis/cluster/PooledClusterConnectionProvider.java#L175

@mp911de
Copy link
Collaborator

mp911de commented Aug 12, 2016

Thanks for the update. I wasn't able to reproduce the issue so far (single/multithreaded connection usage) so it's somewhat cumbersome to nail down the cause. I'm not sure whether synchronizing access to Partitions will fix the issue.

mp911de added a commit that referenced this issue Aug 12, 2016
…tes #333

Partitions now uses a cached read-view for query/read operations. This is to decouple updates to Partitions. The most common case, iteration over Partition is now consistent with the state at which the iterator was obtained. The iterator and its content do not change during partition updates.
@mp911de
Copy link
Collaborator

mp911de commented Aug 12, 2016

I changed Partitions slightly to use a consistent read-view for query/read operations. I'm not fully that this will fix the issue but it adds at least a more consistent behavior. I also deployed a new snapshot of 4.2.2-SNAPSHOT (lettuce-4.2.2-20160812.122340-2.jar) Sonatype's OSS Snapshot Repository.

@OutOfBrain
Copy link
Author

Hey,

we just ran with the initial config again with the new 4.2.2-SNAPSHOT and can confirm that there are no more exceptions being thrown. Seems fixed.
Thanks for that very fast reaction time 👍

@mp911de
Copy link
Collaborator

mp911de commented Aug 12, 2016

Great to hear that. Closing that ticket. Bugfix will be shipped with 4.2.2, scheduled for Aug, 19.

@mp911de mp911de closed this as completed Aug 12, 2016
mp911de added a commit that referenced this issue Aug 16, 2016
…tes #333

Partitions now uses a cached read-view for query/read operations. This is to decouple updates to Partitions. The most common case, iteration over Partition is now consistent with the state at which the iterator was obtained. The iterator and its content do not change during partition updates.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug A general bug
Projects
None yet
Development

No branches or pull requests

2 participants