Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible memory leak, in RMapCache with evictionScheduler workflow, Redisson - 3.15.x #5158

Closed
gitrahul9 opened this issue Jul 6, 2023 · 15 comments
Labels

Comments

@gitrahul9
Copy link

We are using Redisson client 3.15.x with AWS Elasticache, and we are seeing a couple issues with redisson Client, while using RMapCache.

  1. Number of CurrConnections keep increasing after every couple of days ( with no data in new Connections), possibly a connection leak problem ? Which is potentially increasing Redis CPU
  2. Memory leak behaviour, we see that our application heap memory is constantly increasing. here is heap dump analysis snapshot image
  3. we also observed another issue that when doing a get from RMapCache, all our requests were by default going to primary nodes instead of replica nodes of our Elasticache cluster (with cluster mode enabled), we ran MONITOR on our cluster slave nodes and noticed that they were being redirected to primary node. We changed from .get to .getWithTTLOnly that started sending read requests to replica nodes, but the read request count is still higher on the primary node (90:10)
  4. We also see that number of eval commands is very high on our cluster
  5. SetCommands are higher than GetCommands while we expect our application to be read heavy and more get calls than put.

Any suggestions on these issues and are some of them related ?
Is it a known issue and fixed in higher versions ?

@mrniko
Copy link
Member

mrniko commented Jul 6, 2023

Number of CurrConnections keep increasing after every couple of days ( with no data in new Connections), possibly a connection leak problem ? Which is potentially increasing Redis CPU

Can you share Redisson logs?

Memory leak behaviour, we see that our application heap memory is constantly increasing. here is heap dump analysis snapshot

Do you call RMapCache.destroy() if map isn't needed anymore?

We changed from .get to .getWithTTLOnly that started sending read requests to replica nodes, but the read request count is still higher on the primary node (90:10)

Which read operations do you use?

We also see that number of eval commands is very high on our cluster

RMapCache uses a lot of them.

SetCommands are higher than GetCommands while we expect our application to be read heavy and more get calls than put.

RMapCache uses eval scripts which in turn modify data.

@gitrahul9
Copy link
Author

We don't have any error logs (as of now), or any Redisson logs for that matter, because we are over provisioned, but earlier we saw Redisson timeout errors when application was not able to connect resources.
Can you be specific what Redisson logs should I fetch for you ? ( we do not have debug enabled in redisson config anywhere)

We do not call RMapCache.destroy anywhere, I'll look into this, but the heap analysis show the evictionScheduler is possibly the cause of memory leak?

As for Reads, we have normal key value storage only, but internally, it is probably getting converted to eval based commands.
How can we avoid, eval based commands when we only have the use case of storing key/value.

or how can we optimize the Redisson client for use case and is upgrade recommended as a solution for any of the above problems ??

@uilton-oliveira
Copy link

uilton-oliveira commented Jul 7, 2023

I was about to open a issue about this soon, just collecting more evidences...
We are using the latest version (3.22.1) and also have all problems reported, except memory leak..

We migrated recently from few LocalCachedMap keys (instances) with many entries to many keys with less entries, and also enabled writeBehindDelay / writeBehindBatchSize and since that, we saw a increase of CurrConnections from stable 500 connections to 1-1.2k, but only for master nodes (and increasing over time, but we currently restart the application daily)
image
image

Also noticed that most of the Get calls is focused in nodes 001 (masters), even with readMode configured as SLAVE
image

Also noticied some timeouts from time to time, already increased the subscriptionsPerConnection, subscriptionConnectionPoolSize, timeouts and nettyThreads but without any metrics / logs indicating the current usage of connections / pool it's really hard to know the best value here for our case..
image

We don't have log files, we send it directly to elasticsearch and consume it with Graylog, if you tell me what we are looking for, i can try change the log level and filter that messages, but the result will be in csv format...

That's how map is created/configured:
image

Redisson config:

  idleConnectionTimeout: 10000
  connectTimeout: 10000
  timeout: 10000
  retryAttempts: 3
  retryInterval: 1500
  credentialsResolver: !<org.redisson.client.DefaultCredentialsResolver> {}
  subscriptionsPerConnection: 8
  clientName: "prd_platform_cluster_3"
  sslEnableEndpointIdentification: true
  sslProvider: "JDK"
  pingConnectionInterval: 30000
  keepAlive: true
  tcpNoDelay: true
  nameMapper: !<com.cortex.system.web.cache.redis.RedisNameMapping> {}
  commandMapper: !<org.redisson.config.DefaultCommandMapper> {}
  loadBalancer: !<org.redisson.connection.balancer.RoundRobinLoadBalancer> {}
  slaveConnectionMinimumIdleSize: 24
  slaveConnectionPoolSize: 64
  failedSlaveReconnectionInterval: 3000
  failedSlaveCheckInterval: 180000
  masterConnectionMinimumIdleSize: 24
  masterConnectionPoolSize: 64
  readMode: "SLAVE"
  subscriptionMode: "MASTER"
  subscriptionConnectionMinimumIdleSize: 1
  subscriptionConnectionPoolSize: 80
  dnsMonitoringInterval: 5000
  natMapper: !<org.redisson.api.DefaultNatMapper> {}
  nodeAddresses:
  - "rediss://clustercfg.****.use1.cache.amazonaws.com:6379"
  scanInterval: 5000
  checkSlotsCoverage: true
  slaveNotUsed: false
threads: 16
nettyThreads: 64
codec: !<org.redisson.codec.Kryo5Codec> {}
referenceEnabled: true
transportMode: "NIO"
lockWatchdogTimeout: 30000
checkLockSyncedSlaves: false
slavesSyncTimeout: 1000
reliableTopicWatchdogTimeout: 600000
keepPubSubOrder: true
useScriptCache: false
minCleanUpDelay: 5
maxCleanUpDelay: 1800
cleanUpKeysAmount: 100
nettyHook: !<org.redisson.client.DefaultNettyHook> {}
useThreadClassLoader: true
addressResolverGroupFactory: !<org.redisson.connection.RoundRobinDnsAddressResolverGroupFactory> {}```

@mrniko
Copy link
Member

mrniko commented Jul 7, 2023

@gitrahul9

We do not call RMapCache.destroy anywhere, I'll look into this, but the heap analysis show the evictionScheduler is possibly the cause of memory leak?

You need call RMapCache.destroy() method if this RMapCache instance won't be used

As for Reads, we have normal key value storage only, but internally, it is probably getting converted to eval based commands.
How can we avoid, eval based commands when we only have the use case of storing key/value.

Eval scripts can't be avoided.

@mrniko
Copy link
Member

mrniko commented Jul 7, 2023

@uilton-oliveira

Do you use RLocalCachedMap object and not RMapCache?
How many Redisson instances connected to the Redis cluster?
Is single Redisson connections amount fit in pool size?

@uilton-oliveira
Copy link

uilton-oliveira commented Jul 7, 2023

@uilton-oliveira

Do you use RLocalCachedMap object and not RMapCache?
How many Redisson instances connected to the Redis cluster?

Yes, RLocalCachedMap.

There are 21 application instances connected to redis cluster

Is single Redisson connections amount fit in pool size?

How can i check this?

@mrniko
Copy link
Member

mrniko commented Jul 7, 2023

@uilton-oliveira

21*64 + 21*80 = 3024 connections amount may reach at peak. Try to decrease masterConnectionPoolSize, slaveConnectionPoolSize, subscriptionConnectionPoolSize values.

@uilton-oliveira
Copy link

uilton-oliveira commented Jul 7, 2023

  1. Just to better understand it, there's a relation in the split of data in more RLocalCachedMap instances (smaller) instead of big RLocalCachedMap in the increase of connections? is it expected?

  2. About reducing the masterConnectionPoolSize, slaveConnectionPoolSize, subscriptionConnectionPoolSize values, is there a way to see the current connections usage programatically? with that values being know, it would be easier to determine the correct value that we would need to configure that would best fit our needs..

  3. Is it expected to connections only increase over time when needed and not going down? may it be related with Connection selection algorithm to minimize number of connections needed #5151 ?

  4. At last, any idea why the get commands being focused only on nodes 001 (master) instead of 002 slave as configured in readMode?

Edit:
We created another redis cluster with 5 shards instead of 3, but with less powerful instances, and timeouts seens gone for now, and decreased the connections per node, but the connections seens still increasing over time and never going down
image

@gitrahul9
Copy link
Author

I have 4 questions:

This is somewhat like my current implementation for get, put
get:

final RMapCache<String, String> cacheMap = redissonClient.getMapCache(key);
return cacheMap.getWithTTLOnly(key);

put:

final RMapCache<String, String> cacheMap = redissonClient.getMapCache(key);
cacheMap.put(key, value, ttl, timeUnit);
  1. Do you think destroying map after get, put will fix memory leak problem ?

with something like this ?
put

final RMapCache<String, String> cacheMap = redissonClient.getMapCache(key);
        cacheMap.put(key, value, ttl, timeUnit);
        cacheMap.destroy();

get

final RMapCache<String, String> cacheMap = redissonClient.getMapCache(key);
        final String val =  cacheMap.getWithTTLOnly(hashKey.getHashKey());
        cacheMap.destroy();
        return val;
  1. Also, is it a good practice to create map when initializing client ? or with get, put calls ?
  2. How can I distribute my read/get requests over replica nodes, currently my redison client read mode is set to SLAVE, but it is not being honored.
  3. How can I stop the increasing connections ? I see that each of my application container is making ~25 connections with cluster ( default pool size ?), I have 50 containers, so ~1250, but every few days, the redis connections are increasing and stays that way, while my application need was definitely being served by the pool. Any suggestion on this ?

@mrniko
Copy link
Member

mrniko commented Jul 10, 2023

Just to better understand it, there's a relation in the split of data in more RLocalCachedMap instances (smaller) instead of big RLocalCachedMap in the increase of connections? is it expected?

No. You can decrease subscriptionConnectionPoolSize and increase subscriptionsPerConnection setting instead.

About reducing the masterConnectionPoolSize, slaveConnectionPoolSize, subscriptionConnectionPoolSize values, is there a way to see the current connections usage programatically? with that values being know, it would be easier to determine the correct value that we would need to configure that would best fit our needs..

No. Metrics are available only in PRO version.

Is it expected to connections only increase over time when needed and not going down? may it be related with #5151 ?

Yes. This part should be improved.

At last, any idea why the get commands being focused only on nodes 001 (master) instead of 002 slave as configured in readMode?

MapCache get() commands modify idleTimeout setting. This is why master node is used. Try to use getWithTTLOnly() or getAllWithTTLOnly() methods.

@mrniko
Copy link
Member

mrniko commented Jul 10, 2023

cacheMap.put(key, value, ttl, timeUnit);

Use fastPut method if don't need a previous value.

Do you think destroying map after get, put will fix memory leak problem ?

call destroy() method if map isn't used anymore not after each put() call. Is it possible? Or is there no not used MapCache instances in application?

How can I distribute my read/get requests over replica nodes, currently my redison client read mode is set to SLAVE, but it is not being honored.

Can you share more details? getWithTTLOnly() always uses slaves if readMode = SLAVE

How can I stop the increasing connections ?

Set MinimumIdleSize and PoolSize equal to the same value.

@mrniko
Copy link
Member

mrniko commented Jul 10, 2023

As option you can use MapCacheOptions.removeEmptyEvictionTask() method introduced in #5161 to minimize memory allocation by evictionScheduler .

@gitrahul9
Copy link
Author

gitrahul9 commented Jul 10, 2023

call destroy() method if map isn't used anymore not after each put() call. Is it possible? Or is there no not used MapCache instances in application?

I'm sorry I didn't fully understand the question, We are creating map with every get/put call and using it there only (not reusing it anywhere else, I believe), I understand now that we can't delete after every put, that deletes the data in the cache,
I was wondering if there is anything associated with each map that is causing memory leak. In my heap analysis, it is showing up as EvictionScheduler of Redisson is the cause of memory leak.

Can you share more details? getWithTTLOnly() always uses slaves if readMode = SLAVE

So, earlier we used to have cacheMap.get(), that made all calls go to master nodes only, but now that we have moved to getWithTTLOnly(), we see that read replicas are getting load, but most of it is still going to primary nodes, but there is one behaviour that I observed during load test that, if we kept the load running, and all calls are being served from cache, eventually the load moves to read replicas. But in production we are seeing that replicas are serving less than 10% of the traffic.

Set MinimumIdleSize and PoolSize equal to the same value.

We have not explicitly set any value to these configurations, & default value for MasterMinIdleSize and poolSize is 24 only, should I set SlavePoolSize to 24 as well ? I wonder how that would make the difference ?

@mrniko
Copy link
Member

mrniko commented Jul 10, 2023

I'm sorry I didn't fully understand the question, We are creating map with every get/put call and using it there only (not reusing it anywhere else, I believe), I understand now that we can't delete after every put, that deletes the data in the cache,
I was wondering if there is anything associated with each map that is causing memory leak. In my heap analysis, it is showing up as EvictionScheduler of Redisson is the cause of memory leak.

You can upgrade to 3.23.0 version and try my suggestion #5158 (comment)

We have not explicitly set any value to these configurations, & default value for MasterMinIdleSize and poolSize is 24 only, should I set SlavePoolSize to 24 as well ? I wonder how that would make the difference ?

Yes, try it.

But in production we are seeing that replicas are serving less than 10% of the traffic.

Write/read operations ratio might be different for production.

@uilton-oliveira
Copy link

uilton-oliveira commented Jul 10, 2023

About reducing the masterConnectionPoolSize, slaveConnectionPoolSize, subscriptionConnectionPoolSize values, is there a way to see the current connections usage programatically? with that values being know, it would be easier to determine the correct value that we would need to configure that would best fit our needs..

No. Metrics are available only in PRO version.

Not asking for metrics specifically, but a way to see it programatically maybe, or with logs..

The problem that i see is that the wiki does not explain exactly what these settings do and in what case we should increase and based in what? we're completly blind here, the only way that i see is increasing it a little each time and hoping that the error goes away...

subscriptionsPerConnection for example, all that the wiki give for us is: "Subscriptions per Redis connection limit"

Just to better understand it, there's a relation in the split of data in more RLocalCachedMap instances (smaller) instead of big RLocalCachedMap in the increase of connections? is it expected?

No. You can decrease subscriptionConnectionPoolSize and increase subscriptionsPerConnection setting instead.

What would be a reasonable value for subscriptionsPerConnection ? 10, 15?
And subscriptionConnectionPoolSize? back it to default 50? or maybe keep it a little higher, like 60-70?

Sorry about asking it, but we're completly blind here, without knowing exactly what these settings does and knowing what is exactly being currently used...

At last, any idea why the get commands being focused only on nodes 001 (master) instead of 002 slave as configured in readMode?

MapCache get() commands modify idleTimeout setting. This is why master node is used. Try to use getWithTTLOnly() or getAllWithTTLOnly() methods.

In my case i use RLocalCachedMap, i don't see any getAllWithTTLOnly method.. this map type don't seens to have timeout...

@mrniko mrniko added this to the 3.23.2 milestone Jul 24, 2023
@mrniko mrniko closed this as completed Jul 24, 2023
@mrniko mrniko reopened this Jul 24, 2023
@mrniko mrniko closed this as completed Jul 25, 2023
@mrniko mrniko removed this from the 3.23.2 milestone Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

3 participants