Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redis-rdb-bgsave process make cluster falling. #7345

Open
fidelgonzalez opened this issue May 29, 2020 · 4 comments
Open

redis-rdb-bgsave process make cluster falling. #7345

fidelgonzalez opened this issue May 29, 2020 · 4 comments

Comments

@fidelgonzalez
Copy link

Hi community!

I have a redis-cluster running on the stable version of redis-6.0.1. Each node is a virtual machine running on azure.

  • Every VM have 32GB RAM and 4CPU.
  • The working directory of redis, is situated on P6 SSD disk, with the objective to improve the performance of the snapshot process at the time to write in disk.
  • Our save condition conf is setting in: "save 60 10000".
ubuntu@redis-1:~$ redis-cli -p 7379 config get save
1) "save"
2) "60 10000"
8919519ac6f63ea0a5419d53de81cb2b6baaef70 10.0.0.23:7380@17380 master - 0 1590773460863 2 connected 4108-8191
e9b83dca26a7def0c64ace41837b34a6dd6558a2 10.0.0.47:7382@17382 master - 0 1590773461000 4 connected 12300-16383
1d1fc571cf65efde1b28f45c92c1a829050b1dc3 10.0.0.24:7381@17381 master - 0 1590773463111 3 connected 8204-12287
de20bc91789fb7f4ef44750781fad551a6d72b66 10.0.0.22:7379@17379 myself,master - 0 1590773460000 1 connected 13-4095
736abe36922205a5e3de24bf219f703327909056 10.0.0.48:7383@17383 master - 0 1590773463000 5 connected 0-12 4096-4107 8192-8203 12288-12299

The issue is related with the cluster performance when the rdb-bgsave process is running. When the ram memory is fetching the "maxmemory" limit of redis, the bgsave process start to consume swap memory. This procedmient make the cluster fail randomly while the snapshot is running.

This is the redis-server.log at that moment:

92656:M 29 May 2020 16:01:00.392 * 10000 changes in 60 seconds. Saving...
92656:M 29 May 2020 16:01:00.869 * Background saving started by pid 57958
92656:M 29 May 2020 16:01:00.869 # Cluster state changed: fail
92656:M 29 May 2020 16:01:00.870 * Marking node 8919519ac6f63ea0a5419d53de81cb2b6baaef70 as failing (quorum reached).
92656:M 29 May 2020 16:01:14.686 * FAIL message received from 8919519ac6f63ea0a5419d53de81cb2b6baaef70 about 1d1fc571cf65efde1b28f45c92c1a829050b1dc3
92656:M 29 May 2020 16:03:27.083 * Clear FAIL state for node 1d1fc571cf65efde1b28f45c92c1a829050b1dc3: is reachable again and nobody is serving its slots after some time.
92656:M 29 May 2020 16:03:44.354 * Clear FAIL state for node 8919519ac6f63ea0a5419d53de81cb2b6baaef70: is reachable again and nobody is serving its slots after some time.
92656:M 29 May 2020 16:03:44.354 # Cluster state changed: ok
92656:M 29 May 2020 16:04:45.042 # Cluster state changed: fail
92656:M 29 May 2020 16:04:45.371 * Marking node 8919519ac6f63ea0a5419d53de81cb2b6baaef70 as failing (quorum reached).
92656:M 29 May 2020 16:04:57.240 * FAIL message received from e9b83dca26a7def0c64ace41837b34a6dd6558a2 about 1d1fc571cf65efde1b28f45c92c1a829050b1dc3
92656:M 29 May 2020 16:05:43.406 * Clear FAIL state for node 8919519ac6f63ea0a5419d53de81cb2b6baaef70: is reachable again and nobody is serving its slots after some time.
92656:M 29 May 2020 16:05:43.715 * Clear FAIL state for node 1d1fc571cf65efde1b28f45c92c1a829050b1dc3: is reachable again and nobody is serving its slots after some time.
92656:M 29 May 2020 16:05:48.458 # Cluster state changed: ok
92656:M 29 May 2020 16:06:01.902 * Marking node 8919519ac6f63ea0a5419d53de81cb2b6baaef70 as failing (quorum reached).
92656:M 29 May 2020 16:06:01.902 # Cluster state changed: fail
92656:M 29 May 2020 16:06:28.835 * FAIL message received from 8919519ac6f63ea0a5419d53de81cb2b6baaef70 about 1d1fc571cf65efde1b28f45c92c1a829050b1dc3
92656:M 29 May 2020 16:06:32.819 * Clear FAIL state for node 8919519ac6f63ea0a5419d53de81cb2b6baaef70: is reachable again and nobody is serving its slots after some time.
92656:M 29 May 2020 16:09:13.298 * Marking node 8919519ac6f63ea0a5419d53de81cb2b6baaef70 as failing (quorum reached).
92656:M 29 May 2020 16:09:13.410 * Clear FAIL state for node 1d1fc571cf65efde1b28f45c92c1a829050b1dc3: is reachable again and nobody is serving its slots after some time.
92656:M 29 May 2020 16:09:44.038 * Clear FAIL state for node 8919519ac6f63ea0a5419d53de81cb2b6baaef70: is reachable again and nobody is serving its slots after some time.
92656:M 29 May 2020 16:09:44.038 # Cluster state changed: ok
57958:C 29 May 2020 16:11:41.501 * DB saved on disk
57958:C 29 May 2020 16:11:42.112 * RDB: 872 MB of memory used by copy-on-write
92656:M 29 May 2020 16:11:43.217 * Background saving terminated with success

As you can se the errors in the cluster state starts and finish while the redis-rdb-bgsave process is running.

How can we solve this? There is any config recommendation for this issue? It's important to us still have persistance on disk without degradate the service.

Regards!

@javierpajaro
Copy link

Hi @antirez,

As my colleague told, from time to time we experimented long pauses in the application.

We thought it was attribuible to bgsave process, as we have high volume of key changes in short period of time.

In order to solve this:

We dedicated 3 CPU to data processing. We left 1 to administrative stuff.

We disabled persistence, as we can live without it.

We set maxmemory.

We set expire on keys according to business needs.

We put in place an autoscalling process to add nodes and migrate keys automatically.

Adding a node works fine. Migrating keys, despite we do it in small groups of 20s, pauses the cluster one minute aprox. from time to time.

Please, can you give us an advite to reduce downtimes and maximize responde times?

@ShooterIT
Copy link
Collaborator

Maybe fork? you could use latency command to get fork time

@javierpajaro
Copy link

We will try and let you know. Thanks.

@javierpajaro
Copy link

We set lazy options to yes and tunne io-threads and bgsave. And it worked very good!

But, now we are experimenting trouble with cluster resharding. It is too slow for production. Adding a new cluster member and live reshard slots takes more than a day. Worst than that, the process degrades performace of the whole cluster while it is executing.

Is there anyway to do this faster?

We tried to disable bgsave before resharding but it did not help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants