New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redis Master keeps changing during heavy load #12787
Comments
@rishinair19 Do you see any logs like
If there is lot of write traffic during BGSAVE, client output buffer could grow more than the threshold (soft/hard) and lead to disconnection of the replica. You could try increasing the |
I had increased client-output-buffer-limit from 256 MB to 1 GB but yeah still getting these in logs:
Setting it to 4 GB and will check if we get same messages. |
I have increased hard limit to 4GB and soft limit to 1 GB, not seeing these messages in logs now:
But still seeing the original issue, master node keeps switching. Logs from master:
Logs from slave:
Any other tuning that can be done to fix this? |
i'm not familiar with sentinel and it's triggers for failover, maybe @moticless can help here. i do think the THP warning could be relevant. i'm not sure why |
what is your down-after-milliseconds and failover-timeout config options values in sentinel.conf ? |
@rishinair19 Would you be able to share the logs for both primary/replica during the same period time the failover happens ? |
@oranagra SLOWLOG does report some queries but all of them are under 5ms. Not sure what to check here. I had enabled latency config and ran latency latest and latency doctor commands but they came up with empty results. I saw the TH warning once in the logs but not again so maybe that resolved now. @benimohit down-after-milliseconds is 5 seconds, we have not set failover-timeout in sentinel.conf |
@hpatro Here are the logs: Sentinel:
Master:
Slave1:
|
Slave2 with debug logs:
|
can you print |
|
are you sure you enabled the latency monitor correctly? |
I reduced latency-monitor-threshold to 100 1 hour ago.
Output of LATENCY LATEST command:
|
SLOW LOG:
|
I have increased client-output-buffer-limit to 10GB now, repl-backlog-size is 512mb. This is happening almost every 30 minutes. |
Another important observation is that issue occurs when load is high, we were trying load keys and found master changing every 2-3 minutes:
|
We disabled KEYS command and it did reduce load on redis but after a while Redis went down again. We did some load testing and found no issues when running heavy read queries. Cpu spiked to 100 for a few seconds but comes down immediately. In write tests, cpu was continuously at 100% and master was marked as down after a few seconds. The child process forked during bgsave also was taking 100% cpu. Disabling bgsave did not create much difference. We had tried increasing io-threads but it causes Redis to crash - We can't optimize these queries so will try using Redis cluster with multiple masters. |
Regarding the sentinel part of this issue, I doubt it is the source cause of the problem. From the logs it is rather obvious that the master is truly unreachable by all entities. You can increase the value of |
@moticless so you're saying that sentinel triggered that failover because redis didn't respond to PING for more than 5 seconds. (that part is clear from the logs). no other issues, and the connections weren't dropped. and by looking at the latency metrics we don't see such long delays so we don't have an explanation why PING got delayed so much.. |
Ping should be sent once a second. It might be that:
@rishinair19 , To verify sentinel is working as expected in your environment, maybe you can simulate in similar non production environment a temporary unavailability of master, verify timeout of 5 seconds is as expected, and switchover. If possible, also verify ping packets are being sent during that time. Maybe by |
Sentinel is working as expected. We have manually stopped services and sentinel does fail over Redis to different node time and time again. The issue is that Redis does not respond to PING as CPU core is exhausted on master. We do have multiple cores but since Redis is single threaded, it does not benefit. |
Have you observed that the timeout to switchover is indeed 5 seconds and not less? I want to eliminate the option that Sentinel might be the root cause. |
Redis Version: 7.2.3
OS: Rocky Linux release 8.8
Redis Mode: Standalone
Configuration: 5 nodes with sentinel running on each
Last change we did was upgrade to 7.2.3 from 6.2.6 and also increased number of keys from 2 million to 4.5 million. We're continuously seeing Master node going down and becoming slave, getting below in sentinel logs:
Not getting much information in redis logs at time when master was marked down:
Issue is observed only during day time when there is higher load and is mostly stable during night time. Servers on which Redis is hosted have 90+ CPU cores and 600 GB RAM. We have checked resource utilisation at time of issue as well and found enough were available.
Average number of clients connecting during the day is 1200-1300, slow logs does not show queries longer than 5 seconds.
We have checked OS logs and did not find any process killing Redis process. We also tried optimising through redis.conf but see no difference.
redis.conf:
Also got this warning in logs but not sure if relevant:
Redis fails over to slave node and recovers but we it takes a while to load DB in memory. This is creating application issues. Can someone please help with this?
The text was updated successfully, but these errors were encountered: