Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Continuous "I/O error reading bulk count from MASTER: No error information" Failing the Readiness Probe #11414

Open
seanocca opened this issue Oct 21, 2022 · 2 comments

Comments

@seanocca
Copy link

seanocca commented Oct 21, 2022

Describe the bug

We are using this cache in conjunction with our Grafana Loki deployment handling roughly 100-200GB of uncompressed logs every day. This causes issues for the cache reading from master. The redis cluster handles the caching of compressed logs which should work to up to 100GB of throughput (well over the daily amount of 100GB uncompressed logs).

We hit a readiness probe failed error without any helpful error information

The error is as follows

I/O error reading bulk count from MASTER: No error information
RDB: 50 MB of memory used by copy-on-write
Reconnecting to MASTER xxx.xxx.xxx.xxx:6379 after failure
MASTER <-> REPLICA sync started
Non blocking connect for SYNC fired the event.
Master replied to PING, replication can continue...
Partial resynchronization not possible (no cached master)

The main issue for us is the No error information part

There is no way to debug this issue with this kind of response message

To reproduce

We use Kubernetes pods with the spotahome/redis-operator

The failover has some CustomConfig that will override the default values set by the operator (see below in additional information around CustomConfig).
We run 4 instances that have with 9 pod across them. We request 3 cores and 35GB of memory per pod

Expected behavior

We expect one of two scenarios to occur.

  1. The pod to fail with an error message that can help us to change config to improve performance
  2. The pod to either not fail the readiness probe or restart the pod on the occurrence of this error message (you might not be able to help with this one, as we use the redis-operator)

Additional information

We have the Persistent Volume Claim set to the size of 256GB. This should be more than enough data to hold the searched data for any timeframe.
CustomConfig set in the Redis Failover

"repl-timeout 610"
"save 60 5000"
"tcp-keepalive 610"
"maxclients 500000"
"oom-score-adj yes"
"oom-score-adj-values 0 200 800"
"dynamic-hz yes"
@vineelyalamarthy
Copy link

is this Redis Cluster or sentinel?

@seanocca
Copy link
Author

is this Redis Cluster or sentinel?

This error comes up on the redis cluster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants