Can't create a basic redis failover, fail to sync with master #93

francoispqt · 2018-09-17T15:39:07Z

Expected behaviour

What do you want to achieve?
Create a basic RedisFailover, using the helm chart for the operator and the minimal example

Actual behaviour

What is happening? Are all the pieces created? Can you access to the service?
The operator is created. When creating the RedisFailover, the rfr-redisfailover-0 pod fails, which puts all rfr and rfs pods to fail.
Pod rfr-redisfailover-0 logs output:

1:S 17 Sep 15:34:56.342 * Connecting to MASTER 127.0.0.1:6379
1:S 17 Sep 15:34:56.342 * MASTER <-> SLAVE sync started
1:S 17 Sep 15:34:56.342 * Non blocking connect for SYNC fired the event.
1:S 17 Sep 15:34:56.342 * Master replied to PING, replication can continue...
1:S 17 Sep 15:34:56.342 * Partial resynchronization not possible (no cached master)
1:S 17 Sep 15:34:56.342 * Master does not support PSYNC or is in error state (reply: -ERR Can't SYNC while not connected with my master)
1:S 17 Sep 15:34:56.342 * Retrying with SYNC...
1:S 17 Sep 15:34:56.343 # MASTER aborted replication with an error: ERR Can't SYNC while not connected with my master

Steps to reproduce the behaviour

Created the operator with the following helm values:

replicaCount: 1
apiVersion: apps/v1
image: quay.io/spotahome/redis-operator
tag: 0.5.2
pullPolicy: Always
team: devops
containerName: redisoperator
podAnnotations: {}
resources:
  requests:
    cpu: 10m
    memory: 50Mi
  limits:
    cpu: 100m
    memory: 50Mi
rbac:
  install: true
  apiVersion: v1
  imagePullSecrets: []
monitoring:
  enabled: false
  prometheus:
    name: unknown

Then created the RedisFailover with the following config:

apiVersion: storage.spotahome.com/v1alpha2
kind: RedisFailover
metadata:
  name: redisfailover
  namespace: dev

Environment

How are the pieces configured?

Redis Operator version
0.5.2
Kubernetes version
v1.10.5
Kubernetes configuration used (eg: Is RBAC active?)
RBAC is active

Logs

From rfr redis container

1:S 17 Sep 15:34:56.342 * Connecting to MASTER 127.0.0.1:6379
1:S 17 Sep 15:34:56.342 * MASTER <-> SLAVE sync started
1:S 17 Sep 15:34:56.342 * Non blocking connect for SYNC fired the event.
1:S 17 Sep 15:34:56.342 * Master replied to PING, replication can continue...
1:S 17 Sep 15:34:56.342 * Partial resynchronization not possible (no cached master)
1:S 17 Sep 15:34:56.342 * Master does not support PSYNC or is in error state (reply: -ERR Can't SYNC while not connected with my master)
1:S 17 Sep 15:34:56.342 * Retrying with SYNC...
1:S 17 Sep 15:34:56.343 # MASTER aborted replication with an error: ERR Can't SYNC while not connected with my master

From operator:

time="2018-09-17T15:42:10Z" level=warning msg="error processing dev/redisfailover job (requeued): read tcp 100.96.49.200:57220->100.96.7.70:6379: read: connection reset by peer" controller=redisfailover operator=redisfailover src="generic.go:223"
time="2018-09-17T15:42:11Z" level=info msg="configMap updated" configMap=rfs-redisfailover namespace=dev service=k8s.configMap src="configmap.go:76"
time="2018-09-17T15:42:12Z" level=info msg="configMap updated" configMap=rfr-shutdown-redisfailover namespace=dev service=k8s.configMap src="configmap.go:76"
time="2018-09-17T15:42:12Z" level=info msg="configMap updated" configMap=rfr-redisfailover namespace=dev service=k8s.configMap src="configmap.go:76"
time="2018-09-17T15:42:12Z" level=info msg="podDisruptionBudget updated" namespace=dev podDisruptionBudget=rfr-redisfailover service=k8s.podDisruptionBudget src="poddisruptionbudget.go:77"
time="2018-09-17T15:42:12Z" level=info msg="statefulSet updated" namespace=dev service=k8s.statefulSet src="statefulset.go:96" statefulSet=rfr-redisfailover
time="2018-09-17T15:42:12Z" level=info msg="podDisruptionBudget updated" namespace=dev podDisruptionBudget=rfs-redisfailover service=k8s.podDisruptionBudget src="poddisruptionbudget.go:77"
time="2018-09-17T15:42:12Z" level=info msg="deployment updated" deployment=rfs-redisfailover namespace=dev service=k8s.deployment src="deployment.go:96"
time="2018-09-17T15:42:12Z" level=error msg="Error processing dev/redisfailover: read tcp 100.96.49.200:57230->100.96.7.70:6379: read: connection reset by peer" controller=redisfailover operator=redisfailover src="generic.go:223"

The text was updated successfully, but these errors were encountered:

jchanam · 2018-09-19T09:35:45Z

Hi,

You just have to wait for the operator to get the failover to the desired state. At this moment, it waits when a new redis-failover is created because of security reasons.

After a few minutes, is the railover running?

francoispqt · 2018-09-19T10:47:57Z

Ok it's working fine. We had an issue with an istio proxy injected automatically which was breaking liveness probe. My bad.

Thank you

iroller · 2018-09-19T16:39:18Z

Ok it's working fine. We had an issue with an istio proxy injected automatically which was breaking liveness probe. My bad.

Thank you

Could you elaborate on why would istio-proxy be breaking the probe and how did you fix it?

francoispqt · 2018-09-20T03:03:12Z

So to put it back in context, we inject an istio-proxy by default in all our pods unless we put a specific annotation, because we have mostly http services and we want it for those http services. But for redis, in fact, we don't need it. I can't really tell exactly why it breaks the probe (the probe is just a ping to the hostname:port), but removing the istio proxy did fix it, we might investigate further.

iroller · 2018-09-20T03:11:37Z

@francoispqt create a service redis (operator will only create a service for sentinel) and it'll work. See https://github.com/istio/istio/blob/master/pilot/cmd/pilot-agent/status/ready/probe.go#L41-L65 for details. You will probably not use this service, but it'll get istio working.

Paic · 2018-10-30T15:42:44Z

I'm having the exact same problem as the author.
Even when a dummy redis service is created, same output is happening :

time="2018-10-30T15:38:24Z" level=warning msg="error processing default/redisfailover job (requeued): read tcp 10.2.8.41:53960->10.2.7.26:6379: read: connection reset by peer" controller=redisfailover operator=redisfailover src="generic.go:223"

Dummy service created looks as follow :

apiVersion: v1
kind: Service
metadata:
  name: rfr-redis-failover-dummy
spec:
  ports:
    - name: redis-dummy
      port: 6379
      protocol: TCP
      targetPort: redis
  selector:
    app: redis-failover
    component: redis
    creator: redisfailover
    redisfailover: redisfailover

EDIT: After adding clusterIP: None to the dummy service, the redis master-slave replication was able to happen.
But now the operator is not able to contact sentinels :

time="2018-10-30T16:37:44Z" level=warning msg="error processing default/redisfailover job (requeued): read tcp 10.2.7.31:45280->10.2.7.33:26379: read: connection reset by peer" controller=redisfailover operator=redisfailover src="generic.go:223"

Which is strange because a service already exist for them.

jchanam · 2018-10-30T16:41:51Z

@Paic what version of the operator are you using?

That error appears when the operator tries to connect directly to the redises to check it's status. It happens when the redis container is not started yet, thus it should disappear after a few moments.

In the last version, that behaviour has changes and it checks only the running pods only, so it should not appear again.

Paic · 2018-10-30T16:50:11Z

I'm using the latest operator version (docker image :latest).

As stated in my edit, the replication is now ok, but the same error occurs with sentinels.

One thing I noticed is that the port in the sentinel service is named "sentinel". I tried renaming it to redis-sentinel (for Istio) but I don't know if envoy proxies picked up the change (@iroller).

EDIT : Looking at a sentinel stats (INFO command) the number of total_connections_received kept growing : https://snag.gy/NeAgDb.jpg, which mean the connection to redis can at least be made.
I was also able to connect from a sentinel to another (DNS or PodIP) using redis-cli and execute commands.

After installing redis-cli into the operator, I was also able to connect to a sentinel (DNS or ServiceIP) but every command was resetting the connection : https://snag.gy/Io270F.jpg

jchanam · 2018-10-31T09:05:45Z

The operator does not use the service to connect to the redis/sentinel pods, so it's the same if there are a service for redises or not.

The operator ensures that what it creates remain unchanged, so if you edit it, it should come back as before.

Paic · 2018-10-31T09:45:06Z

I see. After further testing, that's why inside the operator redis-cli using DNS works (Istio created a listener with the sentinel service IP) but doesn't with direct PodIP (no listener for each sentinel pod IP)

If you create a dummy headless sentinel service, the operator is able to connect to them directly with the PodIP and everything looks fine.

Thanks for the help :)

EDIT: slaves are recognized by sentinel as 127.0.0.1, which detect them a down (obviously)

kmcrawford · 2019-09-30T15:58:10Z

I was able to solve with:

apiVersion: v1
kind: Service
metadata:
  name: rfr-redis-failover-dummy
  namespace: redis
spec:
  clusterIP: None
  ports:
    - name: redis-dummy
      port: 6379
      protocol: TCP
      targetPort: redis
  selector:
    app.kubernetes.io/component: redis
    app.kubernetes.io/managed-by: redis-operator
    app.kubernetes.io/name: redis-dev
    app.kubernetes.io/part-of: redis-failover

jchanam added the question label Sep 19, 2018

francoispqt closed this as completed Sep 19, 2018

iroller mentioned this issue Sep 27, 2018

helm stable/redis does not work with istio sidecar istio/istio#5453

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't create a basic redis failover, fail to sync with master #93

Can't create a basic redis failover, fail to sync with master #93

francoispqt commented Sep 17, 2018 •

edited

jchanam commented Sep 19, 2018

francoispqt commented Sep 19, 2018

iroller commented Sep 19, 2018

francoispqt commented Sep 20, 2018

iroller commented Sep 20, 2018 •

edited

Paic commented Oct 30, 2018 •

edited

jchanam commented Oct 30, 2018 •

edited

Paic commented Oct 30, 2018 •

edited

jchanam commented Oct 31, 2018

Paic commented Oct 31, 2018 •

edited

kmcrawford commented Sep 30, 2019 •

edited

Can't create a basic redis failover, fail to sync with master #93

Can't create a basic redis failover, fail to sync with master #93

Comments

francoispqt commented Sep 17, 2018 • edited

Expected behaviour

Actual behaviour

Steps to reproduce the behaviour

Environment

Logs

jchanam commented Sep 19, 2018

francoispqt commented Sep 19, 2018

iroller commented Sep 19, 2018

francoispqt commented Sep 20, 2018

iroller commented Sep 20, 2018 • edited

Paic commented Oct 30, 2018 • edited

jchanam commented Oct 30, 2018 • edited

Paic commented Oct 30, 2018 • edited

jchanam commented Oct 31, 2018

Paic commented Oct 31, 2018 • edited

kmcrawford commented Sep 30, 2019 • edited

francoispqt commented Sep 17, 2018 •

edited

iroller commented Sep 20, 2018 •

edited

Paic commented Oct 30, 2018 •

edited

jchanam commented Oct 30, 2018 •

edited

Paic commented Oct 30, 2018 •

edited

Paic commented Oct 31, 2018 •

edited

kmcrawford commented Sep 30, 2019 •

edited