Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't create a basic redis failover, fail to sync with master #93

Closed
francoispqt opened this issue Sep 17, 2018 · 11 comments
Closed

Can't create a basic redis failover, fail to sync with master #93

francoispqt opened this issue Sep 17, 2018 · 11 comments
Labels

Comments

@francoispqt
Copy link

francoispqt commented Sep 17, 2018

Expected behaviour

What do you want to achieve?
Create a basic RedisFailover, using the helm chart for the operator and the minimal example

Actual behaviour

What is happening? Are all the pieces created? Can you access to the service?
The operator is created. When creating the RedisFailover, the rfr-redisfailover-0 pod fails, which puts all rfr and rfs pods to fail.
Pod rfr-redisfailover-0 logs output:

1:S 17 Sep 15:34:56.342 * Connecting to MASTER 127.0.0.1:6379
1:S 17 Sep 15:34:56.342 * MASTER <-> SLAVE sync started
1:S 17 Sep 15:34:56.342 * Non blocking connect for SYNC fired the event.
1:S 17 Sep 15:34:56.342 * Master replied to PING, replication can continue...
1:S 17 Sep 15:34:56.342 * Partial resynchronization not possible (no cached master)
1:S 17 Sep 15:34:56.342 * Master does not support PSYNC or is in error state (reply: -ERR Can't SYNC while not connected with my master)
1:S 17 Sep 15:34:56.342 * Retrying with SYNC...
1:S 17 Sep 15:34:56.343 # MASTER aborted replication with an error: ERR Can't SYNC while not connected with my master

Steps to reproduce the behaviour

  • Created the operator with the following helm values:
replicaCount: 1
apiVersion: apps/v1
image: quay.io/spotahome/redis-operator
tag: 0.5.2
pullPolicy: Always
team: devops
containerName: redisoperator
podAnnotations: {}
resources:
  requests:
    cpu: 10m
    memory: 50Mi
  limits:
    cpu: 100m
    memory: 50Mi
rbac:
  install: true
  apiVersion: v1
  imagePullSecrets: []
monitoring:
  enabled: false
  prometheus:
    name: unknown

Then created the RedisFailover with the following config:

apiVersion: storage.spotahome.com/v1alpha2
kind: RedisFailover
metadata:
  name: redisfailover
  namespace: dev

Environment

How are the pieces configured?

  • Redis Operator version
    0.5.2

  • Kubernetes version
    v1.10.5

  • Kubernetes configuration used (eg: Is RBAC active?)
    RBAC is active

Logs

From rfr redis container

1:S 17 Sep 15:34:56.342 * Connecting to MASTER 127.0.0.1:6379
1:S 17 Sep 15:34:56.342 * MASTER <-> SLAVE sync started
1:S 17 Sep 15:34:56.342 * Non blocking connect for SYNC fired the event.
1:S 17 Sep 15:34:56.342 * Master replied to PING, replication can continue...
1:S 17 Sep 15:34:56.342 * Partial resynchronization not possible (no cached master)
1:S 17 Sep 15:34:56.342 * Master does not support PSYNC or is in error state (reply: -ERR Can't SYNC while not connected with my master)
1:S 17 Sep 15:34:56.342 * Retrying with SYNC...
1:S 17 Sep 15:34:56.343 # MASTER aborted replication with an error: ERR Can't SYNC while not connected with my master

From operator:

time="2018-09-17T15:42:10Z" level=warning msg="error processing dev/redisfailover job (requeued): read tcp 100.96.49.200:57220->100.96.7.70:6379: read: connection reset by peer" controller=redisfailover operator=redisfailover src="generic.go:223"
time="2018-09-17T15:42:11Z" level=info msg="configMap updated" configMap=rfs-redisfailover namespace=dev service=k8s.configMap src="configmap.go:76"
time="2018-09-17T15:42:12Z" level=info msg="configMap updated" configMap=rfr-shutdown-redisfailover namespace=dev service=k8s.configMap src="configmap.go:76"
time="2018-09-17T15:42:12Z" level=info msg="configMap updated" configMap=rfr-redisfailover namespace=dev service=k8s.configMap src="configmap.go:76"
time="2018-09-17T15:42:12Z" level=info msg="podDisruptionBudget updated" namespace=dev podDisruptionBudget=rfr-redisfailover service=k8s.podDisruptionBudget src="poddisruptionbudget.go:77"
time="2018-09-17T15:42:12Z" level=info msg="statefulSet updated" namespace=dev service=k8s.statefulSet src="statefulset.go:96" statefulSet=rfr-redisfailover
time="2018-09-17T15:42:12Z" level=info msg="podDisruptionBudget updated" namespace=dev podDisruptionBudget=rfs-redisfailover service=k8s.podDisruptionBudget src="poddisruptionbudget.go:77"
time="2018-09-17T15:42:12Z" level=info msg="deployment updated" deployment=rfs-redisfailover namespace=dev service=k8s.deployment src="deployment.go:96"
time="2018-09-17T15:42:12Z" level=error msg="Error processing dev/redisfailover: read tcp 100.96.49.200:57230->100.96.7.70:6379: read: connection reset by peer" controller=redisfailover operator=redisfailover src="generic.go:223"
@jchanam
Copy link
Collaborator

jchanam commented Sep 19, 2018

Hi,

You just have to wait for the operator to get the failover to the desired state. At this moment, it waits when a new redis-failover is created because of security reasons.

After a few minutes, is the railover running?

@francoispqt
Copy link
Author

Ok it's working fine. We had an issue with an istio proxy injected automatically which was breaking liveness probe. My bad.

Thank you

@iroller
Copy link

iroller commented Sep 19, 2018

Ok it's working fine. We had an issue with an istio proxy injected automatically which was breaking liveness probe. My bad.

Thank you

Could you elaborate on why would istio-proxy be breaking the probe and how did you fix it?

@francoispqt
Copy link
Author

So to put it back in context, we inject an istio-proxy by default in all our pods unless we put a specific annotation, because we have mostly http services and we want it for those http services. But for redis, in fact, we don't need it. I can't really tell exactly why it breaks the probe (the probe is just a ping to the hostname:port), but removing the istio proxy did fix it, we might investigate further.

@iroller
Copy link

iroller commented Sep 20, 2018

@francoispqt create a service redis (operator will only create a service for sentinel) and it'll work. See https://github.com/istio/istio/blob/master/pilot/cmd/pilot-agent/status/ready/probe.go#L41-L65 for details. You will probably not use this service, but it'll get istio working.

@Paic
Copy link
Contributor

Paic commented Oct 30, 2018

I'm having the exact same problem as the author.
Even when a dummy redis service is created, same output is happening :

time="2018-10-30T15:38:24Z" level=warning msg="error processing default/redisfailover job (requeued): read tcp 10.2.8.41:53960->10.2.7.26:6379: read: connection reset by peer" controller=redisfailover operator=redisfailover src="generic.go:223"

Dummy service created looks as follow :

apiVersion: v1
kind: Service
metadata:
  name: rfr-redis-failover-dummy
spec:
  ports:
    - name: redis-dummy
      port: 6379
      protocol: TCP
      targetPort: redis
  selector:
    app: redis-failover
    component: redis
    creator: redisfailover
    redisfailover: redisfailover

EDIT: After adding clusterIP: None to the dummy service, the redis master-slave replication was able to happen.
But now the operator is not able to contact sentinels :

time="2018-10-30T16:37:44Z" level=warning msg="error processing default/redisfailover job (requeued): read tcp 10.2.7.31:45280->10.2.7.33:26379: read: connection reset by peer" controller=redisfailover operator=redisfailover src="generic.go:223"

Which is strange because a service already exist for them.

@jchanam
Copy link
Collaborator

jchanam commented Oct 30, 2018

@Paic what version of the operator are you using?

That error appears when the operator tries to connect directly to the redises to check it's status. It happens when the redis container is not started yet, thus it should disappear after a few moments.

In the last version, that behaviour has changes and it checks only the running pods only, so it should not appear again.

@Paic
Copy link
Contributor

Paic commented Oct 30, 2018

I'm using the latest operator version (docker image :latest).

As stated in my edit, the replication is now ok, but the same error occurs with sentinels.

One thing I noticed is that the port in the sentinel service is named "sentinel". I tried renaming it to redis-sentinel (for Istio) but I don't know if envoy proxies picked up the change (@iroller).

EDIT : Looking at a sentinel stats (INFO command) the number of total_connections_received kept growing : https://snag.gy/NeAgDb.jpg, which mean the connection to redis can at least be made.
I was also able to connect from a sentinel to another (DNS or PodIP) using redis-cli and execute commands.

After installing redis-cli into the operator, I was also able to connect to a sentinel (DNS or ServiceIP) but every command was resetting the connection : https://snag.gy/Io270F.jpg

@jchanam
Copy link
Collaborator

jchanam commented Oct 31, 2018

The operator does not use the service to connect to the redis/sentinel pods, so it's the same if there are a service for redises or not.

The operator ensures that what it creates remain unchanged, so if you edit it, it should come back as before.

@Paic
Copy link
Contributor

Paic commented Oct 31, 2018

I see. After further testing, that's why inside the operator redis-cli using DNS works (Istio created a listener with the sentinel service IP) but doesn't with direct PodIP (no listener for each sentinel pod IP)

If you create a dummy headless sentinel service, the operator is able to connect to them directly with the PodIP and everything looks fine.

Thanks for the help :)

EDIT: slaves are recognized by sentinel as 127.0.0.1, which detect them a down (obviously)

@kmcrawford
Copy link
Contributor

kmcrawford commented Sep 30, 2019

I was able to solve with:

apiVersion: v1
kind: Service
metadata:
  name: rfr-redis-failover-dummy
  namespace: redis
spec:
  clusterIP: None
  ports:
    - name: redis-dummy
      port: 6379
      protocol: TCP
      targetPort: redis
  selector:
    app.kubernetes.io/component: redis
    app.kubernetes.io/managed-by: redis-operator
    app.kubernetes.io/name: redis-dev
    app.kubernetes.io/part-of: redis-failover

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants