Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redis in Docker Swarm: master fails to reconnect to replica after a while #6949

Open
Monokai opened this issue Mar 3, 2020 · 2 comments
Open

Comments

@Monokai
Copy link

Monokai commented Mar 3, 2020

I'm having difficulties getting redis to work in a Docker Swarm setup. At first it works, but after a while (probably after a service restart), I'm getting these errors:

03 Mar 2020 13:41:09.748 * Connecting to MASTER redis-master:6379
03 Mar 2020 13:41:09.749 * MASTER <-> REPLICA sync started
03 Mar 2020 13:41:09.749 # Error condition on socket for SYNC: Connection refused
03 Mar 2020 13:41:10.751 * Connecting to MASTER redis-master:6379
…

These go on forever.

I'm using a docker-compose stack. I have 1 master, and replicas on each of the 3 servers. The setup doesn't use sentinels. My thinking is that if the master fails, docker restarts the service and it reads the config back in via the shared volume that is used by masters and replicas. Relevant parts:

redis-master:
    image: "${CI_REGISTRY_IMAGE}:redis_master-${CI_COMMIT_REF_SLUG}"
    networks:
      - mynetwork
    volumes:
      - redis:/opt/scripts
    ports:
      - 6379:6379
    command: sh -c 'redis-server /usr/local/etc/redis/redis.conf --bind $$(hostname -i)'
    deploy:
      replicas: 1
      update_config:
        parallelism: 1
        delay: 10s
        order: stop-first
        failure_action: rollback
      rollback_config:
        parallelism: 1
        delay: 10s
        order: stop-first
      restart_policy:
        condition: on-failure
        delay: 10s
        max_attempts: 5
        window: 180s
    healthcheck:
      test: /usr/local/bin/healthcheck.sh
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 1m

  redis-replica:
    image: "${CI_REGISTRY_IMAGE}:redis_replica-${CI_COMMIT_REF_SLUG}"
    networks:
      - mynetwork
    volumes:
      - redis:/opt/scripts
    ports:
      - 6380:6380
    command: sh -c 'redis-server /usr/local/etc/redis/redis.conf --bind $$(hostname -i) --replica-announce-ip $$(hostname -i) --port 6380 --replicaof redis-master 6379'
    depends_on:
      - redis-master
    deploy:
      mode: global
      update_config:
        parallelism: 1
        delay: 10s
        order: stop-first
        failure_action: rollback
      rollback_config:
        parallelism: 1
        delay: 10s
        order: stop-first
      restart_policy:
        condition: on-failure
        delay: 10s
        max_attempts: 5
        window: 180s
    healthcheck:
      test: /usr/local/bin/healthcheck.sh
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 1m

I have tested this setup and it seems to work. I've also tested rebooting each of the 3 servers, and after a while Redis connects to all of the instances fine, so they seem to find each other.

But, after a while this breaks. I don't know exactly when, by the time I notice it, my log is full of reconnecting messages.

If I bind to 0.0.0.0, everything seems to go well (at least for longer periods of time), but this puts my database wide open, so that's not feasible. I have a feeling the problem has something to do with the binding, or a restart of the service gets a new IP or something, I don't know.

Any help much appreciated!

@Monokai
Copy link
Author

Monokai commented Mar 4, 2020

Update. I now did change the bind to 0.0.0.0, and used expose: 6379 instead of ports: "6379:6379" to expose the port to other services in the overlay network without mapping the port of the container to the host.

Again, everything looks OK and after a while I'm now getting:

Mar 2020 10:52:21.924 # Unable to connect to MASTER: Resource temporarily unavailable
Mar 2020 10:52:22.927 * Connecting to MASTER redis-master:6379
Mar 2020 10:52:22.928 # Unable to connect to MASTER: Resource temporarily unavailable
Mar 2020 10:52:23.931 * Connecting to MASTER redis-master:6379
…

It repeats every second

@Monokai
Copy link
Author

Monokai commented Apr 1, 2020

Update. This might have been an out-of-memory error. I've upgraded the server to include more memory and I haven't got any issues since two weeks.

I would like to hear if this Docker Swarm setup is the right way to go or if I need to put Redis in cluster mode or something. It's all a bit vague to me how a Redis master / replica setup ideally should be used in a Docker Swarm setup. I only use Redis for caching purposes and it doesn't matter all that much if some data is lost due to restarting a Redis instance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant