Resiliency to Redis network failures #998

sborpo · 2018-10-14T08:25:31Z

I'm designing a system based on a message queue so I took RQ for a try.
One of my main concerns is that the message queue will be resilient to crashes/network problems in the different parts of the system (The other parts will still try to do their job - assuming that the watchdogs will reset whatever needed)
For example if the Redis server was down - the workers/clients will still try to fetch/push with certain backoff strategy.

I saw that there is no code handling those cases (Maybe I'm missing something...) so when I ran a worker and shutdown the Redis server i saw that the worker also exists with the connection exception without trying to re-new the connection:

Traceback (most recent call last):
  File "/home/my_usr/PycharmProjects/redis_poc/redis_queue/worker_runner.py", line 204, in <module>
    w1.work()
  File "/home/my_usr/.local/lib/python3.6/site-packages/rq/worker.py", line 488, in work
    self.register_birth()
  File "/home/my_usr/.local/lib/python3.6/site-packages/rq/worker.py", line 273, in register_birth
    if self.connection.exists(self.key) and \
  File "/home/my_usr/.local/lib/python3.6/site-packages/redis/client.py", line 951, in exists
    return self.execute_command('EXISTS', name)
  File "/home/my_usr/.local/lib/python3.6/site-packages/redis/client.py", line 673, in execute_command
    connection.send_command(*args)
  File "/home/my_usr/.local/lib/python3.6/site-packages/redis/connection.py", line 610, in send_command
    self.send_packed_command(self.pack_command(*args))
  File "/home/my_usr/.local/lib/python3.6/site-packages/redis/connection.py", line 585, in send_packed_command
    self.connect()
  File "/home/my_usr/.local/lib/python3.6/site-packages/redis/connection.py", line 489, in connect
    raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 111 connecting to localhost:6379. Connection refused.

It was intended by design? What may be the best practice to handle those things using the current codebase?
I don't think that wrapping the calls will be a good solution:

    while True:
        try:
            with Connection(): 
                w1 = Worker(['queue1'])
                w1.work()
        except redis.exceptions.ConnectionError:
            print('lost connection')
            # Sleep 10 seconds and retry
            time.sleep(10)
        except Exception as e:
            print(e)
            break

The text was updated successfully, but these errors were encountered:

selwin · 2018-10-27T10:46:27Z

I usually rely on systemd to bring workers back from Redis connection errors during production. But yes, I think this is something we should build into RQ. Please open a PR for this.

bhargavrpatel · 2018-11-07T18:00:35Z

@selwin Follow up question on this: Have you come across a silent "loss" where, for some reason, there is not a connection error but the workers simply wait for jobs and the queue keeps building up? We've come across this issue multiple times. I tried doing a quick test by setting the TTL to a low count and override connection, setting it as a property for easier tracing as shown below:

    @property
    def connection(self):
        if self._beat_count == 0 or (self._beat_count % BEAT_COUNT == 0):
            import pdb; pdb.set_trace()
            self._connection.time()
        return self._connection

    @connection.setter
    def connection(self, x):
        self._connection = 

    def heartbeat(self, timeout=None, pipeline=None):
        super(Foo, self).heartbeat(timeout=timeout, pipeline=pipeline)
        self._beat_count += 1

I was using connection.time() as a way to check connectivity as its an O(1) operation. What I noticed however is that, when redis service is stopped manually calling that function raises ConnectionError, but everything recovers on its own when redis service is brought back up.

foozmeat · 2018-12-21T17:46:22Z

@bhargavrpatel We hit this issue every time we fail or switchover our redis cluster.

corynezin · 2019-10-25T14:31:19Z

Still running into this issue, has any kind of retry support been added?

parikls · 2020-05-23T14:10:35Z

Any updates on this?

levrik · 2020-06-23T06:53:24Z

We're also experiencing the issue reported by @bhargavrpatel here.
Redis instance was moved to another node on Kubernetes, thus we've lost connection for a short time period.
The worker didn't complain but also didn't reconnect after Redis was back up.
At least this is what I'm guessing because it didn't start processing tasks in the queue.
It just stayed there and we had to manually restart it when we've noticed. After the restart of the worker, it started processing the queue again.

solves issue rq#1153, rq#998 rq workers not auto connecting to redis server incase if they are down/restarted.

vincentwoo · 2020-12-09T04:17:48Z

Also definitely just lost a job to a redis server disconnect - would love some guidance on how to guard against this.

waldner · 2021-03-07T13:42:14Z

Is this going into a release in the near future? Currently I have to restart 900+ workers whenever a brief interruption to redis occurs.

selwin · 2021-03-07T14:16:21Z

Fixed in #1387.

I’ll make a release sometime in the next few weeks.

corynezin mentioned this issue May 25, 2020

Add exception catch for redis connection failure #1261

Open

Asrst mentioned this issue Dec 3, 2020

Add exception to catch redis connection failure to retry after wait time #1387

Merged

Asrst added a commit to Asrst/rq that referenced this issue Dec 3, 2020

Merge branch 'wait-for-connection'

2c40de2

solves issue rq#1153, rq#998 rq workers not auto connecting to redis server incase if they are down/restarted.

selwin closed this as completed Mar 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resiliency to Redis network failures #998

Resiliency to Redis network failures #998

sborpo commented Oct 14, 2018

selwin commented Oct 27, 2018

bhargavrpatel commented Nov 7, 2018

foozmeat commented Dec 21, 2018

corynezin commented Oct 25, 2019

parikls commented May 23, 2020

levrik commented Jun 23, 2020 •

edited

Loading

vincentwoo commented Dec 9, 2020

waldner commented Mar 7, 2021

selwin commented Mar 7, 2021

Resiliency to Redis network failures #998

Resiliency to Redis network failures #998

Comments

sborpo commented Oct 14, 2018

selwin commented Oct 27, 2018

bhargavrpatel commented Nov 7, 2018

foozmeat commented Dec 21, 2018

corynezin commented Oct 25, 2019

parikls commented May 23, 2020

levrik commented Jun 23, 2020 • edited Loading

vincentwoo commented Dec 9, 2020

waldner commented Mar 7, 2021

selwin commented Mar 7, 2021

levrik commented Jun 23, 2020 •

edited

Loading