Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resiliency to Redis network failures #998

Closed
sborpo opened this issue Oct 14, 2018 · 9 comments
Closed

Resiliency to Redis network failures #998

sborpo opened this issue Oct 14, 2018 · 9 comments

Comments

@sborpo
Copy link

sborpo commented Oct 14, 2018

I'm designing a system based on a message queue so I took RQ for a try.
One of my main concerns is that the message queue will be resilient to crashes/network problems in the different parts of the system (The other parts will still try to do their job - assuming that the watchdogs will reset whatever needed)
For example if the Redis server was down - the workers/clients will still try to fetch/push with certain backoff strategy.

I saw that there is no code handling those cases (Maybe I'm missing something...) so when I ran a worker and shutdown the Redis server i saw that the worker also exists with the connection exception without trying to re-new the connection:

Traceback (most recent call last):
  File "/home/my_usr/PycharmProjects/redis_poc/redis_queue/worker_runner.py", line 204, in <module>
    w1.work()
  File "/home/my_usr/.local/lib/python3.6/site-packages/rq/worker.py", line 488, in work
    self.register_birth()
  File "/home/my_usr/.local/lib/python3.6/site-packages/rq/worker.py", line 273, in register_birth
    if self.connection.exists(self.key) and \
  File "/home/my_usr/.local/lib/python3.6/site-packages/redis/client.py", line 951, in exists
    return self.execute_command('EXISTS', name)
  File "/home/my_usr/.local/lib/python3.6/site-packages/redis/client.py", line 673, in execute_command
    connection.send_command(*args)
  File "/home/my_usr/.local/lib/python3.6/site-packages/redis/connection.py", line 610, in send_command
    self.send_packed_command(self.pack_command(*args))
  File "/home/my_usr/.local/lib/python3.6/site-packages/redis/connection.py", line 585, in send_packed_command
    self.connect()
  File "/home/my_usr/.local/lib/python3.6/site-packages/redis/connection.py", line 489, in connect
    raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 111 connecting to localhost:6379. Connection refused.

It was intended by design? What may be the best practice to handle those things using the current codebase?
I don't think that wrapping the calls will be a good solution:

    while True:
        try:
            with Connection(): 
                w1 = Worker(['queue1'])
                w1.work()
        except redis.exceptions.ConnectionError:
            print('lost connection')
            # Sleep 10 seconds and retry
            time.sleep(10)
        except Exception as e:
            print(e)
            break
@selwin
Copy link
Collaborator

selwin commented Oct 27, 2018

I usually rely on systemd to bring workers back from Redis connection errors during production. But yes, I think this is something we should build into RQ. Please open a PR for this.

@bhargavrpatel
Copy link

@selwin Follow up question on this: Have you come across a silent "loss" where, for some reason, there is not a connection error but the workers simply wait for jobs and the queue keeps building up? We've come across this issue multiple times. I tried doing a quick test by setting the TTL to a low count and override connection, setting it as a property for easier tracing as shown below:

    @property
    def connection(self):
        if self._beat_count == 0 or (self._beat_count % BEAT_COUNT == 0):
            import pdb; pdb.set_trace()
            self._connection.time()
        return self._connection

    @connection.setter
    def connection(self, x):
        self._connection = 

    def heartbeat(self, timeout=None, pipeline=None):
        super(Foo, self).heartbeat(timeout=timeout, pipeline=pipeline)
        self._beat_count += 1

I was using connection.time() as a way to check connectivity as its an O(1) operation. What I noticed however is that, when redis service is stopped manually calling that function raises ConnectionError, but everything recovers on its own when redis service is brought back up.

@foozmeat
Copy link

@bhargavrpatel We hit this issue every time we fail or switchover our redis cluster.

@corynezin
Copy link
Contributor

Still running into this issue, has any kind of retry support been added?

@parikls
Copy link

parikls commented May 23, 2020

Any updates on this?

@levrik
Copy link
Contributor

levrik commented Jun 23, 2020

We're also experiencing the issue reported by @bhargavrpatel here.
Redis instance was moved to another node on Kubernetes, thus we've lost connection for a short time period.
The worker didn't complain but also didn't reconnect after Redis was back up.
At least this is what I'm guessing because it didn't start processing tasks in the queue.
It just stayed there and we had to manually restart it when we've noticed. After the restart of the worker, it started processing the queue again.

Asrst added a commit to Asrst/rq that referenced this issue Dec 3, 2020
solves issue rq#1153, rq#998
rq workers not auto connecting to redis server incase if they are down/restarted.
@vincentwoo
Copy link

Also definitely just lost a job to a redis server disconnect - would love some guidance on how to guard against this.

@waldner
Copy link
Contributor

waldner commented Mar 7, 2021

Is this going into a release in the near future? Currently I have to restart 900+ workers whenever a brief interruption to redis occurs.

@selwin
Copy link
Collaborator

selwin commented Mar 7, 2021

Fixed in #1387.

I’ll make a release sometime in the next few weeks.

@selwin selwin closed this as completed Mar 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants