-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resiliency to Redis network failures #998
Comments
I usually rely on systemd to bring workers back from Redis connection errors during production. But yes, I think this is something we should build into RQ. Please open a PR for this. |
@selwin Follow up question on this: Have you come across a silent "loss" where, for some reason, there is not a connection error but the workers simply wait for jobs and the queue keeps building up? We've come across this issue multiple times. I tried doing a quick test by setting the TTL to a low count and override @property
def connection(self):
if self._beat_count == 0 or (self._beat_count % BEAT_COUNT == 0):
import pdb; pdb.set_trace()
self._connection.time()
return self._connection
@connection.setter
def connection(self, x):
self._connection =
def heartbeat(self, timeout=None, pipeline=None):
super(Foo, self).heartbeat(timeout=timeout, pipeline=pipeline)
self._beat_count += 1 I was using |
@bhargavrpatel We hit this issue every time we fail or switchover our redis cluster. |
Still running into this issue, has any kind of retry support been added? |
Any updates on this? |
We're also experiencing the issue reported by @bhargavrpatel here. |
Also definitely just lost a job to a redis server disconnect - would love some guidance on how to guard against this. |
Is this going into a release in the near future? Currently I have to restart 900+ workers whenever a brief interruption to redis occurs. |
Fixed in #1387. I’ll make a release sometime in the next few weeks. |
I'm designing a system based on a message queue so I took RQ for a try.
One of my main concerns is that the message queue will be resilient to crashes/network problems in the different parts of the system (The other parts will still try to do their job - assuming that the watchdogs will reset whatever needed)
For example if the Redis server was down - the workers/clients will still try to fetch/push with certain backoff strategy.
I saw that there is no code handling those cases (Maybe I'm missing something...) so when I ran a worker and shutdown the Redis server i saw that the worker also exists with the connection exception without trying to re-new the connection:
It was intended by design? What may be the best practice to handle those things using the current codebase?
I don't think that wrapping the calls will be a good solution:
The text was updated successfully, but these errors were encountered: