New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too many workers? Waiting jobs building up. #251
Comments
I do not think the number of workers is the issue here. But I wonder about schedulers, maybe having too many causes issues, did you try to reduce the number? you do not need so many of these, one is enough but a couple more are good to have for redundancy. |
Don't I need a scheduler per queue? |
Yes, one at a minimum per queue. But the amount of schedulers do not need to match the number of workers. |
I think I was a bit unclear. I want roughly 16 unique queues, 16 schedulers(1 per queue), 16 workers(1 per queue). Is there anything there that I can do without? In the current setup we just all all the queues and schedulers on one container and the workers all on another. |
yes, the reason the scheduler is a separate instance is because it requires a dedicated connection to redis, this is hidden in bull3, but the rationale is so that you can choose better your redundancy strategy here, and also because if you do not need delayed jobs you do not need any schedulers either, so for very simple queues you will save one connection. |
ok, but back to the issue. So the jobs are not being processed at all? in which status are the job then? are they in the waiting list or delayed? |
They were all waiting. Then the moment i removed the additional queues/workers/schedulers it started flowing again on the original queues. |
Thats really weird actually. I would love if you could manage to write a simple test that reproduces it. |
I have the same setup deployed in an easier environment to explore now so if it occurs again I should be able to provide more info. Could this happen if one of the newer workers was failing to return a promise and was timing out. |
if a job never ends then it could keep the worker busy forever but you should see the jobs in active state then. |
Ok the issue seems to have reoccured. All the queues have stalled up with big waiting lists and nothing in active. Not sure how to recreate this locally as it seems to not occur immediately. Perhaps only after some mild volume goes through the queues. Correction from above. I had 7 queues originally and added 3 more and this issue has appeared. |
I am seeing a couple of errors that may help
cheers |
CPU on that worker is too high and job locks are not being re-acquired in time. Can you check the CPU monitoring of your workers instances?
This one is interesting but I do not think it is the cause of jobs not being processed. Can you check if workers are properly connected to redis? maybe they lost the connection. |
If I have to bet on something, your workers are consuming all CPU, this leads to loosing connection to redis and then stop processing jobs. Also seems to happen when you add more queues... |
@Adam-Burke do you see any "BUSY" errors like I do in #265? I see a similar issue, I only have two queues but multiple workers and schedulers for each queue on different machines. Everything works fine if there is not a lot of volume, but as soon as I get some serious volume the workers seem to die because of redis BUSY errors. I don't actually get failed jobs but just a very long waiting list. |
Thanks heaps, really appreciate the help. It was basically what you guessed @manast. It seems an infinite loop was occurring in our logger somewhere when it is trying to log an object with circular dependencies. |
@manast can you shine some more light on this? I do see a lot of CPU consumption as well. But I couldn't find any infinite loop like @Adam-Burke did. At a certain volume my jobs just stop processing, I suspect because the worker is dead but I am not sure in how to actually check it. Before I only passed in the const Redis = require( 'ioredis' );
const connection = new Redis();
new Queue( 'queue', {
connection,
} );
new Worker( 'queue', processQueue, {
lockDuration: 300000,
concurrency: 5,
connection,
} ); So while the workers seem dead as jobs are building up in the |
I would spend time trying to understand why the CPU is so high. Also how is the CPU usage on the redis instance itself? |
CPU usage on the redis instance is fine. I am processing ~ 100.000 events that each trigger a job within a couple of minutes, so I would naturally expect CPU to peak whenever this happens. Not sure how to keep CPU usage low during these spikes. After a bit more investigation I am actually only checking if the |
Maybe you need to add more workers so that the CPU per worker instance is not saturated. This will create all kind of problems not just with the connection to redis but to any other asynchronous tasks you may do on the job processors. |
@manast Thanks for your input. I did try a few more things and it seems that the CPU is actually not the cause of this issue. I tried it on a dev and staging instance and even with prolonged 100% CPU usage the workers kept processing jobs without any issues. If I killed the redis instance, the node process dies and is automatically restarted. As soon as the redis instance is back up the workers work as well once again. The difference to the production environment is that I was using a single node redis, instead of a master - slave - sentinel deployment. So from the above it seems that something is wrong with my sentinel deployment and that bullmq works as intended. As a workaround I used a redis instance from redislabs.com and now it works as well without any issues on the production system. The only thing I find quite odd is that I don't get any connection errors that would actually indicate that the master - slave - sentinel deployment is having issues. And that the issue actually only shows up as soon as throughput gets serious. |
I need to process approx 50k message per second. Would there be any problem with bull? |
Depends a lot on your setup and what features you want to use from bull. With a plain non-prioritized queue it should be possible, but you should do some proof of concept first, 50k jobs per second is not trivial in any technology you choose, so it is going to require careful fine-tuning no matter what. |
@AmreeshTyagi Unfortunately I have found to many issues with |
Bullmq 1.8.7
Redis: elasticache
Nodejs 12.18.3
If I have more workers than processes on an instance I'm wondering if this will this cause any issues? What happens in this situation?
I just added about 5 additional queues (starting from 14) and workers and I believe it has caused caused an issue where we are building up waiting events and all the queues seem blocked. I manually retriggered one job from arena gui and it appeared to go active then bounce back to waiting immidiately.
I removed the additional queues & related workers/schedulers and it immediately completed the jobs very quickly so it appears as if we've hit some limit but I'm unsure what is most likely to be blocking.
Could this be a bug in our new workers? We don't have a lot of throughput so I wouldnt have thought it would have completely blocked every queue but there is probably something I'm not understanding.
The text was updated successfully, but these errors were encountered: