Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many workers? Waiting jobs building up. #251

Closed
Adam-Burke opened this issue Aug 7, 2020 · 24 comments
Closed

Too many workers? Waiting jobs building up. #251

Adam-Burke opened this issue Aug 7, 2020 · 24 comments

Comments

@Adam-Burke
Copy link

Adam-Burke commented Aug 7, 2020

Bullmq 1.8.7
Redis: elasticache
Nodejs 12.18.3

If I have more workers than processes on an instance I'm wondering if this will this cause any issues? What happens in this situation?

I just added about 5 additional queues (starting from 14) and workers and I believe it has caused caused an issue where we are building up waiting events and all the queues seem blocked. I manually retriggered one job from arena gui and it appeared to go active then bounce back to waiting immidiately.

I removed the additional queues & related workers/schedulers and it immediately completed the jobs very quickly so it appears as if we've hit some limit but I'm unsure what is most likely to be blocking.

Could this be a bug in our new workers? We don't have a lot of throughput so I wouldnt have thought it would have completely blocked every queue but there is probably something I'm not understanding.

@manast
Copy link
Contributor

manast commented Aug 10, 2020

I do not think the number of workers is the issue here. But I wonder about schedulers, maybe having too many causes issues, did you try to reduce the number? you do not need so many of these, one is enough but a couple more are good to have for redundancy.

@Adam-Burke
Copy link
Author

Don't I need a scheduler per queue?

@manast
Copy link
Contributor

manast commented Aug 16, 2020

Yes, one at a minimum per queue. But the amount of schedulers do not need to match the number of workers.

@Adam-Burke
Copy link
Author

Adam-Burke commented Aug 17, 2020

I think I was a bit unclear.

I want roughly 16 unique queues, 16 schedulers(1 per queue), 16 workers(1 per queue). Is there anything there that I can do without?

In the current setup we just all all the queues and schedulers on one container and the workers all on another.

@manast
Copy link
Contributor

manast commented Aug 17, 2020

yes, the reason the scheduler is a separate instance is because it requires a dedicated connection to redis, this is hidden in bull3, but the rationale is so that you can choose better your redundancy strategy here, and also because if you do not need delayed jobs you do not need any schedulers either, so for very simple queues you will save one connection.

@manast
Copy link
Contributor

manast commented Aug 17, 2020

ok, but back to the issue. So the jobs are not being processed at all? in which status are the job then? are they in the waiting list or delayed?

@Adam-Burke
Copy link
Author

Adam-Burke commented Aug 18, 2020

They were all waiting. Then the moment i removed the additional queues/workers/schedulers it started flowing again on the original queues.

@manast
Copy link
Contributor

manast commented Aug 18, 2020

Thats really weird actually. I would love if you could manage to write a simple test that reproduces it.

@Adam-Burke
Copy link
Author

I have the same setup deployed in an easier environment to explore now so if it occurs again I should be able to provide more info.

Could this happen if one of the newer workers was failing to return a promise and was timing out.

@manast
Copy link
Contributor

manast commented Aug 18, 2020

if a job never ends then it could keep the worker busy forever but you should see the jobs in active state then.

@Adam-Burke
Copy link
Author

Ok the issue seems to have reoccured. All the queues have stalled up with big waiting lists and nothing in active. Not sure how to recreate this locally as it seems to not occur immediately. Perhaps only after some mild volume goes through the queues.

Screen Shot 2020-08-27 at 2 31 47 pm

Screen Shot 2020-08-27 at 2 30 14 pm

Correction from above. I had 7 queues originally and added 3 more and this issue has appeared.

@Adam-Burke
Copy link
Author

I am seeing a couple of errors that may help

  • Missing lock for job 3365 finished
  • job stalled more than allowable limit

cheers

@manast
Copy link
Contributor

manast commented Aug 27, 2020

job stalled more than allowable limit

CPU on that worker is too high and job locks are not being re-acquired in time. Can you check the CPU monitoring of your workers instances?

Missing lock for job 3365 finished

This one is interesting but I do not think it is the cause of jobs not being processed.

Can you check if workers are properly connected to redis? maybe they lost the connection.

@manast
Copy link
Contributor

manast commented Aug 27, 2020

If I have to bet on something, your workers are consuming all CPU, this leads to loosing connection to redis and then stop processing jobs. Also seems to happen when you add more queues...

@jaschaio
Copy link

jaschaio commented Aug 28, 2020

@Adam-Burke do you see any "BUSY" errors like I do in #265?

I see a similar issue, I only have two queues but multiple workers and schedulers for each queue on different machines. Everything works fine if there is not a lot of volume, but as soon as I get some serious volume the workers seem to die because of redis BUSY errors. I don't actually get failed jobs but just a very long waiting list.

@Adam-Burke
Copy link
Author

Thanks heaps, really appreciate the help. It was basically what you guessed @manast.

It seems an infinite loop was occurring in our logger somewhere when it is trying to log an object with circular dependencies.

@jaschaio
Copy link

jaschaio commented Aug 29, 2020

@manast can you shine some more light on this?

I do see a lot of CPU consumption as well. But I couldn't find any infinite loop like @Adam-Burke did.

At a certain volume my jobs just stop processing, I suspect because the worker is dead but I am not sure in how to actually check it.

Before I only passed in the connection config instead of utilizing the same connection. So in order to check if the connection is lost I am now utilizing a single connection:

const Redis = require( 'ioredis' );

const connection = new Redis();

new Queue( 'queue', {
    connection,
} );

new Worker( 'queue', processQueue, {
    lockDuration: 300000,
    concurrency: 5,
    connection,
} );

So while the workers seem dead as jobs are building up in the waiting state, the connection is actually still alive and usable. Restarting the node process is the only thing that seems to work to get the jobs processed once again.

@manast
Copy link
Contributor

manast commented Aug 29, 2020

I would spend time trying to understand why the CPU is so high. Also how is the CPU usage on the redis instance itself?

@jaschaio
Copy link

CPU usage on the redis instance is fine. I am processing ~ 100.000 events that each trigger a job within a couple of minutes, so I would naturally expect CPU to peak whenever this happens. Not sure how to keep CPU usage low during these spikes.

After a bit more investigation I am actually only checking if the redis connection is still alive in a seperate node process than the ones the workers are running in. So you are probably right that "workers are consuming all CPU, this leads to loosing connection to redis and then stop processing jobs". Anyway I can circumvent this – listening to lost redis connections and than restarting the worker for example?

@manast
Copy link
Contributor

manast commented Aug 29, 2020

Maybe you need to add more workers so that the CPU per worker instance is not saturated. This will create all kind of problems not just with the connection to redis but to any other asynchronous tasks you may do on the job processors.

@jaschaio
Copy link

jaschaio commented Aug 31, 2020

@manast Thanks for your input.

I did try a few more things and it seems that the CPU is actually not the cause of this issue. I tried it on a dev and staging instance and even with prolonged 100% CPU usage the workers kept processing jobs without any issues. If I killed the redis instance, the node process dies and is automatically restarted. As soon as the redis instance is back up the workers work as well once again. The difference to the production environment is that I was using a single node redis, instead of a master - slave - sentinel deployment.

So from the above it seems that something is wrong with my sentinel deployment and that bullmq works as intended. As a workaround I used a redis instance from redislabs.com and now it works as well without any issues on the production system.

The only thing I find quite odd is that I don't get any connection errors that would actually indicate that the master - slave - sentinel deployment is having issues. And that the issue actually only shows up as soon as throughput gets serious.

@AmreeshTyagi
Copy link

And that the issue actually only shows up as soon as throughput gets serious.
@jaschaio Can you please let know what throughput you are getting on production with bull?

I need to process approx 50k message per second. Would there be any problem with bull?

@manast
Copy link
Contributor

manast commented Feb 25, 2021

Depends a lot on your setup and what features you want to use from bull. With a plain non-prioritized queue it should be possible, but you should do some proof of concept first, 50k jobs per second is not trivial in any technology you choose, so it is going to require careful fine-tuning no matter what.

@jaschaio
Copy link

@AmreeshTyagi Unfortunately I have found to many issues with bullmq + redis that I never quite got the hang of. It was not reliable enough for me so I have since changed to rabbitmq. As @manast has said, 50k jobs per second is a crazy amount and I am nowhere near that amount – but I would probably look into a message queue that was specifically build for that kind of workloads like kafka or rabbitmq.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants