Too many workers? Waiting jobs building up. #251

Adam-Burke · 2020-08-07T09:13:31Z

Bullmq 1.8.7
Redis: elasticache
Nodejs 12.18.3

If I have more workers than processes on an instance I'm wondering if this will this cause any issues? What happens in this situation?

I just added about 5 additional queues (starting from 14) and workers and I believe it has caused caused an issue where we are building up waiting events and all the queues seem blocked. I manually retriggered one job from arena gui and it appeared to go active then bounce back to waiting immidiately.

I removed the additional queues & related workers/schedulers and it immediately completed the jobs very quickly so it appears as if we've hit some limit but I'm unsure what is most likely to be blocking.

Could this be a bug in our new workers? We don't have a lot of throughput so I wouldnt have thought it would have completely blocked every queue but there is probably something I'm not understanding.

manast · 2020-08-10T16:03:50Z

I do not think the number of workers is the issue here. But I wonder about schedulers, maybe having too many causes issues, did you try to reduce the number? you do not need so many of these, one is enough but a couple more are good to have for redundancy.

Adam-Burke · 2020-08-16T09:34:25Z

Don't I need a scheduler per queue?

manast · 2020-08-16T09:50:22Z

Yes, one at a minimum per queue. But the amount of schedulers do not need to match the number of workers.

Adam-Burke · 2020-08-17T03:00:19Z

I think I was a bit unclear.

I want roughly 16 unique queues, 16 schedulers(1 per queue), 16 workers(1 per queue). Is there anything there that I can do without?

In the current setup we just all all the queues and schedulers on one container and the workers all on another.

manast · 2020-08-17T07:31:04Z

yes, the reason the scheduler is a separate instance is because it requires a dedicated connection to redis, this is hidden in bull3, but the rationale is so that you can choose better your redundancy strategy here, and also because if you do not need delayed jobs you do not need any schedulers either, so for very simple queues you will save one connection.

manast · 2020-08-17T07:42:16Z

ok, but back to the issue. So the jobs are not being processed at all? in which status are the job then? are they in the waiting list or delayed?

Adam-Burke · 2020-08-18T03:18:48Z

They were all waiting. Then the moment i removed the additional queues/workers/schedulers it started flowing again on the original queues.

manast · 2020-08-18T06:48:13Z

Thats really weird actually. I would love if you could manage to write a simple test that reproduces it.

Adam-Burke · 2020-08-18T06:57:51Z

I have the same setup deployed in an easier environment to explore now so if it occurs again I should be able to provide more info.

Could this happen if one of the newer workers was failing to return a promise and was timing out.

manast · 2020-08-18T09:17:54Z

if a job never ends then it could keep the worker busy forever but you should see the jobs in active state then.

Adam-Burke · 2020-08-27T04:38:45Z

Ok the issue seems to have reoccured. All the queues have stalled up with big waiting lists and nothing in active. Not sure how to recreate this locally as it seems to not occur immediately. Perhaps only after some mild volume goes through the queues.

Correction from above. I had 7 queues originally and added 3 more and this issue has appeared.

Adam-Burke · 2020-08-27T04:40:13Z

I am seeing a couple of errors that may help

Missing lock for job 3365 finished
job stalled more than allowable limit

cheers

manast · 2020-08-27T09:34:06Z

job stalled more than allowable limit

CPU on that worker is too high and job locks are not being re-acquired in time. Can you check the CPU monitoring of your workers instances?

Missing lock for job 3365 finished

This one is interesting but I do not think it is the cause of jobs not being processed.

Can you check if workers are properly connected to redis? maybe they lost the connection.

manast · 2020-08-27T09:35:24Z

If I have to bet on something, your workers are consuming all CPU, this leads to loosing connection to redis and then stop processing jobs. Also seems to happen when you add more queues...

jaschaio · 2020-08-28T11:45:47Z

@Adam-Burke do you see any "BUSY" errors like I do in #265?

I see a similar issue, I only have two queues but multiple workers and schedulers for each queue on different machines. Everything works fine if there is not a lot of volume, but as soon as I get some serious volume the workers seem to die because of redis BUSY errors. I don't actually get failed jobs but just a very long waiting list.

Adam-Burke · 2020-08-28T13:21:24Z

Thanks heaps, really appreciate the help. It was basically what you guessed @manast.

It seems an infinite loop was occurring in our logger somewhere when it is trying to log an object with circular dependencies.

jaschaio · 2020-08-29T18:13:47Z

@manast can you shine some more light on this?

I do see a lot of CPU consumption as well. But I couldn't find any infinite loop like @Adam-Burke did.

At a certain volume my jobs just stop processing, I suspect because the worker is dead but I am not sure in how to actually check it.

Before I only passed in the connection config instead of utilizing the same connection. So in order to check if the connection is lost I am now utilizing a single connection:

const Redis = require( 'ioredis' );

const connection = new Redis();

new Queue( 'queue', {
    connection,
} );

new Worker( 'queue', processQueue, {
    lockDuration: 300000,
    concurrency: 5,
    connection,
} );

So while the workers seem dead as jobs are building up in the waiting state, the connection is actually still alive and usable. Restarting the node process is the only thing that seems to work to get the jobs processed once again.

manast · 2020-08-29T18:35:22Z

I would spend time trying to understand why the CPU is so high. Also how is the CPU usage on the redis instance itself?

jaschaio · 2020-08-29T19:41:26Z

CPU usage on the redis instance is fine. I am processing ~ 100.000 events that each trigger a job within a couple of minutes, so I would naturally expect CPU to peak whenever this happens. Not sure how to keep CPU usage low during these spikes.

After a bit more investigation I am actually only checking if the redis connection is still alive in a seperate node process than the ones the workers are running in. So you are probably right that "workers are consuming all CPU, this leads to loosing connection to redis and then stop processing jobs". Anyway I can circumvent this – listening to lost redis connections and than restarting the worker for example?

manast · 2020-08-29T20:02:25Z

Maybe you need to add more workers so that the CPU per worker instance is not saturated. This will create all kind of problems not just with the connection to redis but to any other asynchronous tasks you may do on the job processors.

jaschaio · 2020-08-31T21:04:46Z

@manast Thanks for your input.

I did try a few more things and it seems that the CPU is actually not the cause of this issue. I tried it on a dev and staging instance and even with prolonged 100% CPU usage the workers kept processing jobs without any issues. If I killed the redis instance, the node process dies and is automatically restarted. As soon as the redis instance is back up the workers work as well once again. The difference to the production environment is that I was using a single node redis, instead of a master - slave - sentinel deployment.

So from the above it seems that something is wrong with my sentinel deployment and that bullmq works as intended. As a workaround I used a redis instance from redislabs.com and now it works as well without any issues on the production system.

The only thing I find quite odd is that I don't get any connection errors that would actually indicate that the master - slave - sentinel deployment is having issues. And that the issue actually only shows up as soon as throughput gets serious.

AmreeshTyagi · 2021-02-24T18:03:10Z

And that the issue actually only shows up as soon as throughput gets serious.
@jaschaio Can you please let know what throughput you are getting on production with bull?

I need to process approx 50k message per second. Would there be any problem with bull?

manast · 2021-02-25T09:15:36Z

Depends a lot on your setup and what features you want to use from bull. With a plain non-prioritized queue it should be possible, but you should do some proof of concept first, 50k jobs per second is not trivial in any technology you choose, so it is going to require careful fine-tuning no matter what.

jaschaio · 2021-03-24T09:59:55Z

@AmreeshTyagi Unfortunately I have found to many issues with bullmq + redis that I never quite got the hang of. It was not reliable enough for me so I have since changed to rabbitmq. As @manast has said, 50k jobs per second is a crazy amount and I am nowhere near that amount – but I would probably look into a message queue that was specifically build for that kind of workloads like kafka or rabbitmq.

Adam-Burke closed this as completed Aug 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too many workers? Waiting jobs building up. #251

Too many workers? Waiting jobs building up. #251

Adam-Burke commented Aug 7, 2020 •

edited

manast commented Aug 10, 2020 •

edited

Adam-Burke commented Aug 16, 2020

manast commented Aug 16, 2020

Adam-Burke commented Aug 17, 2020 •

edited

manast commented Aug 17, 2020

manast commented Aug 17, 2020

Adam-Burke commented Aug 18, 2020 •

edited

manast commented Aug 18, 2020

Adam-Burke commented Aug 18, 2020

manast commented Aug 18, 2020

Adam-Burke commented Aug 27, 2020

Adam-Burke commented Aug 27, 2020

manast commented Aug 27, 2020

manast commented Aug 27, 2020

jaschaio commented Aug 28, 2020 •

edited

Adam-Burke commented Aug 28, 2020

jaschaio commented Aug 29, 2020 •

edited

manast commented Aug 29, 2020

jaschaio commented Aug 29, 2020

manast commented Aug 29, 2020

jaschaio commented Aug 31, 2020 •

edited

AmreeshTyagi commented Feb 24, 2021

manast commented Feb 25, 2021

jaschaio commented Mar 24, 2021

Too many workers? Waiting jobs building up. #251

Too many workers? Waiting jobs building up. #251

Comments

Adam-Burke commented Aug 7, 2020 • edited

manast commented Aug 10, 2020 • edited

Adam-Burke commented Aug 16, 2020

manast commented Aug 16, 2020

Adam-Burke commented Aug 17, 2020 • edited

manast commented Aug 17, 2020

manast commented Aug 17, 2020

Adam-Burke commented Aug 18, 2020 • edited

manast commented Aug 18, 2020

Adam-Burke commented Aug 18, 2020

manast commented Aug 18, 2020

Adam-Burke commented Aug 27, 2020

Adam-Burke commented Aug 27, 2020

manast commented Aug 27, 2020

manast commented Aug 27, 2020

jaschaio commented Aug 28, 2020 • edited

Adam-Burke commented Aug 28, 2020

jaschaio commented Aug 29, 2020 • edited

manast commented Aug 29, 2020

jaschaio commented Aug 29, 2020

manast commented Aug 29, 2020

jaschaio commented Aug 31, 2020 • edited

AmreeshTyagi commented Feb 24, 2021

manast commented Feb 25, 2021

jaschaio commented Mar 24, 2021

Adam-Burke commented Aug 7, 2020 •

edited

manast commented Aug 10, 2020 •

edited

Adam-Burke commented Aug 17, 2020 •

edited

Adam-Burke commented Aug 18, 2020 •

edited

jaschaio commented Aug 28, 2020 •

edited

jaschaio commented Aug 29, 2020 •

edited

jaschaio commented Aug 31, 2020 •

edited