addBulk performance issues #1670

DanielNetzer · 2023-02-09T07:20:21Z

Hello everyone, we have been experiencing performance issues when utilizing the addBulk method https://docs.bullmq.io/guide/queues/adding-bulks
When adding over 1K jobs (we currently split our bulks into smaller chunks of 1K).

As shown in the flame graph above, for each of the jobs added, it computes multiple fields, creating an insane latency (which can reach up to a few seconds).

EVALSHA 4e1516c18cc8ae3018108f2008b5c7c3d7934e88 8 bull:CHROME:wait bull:CHROME:paused bull:CHROME:meta bull:CHROME:id bull:CHROME:delayed bull:CHROME:priority bull:CHROME:completed bull:CHROME:events ? {"id":"e007f205-4cc2-4b8d-93a3-a2fc8ac724ca","ttl":300,"projectId":"1a2a3fae-8cf6-4b02-b231-93ef1... ?

This is an example of 1 of the EVALSHA commands being submitted individually.
The infrastructure used is AWS Elasticache with a single node of type cache.m6g.xlarge.

BullMQ should create a single Redis transaction and push that to the Redis, not do ping-pong for every job added.

The text was updated successfully, but these errors were encountered:

manast · 2023-02-09T07:47:52Z

For addBulk we are using a ioredis multi transaction which is supposed to use pipeline to greatly alleviate roundtrip latencies:
https://github.com/luin/ioredis#transaction

DanielNetzer · 2023-02-09T08:20:59Z

I know, went over the code and it seems like it should work this way, but when looking at traces it submits great amounts of EVALSHA commands, that check all the relevant keys around a single job and causes immense backpressure on the Redis.
I'm thinking about moving to a redis cluster perhaps, but that won't help as the prefix will be the queue name and still happen on a single node.

manast · 2023-02-09T08:41:53Z

Are you sure you really are chunking in 1k chunks? can you send me a piece of code that shows this behaviour?

manast · 2023-02-09T09:06:05Z

Btw, the pipeline does not imply that all the commands are sent in one roundtrip, but that they do not need to wait for the response before sending the next one. Also, how long is it taking that 1k batch to be sent? depending on the network conditions something like 100ms-200ms would be normal.

DanielNetzer · 2023-02-09T09:11:34Z

Hi @manast, thanks for the fast response.
Adding our code snippet that slice into chunks and publish the jobs.

  /**
   * Publishs jobs to queues
   */
  async publish(
    jobs: JobRequest[],
    distributionQueue: string,
    logger: pino.Logger,
  ) {
    const connection = this.connection;

    const queue = new Queue(distributionQueue, {
      connection,
      sharedConnection: true,
    });

    const jobsBulk: JobQueue[] = jobs.map((x) => {
      return {
        name: x.id,
        data: x,
        opts: {
          jobId: x.id,
          removeOnComplete: true,
          removeOnFail: true,
        },
      };
    });

    const jobChunks = sliceIntoChunks(jobsBulk, 1000) as JobQueue[][];

    for (const jobChunk of jobChunks) {
      try {
        await queue.addBulk(jobChunk);
      } catch (error: any) {
        logger.error(
          {
            distribution_queue: queue.name,
            jobsCount: jobChunk.length,
            error_message: error?.message,
          },
          `error when adding jobs chunk`,
        );
      }
    }

    try {
      await queue.close();
    } catch (error: any) {
      logger.error(
        {
          distribution_queue: queue.name,
          error_message: error?.message,
        },
        `failed to close queue`,
      );
    }
  }

As you can see, according to the flame graph, it can take up to 1 sec and over.

manast · 2023-02-09T09:22:23Z

Can you provide a minimal example that I can run that reproduces the issue? Do you get the same behaviour running this locally?

DanielNetzer · 2023-02-09T09:31:37Z

I will have to create a docker-compose with an entire infrastructure setup.
Take the code snippet above and just push jobs nonstop and consume them from the other side, and you will see the Redis impacted almost immediately.
I will try to create something. If you can look at the implementation that BullMQ currently has, I think something in there that utilizing the base addJob maybe not is the most suitable way to tackle massive updates.

manast · 2023-02-09T11:28:26Z

The snippet above is not enough for me to test. I have own performance tests and addBulk is performing well, so if you want me to examine your use case you need to provide a simple test case that I can run easily (docker-compose, etc, should not be needed, nor external dependencies).

DanielNetzer · 2023-02-09T15:00:10Z

@manast I will do a small test case to check this and post a link to the repo here later on this week.
Thanks again for the quick response and for this great library.

DanielNetzer · 2023-02-15T19:51:04Z

Hello @manast we have it 👯
https://github.com/DanielNetzer/bullmq-load
LMK if there's anything else I can help with.
once running with multiple consumers the redis spikes up to 80-90%, I'm running the Docker daemon with 8 Cores.

manast · 2023-02-16T10:09:45Z

Ok, so it is not addBulk the problem, it is the workers, or did I miss something?

DanielNetzer · 2023-02-16T10:11:25Z

still trying to figure out what is choking the redis, not a big expert in Lua scripts sadly :(

manast · 2023-02-16T10:21:57Z

Ok, so at least addBulk seems to work pretty well with that test. I am getting around 25% CPU usage while having both the test code and the Redis instance in my local machine. However, I can see that some backpressure builds up as it is adding 5k new jobs every 100ms and at least my machine cannot cope with that as it takes 300-400 ms per 5k jobs.

DanielNetzer · 2023-02-16T10:23:51Z

we have over 4K k8s pods connected as consumers, and I'm thinking that this is what might be pressuring the redis instance, we are doing a transition to a Redis Cluster, but still, the distribution per shard is by Queue at most. so it might still happen as almost 2K pods are listening to the same queue.
WDYT?

manast · 2023-02-16T10:27:33Z

Redis cluster will not bring any benefit to BullMQ performance-wise. I am still investigating your case to see if I can identify some bottlenecks.

manast · 2023-02-16T10:38:59Z

Btw, I noticed an issue with your test, it is only adding 5k jobs once as it is reusing the jobId for every batch. So the first batch adds 5k jobs but then the rest are not added again until the jobs are consumed.

manast · 2023-02-16T10:41:29Z

Secondly, in your consumer code, you are outputting the whole job, this will consume a lot of resources, just doing a large console.log so fast and with so much data can put the CPU to 100%.

manast · 2023-02-16T10:45:40Z

Without the console.log and a concurrency of 1000, it is processing jobs at around 20%-30% CPU usage... so I cannot identify any performance issues so far.

manast · 2023-02-16T10:58:51Z

But ok, so the scenario is 4000 workers Right? what about the concurrency factor per worker? I would like to better understand the complete scenario to see if I can figure out a bottleneck. It should be possible to generate a test case using AWS elasticache that more easily shows how the Redis CPU spikes to 100% in some circumstances, you are welcome to contact me privately if you are not allowed to share some details.

DanielNetzer · 2023-02-16T12:16:55Z

Each worker runs a single job at a time, its a browser process that runs automated testing, playwright for example.
we have chrome, firefox, edge & opera. we cant do multiple jobs because only 1 tab is active.
Does that answer the question?

manast · 2023-02-16T12:59:44Z

Ok, so the worst case is 4k jobs simultaneously?

manast · 2023-02-16T13:25:18Z

Btw, regarding AWS instance, I think you will be better off with this one: cache.r6g.large you get a bit more RAM, but the important thing is that is much cheaper. The one you are using has 4 cores, but Redis is single core so there is no use for the remaining 3 at all.

manast · 2023-02-16T13:56:22Z

hmm, it seems that elasticache can indeed take some advantage of multi-core now: https://aws.amazon.com/blogs/database/boosting-application-performance-and-reducing-costs-with-amazon-elasticache-for-redis/

DanielNetzer · 2023-02-16T14:39:32Z

The worst case is way above 4K, reaching up to a few tens of thousands of pods.

manast · 2023-02-16T14:48:35Z

Ok. I am not sure how well Redis cope with so many connections, this is something we should investigate, but there are some things BullMQ does per Worker, for example checking for stalled jobs. This check is super fast but if you have tens of thousands then it starts adding up. We can make optimizations to minimize impact in these cases though.

DanielNetzer · 2023-02-16T18:04:37Z

I truly believe it's the amount of workers connected, isn't the queueSchedulers were the ones checking for stalledJobs?

manast · 2023-02-16T21:16:05Z

@DanielNetzer queue schedulers are not needed anymore. One think you can try is to increase the interval at which the stalled jobs are checked, since you will have thousands of workers maybe something like every 5 minutes will be more than enough ```https://api.docs.bullmq.io/interfaces/WorkerOptions.html#stalledInterval
I do not think this would prove to be the culprit though... btw, how many jobs do you process per second?

DanielNetzer · 2023-02-17T07:47:49Z

Hello @manast We currently process approx. 10K jobs per second give or take, depends.
Looking at what you sent, I see many other options that cause bull to check things against the Redis. I can't find the default settings.
lockRenewTime
drainDelay
skipDelayCheck
stalledInterval
Any way to get the default settings for those on the README perhaps and also the current default values they have?

manast · 2023-02-17T09:54:27Z

I will fix the api docs. How those settings could affect is a bit involved to explain. For instance, drainDelay is irrelevant if the queue is always busy, and if it is not busy then there is even less of an issue since you have a lot of spare resources.

DanielNetzer · 2023-02-17T18:45:15Z

@manast I think I managed to isolate the issue, it's the blocking BRPOPLPUSH.
related to this redis/ioredis#1718

manast · 2023-02-18T22:57:49Z

Which version of BullMQ are you using in that test? The current maximum timeout time for brpoplpush in BullMQ is 10 seconds, so I do not know how it can be almost 40 seconds in your screenshot unless it is a different version.

DanielNetzer · 2023-02-19T06:19:46Z

so I rolled out a version with the latest for IORedis and BullMQ on Friday to our DEV and Staging environments, if all is well will roll it to PROD also.
but again the 10Sec timeout should only happen if they aren't any jobs in the queue, correct?

manast · 2023-02-19T16:21:55Z

Or if there are only delayed/rate-limited jobs.

DanielNetzer · 2023-02-19T16:48:36Z

so this issue still remains a mystery, we are deploying new redis clusters instead of the single and will check the performance and also updating the redis version to 7.* (elasticache).
https://redis.com/blog/redis-7-generally-available/

DanielNetzer · 2023-02-21T10:38:34Z

I checked, the performance seems to be the same so no improvements from AWS side of things. we are switching to a Redis Cluster hopefully it will distribute the queues between the different loads.
We don't use Delayed/Rate-limited jobs.

manast · 2023-02-21T11:29:09Z

I really want to help you understand what is going on. If you use cluster you have to consider this: https://docs.bullmq.io/bull/patterns/redis-cluster

DanielNetzer · 2023-02-21T11:59:32Z

yep, already using that.

billoneil · 2023-04-14T12:25:50Z

@DanielNetzer it looks like you are using datadog could you also take a look at the node runtime metrics dashboard? If the CPU of node is very high or you are hitting event loop delays then you can't trust the timing of the spans entirely because they can include time where the system was hanging and that's a problem with the service not bull/redis but the spans will look like they are slower.

What do the redis CPU metrics look like?
What do the node CPU metrics, and event loop delays look like?

I noticed when we enabled bull on one of our services its making ~6k evalsha calls a second total across 3 nodes (~2k evalsha's a second per node). Which seems very high for our limited use case but I'll have to dig deeper before I can share anything there. Although this shouldn't be much load on the redis instance I think creating the ~6k spans a second is adding a perf hit from the Datadog dd-trace agent. In this service we can see much more time is spent in the dd-trace agent than our other services when inspecting profiles. After enabling bull with these ~2k redis requests a second we saw ~20% increase in CPU usage across each node app, I think this could be from the tracing library itself.

I might try configuring the tracer to not create spans for EVALSHA and see if that reduces the CPU usage.

billoneil · 2023-05-10T00:38:22Z

Following up on my comment. Our issue was with calling removeJob in bull not bullmq but that manifested in creating thousands of trace spans a second from issuing the SCAN command. the dd-trace library was accounting for a significant amount of the CPU being utilized when the SCAN was triggering. Once we removed the code calling removeJob and no longer created so many datadog spans the CPU usage was drastically reduced in the node apps. Redis was able to handle all these requests with no problem.

manast · 2023-05-12T09:34:37Z

@billoneil could you share more information regarding your call to removeJob, as even though it is a slow call, it does not require a call to SCAN AFAIK.

billoneil · 2023-06-28T11:36:00Z

@manast it was the removeJobs command calling SCAN sorry for the typo. Again this was on bull not bullmq.

The relevant point to this thread is making 1000's of calls rapidly may perform well in Redis but we saw a performance hit on our application side from the APM. That CPU hit resulted in the spans (generated client side) to appear slower than the actual request took (server side on redis). The issue was the CPU maxing out server side not Redis/bullmq. This is not an issue with bullmq.

addBulk performance issues #1670

addBulk performance issues #1670

Comments

DanielNetzer commented Feb 9, 2023

manast commented Feb 9, 2023 • edited

DanielNetzer commented Feb 9, 2023

manast commented Feb 9, 2023

manast commented Feb 9, 2023

DanielNetzer commented Feb 9, 2023

manast commented Feb 9, 2023

DanielNetzer commented Feb 9, 2023

manast commented Feb 9, 2023

DanielNetzer commented Feb 9, 2023

DanielNetzer commented Feb 15, 2023 • edited

manast commented Feb 16, 2023

DanielNetzer commented Feb 16, 2023

manast commented Feb 16, 2023

DanielNetzer commented Feb 16, 2023

manast commented Feb 16, 2023

manast commented Feb 16, 2023

manast commented Feb 16, 2023

manast commented Feb 16, 2023

manast commented Feb 16, 2023

DanielNetzer commented Feb 16, 2023

manast commented Feb 16, 2023

manast commented Feb 16, 2023

manast commented Feb 16, 2023

DanielNetzer commented Feb 16, 2023

manast commented Feb 16, 2023

DanielNetzer commented Feb 16, 2023

manast commented Feb 16, 2023

DanielNetzer commented Feb 17, 2023

manast commented Feb 17, 2023

DanielNetzer commented Feb 17, 2023

manast commented Feb 18, 2023

DanielNetzer commented Feb 19, 2023

manast commented Feb 19, 2023 • edited

DanielNetzer commented Feb 19, 2023

DanielNetzer commented Feb 21, 2023

manast commented Feb 21, 2023

DanielNetzer commented Feb 21, 2023

billoneil commented Apr 14, 2023 • edited

billoneil commented May 10, 2023 • edited

manast commented May 12, 2023

billoneil commented Jun 28, 2023 • edited

manast commented Feb 9, 2023 •

edited

DanielNetzer commented Feb 15, 2023 •

edited

manast commented Feb 19, 2023 •

edited

billoneil commented Apr 14, 2023 •

edited

billoneil commented May 10, 2023 •

edited

billoneil commented Jun 28, 2023 •

edited