Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

addBulk performance issues #1670

Open
DanielNetzer opened this issue Feb 9, 2023 · 41 comments
Open

addBulk performance issues #1670

DanielNetzer opened this issue Feb 9, 2023 · 41 comments

Comments

@DanielNetzer
Copy link

Hello everyone, we have been experiencing performance issues when utilizing the addBulk method https://docs.bullmq.io/guide/queues/adding-bulks
When adding over 1K jobs (we currently split our bulks into smaller chunks of 1K).

Screenshot 2023-02-09 at 9 15 08

As shown in the flame graph above, for each of the jobs added, it computes multiple fields, creating an insane latency (which can reach up to a few seconds).

EVALSHA 4e1516c18cc8ae3018108f2008b5c7c3d7934e88 8 bull:CHROME:wait bull:CHROME:paused bull:CHROME:meta bull:CHROME:id bull:CHROME:delayed bull:CHROME:priority bull:CHROME:completed bull:CHROME:events ? {"id":"e007f205-4cc2-4b8d-93a3-a2fc8ac724ca","ttl":300,"projectId":"1a2a3fae-8cf6-4b02-b231-93ef1... ?

This is an example of 1 of the EVALSHA commands being submitted individually.
The infrastructure used is AWS Elasticache with a single node of type cache.m6g.xlarge.

BullMQ should create a single Redis transaction and push that to the Redis, not do ping-pong for every job added.

@manast
Copy link
Contributor

manast commented Feb 9, 2023

For addBulk we are using a ioredis multi transaction which is supposed to use pipeline to greatly alleviate roundtrip latencies:
https://github.com/luin/ioredis#transaction

@DanielNetzer
Copy link
Author

I know, went over the code and it seems like it should work this way, but when looking at traces it submits great amounts of EVALSHA commands, that check all the relevant keys around a single job and causes immense backpressure on the Redis.
I'm thinking about moving to a redis cluster perhaps, but that won't help as the prefix will be the queue name and still happen on a single node.

@manast
Copy link
Contributor

manast commented Feb 9, 2023

Are you sure you really are chunking in 1k chunks? can you send me a piece of code that shows this behaviour?

@manast
Copy link
Contributor

manast commented Feb 9, 2023

Btw, the pipeline does not imply that all the commands are sent in one roundtrip, but that they do not need to wait for the response before sending the next one. Also, how long is it taking that 1k batch to be sent? depending on the network conditions something like 100ms-200ms would be normal.

@DanielNetzer
Copy link
Author

Hi @manast, thanks for the fast response.
Adding our code snippet that slice into chunks and publish the jobs.

  /**
   * Publishs jobs to queues
   */
  async publish(
    jobs: JobRequest[],
    distributionQueue: string,
    logger: pino.Logger,
  ) {
    const connection = this.connection;

    const queue = new Queue(distributionQueue, {
      connection,
      sharedConnection: true,
    });

    const jobsBulk: JobQueue[] = jobs.map((x) => {
      return {
        name: x.id,
        data: x,
        opts: {
          jobId: x.id,
          removeOnComplete: true,
          removeOnFail: true,
        },
      };
    });

    const jobChunks = sliceIntoChunks(jobsBulk, 1000) as JobQueue[][];

    for (const jobChunk of jobChunks) {
      try {
        await queue.addBulk(jobChunk);
      } catch (error: any) {
        logger.error(
          {
            distribution_queue: queue.name,
            jobsCount: jobChunk.length,
            error_message: error?.message,
          },
          `error when adding jobs chunk`,
        );
      }
    }

    try {
      await queue.close();
    } catch (error: any) {
      logger.error(
        {
          distribution_queue: queue.name,
          error_message: error?.message,
        },
        `failed to close queue`,
      );
    }
  }

As you can see, according to the flame graph, it can take up to 1 sec and over.

@manast
Copy link
Contributor

manast commented Feb 9, 2023

Can you provide a minimal example that I can run that reproduces the issue? Do you get the same behaviour running this locally?

@DanielNetzer
Copy link
Author

I will have to create a docker-compose with an entire infrastructure setup.
Take the code snippet above and just push jobs nonstop and consume them from the other side, and you will see the Redis impacted almost immediately.
I will try to create something. If you can look at the implementation that BullMQ currently has, I think something in there that utilizing the base addJob maybe not is the most suitable way to tackle massive updates.

@manast
Copy link
Contributor

manast commented Feb 9, 2023

The snippet above is not enough for me to test. I have own performance tests and addBulk is performing well, so if you want me to examine your use case you need to provide a simple test case that I can run easily (docker-compose, etc, should not be needed, nor external dependencies).

@DanielNetzer
Copy link
Author

@manast I will do a small test case to check this and post a link to the repo here later on this week.
Thanks again for the quick response and for this great library.

@DanielNetzer
Copy link
Author

DanielNetzer commented Feb 15, 2023

Hello @manast we have it 👯
https://github.com/DanielNetzer/bullmq-load
LMK if there's anything else I can help with.
once running with multiple consumers the redis spikes up to 80-90%, I'm running the Docker daemon with 8 Cores.

@manast
Copy link
Contributor

manast commented Feb 16, 2023

Ok, so it is not addBulk the problem, it is the workers, or did I miss something?

@DanielNetzer
Copy link
Author

still trying to figure out what is choking the redis, not a big expert in Lua scripts sadly :(

@manast
Copy link
Contributor

manast commented Feb 16, 2023

Ok, so at least addBulk seems to work pretty well with that test. I am getting around 25% CPU usage while having both the test code and the Redis instance in my local machine. However, I can see that some backpressure builds up as it is adding 5k new jobs every 100ms and at least my machine cannot cope with that as it takes 300-400 ms per 5k jobs.

image

@DanielNetzer
Copy link
Author

we have over 4K k8s pods connected as consumers, and I'm thinking that this is what might be pressuring the redis instance, we are doing a transition to a Redis Cluster, but still, the distribution per shard is by Queue at most. so it might still happen as almost 2K pods are listening to the same queue.
WDYT?

@manast
Copy link
Contributor

manast commented Feb 16, 2023

Redis cluster will not bring any benefit to BullMQ performance-wise. I am still investigating your case to see if I can identify some bottlenecks.

@manast
Copy link
Contributor

manast commented Feb 16, 2023

Btw, I noticed an issue with your test, it is only adding 5k jobs once as it is reusing the jobId for every batch. So the first batch adds 5k jobs but then the rest are not added again until the jobs are consumed.

@manast
Copy link
Contributor

manast commented Feb 16, 2023

Secondly, in your consumer code, you are outputting the whole job, this will consume a lot of resources, just doing a large console.log so fast and with so much data can put the CPU to 100%.

@manast
Copy link
Contributor

manast commented Feb 16, 2023

Without the console.log and a concurrency of 1000, it is processing jobs at around 20%-30% CPU usage... so I cannot identify any performance issues so far.

@manast
Copy link
Contributor

manast commented Feb 16, 2023

But ok, so the scenario is 4000 workers Right? what about the concurrency factor per worker? I would like to better understand the complete scenario to see if I can figure out a bottleneck. It should be possible to generate a test case using AWS elasticache that more easily shows how the Redis CPU spikes to 100% in some circumstances, you are welcome to contact me privately if you are not allowed to share some details.

@DanielNetzer
Copy link
Author

Each worker runs a single job at a time, its a browser process that runs automated testing, playwright for example.
we have chrome, firefox, edge & opera. we cant do multiple jobs because only 1 tab is active.
Does that answer the question?

@manast
Copy link
Contributor

manast commented Feb 16, 2023

Ok, so the worst case is 4k jobs simultaneously?

@manast
Copy link
Contributor

manast commented Feb 16, 2023

Btw, regarding AWS instance, I think you will be better off with this one: cache.r6g.large you get a bit more RAM, but the important thing is that is much cheaper. The one you are using has 4 cores, but Redis is single core so there is no use for the remaining 3 at all.

@manast
Copy link
Contributor

manast commented Feb 16, 2023

hmm, it seems that elasticache can indeed take some advantage of multi-core now: https://aws.amazon.com/blogs/database/boosting-application-performance-and-reducing-costs-with-amazon-elasticache-for-redis/

@DanielNetzer
Copy link
Author

The worst case is way above 4K, reaching up to a few tens of thousands of pods.

@manast
Copy link
Contributor

manast commented Feb 16, 2023

Ok. I am not sure how well Redis cope with so many connections, this is something we should investigate, but there are some things BullMQ does per Worker, for example checking for stalled jobs. This check is super fast but if you have tens of thousands then it starts adding up. We can make optimizations to minimize impact in these cases though.

@DanielNetzer
Copy link
Author

I truly believe it's the amount of workers connected, isn't the queueSchedulers were the ones checking for stalledJobs?

@manast
Copy link
Contributor

manast commented Feb 16, 2023

@DanielNetzer queue schedulers are not needed anymore. One think you can try is to increase the interval at which the stalled jobs are checked, since you will have thousands of workers maybe something like every 5 minutes will be more than enough ```https://api.docs.bullmq.io/interfaces/WorkerOptions.html#stalledInterval
I do not think this would prove to be the culprit though... btw, how many jobs do you process per second?

@DanielNetzer
Copy link
Author

Hello @manast We currently process approx. 10K jobs per second give or take, depends.
Looking at what you sent, I see many other options that cause bull to check things against the Redis. I can't find the default settings.
lockRenewTime
drainDelay
skipDelayCheck
stalledInterval
Any way to get the default settings for those on the README perhaps and also the current default values they have?

@manast
Copy link
Contributor

manast commented Feb 17, 2023

I will fix the api docs. How those settings could affect is a bit involved to explain. For instance, drainDelay is irrelevant if the queue is always busy, and if it is not busy then there is even less of an issue since you have a lot of spare resources.

@DanielNetzer
Copy link
Author

@manast I think I managed to isolate the issue, it's the blocking BRPOPLPUSH.
related to this redis/ioredis#1718
Screenshot 2023-02-17 at 20 43 31

@manast
Copy link
Contributor

manast commented Feb 18, 2023

Which version of BullMQ are you using in that test? The current maximum timeout time for brpoplpush in BullMQ is 10 seconds, so I do not know how it can be almost 40 seconds in your screenshot unless it is a different version.

@DanielNetzer
Copy link
Author

so I rolled out a version with the latest for IORedis and BullMQ on Friday to our DEV and Staging environments, if all is well will roll it to PROD also.
but again the 10Sec timeout should only happen if they aren't any jobs in the queue, correct?

@manast
Copy link
Contributor

manast commented Feb 19, 2023

Or if there are only delayed/rate-limited jobs.

@DanielNetzer
Copy link
Author

so this issue still remains a mystery, we are deploying new redis clusters instead of the single and will check the performance and also updating the redis version to 7.* (elasticache).
https://redis.com/blog/redis-7-generally-available/

@DanielNetzer
Copy link
Author

I checked, the performance seems to be the same so no improvements from AWS side of things. we are switching to a Redis Cluster hopefully it will distribute the queues between the different loads.
We don't use Delayed/Rate-limited jobs.

@manast
Copy link
Contributor

manast commented Feb 21, 2023

I really want to help you understand what is going on. If you use cluster you have to consider this: https://docs.bullmq.io/bull/patterns/redis-cluster

@DanielNetzer
Copy link
Author

yep, already using that.

@billoneil
Copy link

billoneil commented Apr 14, 2023

@DanielNetzer it looks like you are using datadog could you also take a look at the node runtime metrics dashboard? If the CPU of node is very high or you are hitting event loop delays then you can't trust the timing of the spans entirely because they can include time where the system was hanging and that's a problem with the service not bull/redis but the spans will look like they are slower.

  1. What do the redis CPU metrics look like?
  2. What do the node CPU metrics, and event loop delays look like?

I noticed when we enabled bull on one of our services its making ~6k evalsha calls a second total across 3 nodes (~2k evalsha's a second per node). Which seems very high for our limited use case but I'll have to dig deeper before I can share anything there. Although this shouldn't be much load on the redis instance I think creating the ~6k spans a second is adding a perf hit from the Datadog dd-trace agent. In this service we can see much more time is spent in the dd-trace agent than our other services when inspecting profiles. After enabling bull with these ~2k redis requests a second we saw ~20% increase in CPU usage across each node app, I think this could be from the tracing library itself.

I might try configuring the tracer to not create spans for EVALSHA and see if that reduces the CPU usage.

@billoneil
Copy link

billoneil commented May 10, 2023

Following up on my comment. Our issue was with calling removeJob in bull not bullmq but that manifested in creating thousands of trace spans a second from issuing the SCAN command. the dd-trace library was accounting for a significant amount of the CPU being utilized when the SCAN was triggering. Once we removed the code calling removeJob and no longer created so many datadog spans the CPU usage was drastically reduced in the node apps. Redis was able to handle all these requests with no problem.

@manast
Copy link
Contributor

manast commented May 12, 2023

@billoneil could you share more information regarding your call to removeJob, as even though it is a slow call, it does not require a call to SCAN AFAIK.

@billoneil
Copy link

billoneil commented Jun 28, 2023

@manast it was the removeJobs command calling SCAN sorry for the typo. Again this was on bull not bullmq.

The relevant point to this thread is making 1000's of calls rapidly may perform well in Redis but we saw a performance hit on our application side from the APM. That CPU hit resulted in the spans (generated client side) to appear slower than the actual request took (server side on redis). The issue was the CPU maxing out server side not Redis/bullmq. This is not an issue with bullmq.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants