Skip to content

Commit

Permalink
draft
Browse files Browse the repository at this point in the history
  • Loading branch information
npepinpe committed Jan 19, 2024
1 parent 1e3972f commit a085bc8
Show file tree
Hide file tree
Showing 3 changed files with 99 additions and 0 deletions.
99 changes: 99 additions & 0 deletions chaos-days/blog/2024-01-19-Job-Activation-Latency/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
---
layout: posts
title: ""
date: 2024-01-19
categories:
- performance
- bpmn
tags:
- availability
authors: npepinpe
---

# Reducing the job activation delay

With the addition of end-to-end job streaming capabilities in Zeebe, we wanted to measure the improvements in job activation latency:

- How much is a single job activation latency reduced?
- How much is the activation latency reduced between each task of the same process instance?
- How much is the activation latency reduced on large clusters with a high broker and partition count?

Additionally, we wanted to guarantee that every component involved in streaming, including clients, would remain resilient in the face of load surges.

**TL;DR;** Job activation latency is greatly reduced, with task based workloads seeing between 30-50% reduced overall execution latency. Completing a task now immediately triggers pushing out the next one, meaning the latency to activate the next task in a sequence is bounded by how much time it takes to process its completion in Zeebe. Activation latency is unaffected by how many partitions or brokers there in a cluster, as opposed to job polling, thus ensuring scalability of the system. Finally, reuse of gRPC's flow control mechanism ensure clients cannot be overloaded even in the face of load surges, without impacting other workloads in the cluster.

<!--truncate-->

## Why job activation latency matters

Jobs are one of the fundamental building blocks of Zeebe, representing primarily all tasks (e.g. service, send, user), as well as some less obvious symbols (e.g. intermediate message throw event). In essence, they represent the actual unit of work in a process, the part users will implement, i.e. the actual application code. To reduce the likelihood of a job being worked on by multiple clients at the same time, it first goes through an activation process, where it is soft-locked for a specific amount of time. Soft-locked here means anyone can still interact with it - they can complete the job, fail it, etc. Only the activation is locked out, meaning no one else can activate the job until it's timed out.

This means that most workloads will consist mostly of job interactions: creation, activation, completion, etc. As such, it's critical to ensure clients receive jobs as fast as possible in order to make progress.


## Polling: a first implementation

Back in 2018, Zeebe introduced the `ActivateJobs` RPC for its gRPC clients, analogous to fetching and locking [external tasks in Camunda 7.x](https://docs.camunda.org/manual/7.20/user-guide/process-engine/external-tasks/). This endpoint allowed clients to activate fetch and activate a specific number of available jobs. In other words, it allowed them to _poll_ for jobs.

This was the first implementation to activate and work on jobs in Zeebe for multiple reason:

- It follows a simple request/response pattern
- Flow control is delegated to the client/user
- Most other approaches will build onto the building blocks used by polling
- You will likely implement polling anyway as a fallback for other approaches (e.g. pushing)

Grossly simplified, the implementation worked like this:

![Job polling](./job-poll.png)

- A client initiates an `ActivateJobs` call by sending an initial request
- The gateway receives the request and validates it
- The gateway starts polling each partition synchronously one by one
- Whenever jobs are received from a partition, it forwards them to the client
- When all partitions are exhausted, or the maximum number of jobs have been activated, the request is closed


Already we can infer certain performance bottle necks based on the following:

- Every request - whether client to gateway, or gateway to broker - adds delay to the activation latency
- In the worst case scenario, we have to poll _every_ partitions.
- The gateway does not know in advance which partitions have jobs available.
- Scaling out your clients may have adverse effects by sending out too many requests which all have to be processed independently

So if we have, say, 30 partitions, and each gateway-to-broker request takes 100ms, fetching the jobs on the last partition will take up to 3 seconds, even though the actual activation time on that partition was only 100ms.

Furthermore, if we have a sequence of tasks, fetching the next task in the sequence requires, in the worst case scenario, another complete round of polling through all the partitions, even though the task may already be available.

One would think a workaround to this issue would simply be to poll more often, but this can have an adverse impact: each polling request has to be processed by the brokers, and sending too many will simply flood your brokers and slow down all processing, further compounding the problem.

### Long polling: a second implementation

To simplify things, the Zeebe team introduced [long polling in 2019](https://github.com/camunda/zeebe/issues/2825). [Long polling](https://en.wikipedia.org/wiki/Push_technology#Long_polling) is a fairly common technique to emulate a push or streaming approach while maintaing the request-response pattern of polling. Essentially, if the server has nothing to send to the client, instead of completing the request it will hold it until content is available, or a timeout is reached.

In Zeebe, this means that if we did not reach the maximum number of jobs to activate after polling all partitions, the request is parked but not closed. Eventually when jobs are available, the brokers will make this information known to the gateways, who will then unpark the oldest request and start a new polling round.

![Job polling](./job-long-poll.png)

This solved certain problems:

- We reduced the amount of requests sent by clients, thus reducing load on the cluster.
- In some cases, we reduced the latency when activating the next task in sequence.

However, there are still some issues:

- When receiving the notification we _still_ have to poll all partitions.
- If you have multiple gateways, all gateways will start polling if they have parked requests. Some of them may not get any jobs, but they will still have sent requests to brokers which still all have to be processed.
- In high load cases, you still need another client request/poll cycle to fetch the next task in a sequence.
- Scaling out your clients still add more load on the system, even if the poll less often

## Job push: third time's the charm



### Implementation

### Usage

### Tests, results, and comparisons


Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit a085bc8

Please sign in to comment.