Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job push activation latency blog post #481

Merged
merged 3 commits into from
Jan 26, 2024
Merged

Job push activation latency blog post #481

merged 3 commits into from
Jan 26, 2024

Conversation

npepinpe
Copy link
Member

@npepinpe npepinpe commented Jan 19, 2024

A blog post highlighting the improvements we've seen with job push. It:

  • Explains what job polling is
  • Outlines the issues with polling as well as long polling
  • Explains the concept behind job push and how it would solve the issues seen with polling
  • Provides three different tests and benchmark results highlighting the results

Finally, there's a section at the bottom linking to the previous blog posts and the documentation.

@npepinpe npepinpe force-pushed the np-jp-blog branch 2 times, most recently from b0223c6 to f85a7fe Compare January 21, 2024 20:08
@npepinpe npepinpe changed the title draft Job push activation latency blog post Jan 21, 2024
@npepinpe npepinpe marked this pull request as ready for review January 21, 2024 20:11
@npepinpe
Copy link
Member Author

This was not really a "chaos" thing though, so this might not be the right blog for this 😄

Adds a blog post to summarize the improvements on activation latency
we've seen with job push.
@Zelldon
Copy link
Member

Zelldon commented Jan 22, 2024

Haven't reviewed it yet. But was also wondering whether you maybe want to post it to the camunda blog instead or on medium. :)

@Zelldon
Copy link
Member

Zelldon commented Jan 23, 2024

@npepinpe wdyt about submitting this to the camunda blog?

@npepinpe
Copy link
Member Author

maybe. the idea was to do something like https://zeebe-io.github.io/zeebe-chaos/2023/12/20/Broker-scaling-performance which was also about performance and not really chaos

@npepinpe
Copy link
Member Author

regardless i could use a review from another engineer :D

@Zelldon
Copy link
Member

Zelldon commented Jan 24, 2024

@npepinpe will do when I find the time

@npepinpe
Copy link
Member Author

Ole and Deepthi can also help, considering your week so far 😄

Copy link
Member

@Zelldon Zelldon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome @npepinpe thanks for this!🤩❤️ Great post and really valuable additional to the blog. I hope you will also post this to the camunda blog. I definitely think it is worth it.


Additionally, we wanted to guarantee that every component involved in streaming, including clients, would remain resilient in the face of load surges.

**TL;DR;** Job activation latency is greatly reduced, with task based workloads seeing up to 50% reduced overall execution latency. Completing a task now immediately triggers pushing out the next one, meaning the latency to activate the next task in a sequence is bounded by how much time it takes to process its completion in Zeebe. Activation latency is unaffected by how many partitions or brokers there in a cluster, as opposed to job polling, thus ensuring scalability of the system. Finally, reuse of gRPC's flow control mechanism ensure clients cannot be overloaded even in the face of load surges, without impacting other workloads in the cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤩


## Why job activation latency matters

Jobs are one of the fundamental building blocks of Zeebe, representing primarily all tasks (e.g. service, send, user), as well as some less obvious symbols (e.g. intermediate message throw event). In essence, they represent the actual unit of work in a process, the part users will implement, i.e. the actual application code. To reduce the likelihood of a job being worked on by multiple clients at the same time, it first goes through an activation process, where it is soft-locked for a specific amount of time. Soft-locked here means anyone can still interact with it - they can complete the job, fail it, etc. Only the activation is locked out, meaning no one else can activate the job until it's timed out.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I like that you give a short intro first


## Polling: a first implementation

Back in 2018, Zeebe introduced the `ActivateJobs` RPC for its gRPC clients, analogous to fetching and locking [external tasks in Camunda 7.x](https://docs.camunda.org/manual/7.20/user-guide/process-engine/external-tasks/). This endpoint allowed clients to activate fetch and activate a specific number of available jobs. In other words, it allowed them to _poll_ for jobs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I feel old now. I even worked on external tasks in Camunda 7


Back in 2018, Zeebe introduced the `ActivateJobs` RPC for its gRPC clients, analogous to fetching and locking [external tasks in Camunda 7.x](https://docs.camunda.org/manual/7.20/user-guide/process-engine/external-tasks/). This endpoint allowed clients to activate fetch and activate a specific number of available jobs. In other words, it allowed them to _poll_ for jobs.

This was the first implementation to activate and work on jobs in Zeebe for multiple reason:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether this is true we had topic subscriptions but not sure whether we had jobs here. But I guess this doesn't matter 😄

- Every request - whether client to gateway, or gateway to broker - adds delay to the activation latency
- In the worst case scenario, we have to poll _every_ partition.
- The gateway does not know in advance which partitions have jobs available.
- Scaling out your clients may have adverse effects by sending out too many requests which all have to be processed independently
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍that is definitely many people run into

- Brokers push jobs out immediately as they become available, removing the need for a gateway-to-broker request.
- Since the stream is long lived, there are almost no client requests required after the initial one.
- No need to poll every partition anymore.
- No thundering herd issues if you have many gateways all polling at the same time due to a notification.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether this is clear. What do you mean all polling at the same time.

chaos-days/blog/2024-01-19-Job-Activation-Latency/index.md Outdated Show resolved Hide resolved
chaos-days/blog/2024-01-19-Job-Activation-Latency/index.md Outdated Show resolved Hide resolved
@Zelldon
Copy link
Member

Zelldon commented Jan 26, 2024

I will go ahead and merge this I think it is great.

@Zelldon Zelldon merged commit 7ab9378 into main Jan 26, 2024
2 checks passed
@Zelldon Zelldon deleted the np-jp-blog branch January 26, 2024 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants