New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible regression in job activation #4524
Comments
Do we have some idea of changes done recently that would affect this? |
Unfortunately not. I expected that it might due to the changes with the helm charts and the configuration. Before the thread count was not taken into account. I tried to reset to 2 (instead of 4). Nothing changed. After that I found out that it seems due to job activation, then I thought it might due to a single standalone gateway instead of three embedded. Maybe this gateway is overloaded? I changed the replication to three, but also this doesn't change anything. Btw: I had to restart the worker such that they using the new gateways. |
Ok this problem is related to the If I use the embedded GW then it works as expected. I think we should investigate that further. It would be awesome if we had tests in the helm charts. \cc @npepinpe |
It seems this issue only exist for job activation. In the last weeks I used normally just used started with simple processes (without task) or with just an timer, which is reason why I haven't seen this earlier. If we use the simpleStarter and timers with a standalone gateway the throughput looks like this. With embedded it looks like this: |
In what version of Zeebe is this observed? We are seeing very long execution durations (7 - 10 sec) on a single workflow on v0.22. |
I did a bit of digging, and it turns out the default for the standalone gateway management threads is 1, which makes the gateway near single threaded - it accepts connections on many threads, but has a single thread to communicate with all brokers, becoming a bottle neck. Setting a value of Please test and let me know how it works out. |
How do you did that? On the helm charts I can't do that for the gw right?
@DougAtPeddle this was encountered on the latest snapshot with the latest helm charts (0.100) |
I was getting ~140 WFI/s as usual - let's dig into it together. |
This was mostly due to a bug in the Helm charts - if you set the gateway's cluster host (e.g. This is fixed in the latest Helm chart version, but for those who run into this and are not using Helm chart, make sure your standalone gateway's cluster host is set to an external address that the brokers can reach, e.g. the pod's IP address. |
* create new index `post-importer-queue` * store last processed post import position in `import-position` index (new version) * we don't use `pendingIncident` flag in `list-view` index anymore, instead we iterate over `post-importer-queue` records * we avoid updating incident entities from different thread: * importer only inserts data * post importer updates * implement migration to fill in `post-importer-queue` * unignore flaky tests related with incidents closes #4524 * fix(post-importer): fix migration... ...to not be applied on every restart * fix(post-importer): fix migration step equals method
Describe the bug
I observed a big drop in throughput on running our normal benchmark.
I would normally expect ~200 workflows and tasks to be completed.
It seems that job activation is the problem since the activation latency is quite high.
The standalone gateway also throws endless the following timeouts:
To Reproduce
Run the helm chart (v100) with our benchmark,
Expected behavior
Around 200 workflows are completed per second.
The text was updated successfully, but these errors were encountered: