Possible regression in job activation #4524

Zelldon · 2020-05-14T12:34:46Z

Describe the bug

I observed a big drop in throughput on running our normal benchmark.

I would normally expect ~200 workflows and tasks to be completed.

It seems that job activation is the problem since the activation latency is quite high.

The standalone gateway also throws endless the following timeouts:

2020-05-14 12:31:23.930 [io.zeebe.gateway.impl.broker.BrokerRequestManager] [gateway-scheduler-zb-actors-0] ERROR io.zeebe.gateway - Error handling gRPC request
io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: Time out between gateway and broker: Request type command-api-1 timed out in 15000 milliseconds
	at io.grpc.Status.asRuntimeException(Status.java:524) ~[grpc-api-1.29.0.jar:1.29.0]
	at io.zeebe.gateway.EndpointManager.convertThrowable(EndpointManager.java:397) ~[zeebe-gateway-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.gateway.EndpointManager.lambda$sendRequest$3(EndpointManager.java:311) ~[zeebe-gateway-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.gateway.impl.broker.BrokerRequestManager.lambda$sendRequest$3(BrokerRequestManager.java:148) ~[zeebe-gateway-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.gateway.impl.broker.BrokerRequestManager.lambda$sendRequestInternal$5(BrokerRequestManager.java:191) ~[zeebe-gateway-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.future.FutureContinuationRunnable.run(FutureContinuationRunnable.java:33) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorJob.invoke(ActorJob.java:76) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorJob.execute(ActorJob.java:39) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorTask.execute(ActorTask.java:118) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:107) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorThread.doWork(ActorThread.java:91) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorThread.run(ActorThread.java:204) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
Caused by: java.util.concurrent.TimeoutException: Request type command-api-1 timed out in 15000 milliseconds
	at io.atomix.cluster.messaging.impl.AbstractClientConnection$Callback.timeout(AbstractClientConnection.java:163) ~[atomix-cluster-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[?:?]
	at java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
	at java.lang.Thread.run(Unknown Source) ~[?:?]

To Reproduce
Run the helm chart (v100) with our benchmark,

Expected behavior
Around 200 workflows are completed per second.

The text was updated successfully, but these errors were encountered:

npepinpe · 2020-05-14T12:38:45Z

Do we have some idea of changes done recently that would affect this?

Zelldon · 2020-05-14T12:42:17Z

Unfortunately not. I expected that it might due to the changes with the helm charts and the configuration. Before the thread count was not taken into account. I tried to reset to 2 (instead of 4). Nothing changed.

After that I found out that it seems due to job activation, then I thought it might due to a single standalone gateway instead of three embedded. Maybe this gateway is overloaded? I changed the replication to three, but also this doesn't change anything. Btw: I had to restart the worker such that they using the new gateways.

Zelldon · 2020-05-15T05:42:26Z

Ok this problem is related to the StandaloneGateway, which we have introduced in the Helm charts.

If I use the embedded GW then it works as expected.

I think we should investigate that further. It would be awesome if we had tests in the helm charts.

\cc @npepinpe

Zelldon · 2020-05-15T07:23:54Z

It seems this issue only exist for job activation. In the last weeks I used normally just used started with simple processes (without task) or with just an timer, which is reason why I haven't seen this earlier.

If we use the simpleStarter and timers with a standalone gateway the throughput looks like this.

With embedded it looks like this:

DougAtPeddle · 2020-05-15T13:11:24Z

In what version of Zeebe is this observed? We are seeing very long execution durations (7 - 10 sec) on a single workflow on v0.22.

npepinpe · 2020-05-16T22:05:16Z

I did a bit of digging, and it turns out the default for the standalone gateway management threads is 1, which makes the gateway near single threaded - it accepts connections on many threads, but has a single thread to communicate with all brokers, becoming a bottle neck. Setting a value of ZEEBE_GATEWAY_THREADS_MANAGEMENTTHREADS to 4, for example, gives much better performance.

Please test and let me know how it works out.

Zelldon · 2020-05-18T05:07:50Z

Setting a value of ZEEBE_GATEWAY_THREADS_MANAGEMENTTHREADS to 4, for example, gives much better performance.

How do you did that? On the helm charts I can't do that for the gw right?

In what version of Zeebe is this observed? We are seeing very long execution durations (7 - 10 sec) on a single workflow on v0.22.

@DougAtPeddle this was encountered on the latest snapshot with the latest helm charts (0.100)

Zelldon · 2020-05-20T13:29:41Z

Even with higher thread count, I used now 4 management threads, I have this bad performance 👀

npepinpe · 2020-05-20T13:30:26Z

I was getting ~140 WFI/s as usual - let's dig into it together.

npepinpe · 2020-06-03T08:04:05Z

This was mostly due to a bug in the Helm charts - if you set the gateway's cluster host (e.g. ZEEBE_GATEWAY_CLUSTER_HOST) to 0.0.0.0, then it will advertise its address to other nodes as 0.0.0.0 - this broke long polling, as the gateways would be blocked waiting for brokers to notify them of new jobs, but the brokers would attempt to do so via 0.0.0.0, which would be themselves...

This is fixed in the latest Helm chart version, but for those who run into this and are not using Helm chart, make sure your standalone gateway's cluster host is set to an external address that the brokers can reach, e.g. the pod's IP address.

* create new index `post-importer-queue` * store last processed post import position in `import-position` index (new version) * we don't use `pendingIncident` flag in `list-view` index anymore, instead we iterate over `post-importer-queue` records * we avoid updating incident entities from different thread: * importer only inserts data * post importer updates * implement migration to fill in `post-importer-queue` * unignore flaky tests related with incidents closes #4524 * fix(post-importer): fix migration... ...to not be applied on every restart * fix(post-importer): fix migration step equals method

Zelldon mentioned this issue May 18, 2020

feat(gateway): support arbitary env variables camunda-community-hub/zeebe-cluster-helm#67

Merged

npepinpe added Status: In Progress and removed Status: Needs Priority labels May 18, 2020

Zelldon mentioned this issue May 20, 2020

Unstable cluster on bigger state #4560

Closed

npepinpe closed this as completed Jun 3, 2020

oleschoenburg added the version:8.5.0 Marks an issue as being completely or in parts released in 8.5.0 label Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible regression in job activation #4524

Possible regression in job activation #4524

Zelldon commented May 14, 2020 •

edited

npepinpe commented May 14, 2020

Zelldon commented May 14, 2020

Zelldon commented May 15, 2020

Zelldon commented May 15, 2020

DougAtPeddle commented May 15, 2020

npepinpe commented May 16, 2020

Zelldon commented May 18, 2020

Zelldon commented May 20, 2020

npepinpe commented May 20, 2020

npepinpe commented Jun 3, 2020

Possible regression in job activation #4524

Possible regression in job activation #4524

Comments

Zelldon commented May 14, 2020 • edited

npepinpe commented May 14, 2020

Zelldon commented May 14, 2020

Zelldon commented May 15, 2020

Zelldon commented May 15, 2020

DougAtPeddle commented May 15, 2020

npepinpe commented May 16, 2020

Zelldon commented May 18, 2020

Zelldon commented May 20, 2020

npepinpe commented May 20, 2020

npepinpe commented Jun 3, 2020

Zelldon commented May 14, 2020 •

edited