Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible regression in job activation #4524

Closed
Zelldon opened this issue May 14, 2020 · 10 comments
Closed

Possible regression in job activation #4524

Zelldon opened this issue May 14, 2020 · 10 comments
Labels
area/performance Marks an issue as performance related kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround version:8.5.0 Marks an issue as being completely or in parts released in 8.5.0

Comments

@Zelldon
Copy link
Member

Zelldon commented May 14, 2020

Describe the bug

I observed a big drop in throughput on running our normal benchmark.

general

I would normally expect ~200 workflows and tasks to be completed.

It seems that job activation is the problem since the activation latency is quite high.

activation

The standalone gateway also throws endless the following timeouts:

2020-05-14 12:31:23.930 [io.zeebe.gateway.impl.broker.BrokerRequestManager] [gateway-scheduler-zb-actors-0] ERROR io.zeebe.gateway - Error handling gRPC request
io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: Time out between gateway and broker: Request type command-api-1 timed out in 15000 milliseconds
	at io.grpc.Status.asRuntimeException(Status.java:524) ~[grpc-api-1.29.0.jar:1.29.0]
	at io.zeebe.gateway.EndpointManager.convertThrowable(EndpointManager.java:397) ~[zeebe-gateway-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.gateway.EndpointManager.lambda$sendRequest$3(EndpointManager.java:311) ~[zeebe-gateway-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.gateway.impl.broker.BrokerRequestManager.lambda$sendRequest$3(BrokerRequestManager.java:148) ~[zeebe-gateway-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.gateway.impl.broker.BrokerRequestManager.lambda$sendRequestInternal$5(BrokerRequestManager.java:191) ~[zeebe-gateway-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.future.FutureContinuationRunnable.run(FutureContinuationRunnable.java:33) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorJob.invoke(ActorJob.java:76) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorJob.execute(ActorJob.java:39) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorTask.execute(ActorTask.java:118) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:107) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorThread.doWork(ActorThread.java:91) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorThread.run(ActorThread.java:204) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
Caused by: java.util.concurrent.TimeoutException: Request type command-api-1 timed out in 15000 milliseconds
	at io.atomix.cluster.messaging.impl.AbstractClientConnection$Callback.timeout(AbstractClientConnection.java:163) ~[atomix-cluster-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[?:?]
	at java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
	at java.lang.Thread.run(Unknown Source) ~[?:?]

To Reproduce
Run the helm chart (v100) with our benchmark,

Expected behavior
Around 200 workflows are completed per second.

@Zelldon Zelldon added kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog area/performance Marks an issue as performance related severity/high Marks a bug as having a noticeable impact on the user with no known workaround Status: Needs Priority labels May 14, 2020
@npepinpe
Copy link
Member

Do we have some idea of changes done recently that would affect this?

@Zelldon
Copy link
Member Author

Zelldon commented May 14, 2020

Unfortunately not. I expected that it might due to the changes with the helm charts and the configuration. Before the thread count was not taken into account. I tried to reset to 2 (instead of 4). Nothing changed.

After that I found out that it seems due to job activation, then I thought it might due to a single standalone gateway instead of three embedded. Maybe this gateway is overloaded? I changed the replication to three, but also this doesn't change anything. Btw: I had to restart the worker such that they using the new gateways.

@Zelldon
Copy link
Member Author

Zelldon commented May 15, 2020

Ok this problem is related to the StandaloneGateway, which we have introduced in the Helm charts.

If I use the embedded GW then it works as expected.

embeddedGW

I think we should investigate that further. It would be awesome if we had tests in the helm charts.

\cc @npepinpe

@Zelldon
Copy link
Member Author

Zelldon commented May 15, 2020

It seems this issue only exist for job activation. In the last weeks I used normally just used started with simple processes (without task) or with just an timer, which is reason why I haven't seen this earlier.

If we use the simpleStarter and timers with a standalone gateway the throughput looks like this.

standaloneGwTimer

With embedded it looks like this:

embeddedTimer

@DougAtPeddle
Copy link

In what version of Zeebe is this observed? We are seeing very long execution durations (7 - 10 sec) on a single workflow on v0.22.

@npepinpe
Copy link
Member

I did a bit of digging, and it turns out the default for the standalone gateway management threads is 1, which makes the gateway near single threaded - it accepts connections on many threads, but has a single thread to communicate with all brokers, becoming a bottle neck. Setting a value of ZEEBE_GATEWAY_THREADS_MANAGEMENTTHREADS to 4, for example, gives much better performance.

Please test and let me know how it works out.

@Zelldon
Copy link
Member Author

Zelldon commented May 18, 2020

Setting a value of ZEEBE_GATEWAY_THREADS_MANAGEMENTTHREADS to 4, for example, gives much better performance.

How do you did that? On the helm charts I can't do that for the gw right?

In what version of Zeebe is this observed? We are seeing very long execution durations (7 - 10 sec) on a single workflow on v0.22.

@DougAtPeddle this was encountered on the latest snapshot with the latest helm charts (0.100)

@Zelldon
Copy link
Member Author

Zelldon commented May 20, 2020

Even with higher thread count, I used now 4 management threads, I have this bad performance 👀

processing

@npepinpe
Copy link
Member

I was getting ~140 WFI/s as usual - let's dig into it together.

@npepinpe
Copy link
Member

npepinpe commented Jun 3, 2020

This was mostly due to a bug in the Helm charts - if you set the gateway's cluster host (e.g. ZEEBE_GATEWAY_CLUSTER_HOST) to 0.0.0.0, then it will advertise its address to other nodes as 0.0.0.0 - this broke long polling, as the gateways would be blocked waiting for brokers to notify them of new jobs, but the brokers would attempt to do so via 0.0.0.0, which would be themselves...

This is fixed in the latest Helm chart version, but for those who run into this and are not using Helm chart, make sure your standalone gateway's cluster host is set to an external address that the brokers can reach, e.g. the pod's IP address.

@npepinpe npepinpe closed this as completed Jun 3, 2020
github-merge-queue bot pushed a commit that referenced this issue Mar 14, 2024
* create new index `post-importer-queue`
* store last processed post import position in `import-position` index
  (new version)
* we don't use `pendingIncident` flag in `list-view` index anymore, instead
  we iterate over `post-importer-queue` records
* we avoid updating incident entities from different thread:
  * importer only inserts data
  * post importer updates
* implement migration to fill in `post-importer-queue`
* unignore flaky tests related with incidents

closes #4524

* fix(post-importer): fix migration...

...to not be applied on every restart

* fix(post-importer): fix migration step equals method
@oleschoenburg oleschoenburg added the version:8.5.0 Marks an issue as being completely or in parts released in 8.5.0 label Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Marks an issue as performance related kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround version:8.5.0 Marks an issue as being completely or in parts released in 8.5.0
Projects
None yet
Development

No branches or pull requests

4 participants