Probe components health when partition is probed #4764

korthout · 2020-06-18T10:05:44Z

Description

When the BrokerHealthCheckService probes the ZeebePartitions for their
health, the ZeebePartition only reported its own health-status. However,
we cannot be sure that this status is updated when one of the components
fails. To make sure that the health check is correct, it is setup to be
bi-directional, so when probing, the partition should also probe its own
critical components.

Lastly, when this results in an UNHEALTHY state, the metrics should be
updated. So we call updateHealthStatus when the partitions own state is
HEALTHY, but its critical components are not.

Related issues

closes #4723

When the BrokerHealthCheckService probes the ZeebePartitions for their health, the ZeebePartition only reported its own health-status. However, we cannot be sure that this status is updated when one of the components fails. To make sure that the health check is correct, it is setup to be bi-directional, so when probing, the partition should also probe its own critical components. Lastly, when this results in an UNHEALTHY state, the metrics should be updated. So we call updateHealthStatus when the partitions own state is HEALTHY, but its critical components are not.

korthout · 2020-06-18T16:25:09Z

I've also looked into why the ZeebePartition was unaware of the streamprocessor failing, while it is supposed to inform the ZeebePartition's CriticalComponentHealthMonitor when it fails. The reason for this is that the ZeebePartition's CriticalComponentHealthMonitor is not yet registered as the failure listener of the StreamProcessor when the failure occurs.

It's visible in the following logging (which I've produced with some additional log statements):

2020-06-18 18:16:28.598 [Broker-0-ZeebePartition-1] [Broker-0-zb-actors-0] INFO  io.zeebe.logstreams - [StreamProcessor]: Scheduling task to add failure listener (ComponentFailureListener) to StreamProcessor
2020-06-18 18:16:28.599 [Broker-0-StreamProcessor-1] [Broker-0-zb-actors-1] INFO  io.zeebe.logstreams - [StreamProcessor]: Is supposed to inform the CriticalComponentsHealthMonitor of ZeebePartition here, by calling failureListener.onFailure
2020-06-18 18:16:28.599 [Broker-0-StreamProcessor-1] [Broker-0-zb-actors-1] WARN  io.zeebe.logstreams - [StreamProcessor]: Unable to inform, failure listener is null...
2020-06-18 18:16:28.599 [Broker-0-StreamProcessor-1] [Broker-0-zb-actors-1] ERROR io.zeebe.logstreams - [StreamProcessor]: Actor Broker-0-StreamProcessor-1 failed in phase FAILED.
java.lang.IllegalStateException: This is the injected failure for testing purposes
	at io.zeebe.engine.processor.ReProcessingStateMachine.startRecover(ReProcessingStateMachine.java:135) ~[zeebe-workflow-engine-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.engine.processor.StreamProcessor.onActorStarted(StreamProcessor.java:116) ~[zeebe-workflow-engine-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorJob.invoke(ActorJob.java:73) ~[zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorJob.execute(ActorJob.java:39) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorTask.execute(ActorTask.java:118) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:107) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorThread.doWork(ActorThread.java:91) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorThread.run(ActorThread.java:204) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
2020-06-18 18:16:28.605 [Broker-0-StreamProcessor-1] [Broker-0-zb-actors-1] WARN  io.zeebe.logstreams - [StreamProcessor]: Trying to handle actor task failure: Unable to call failureListener.onFailure() because failureListener is null
2020-06-18 18:16:28.606 [] [main] INFO  io.zeebe.broker.system - Bootstrap Broker-0 partitions succeeded. Started 1 steps in 901 ms.
2020-06-18 18:16:28.606 [] [main] INFO  io.zeebe.broker.system - Bootstrap Broker-0 succeeded. Started 11 steps in 2159 ms.

Here we see there are 2 attempts to call failureListener.onFailure(). At both moments the failureListener is null, even though the task to add the failure listener is already scheduled. That task is never executed because the actor is closed as soon as a failure occurs. If the actor would have executed the task to add the failure listener, we would have seen a log entry stating: [StreamProcessor]: Failure listener (ComponentFailureListener) added to streamprocessor

In normal situations, where the StreamProcessor was able to fully startup successfully, it would add the failure listener and is then able to inform the monitor. In situations where it is unable to startup, it can't and we need to rely on the monitor to check-in on the streamprocessor's health using its periodic check.

korthout · 2020-06-18T16:29:31Z

@deepthidevaki This change seems to make the integration test BrokerMonitoringEndpointTest.shouldGetHealthStatus unstable, but I'm unable to reproduce it locally. I need some help with this part.

korthout · 2020-06-19T10:07:26Z

@deepthidevaki After our work this morning on the flaky test, we've combined the onFailure and handleFailure methods of StreamProcessor and moved the initial healthCheckTick in the StreamProcessor to before completing the openFuture.

Injecting the failure (like in the original issue) now results in the following logging:

2020-06-19 12:01:16.666 [Broker-0-ZeebePartition-1] [Broker-0-zb-actors-1] INFO  io.zeebe.logstreams - Checking health of StreamProcessor
2020-06-19 12:01:16.666 [Broker-0-ZeebePartition-1] [Broker-0-zb-actors-1] INFO  io.zeebe.logstreams - everything healthy for the stream processor
2020-06-19 12:01:16.666 [Broker-0-ZeebePartition-1] [Broker-0-zb-actors-1] INFO  io.zeebe.broker.system - [CriticalComponentsHealthMonitor]: Calculated health: {Broker-0-StreamProcessor-1=HEALTHY, logStream=HEALTHY}
2020-06-19 12:01:16.663 [Broker-0-StreamProcessor-1] [Broker-0-zb-actors-0] ERROR io.zeebe.logstreams - Actor Broker-0-StreamProcessor-1 failed in phase STARTED.
java.lang.IllegalStateException: Muahahaha
	at io.zeebe.engine.processor.StreamProcessor.onActorStarted(StreamProcessor.java:108) ~[zeebe-workflow-engine-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorJob.invoke(ActorJob.java:73) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorJob.execute(ActorJob.java:39) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorTask.execute(ActorTask.java:118) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:107) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorThread.doWork(ActorThread.java:91) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorThread.run(ActorThread.java:204) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
2020-06-19 12:01:16.673 [] [main] INFO  io.zeebe.broker.system - Bootstrap Broker-0 partitions succeeded. Started 1 steps in 892 ms.
2020-06-19 12:01:16.673 [] [main] INFO  io.zeebe.broker.system - Bootstrap Broker-0 succeeded. Started 11 steps in 2072 ms.
2020-06-19 12:01:16.673 [Broker-0-ZeebePartition-1] [Broker-0-zb-actors-0] INFO  io.zeebe.broker.system - [CriticalComponentsHealthMonitor]: Calculated health: {Raft-1=HEALTHY, Broker-0-StreamProcessor-1=HEALTHY, logStream=HEALTHY}
2020-06-19 12:01:16.673 [Broker-0-HealthCheckService] [Broker-0-zb-actors-1] INFO  io.zeebe.broker.system - [CriticalComponentsHealthMonitor]: Calculated health: {Partition-1=HEALTHY}
2020-06-19 12:02:15.781 [Broker-0-HealthCheckService] [Broker-0-zb-actors-0] INFO  io.zeebe.broker.system - [CriticalComponentsHealthMonitor]: Calculated health: {Partition-1=HEALTHY}
2020-06-19 12:02:16.692 [Broker-0-ZeebePartition-1] [Broker-0-zb-actors-0] INFO  io.zeebe.logstreams - Checking health of StreamProcessor
2020-06-19 12:02:16.692 [Broker-0-ZeebePartition-1] [Broker-0-zb-actors-0] ERROR io.zeebe.logstreams - Actor closed!
2020-06-19 12:02:16.692 [Broker-0-ZeebePartition-1] [Broker-0-zb-actors-0] INFO  io.zeebe.broker.system - [CriticalComponentsHealthMonitor]: Calculated health: {Raft-1=HEALTHY, Broker-0-StreamProcessor-1=UNHEALTHY, logStream=HEALTHY}
2020-06-19 12:02:16.694 [Broker-0-HealthCheckService] [Broker-0-zb-actors-0] ERROR io.zeebe.broker.system - Partition-1 failed, marking it as unhealthy
2020-06-19 12:02:16.694 [Broker-0-HealthCheckService] [Broker-0-zb-actors-0] INFO  io.zeebe.broker.system - [CriticalComponentsHealthMonitor]: Calculated health: {Partition-1=UNHEALTHY}

Here we can see that the after failing, the actor is closed but because the failure listener is still null, the health monitor checking the stream processor is not informed about the stream processor becoming unhealthy. After 1 minute the CriticalComponentsHealthMonitor of the ZeebePartition probes its components, among them the StreamProcessor and discovers that the actor is closed and thus it has become unhealthy.

I would like to consider whether we can improve this. Should we perhaps inform the failure listener of the current status, as soon as it registers?

The healthcheck tick should be performed before the open future is completed, to make sure it is initialized when the health monitoring starts. This commit also changes how failures are handled. Before this commit they were handled internally in the class and rethrown. The unhandled thrown exception was then handled by the Actor which in turn calls the handleFailure method. This lead to an error message where the actor failed in phase FAILED, while it actually failed in phase STARTED and then the actor was failed afterwhich the failure was handled again.

korthout · 2020-06-19T12:51:43Z

@deepthidevaki I've rerun the pipeline about 10 times. There was only 1 build flaky, but this was because of another unstable test (already described here #4430). I think this integration test is no longer flaky. Please have another look at the changes :)

deepthidevaki

👍 Thanks for the documentation.

korthout · 2020-06-22T09:56:52Z

bors r+

4764: Probe components health when partition is probed r=korthout a=korthout ## Description When the BrokerHealthCheckService probes the ZeebePartitions for their health, the ZeebePartition only reported its own health-status. However, we cannot be sure that this status is updated when one of the components fails. To make sure that the health check is correct, it is setup to be bi-directional, so when probing, the partition should also probe its own critical components. Lastly, when this results in an UNHEALTHY state, the metrics should be updated. So we call updateHealthStatus when the partitions own state is HEALTHY, but its critical components are not. ## Related issues closes #4723 Co-authored-by: Nico korthout <nico.korthout@camunda.com>

zeebe-bors · 2020-06-22T10:17:12Z

Build failed

continuous-integration/jenkins/branch

korthout · 2020-06-22T10:57:49Z

bors retry

4764: Probe components health when partition is probed r=korthout a=korthout ## Description When the BrokerHealthCheckService probes the ZeebePartitions for their health, the ZeebePartition only reported its own health-status. However, we cannot be sure that this status is updated when one of the components fails. To make sure that the health check is correct, it is setup to be bi-directional, so when probing, the partition should also probe its own critical components. Lastly, when this results in an UNHEALTHY state, the metrics should be updated. So we call updateHealthStatus when the partitions own state is HEALTHY, but its critical components are not. ## Related issues closes #4723 Co-authored-by: Nico korthout <nico.korthout@camunda.com>

zeebe-bors · 2020-06-22T11:15:46Z

Build failed

continuous-integration/jenkins/branch

korthout · 2020-06-22T11:21:46Z

bors retry

zeebe-bors · 2020-06-22T11:40:12Z

Build succeeded

continuous-integration/jenkins/branch

4796: [BACKPORT] Probe components health when partition is probed r=korthout a=korthout ## Description When the BrokerHealthCheckService probes the ZeebePartitions for their health, the ZeebePartition only reported its own health-status. However, we cannot be sure that this status is updated when one of the components fails. To make sure that the health check is correct, it is setup to be bi-directional, so when probing, the partition should also probe its own critical components. Lastly, when this results in an UNHEALTHY state, the metrics should be updated. So we call updateHealthStatus when the partitions own state is HEALTHY, but its critical components are not. ## Related issues closes #4723 backport of #4764 Co-authored-by: Nico korthout <nico.korthout@camunda.com>

…backport-0.23 [BACKPORT] Probe components health when partition is probed Description When the BrokerHealthCheckService probes the ZeebePartitions for their health, the ZeebePartition only reported its own health-status. However, we cannot be sure that this status is updated when one of the components fails. To make sure that the health check is correct, it is setup to be bi-directional, so when probing, the partition should also probe its own critical components. Lastly, when this results in an UNHEALTHY state, the metrics should be updated. So we call updateHealthStatus when the partitions own state is HEALTHY, but its critical components are not. Related issues closes #4723 backport of #4764

korthout marked this pull request as ready for review June 18, 2020 14:44

chore(broker): add documentation of the healthcheck class structure

a22838e

korthout force-pushed the 4723-failed-stream-processing branch from e79a87c to a22838e Compare June 18, 2020 15:45

korthout requested a review from deepthidevaki June 18, 2020 16:29

deepthidevaki approved these changes Jun 22, 2020

View reviewed changes

zeebe-bors bot merged commit 7bd8de6 into develop Jun 22, 2020

zeebe-bors bot deleted the 4723-failed-stream-processing branch June 22, 2020 11:40

This was referenced Jun 23, 2020

[BACKPORT] Probe components health when partition is probed #4796

Merged

[BACKPORT] Probe components health when partition is probed #4797

Closed

npepinpe added the Release: 0.24.0 label Jul 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Probe components health when partition is probed #4764

Probe components health when partition is probed #4764

korthout commented Jun 18, 2020

korthout commented Jun 18, 2020 •

edited

korthout commented Jun 18, 2020

korthout commented Jun 19, 2020

korthout commented Jun 19, 2020 •

edited

deepthidevaki left a comment

korthout commented Jun 22, 2020

zeebe-bors bot commented Jun 22, 2020

korthout commented Jun 22, 2020

zeebe-bors bot commented Jun 22, 2020

korthout commented Jun 22, 2020

zeebe-bors bot commented Jun 22, 2020

Probe components health when partition is probed #4764

Probe components health when partition is probed #4764

Conversation

korthout commented Jun 18, 2020

Description

Related issues

korthout commented Jun 18, 2020 • edited

korthout commented Jun 18, 2020

korthout commented Jun 19, 2020

korthout commented Jun 19, 2020 • edited

deepthidevaki left a comment

Choose a reason for hiding this comment

korthout commented Jun 22, 2020

zeebe-bors bot commented Jun 22, 2020

Build failed

korthout commented Jun 22, 2020

zeebe-bors bot commented Jun 22, 2020

Build failed

korthout commented Jun 22, 2020

zeebe-bors bot commented Jun 22, 2020

Build succeeded

korthout commented Jun 18, 2020 •

edited

korthout commented Jun 19, 2020 •

edited