Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Probe components health when partition is probed #4764

Merged
merged 3 commits into from
Jun 22, 2020

Conversation

korthout
Copy link
Member

Description

When the BrokerHealthCheckService probes the ZeebePartitions for their
health, the ZeebePartition only reported its own health-status. However,
we cannot be sure that this status is updated when one of the components
fails. To make sure that the health check is correct, it is setup to be
bi-directional, so when probing, the partition should also probe its own
critical components.

Lastly, when this results in an UNHEALTHY state, the metrics should be
updated. So we call updateHealthStatus when the partitions own state is
HEALTHY, but its critical components are not.

Related issues

closes #4723

When the BrokerHealthCheckService probes the ZeebePartitions for their
health, the ZeebePartition only reported its own health-status. However,
we cannot be sure that this status is updated when one of the components
fails. To make sure that the health check is correct, it is setup to be
bi-directional, so when probing, the partition should also probe its own
critical components.

Lastly, when this results in an UNHEALTHY state, the metrics should be
updated. So we call updateHealthStatus when the partitions own state is
HEALTHY, but its critical components are not.
@korthout korthout marked this pull request as ready for review June 18, 2020 14:44
@korthout korthout force-pushed the 4723-failed-stream-processing branch from e79a87c to a22838e Compare June 18, 2020 15:45
@korthout
Copy link
Member Author

korthout commented Jun 18, 2020

I've also looked into why the ZeebePartition was unaware of the streamprocessor failing, while it is supposed to inform the ZeebePartition's CriticalComponentHealthMonitor when it fails. The reason for this is that the ZeebePartition's CriticalComponentHealthMonitor is not yet registered as the failure listener of the StreamProcessor when the failure occurs.

It's visible in the following logging (which I've produced with some additional log statements):

2020-06-18 18:16:28.598 [Broker-0-ZeebePartition-1] [Broker-0-zb-actors-0] INFO  io.zeebe.logstreams - [StreamProcessor]: Scheduling task to add failure listener (ComponentFailureListener) to StreamProcessor
2020-06-18 18:16:28.599 [Broker-0-StreamProcessor-1] [Broker-0-zb-actors-1] INFO  io.zeebe.logstreams - [StreamProcessor]: Is supposed to inform the CriticalComponentsHealthMonitor of ZeebePartition here, by calling failureListener.onFailure
2020-06-18 18:16:28.599 [Broker-0-StreamProcessor-1] [Broker-0-zb-actors-1] WARN  io.zeebe.logstreams - [StreamProcessor]: Unable to inform, failure listener is null...
2020-06-18 18:16:28.599 [Broker-0-StreamProcessor-1] [Broker-0-zb-actors-1] ERROR io.zeebe.logstreams - [StreamProcessor]: Actor Broker-0-StreamProcessor-1 failed in phase FAILED.
java.lang.IllegalStateException: This is the injected failure for testing purposes
	at io.zeebe.engine.processor.ReProcessingStateMachine.startRecover(ReProcessingStateMachine.java:135) ~[zeebe-workflow-engine-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.engine.processor.StreamProcessor.onActorStarted(StreamProcessor.java:116) ~[zeebe-workflow-engine-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorJob.invoke(ActorJob.java:73) ~[zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorJob.execute(ActorJob.java:39) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorTask.execute(ActorTask.java:118) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:107) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorThread.doWork(ActorThread.java:91) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorThread.run(ActorThread.java:204) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
2020-06-18 18:16:28.605 [Broker-0-StreamProcessor-1] [Broker-0-zb-actors-1] WARN  io.zeebe.logstreams - [StreamProcessor]: Trying to handle actor task failure: Unable to call failureListener.onFailure() because failureListener is null
2020-06-18 18:16:28.606 [] [main] INFO  io.zeebe.broker.system - Bootstrap Broker-0 partitions succeeded. Started 1 steps in 901 ms.
2020-06-18 18:16:28.606 [] [main] INFO  io.zeebe.broker.system - Bootstrap Broker-0 succeeded. Started 11 steps in 2159 ms.

Here we see there are 2 attempts to call failureListener.onFailure(). At both moments the failureListener is null, even though the task to add the failure listener is already scheduled. That task is never executed because the actor is closed as soon as a failure occurs. If the actor would have executed the task to add the failure listener, we would have seen a log entry stating: [StreamProcessor]: Failure listener (ComponentFailureListener) added to streamprocessor

In normal situations, where the StreamProcessor was able to fully startup successfully, it would add the failure listener and is then able to inform the monitor. In situations where it is unable to startup, it can't and we need to rely on the monitor to check-in on the streamprocessor's health using its periodic check.

@korthout
Copy link
Member Author

@deepthidevaki This change seems to make the integration test BrokerMonitoringEndpointTest.shouldGetHealthStatus unstable, but I'm unable to reproduce it locally. I need some help with this part.

@korthout
Copy link
Member Author

@deepthidevaki After our work this morning on the flaky test, we've combined the onFailure and handleFailure methods of StreamProcessor and moved the initial healthCheckTick in the StreamProcessor to before completing the openFuture.

Injecting the failure (like in the original issue) now results in the following logging:

2020-06-19 12:01:16.666 [Broker-0-ZeebePartition-1] [Broker-0-zb-actors-1] INFO  io.zeebe.logstreams - Checking health of StreamProcessor
2020-06-19 12:01:16.666 [Broker-0-ZeebePartition-1] [Broker-0-zb-actors-1] INFO  io.zeebe.logstreams - everything healthy for the stream processor
2020-06-19 12:01:16.666 [Broker-0-ZeebePartition-1] [Broker-0-zb-actors-1] INFO  io.zeebe.broker.system - [CriticalComponentsHealthMonitor]: Calculated health: {Broker-0-StreamProcessor-1=HEALTHY, logStream=HEALTHY}
2020-06-19 12:01:16.663 [Broker-0-StreamProcessor-1] [Broker-0-zb-actors-0] ERROR io.zeebe.logstreams - Actor Broker-0-StreamProcessor-1 failed in phase STARTED.
java.lang.IllegalStateException: Muahahaha
	at io.zeebe.engine.processor.StreamProcessor.onActorStarted(StreamProcessor.java:108) ~[zeebe-workflow-engine-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorJob.invoke(ActorJob.java:73) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorJob.execute(ActorJob.java:39) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorTask.execute(ActorTask.java:118) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:107) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorThread.doWork(ActorThread.java:91) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorThread.run(ActorThread.java:204) [zeebe-util-0.24.0-SNAPSHOT.jar:0.24.0-SNAPSHOT]
2020-06-19 12:01:16.673 [] [main] INFO  io.zeebe.broker.system - Bootstrap Broker-0 partitions succeeded. Started 1 steps in 892 ms.
2020-06-19 12:01:16.673 [] [main] INFO  io.zeebe.broker.system - Bootstrap Broker-0 succeeded. Started 11 steps in 2072 ms.
2020-06-19 12:01:16.673 [Broker-0-ZeebePartition-1] [Broker-0-zb-actors-0] INFO  io.zeebe.broker.system - [CriticalComponentsHealthMonitor]: Calculated health: {Raft-1=HEALTHY, Broker-0-StreamProcessor-1=HEALTHY, logStream=HEALTHY}
2020-06-19 12:01:16.673 [Broker-0-HealthCheckService] [Broker-0-zb-actors-1] INFO  io.zeebe.broker.system - [CriticalComponentsHealthMonitor]: Calculated health: {Partition-1=HEALTHY}
2020-06-19 12:02:15.781 [Broker-0-HealthCheckService] [Broker-0-zb-actors-0] INFO  io.zeebe.broker.system - [CriticalComponentsHealthMonitor]: Calculated health: {Partition-1=HEALTHY}
2020-06-19 12:02:16.692 [Broker-0-ZeebePartition-1] [Broker-0-zb-actors-0] INFO  io.zeebe.logstreams - Checking health of StreamProcessor
2020-06-19 12:02:16.692 [Broker-0-ZeebePartition-1] [Broker-0-zb-actors-0] ERROR io.zeebe.logstreams - Actor closed!
2020-06-19 12:02:16.692 [Broker-0-ZeebePartition-1] [Broker-0-zb-actors-0] INFO  io.zeebe.broker.system - [CriticalComponentsHealthMonitor]: Calculated health: {Raft-1=HEALTHY, Broker-0-StreamProcessor-1=UNHEALTHY, logStream=HEALTHY}
2020-06-19 12:02:16.694 [Broker-0-HealthCheckService] [Broker-0-zb-actors-0] ERROR io.zeebe.broker.system - Partition-1 failed, marking it as unhealthy
2020-06-19 12:02:16.694 [Broker-0-HealthCheckService] [Broker-0-zb-actors-0] INFO  io.zeebe.broker.system - [CriticalComponentsHealthMonitor]: Calculated health: {Partition-1=UNHEALTHY}

Here we can see that the after failing, the actor is closed but because the failure listener is still null, the health monitor checking the stream processor is not informed about the stream processor becoming unhealthy. After 1 minute the CriticalComponentsHealthMonitor of the ZeebePartition probes its components, among them the StreamProcessor and discovers that the actor is closed and thus it has become unhealthy.

I would like to consider whether we can improve this. Should we perhaps inform the failure listener of the current status, as soon as it registers?

The healthcheck tick should be performed before the open future is
completed, to make sure it is initialized when the health monitoring
starts.

This commit also changes how failures are handled. Before this commit
they were handled internally in the class and rethrown. The unhandled
thrown exception was then handled by the Actor which in turn calls the
handleFailure method. This lead to an error message where the actor
failed in phase FAILED, while it actually failed in phase STARTED and
then the actor was failed afterwhich the failure was handled again.
@korthout
Copy link
Member Author

korthout commented Jun 19, 2020

@deepthidevaki I've rerun the pipeline about 10 times. There was only 1 build flaky, but this was because of another unstable test (already described here #4430). I think this integration test is no longer flaky. Please have another look at the changes :)

Copy link
Contributor

@deepthidevaki deepthidevaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Thanks for the documentation.

@korthout
Copy link
Member Author

bors r+

zeebe-bors bot added a commit that referenced this pull request Jun 22, 2020
4764: Probe components health when partition is probed r=korthout a=korthout

## Description

When the BrokerHealthCheckService probes the ZeebePartitions for their
health, the ZeebePartition only reported its own health-status. However,
we cannot be sure that this status is updated when one of the components
fails. To make sure that the health check is correct, it is setup to be
bi-directional, so when probing, the partition should also probe its own
critical components.

Lastly, when this results in an UNHEALTHY state, the metrics should be
updated. So we call updateHealthStatus when the partitions own state is
HEALTHY, but its critical components are not.

## Related issues

closes #4723

Co-authored-by: Nico korthout <nico.korthout@camunda.com>
@zeebe-bors
Copy link
Contributor

zeebe-bors bot commented Jun 22, 2020

Build failed

@korthout
Copy link
Member Author

bors retry

zeebe-bors bot added a commit that referenced this pull request Jun 22, 2020
4764: Probe components health when partition is probed r=korthout a=korthout

## Description

When the BrokerHealthCheckService probes the ZeebePartitions for their
health, the ZeebePartition only reported its own health-status. However,
we cannot be sure that this status is updated when one of the components
fails. To make sure that the health check is correct, it is setup to be
bi-directional, so when probing, the partition should also probe its own
critical components.

Lastly, when this results in an UNHEALTHY state, the metrics should be
updated. So we call updateHealthStatus when the partitions own state is
HEALTHY, but its critical components are not.

## Related issues

closes #4723

Co-authored-by: Nico korthout <nico.korthout@camunda.com>
@zeebe-bors
Copy link
Contributor

zeebe-bors bot commented Jun 22, 2020

Build failed

@korthout
Copy link
Member Author

bors retry

@zeebe-bors
Copy link
Contributor

zeebe-bors bot commented Jun 22, 2020

Build succeeded

@zeebe-bors zeebe-bors bot merged commit 7bd8de6 into develop Jun 22, 2020
@zeebe-bors zeebe-bors bot deleted the 4723-failed-stream-processing branch June 22, 2020 11:40
zeebe-bors bot added a commit that referenced this pull request Jun 24, 2020
4796: [BACKPORT] Probe components health when partition is probed r=korthout a=korthout

## Description

When the BrokerHealthCheckService probes the ZeebePartitions for their
health, the ZeebePartition only reported its own health-status. However,
we cannot be sure that this status is updated when one of the components
fails. To make sure that the health check is correct, it is setup to be
bi-directional, so when probing, the partition should also probe its own
critical components.

Lastly, when this results in an UNHEALTHY state, the metrics should be
updated. So we call updateHealthStatus when the partitions own state is
HEALTHY, but its critical components are not.

## Related issues

closes #4723 

backport of #4764

Co-authored-by: Nico korthout <nico.korthout@camunda.com>
zeebe-bors bot added a commit that referenced this pull request Jun 24, 2020
4796: [BACKPORT] Probe components health when partition is probed r=korthout a=korthout

## Description

When the BrokerHealthCheckService probes the ZeebePartitions for their
health, the ZeebePartition only reported its own health-status. However,
we cannot be sure that this status is updated when one of the components
fails. To make sure that the health check is correct, it is setup to be
bi-directional, so when probing, the partition should also probe its own
critical components.

Lastly, when this results in an UNHEALTHY state, the metrics should be
updated. So we call updateHealthStatus when the partitions own state is
HEALTHY, but its critical components are not.

## Related issues

closes #4723 

backport of #4764

Co-authored-by: Nico korthout <nico.korthout@camunda.com>
zeebe-bors bot added a commit that referenced this pull request Jun 24, 2020
4796: [BACKPORT] Probe components health when partition is probed r=korthout a=korthout

## Description

When the BrokerHealthCheckService probes the ZeebePartitions for their
health, the ZeebePartition only reported its own health-status. However,
we cannot be sure that this status is updated when one of the components
fails. To make sure that the health check is correct, it is setup to be
bi-directional, so when probing, the partition should also probe its own
critical components.

Lastly, when this results in an UNHEALTHY state, the metrics should be
updated. So we call updateHealthStatus when the partitions own state is
HEALTHY, but its critical components are not.

## Related issues

closes #4723 

backport of #4764

Co-authored-by: Nico korthout <nico.korthout@camunda.com>
zeebe-bors bot added a commit that referenced this pull request Jun 24, 2020
4796: [BACKPORT] Probe components health when partition is probed r=korthout a=korthout

## Description

When the BrokerHealthCheckService probes the ZeebePartitions for their
health, the ZeebePartition only reported its own health-status. However,
we cannot be sure that this status is updated when one of the components
fails. To make sure that the health check is correct, it is setup to be
bi-directional, so when probing, the partition should also probe its own
critical components.

Lastly, when this results in an UNHEALTHY state, the metrics should be
updated. So we call updateHealthStatus when the partitions own state is
HEALTHY, but its critical components are not.

## Related issues

closes #4723 

backport of #4764

Co-authored-by: Nico korthout <nico.korthout@camunda.com>
zeebe-bors bot added a commit that referenced this pull request Jun 24, 2020
4796: [BACKPORT] Probe components health when partition is probed r=korthout a=korthout

## Description

When the BrokerHealthCheckService probes the ZeebePartitions for their
health, the ZeebePartition only reported its own health-status. However,
we cannot be sure that this status is updated when one of the components
fails. To make sure that the health check is correct, it is setup to be
bi-directional, so when probing, the partition should also probe its own
critical components.

Lastly, when this results in an UNHEALTHY state, the metrics should be
updated. So we call updateHealthStatus when the partitions own state is
HEALTHY, but its critical components are not.

## Related issues

closes #4723 

backport of #4764

Co-authored-by: Nico korthout <nico.korthout@camunda.com>
zeebe-bors bot added a commit that referenced this pull request Jun 25, 2020
4796: [BACKPORT] Probe components health when partition is probed r=korthout a=korthout

## Description

When the BrokerHealthCheckService probes the ZeebePartitions for their
health, the ZeebePartition only reported its own health-status. However,
we cannot be sure that this status is updated when one of the components
fails. To make sure that the health check is correct, it is setup to be
bi-directional, so when probing, the partition should also probe its own
critical components.

Lastly, when this results in an UNHEALTHY state, the metrics should be
updated. So we call updateHealthStatus when the partitions own state is
HEALTHY, but its critical components are not.

## Related issues

closes #4723 

backport of #4764

Co-authored-by: Nico korthout <nico.korthout@camunda.com>
zeebe-bors bot added a commit that referenced this pull request Jun 25, 2020
4796: [BACKPORT] Probe components health when partition is probed r=korthout a=korthout

## Description

When the BrokerHealthCheckService probes the ZeebePartitions for their
health, the ZeebePartition only reported its own health-status. However,
we cannot be sure that this status is updated when one of the components
fails. To make sure that the health check is correct, it is setup to be
bi-directional, so when probing, the partition should also probe its own
critical components.

Lastly, when this results in an UNHEALTHY state, the metrics should be
updated. So we call updateHealthStatus when the partitions own state is
HEALTHY, but its critical components are not.

## Related issues

closes #4723 

backport of #4764

Co-authored-by: Nico korthout <nico.korthout@camunda.com>
zeebe-bors bot added a commit that referenced this pull request Jun 25, 2020
4796: [BACKPORT] Probe components health when partition is probed r=korthout a=korthout

## Description

When the BrokerHealthCheckService probes the ZeebePartitions for their
health, the ZeebePartition only reported its own health-status. However,
we cannot be sure that this status is updated when one of the components
fails. To make sure that the health check is correct, it is setup to be
bi-directional, so when probing, the partition should also probe its own
critical components.

Lastly, when this results in an UNHEALTHY state, the metrics should be
updated. So we call updateHealthStatus when the partitions own state is
HEALTHY, but its critical components are not.

## Related issues

closes #4723 

backport of #4764

Co-authored-by: Nico korthout <nico.korthout@camunda.com>
zeebe-bors bot added a commit that referenced this pull request Jun 25, 2020
4796: [BACKPORT] Probe components health when partition is probed r=korthout a=korthout

## Description

When the BrokerHealthCheckService probes the ZeebePartitions for their
health, the ZeebePartition only reported its own health-status. However,
we cannot be sure that this status is updated when one of the components
fails. To make sure that the health check is correct, it is setup to be
bi-directional, so when probing, the partition should also probe its own
critical components.

Lastly, when this results in an UNHEALTHY state, the metrics should be
updated. So we call updateHealthStatus when the partitions own state is
HEALTHY, but its critical components are not.

## Related issues

closes #4723 

backport of #4764

Co-authored-by: Nico korthout <nico.korthout@camunda.com>
korthout added a commit that referenced this pull request Jun 25, 2020
…backport-0.23

[BACKPORT] Probe components health when partition is probed

Description
When the BrokerHealthCheckService probes the ZeebePartitions for their
health, the ZeebePartition only reported its own health-status. However,
we cannot be sure that this status is updated when one of the components
fails. To make sure that the health check is correct, it is setup to be
bi-directional, so when probing, the partition should also probe its own
critical components.

Lastly, when this results in an UNHEALTHY state, the metrics should be
updated. So we call updateHealthStatus when the partitions own state is
HEALTHY, but its critical components are not.

Related issues
closes #4723

backport of #4764
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

No fail over when stream process fails
3 participants