-
Notifications
You must be signed in to change notification settings - Fork 558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Probe components health when partition is probed #4764
Conversation
When the BrokerHealthCheckService probes the ZeebePartitions for their health, the ZeebePartition only reported its own health-status. However, we cannot be sure that this status is updated when one of the components fails. To make sure that the health check is correct, it is setup to be bi-directional, so when probing, the partition should also probe its own critical components. Lastly, when this results in an UNHEALTHY state, the metrics should be updated. So we call updateHealthStatus when the partitions own state is HEALTHY, but its critical components are not.
e79a87c
to
a22838e
Compare
I've also looked into why the ZeebePartition was unaware of the streamprocessor failing, while it is supposed to inform the ZeebePartition's CriticalComponentHealthMonitor when it fails. The reason for this is that the ZeebePartition's CriticalComponentHealthMonitor is not yet registered as the failure listener of the StreamProcessor when the failure occurs. It's visible in the following logging (which I've produced with some additional log statements):
Here we see there are 2 attempts to call failureListener.onFailure(). At both moments the failureListener is null, even though the task to add the failure listener is already scheduled. That task is never executed because the actor is closed as soon as a failure occurs. If the actor would have executed the task to add the failure listener, we would have seen a log entry stating: In normal situations, where the StreamProcessor was able to fully startup successfully, it would add the failure listener and is then able to inform the monitor. In situations where it is unable to startup, it can't and we need to rely on the monitor to check-in on the streamprocessor's health using its periodic check. |
@deepthidevaki This change seems to make the integration test |
@deepthidevaki After our work this morning on the flaky test, we've combined the onFailure and handleFailure methods of StreamProcessor and moved the initial healthCheckTick in the StreamProcessor to before completing the openFuture. Injecting the failure (like in the original issue) now results in the following logging:
Here we can see that the after failing, the actor is closed but because the failure listener is still null, the health monitor checking the stream processor is not informed about the stream processor becoming unhealthy. After 1 minute the CriticalComponentsHealthMonitor of the ZeebePartition probes its components, among them the StreamProcessor and discovers that the actor is closed and thus it has become unhealthy. I would like to consider whether we can improve this. Should we perhaps inform the failure listener of the current status, as soon as it registers? |
The healthcheck tick should be performed before the open future is completed, to make sure it is initialized when the health monitoring starts. This commit also changes how failures are handled. Before this commit they were handled internally in the class and rethrown. The unhandled thrown exception was then handled by the Actor which in turn calls the handleFailure method. This lead to an error message where the actor failed in phase FAILED, while it actually failed in phase STARTED and then the actor was failed afterwhich the failure was handled again.
@deepthidevaki I've rerun the pipeline about 10 times. There was only 1 build flaky, but this was because of another unstable test (already described here #4430). I think this integration test is no longer flaky. Please have another look at the changes :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Thanks for the documentation.
bors r+ |
4764: Probe components health when partition is probed r=korthout a=korthout ## Description When the BrokerHealthCheckService probes the ZeebePartitions for their health, the ZeebePartition only reported its own health-status. However, we cannot be sure that this status is updated when one of the components fails. To make sure that the health check is correct, it is setup to be bi-directional, so when probing, the partition should also probe its own critical components. Lastly, when this results in an UNHEALTHY state, the metrics should be updated. So we call updateHealthStatus when the partitions own state is HEALTHY, but its critical components are not. ## Related issues closes #4723 Co-authored-by: Nico korthout <nico.korthout@camunda.com>
Build failed |
bors retry |
4764: Probe components health when partition is probed r=korthout a=korthout ## Description When the BrokerHealthCheckService probes the ZeebePartitions for their health, the ZeebePartition only reported its own health-status. However, we cannot be sure that this status is updated when one of the components fails. To make sure that the health check is correct, it is setup to be bi-directional, so when probing, the partition should also probe its own critical components. Lastly, when this results in an UNHEALTHY state, the metrics should be updated. So we call updateHealthStatus when the partitions own state is HEALTHY, but its critical components are not. ## Related issues closes #4723 Co-authored-by: Nico korthout <nico.korthout@camunda.com>
Build failed |
bors retry |
Build succeeded |
4796: [BACKPORT] Probe components health when partition is probed r=korthout a=korthout ## Description When the BrokerHealthCheckService probes the ZeebePartitions for their health, the ZeebePartition only reported its own health-status. However, we cannot be sure that this status is updated when one of the components fails. To make sure that the health check is correct, it is setup to be bi-directional, so when probing, the partition should also probe its own critical components. Lastly, when this results in an UNHEALTHY state, the metrics should be updated. So we call updateHealthStatus when the partitions own state is HEALTHY, but its critical components are not. ## Related issues closes #4723 backport of #4764 Co-authored-by: Nico korthout <nico.korthout@camunda.com>
4796: [BACKPORT] Probe components health when partition is probed r=korthout a=korthout ## Description When the BrokerHealthCheckService probes the ZeebePartitions for their health, the ZeebePartition only reported its own health-status. However, we cannot be sure that this status is updated when one of the components fails. To make sure that the health check is correct, it is setup to be bi-directional, so when probing, the partition should also probe its own critical components. Lastly, when this results in an UNHEALTHY state, the metrics should be updated. So we call updateHealthStatus when the partitions own state is HEALTHY, but its critical components are not. ## Related issues closes #4723 backport of #4764 Co-authored-by: Nico korthout <nico.korthout@camunda.com>
4796: [BACKPORT] Probe components health when partition is probed r=korthout a=korthout ## Description When the BrokerHealthCheckService probes the ZeebePartitions for their health, the ZeebePartition only reported its own health-status. However, we cannot be sure that this status is updated when one of the components fails. To make sure that the health check is correct, it is setup to be bi-directional, so when probing, the partition should also probe its own critical components. Lastly, when this results in an UNHEALTHY state, the metrics should be updated. So we call updateHealthStatus when the partitions own state is HEALTHY, but its critical components are not. ## Related issues closes #4723 backport of #4764 Co-authored-by: Nico korthout <nico.korthout@camunda.com>
4796: [BACKPORT] Probe components health when partition is probed r=korthout a=korthout ## Description When the BrokerHealthCheckService probes the ZeebePartitions for their health, the ZeebePartition only reported its own health-status. However, we cannot be sure that this status is updated when one of the components fails. To make sure that the health check is correct, it is setup to be bi-directional, so when probing, the partition should also probe its own critical components. Lastly, when this results in an UNHEALTHY state, the metrics should be updated. So we call updateHealthStatus when the partitions own state is HEALTHY, but its critical components are not. ## Related issues closes #4723 backport of #4764 Co-authored-by: Nico korthout <nico.korthout@camunda.com>
4796: [BACKPORT] Probe components health when partition is probed r=korthout a=korthout ## Description When the BrokerHealthCheckService probes the ZeebePartitions for their health, the ZeebePartition only reported its own health-status. However, we cannot be sure that this status is updated when one of the components fails. To make sure that the health check is correct, it is setup to be bi-directional, so when probing, the partition should also probe its own critical components. Lastly, when this results in an UNHEALTHY state, the metrics should be updated. So we call updateHealthStatus when the partitions own state is HEALTHY, but its critical components are not. ## Related issues closes #4723 backport of #4764 Co-authored-by: Nico korthout <nico.korthout@camunda.com>
4796: [BACKPORT] Probe components health when partition is probed r=korthout a=korthout ## Description When the BrokerHealthCheckService probes the ZeebePartitions for their health, the ZeebePartition only reported its own health-status. However, we cannot be sure that this status is updated when one of the components fails. To make sure that the health check is correct, it is setup to be bi-directional, so when probing, the partition should also probe its own critical components. Lastly, when this results in an UNHEALTHY state, the metrics should be updated. So we call updateHealthStatus when the partitions own state is HEALTHY, but its critical components are not. ## Related issues closes #4723 backport of #4764 Co-authored-by: Nico korthout <nico.korthout@camunda.com>
4796: [BACKPORT] Probe components health when partition is probed r=korthout a=korthout ## Description When the BrokerHealthCheckService probes the ZeebePartitions for their health, the ZeebePartition only reported its own health-status. However, we cannot be sure that this status is updated when one of the components fails. To make sure that the health check is correct, it is setup to be bi-directional, so when probing, the partition should also probe its own critical components. Lastly, when this results in an UNHEALTHY state, the metrics should be updated. So we call updateHealthStatus when the partitions own state is HEALTHY, but its critical components are not. ## Related issues closes #4723 backport of #4764 Co-authored-by: Nico korthout <nico.korthout@camunda.com>
4796: [BACKPORT] Probe components health when partition is probed r=korthout a=korthout ## Description When the BrokerHealthCheckService probes the ZeebePartitions for their health, the ZeebePartition only reported its own health-status. However, we cannot be sure that this status is updated when one of the components fails. To make sure that the health check is correct, it is setup to be bi-directional, so when probing, the partition should also probe its own critical components. Lastly, when this results in an UNHEALTHY state, the metrics should be updated. So we call updateHealthStatus when the partitions own state is HEALTHY, but its critical components are not. ## Related issues closes #4723 backport of #4764 Co-authored-by: Nico korthout <nico.korthout@camunda.com>
4796: [BACKPORT] Probe components health when partition is probed r=korthout a=korthout ## Description When the BrokerHealthCheckService probes the ZeebePartitions for their health, the ZeebePartition only reported its own health-status. However, we cannot be sure that this status is updated when one of the components fails. To make sure that the health check is correct, it is setup to be bi-directional, so when probing, the partition should also probe its own critical components. Lastly, when this results in an UNHEALTHY state, the metrics should be updated. So we call updateHealthStatus when the partitions own state is HEALTHY, but its critical components are not. ## Related issues closes #4723 backport of #4764 Co-authored-by: Nico korthout <nico.korthout@camunda.com>
…backport-0.23 [BACKPORT] Probe components health when partition is probed Description When the BrokerHealthCheckService probes the ZeebePartitions for their health, the ZeebePartition only reported its own health-status. However, we cannot be sure that this status is updated when one of the components fails. To make sure that the health check is correct, it is setup to be bi-directional, so when probing, the partition should also probe its own critical components. Lastly, when this results in an UNHEALTHY state, the metrics should be updated. So we call updateHealthStatus when the partitions own state is HEALTHY, but its critical components are not. Related issues closes #4723 backport of #4764
Description
When the BrokerHealthCheckService probes the ZeebePartitions for their
health, the ZeebePartition only reported its own health-status. However,
we cannot be sure that this status is updated when one of the components
fails. To make sure that the health check is correct, it is setup to be
bi-directional, so when probing, the partition should also probe its own
critical components.
Lastly, when this results in an UNHEALTHY state, the metrics should be
updated. So we call updateHealthStatus when the partitions own state is
HEALTHY, but its critical components are not.
Related issues
closes #4723