Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

streaming: report error to local barrier manager explicitly when error occurs #6319

Closed
BugenZhao opened this issue Nov 11, 2022 · 1 comment · Fixed by #6327
Closed

streaming: report error to local barrier manager explicitly when error occurs #6319

BugenZhao opened this issue Nov 11, 2022 · 1 comment · Fixed by #6327
Labels
component/streaming Stream processing related issue. type/bug Something isn't working

Comments

@BugenZhao
Copy link
Member

BugenZhao commented Nov 11, 2022

If an error (not panic) occurs during the execution of an actor, the actor will exit with Err, and without reporting to the local barrier manager. The exiting of the actor will cause upstream/downstream to exit, and finally cause a failure when injecting the next barrier, which will be realized by the meta service and trigger the fail-over procedure.

However, if the in-flight concurrent checkpoint has reached the maximum (or is simply set to 1), the compute nodes will hang forever, while the heartbeat acts normally. We'll fail to recover the cluster.

We may need to explicitly report the error to the local barrier manager, so that the meta service will realize the failure ASAP.

cc @StrikeW @yezizp2012

@BugenZhao BugenZhao added type/bug Something isn't working component/streaming Stream processing related issue. labels Nov 11, 2022
@github-actions github-actions bot added this to the release-0.1.14 milestone Nov 11, 2022
@yezizp2012
Copy link
Contributor

And the hang situation will only occur if the actor exit right after the barrier has been sent successfully and before the collect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/streaming Stream processing related issue. type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants