-
Notifications
You must be signed in to change notification settings - Fork 526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not fail RisingWave checkpoint if upstream source is unavailable #11705
Comments
cc @fuyufjh @tabVersion @BugenZhao Any opinion? |
Seems we had a similar discussion before -> #7192 (comment) |
had a discussion with @fuyufjh yesterday, there will be a doc about the expected behavior for the source part when the upstream is unavailable. stay tuned! |
Thanks for the heads-up. Look forward to it! ❤️ |
Completely agree. Just discussed exactly the same issue with @tabVersion yesterday. We should start working on this soon. |
Yeah. I proposed to let it retry forever when upstream source is unavailable. Otherwise, as you said, let the barrier injection run normally. Besides, the Sink side need a similar mechanism but the implementation would be more difficult, which depends on the Sink-MV decoupling aka. Log Store, I think. |
While for testing, @cyliu0 may take a look as well |
Is your feature request related to a problem? Please describe.
Suppose user created 3 tables:
When the upstream PG database1 is not available, RisingWave cluster will enter a recovery loop until the upstream DB is back online causing 1) data ingestion to be stopped in T1, T3 and T4; 2) batch queries on T1, T3 and T4 will constantly get interrupted by recovery.
This issue is found during debugging a cluster failure reported by one of our user. It takes us some time to realize that RisingWave is working fine but the upstream source is down.
Describe the solution you'd like
We can pause the source from polling data from an unavailable source but still allow barrier injection from source. In this case, RisingWave checkpoint is unaffected when there is an issue in an external system.
Data will still be lagging for the unavailable source but this is unavoidable.
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: