Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not fail RisingWave checkpoint if upstream source is unavailable #11705

Closed
hzxa21 opened this issue Aug 16, 2023 · 7 comments
Closed

Do not fail RisingWave checkpoint if upstream source is unavailable #11705

hzxa21 opened this issue Aug 16, 2023 · 7 comments

Comments

@hzxa21
Copy link
Collaborator

hzxa21 commented Aug 16, 2023

Is your feature request related to a problem? Please describe.

Suppose user created 3 tables:

  • Streaming table T1 from kafka source
  • Streaming table T2 from direct CDC of PG database1
  • Streaming table T3 from direct CDC of PG database2
  • Batch table T4

When the upstream PG database1 is not available, RisingWave cluster will enter a recovery loop until the upstream DB is back online causing 1) data ingestion to be stopped in T1, T3 and T4; 2) batch queries on T1, T3 and T4 will constantly get interrupted by recovery.

This issue is found during debugging a cluster failure reported by one of our user. It takes us some time to realize that RisingWave is working fine but the upstream source is down.

Describe the solution you'd like

We can pause the source from polling data from an unavailable source but still allow barrier injection from source. In this case, RisingWave checkpoint is unaffected when there is an issue in an external system.

Data will still be lagging for the unavailable source but this is unavoidable.

Describe alternatives you've considered

No response

Additional context

No response

@github-actions github-actions bot added this to the release-1.2 milestone Aug 16, 2023
@hzxa21
Copy link
Collaborator Author

hzxa21 commented Aug 16, 2023

cc @fuyufjh @tabVersion @BugenZhao Any opinion?

@BugenZhao
Copy link
Member

Seems we had a similar discussion before -> #7192 (comment)

@tabVersion
Copy link
Contributor

had a discussion with @fuyufjh yesterday, there will be a doc about the expected behavior for the source part when the upstream is unavailable. stay tuned!

@hzxa21
Copy link
Collaborator Author

hzxa21 commented Aug 16, 2023

had a discussion with @fuyufjh yesterday, there will be a doc about the expected behavior for the source part when the upstream is unavailable. stay tuned!

Thanks for the heads-up. Look forward to it! ❤️

@fuyufjh
Copy link
Contributor

fuyufjh commented Aug 16, 2023

Completely agree. Just discussed exactly the same issue with @tabVersion yesterday. We should start working on this soon.

@fuyufjh
Copy link
Contributor

fuyufjh commented Aug 16, 2023

We can pause the source from polling data from an unavailable source but still allow barrier injection from source. In this case, RisingWave checkpoint is unaffected when there is an issue in an external system.

Yeah. I proposed to let it retry forever when upstream source is unavailable. Otherwise, as you said, let the barrier injection run normally.

Besides, the Sink side need a similar mechanism but the implementation would be more difficult, which depends on the Sink-MV decoupling aka. Log Store, I think.

@fuyufjh
Copy link
Contributor

fuyufjh commented Sep 11, 2023

While for testing, @cyliu0 may take a look as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants