RFC: Suspend MV on Non-Recoverable Errors #54

hzxa21 · 2023-02-24T12:30:12Z

StrikeW · 2023-02-27T12:01:59Z

I think we may extend the scenarios to unrecoverable Sources (risingwavelabs/risingwave#7192). The suspend workflow should be similar from the perspective of Meta. cc @BugenZhao

jon-chuang · 2023-02-28T03:59:37Z

rfcs/0054-suspend-mv-on-non-recoverable-errors.md

+1. Even if these messages can be skip or fixed via filling in default values, correctness can be affected, especially when sinking to external systems, since the damage may hardly get reverted.
+2. Recovery won’t help here. We may end up with re-consuming the malform records and trigger full recovery over and over again. All MVs in the cluster will be affected and no progress can be made.
+
+This RFC proposes that we should suspend the affected MV when encountering non-recovery errors. Non-recoverable errors by definition cannot be fixed automatically so manual intervention and re-creating the MV are unavoidable. The goals of this RFC are:


I think this may be too strict.

The user may still want the MV to run.

I believe that the user should be in charge of ensuring that their input source is persisted outside of RW (e.g. in msg queue like kafka). They should create a new MV that avoids the errors if they discover that they cannot tolerate the given errors.

Perhaps we should expose the error tolerance to the user. However, we need to decide which mode is the default:

connector_error_suspends_mv: true, connector_error_kills_batch_query: true, compute_error_suspends_mv: false, compute_error_kills_batch_query: true, storage_error_suspends_mv: true,

There is some overlap with the batch query behaviour (whether to warn or throw error) @fuyufjh . We have not decided which is the best default behaviour for user (or if we should expose the error tolerance option for batch). Actually, I'm leaning towards not tolerating errors for batch.

jon-chuang · 2023-02-28T04:08:06Z

rfcs/0054-suspend-mv-on-non-recoverable-errors.md

+
+However, when there are non-recoverable errors caused by malform records ingested by the user, the current error handling mechanisms are insufficient:
+
+1. Even if these messages can be skip or fixed via filling in default values, correctness can be affected, especially when sinking to external systems, since the damage may hardly get reverted.


the damage may hardly get reverted.

I'm not completely sure this is correct.

Could you provide some examples where the damage is permanent? One way is that their downstream components to RW are affected, and a lot of manual work is required to clean up the wrong data.

For RW itself, I think we do not need to eagerly suspend the MV, user can create a new MV on the source if they discover errors.

Ultimately, I think the behaviour should be determined by the user, whether they are willing to tolerate certain types of errors with NULL records in place, or if they want to suspend the job. The same goes for batch queries.

One way iss that their downstream components to RW are affected, and a lot of manual work is required to clean up the wrong data.

Exactly.

Ultimately, I think the behaviour should be determined by the user, whether they are willing to tolerate certain types of errors or if they want to suspend the job. The same goes for batch queries.

Totally agree. That was my intention too.

jon-chuang · 2023-02-28T04:19:48Z

rfcs/0054-suspend-mv-on-non-recoverable-errors.md

+
+## Non-Recoverable Errors
+
+- Errors caused by malform records ingested by the user source. Examples: record too large, record parsing error, record format/type mismatch.


record too large

I know this is just an example, but for completeness, in what scenario can record too large occur?

I know we have encountered request too large for etcd request: risingwavelabs/risingwave#7728 but not sure of record too large issue of connectors. @tabVersion

We recently encounter an issue when user ingests a malform row with >64KB pk. We didn't notice it until compactor stops working and we manually inspect the encoded SSTs in S3. User also didn't realize that they have such a malform row ingested before we explicitly told them there is something wrong with a specific table.

Ideally, we should warn the user or even stop ingesting the malform row once it appeared. Although we have a hotfix merged, we still need to figure out a way to do that instead of just workaround the issue by allowing >64KB keys to be stored.

Also, we doesn't seem to have enforce any length limitation on our PK or the VARCHAR type but this is a separate discussion.

fuyufjh

Generally agree.

Some complements to the background: Previously, the major reason blocking us from doing it is we didn't assume "Non-Recoverable Errors", because recovering a suspended streaming job seems to be difficult. But, as your solution limited the scope to "non-recoverable errors", it becomes doable.

fuyufjh · 2023-02-28T10:55:35Z

rfcs/0054-suspend-mv-on-non-recoverable-errors.md

+## Non-Recoverable Errors
+
+- Errors caused by malform records ingested by the user source. Examples: record too large, record parsing error, record format/type mismatch.
+- Errors caused by failure in sanity check. Examples: state table insert/update/delete sanity check fails.


Shall we consider panic as a kind of Non-Recoverable Errors?

Shall we consider panic as a kind of Non-Recoverable Errors?

+1

Disagree. Suspending on panics would mean the system is basically non-HA. We should at least try for a few times.

After some extra thought I think this is bad, because sometimes a panic only means the program don't know how to continue but doesn't necessarily means it can't be recovered from last checkpoint.

fuyufjh · 2023-02-28T10:58:26Z

rfcs/0054-suspend-mv-on-non-recoverable-errors.md

+1. Ban user query and throw an error.
+2. Allow users to query but remind them there is an error and the MV will not longer be updated.
+
+Personally prefer a > b. 


2 looks a little bit better to me, but 1 seems to be the only feasible approache for us because we don't support a dangled MV i.e. an MV without streaming job attached.

2 is doable since the data remains on storage and it is immutable. May need some refactoring on the meta side though (maybe we can just treat the dangled MV and its internal state tables as regular batch tables)

BugenZhao · 2023-03-06T07:16:25Z

rfcs/0054-suspend-mv-on-non-recoverable-errors.md

+
+Detail steps on MV suspension:
+
+1. Actor reports non-recoverable errors to meta node.


This may be tricky as we need to avoid panicking (then aborting) for each type of non-recoverable error to make it work, or we have no chance to do the following stuff gracefully. One may argue that catch_unwind can be a fallback, but I worry that it'll corrupt some internal states and lead to chained errors.

For example, by using bytes::Buf we will panic on the deserialization of a corrupted encoding.

I think the catch_unwind is just to get the information about the error's place such as the actor_id and send them to meta. A check-point-based failover is still needed?

Yes if we have sufficient time to at least report the information. In this case, we must rely on approach #1 of suspending the MV. 🤔

BugenZhao · 2023-03-06T07:19:46Z

rfcs/0054-suspend-mv-on-non-recoverable-errors.md

+
+### DISUCSSION: How to suspend the MV?
+
+1. Trigger a recovery: all ongoing concurrent checkpoint fails and actors of the suspended MV won’t get rebuilt during recovery.


IIUC, we expect the users to drop the problematic materialized views to recovery from this. However, does this mean that the materialized view is "dangled" at this time?

fuyufjh · 2023-03-06T08:29:57Z

After the dicussion, I'm a little wavering now.

It introduces some extra complexity, not only in programing but also in analyzing a bug afterwards.
Sometimes it's hard to tell exactly whether a situation is non-recoverable or recoverable

xxchan · 2023-03-06T12:35:00Z

It introduces some extra complexity, not only in programing but also in analyzing a bug afterwards.

Sometimes it's hard to tell exactly whether a situation is non-recoverable or recoverable

This sounds just the original reasons why we didn't do it..

hzxa21 · 2023-03-06T14:58:14Z

We haven't reach consesus during the meeting but many thoughts pop out.

Goal

Let user aware of the happening of the errors
Maintain correct results of the MV/Sink when non-recoverable errors happen
Reduce the impact of non-recoverable errors for streaming and serving

Meeting summary and some thoughts afterwards

Goal #1 can be achieved via reporting errors in grafana/cloud portal. We can also display the errors in a system table or display them on batch query if we want to draw attentions from the users.
Goal #2 (strict correctness - considering both MV and Sink) can be achieved via suspending the MV as well as all its downstream MVs but it hurts availablity of a subset of the streaming jobs.
Goal #3 can be achieved via
- Not suspending any MVs when errors happen but just skipping the errors or filling in default value. Availabilty is high but this hurts goal #2.
- Suspending only the affected MVs. Availibility is downgraded but goal RFC: The WatermarkFilter and StreamSort operator #2 can still be achieved.
Other thoughts brought up in the meeting:
- Errors/panics caused by kernel bugs may be recoverable by upgrading the codes with the fixes and resume MVs from the lastest checkpoint.
- We can provide an option to the user to choose whether he/she wants to suspend MV on non-recoverable errors but what the default behavior should be is still debatable.
- We also talk about whether it is possible to keep the temporal join MV running if only its inner side MV is suspended. IMO, it is doable but complicated. We need atomic rename for MV so that the inner side MV can be recovered.
- Not all panics are non-recoverable. Sometimes developer may just panic on meeting recoverble errors without thinking all through. In such cases, recovery may help.

In fact, goal #2 contradicts with goal #3 and user may pay different attentions depending on their use cases or development stages:

In test environment or use cases that cannot tolerate incorrect results (e.g. sinking a wrong signal to external system may not be acceptable), use may want to catch the non-recoverable errors and fix them immdieately before bad things happen. In this case, correnectness is preferred over availibility.
In production or use cases that can tolerate incorrect results, user may want its service to be always up. In this case, MV suspension may be a concern since availibility is preferred over strict correctness.

hzxa21 · 2023-03-06T15:35:54Z

In production or use cases that can tolerate incorrect results, user may want its service to be always up. In this case, MV suspension may be a concern since availibility is preferred over strict correctness.

This is the key concern of this RFC. However, in the current implementation, even though incorrect result is acceptable by the user, not all erros can be skipped. Let's assume all the errors can be skipped are recoverable errors. I suggest we constrain our discussion and focus on the following two issues:

How to identify the errors that cannot be skipped?

If we cannot identify these errors, we cannot do anything about it. The main debates are about how to handle panic.

The only feasible option seems to be letting developer carefully identify non-recoverable panic and report it as an non-recoverable error because as mentioned in the comment, we can hardly know the origin actor of a panic.

Example: we already did the conversion from panic to error for the MemTableError::InconsistentOperation and this error is non-recoverable: https://github.com/risingwavelabs/risingwave/blob/f5e41a190b284b26a83b7740af77c842d5e35837/src/storage/src/mem_table.rs#L99

What should we do when encountering errors that cannot be skipped?

Possible options:

Trigger recovery over and over again, hoping that recovery can fix the errors.
Pause all MVs and tell users to intervene. From user points of view, this is equivalent to option 1 if recovery cannot fix the errors (endless recovery == pausing all MVs).
Pause some MVs and tell users to intervene. This is the proposal from this RFC.

# RFC: Suspend MV on Non-Recoverable Errors

a3d8ce7

jon-chuang reviewed Feb 28, 2023

View reviewed changes

fuyufjh reviewed Feb 28, 2023

View reviewed changes

BugenZhao reviewed Mar 6, 2023

View reviewed changes

xxchan mentioned this pull request Jul 27, 2023

streaming agg panics on error (e.g. overflow) risingwavelabs/risingwave#11256

Open

BugenZhao mentioned this pull request Mar 26, 2024

Use blocking (foreground) DDL for source and sinks risingwavelabs/risingwave#15587

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Suspend MV on Non-Recoverable Errors #54

RFC: Suspend MV on Non-Recoverable Errors #54

hzxa21 commented Feb 24, 2023

StrikeW commented Feb 27, 2023 •

edited

jon-chuang Feb 28, 2023 •

edited

jon-chuang Feb 28, 2023 •

edited

jon-chuang Feb 28, 2023 •

edited

hzxa21 Feb 28, 2023

jon-chuang Feb 28, 2023

hzxa21 Feb 28, 2023

fuyufjh left a comment

fuyufjh Feb 28, 2023

hzxa21 Mar 6, 2023

neverchanje Mar 6, 2023 •

edited

fuyufjh Mar 6, 2023 •

edited

fuyufjh Feb 28, 2023 •

edited

hzxa21 Mar 6, 2023

BugenZhao Mar 6, 2023 •

edited

st1page Mar 6, 2023

BugenZhao Mar 6, 2023 •

edited

BugenZhao Mar 6, 2023

fuyufjh commented Mar 6, 2023

xxchan commented Mar 6, 2023

hzxa21 commented Mar 6, 2023

hzxa21 commented Mar 6, 2023


		However, when there are non-recoverable errors caused by malform records ingested by the user, the current error handling mechanisms are insufficient:

		1. Even if these messages can be skip or fixed via filling in default values, correctness can be affected, especially when sinking to external systems, since the damage may hardly get reverted.


		## Non-Recoverable Errors

		- Errors caused by malform records ingested by the user source. Examples: record too large, record parsing error, record format/type mismatch.


		Detail steps on MV suspension:

		1. Actor reports non-recoverable errors to meta node.


		### DISUCSSION: How to suspend the MV?

		1. Trigger a recovery: all ongoing concurrent checkpoint fails and actors of the suspended MV won’t get rebuilt during recovery.

RFC: Suspend MV on Non-Recoverable Errors #54

Are you sure you want to change the base?

RFC: Suspend MV on Non-Recoverable Errors #54

Conversation

hzxa21 commented Feb 24, 2023

StrikeW commented Feb 27, 2023 • edited

jon-chuang Feb 28, 2023 • edited

Choose a reason for hiding this comment

jon-chuang Feb 28, 2023 • edited

Choose a reason for hiding this comment

jon-chuang Feb 28, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fuyufjh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neverchanje Mar 6, 2023 • edited

Choose a reason for hiding this comment

fuyufjh Mar 6, 2023 • edited

Choose a reason for hiding this comment

fuyufjh Feb 28, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BugenZhao Mar 6, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BugenZhao Mar 6, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fuyufjh commented Mar 6, 2023

xxchan commented Mar 6, 2023

hzxa21 commented Mar 6, 2023

Goal

Meeting summary and some thoughts afterwards

hzxa21 commented Mar 6, 2023

How to identify the errors that cannot be skipped?

What should we do when encountering errors that cannot be skipped?

StrikeW commented Feb 27, 2023 •

edited

jon-chuang Feb 28, 2023 •

edited

jon-chuang Feb 28, 2023 •

edited

jon-chuang Feb 28, 2023 •

edited

neverchanje Mar 6, 2023 •

edited

fuyufjh Mar 6, 2023 •

edited

fuyufjh Feb 28, 2023 •

edited

BugenZhao Mar 6, 2023 •

edited

BugenZhao Mar 6, 2023 •

edited