Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Suspend MV on Non-Recoverable Errors #54

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

hzxa21
Copy link

@hzxa21 hzxa21 commented Feb 24, 2023

@StrikeW
Copy link
Contributor

StrikeW commented Feb 27, 2023

I think we may extend the scenarios to unrecoverable Sources (risingwavelabs/risingwave#7192). The suspend workflow should be similar from the perspective of Meta. cc @BugenZhao

1. Even if these messages can be skip or fixed via filling in default values, correctness can be affected, especially when sinking to external systems, since the damage may hardly get reverted.
2. Recovery won’t help here. We may end up with re-consuming the malform records and trigger full recovery over and over again. All MVs in the cluster will be affected and no progress can be made.

This RFC proposes that we should suspend the affected MV when encountering non-recovery errors. Non-recoverable errors by definition cannot be fixed automatically so manual intervention and re-creating the MV are unavoidable. The goals of this RFC are:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may be too strict.

The user may still want the MV to run.

I believe that the user should be in charge of ensuring that their input source is persisted outside of RW (e.g. in msg queue like kafka). They should create a new MV that avoids the errors if they discover that they cannot tolerate the given errors.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should expose the error tolerance to the user. However, we need to decide which mode is the default:

connector_error_suspends_mv: true,
connector_error_kills_batch_query: true,
compute_error_suspends_mv: false,
compute_error_kills_batch_query: true,
storage_error_suspends_mv: true,

There is some overlap with the batch query behaviour (whether to warn or throw error) @fuyufjh . We have not decided which is the best default behaviour for user (or if we should expose the error tolerance option for batch). Actually, I'm leaning towards not tolerating errors for batch.


However, when there are non-recoverable errors caused by malform records ingested by the user, the current error handling mechanisms are insufficient:

1. Even if these messages can be skip or fixed via filling in default values, correctness can be affected, especially when sinking to external systems, since the damage may hardly get reverted.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the damage may hardly get reverted.

I'm not completely sure this is correct.

Could you provide some examples where the damage is permanent? One way is that their downstream components to RW are affected, and a lot of manual work is required to clean up the wrong data.

For RW itself, I think we do not need to eagerly suspend the MV, user can create a new MV on the source if they discover errors.

Ultimately, I think the behaviour should be determined by the user, whether they are willing to tolerate certain types of errors with NULL records in place, or if they want to suspend the job. The same goes for batch queries.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way iss that their downstream components to RW are affected, and a lot of manual work is required to clean up the wrong data.

Exactly.

Ultimately, I think the behaviour should be determined by the user, whether they are willing to tolerate certain types of errors or if they want to suspend the job. The same goes for batch queries.

Totally agree. That was my intention too.


## Non-Recoverable Errors

- Errors caused by malform records ingested by the user source. Examples: record too large, record parsing error, record format/type mismatch.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

record too large

I know this is just an example, but for completeness, in what scenario can record too large occur?

I know we have encountered request too large for etcd request: risingwavelabs/risingwave#7728 but not sure of record too large issue of connectors. @tabVersion

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We recently encounter an issue when user ingests a malform row with >64KB pk. We didn't notice it until compactor stops working and we manually inspect the encoded SSTs in S3. User also didn't realize that they have such a malform row ingested before we explicitly told them there is something wrong with a specific table.

Ideally, we should warn the user or even stop ingesting the malform row once it appeared. Although we have a hotfix merged, we still need to figure out a way to do that instead of just workaround the issue by allowing >64KB keys to be stored.

Also, we doesn't seem to have enforce any length limitation on our PK or the VARCHAR type but this is a separate discussion.

Copy link
Contributor

@fuyufjh fuyufjh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally agree.

Some complements to the background: Previously, the major reason blocking us from doing it is we didn't assume "Non-Recoverable Errors", because recovering a suspended streaming job seems to be difficult. But, as your solution limited the scope to "non-recoverable errors", it becomes doable.

## Non-Recoverable Errors

- Errors caused by malform records ingested by the user source. Examples: record too large, record parsing error, record format/type mismatch.
- Errors caused by failure in sanity check. Examples: state table insert/update/delete sanity check fails.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we consider panic as a kind of Non-Recoverable Errors?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we consider panic as a kind of Non-Recoverable Errors?

+1

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disagree. Suspending on panics would mean the system is basically non-HA. We should at least try for a few times.

Copy link
Contributor

@fuyufjh fuyufjh Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some extra thought I think this is bad, because sometimes a panic only means the program don't know how to continue but doesn't necessarily means it can't be recovered from last checkpoint.

1. Ban user query and throw an error.
2. Allow users to query but remind them there is an error and the MV will not longer be updated.

Personally prefer a > b.
Copy link
Contributor

@fuyufjh fuyufjh Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 looks a little bit better to me, but 1 seems to be the only feasible approache for us because we don't support a dangled MV i.e. an MV without streaming job attached.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 is doable since the data remains on storage and it is immutable. May need some refactoring on the meta side though (maybe we can just treat the dangled MV and its internal state tables as regular batch tables)


Detail steps on MV suspension:

1. Actor reports non-recoverable errors to meta node.
Copy link
Member

@BugenZhao BugenZhao Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be tricky as we need to avoid panicking (then aborting) for each type of non-recoverable error to make it work, or we have no chance to do the following stuff gracefully. One may argue that catch_unwind can be a fallback, but I worry that it'll corrupt some internal states and lead to chained errors.

For example, by using bytes::Buf we will panic on the deserialization of a corrupted encoding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the catch_unwind is just to get the information about the error's place such as the actor_id and send them to meta. A check-point-based failover is still needed?

Copy link
Member

@BugenZhao BugenZhao Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes if we have sufficient time to at least report the information. In this case, we must rely on approach #1 of suspending the MV. 🤔


### DISUCSSION: How to suspend the MV?

1. Trigger a recovery: all ongoing concurrent checkpoint fails and actors of the suspended MV won’t get rebuilt during recovery.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, we expect the users to drop the problematic materialized views to recovery from this. However, does this mean that the materialized view is "dangled" at this time?

@fuyufjh
Copy link
Contributor

fuyufjh commented Mar 6, 2023

After the dicussion, I'm a little wavering now.

  1. It introduces some extra complexity, not only in programing but also in analyzing a bug afterwards.
  2. Sometimes it's hard to tell exactly whether a situation is non-recoverable or recoverable

@xxchan
Copy link
Member

xxchan commented Mar 6, 2023

  1. It introduces some extra complexity, not only in programing but also in analyzing a bug afterwards.
  2. Sometimes it's hard to tell exactly whether a situation is non-recoverable or recoverable

This sounds just the original reasons why we didn't do it..

@hzxa21
Copy link
Author

hzxa21 commented Mar 6, 2023

We haven't reach consesus during the meeting but many thoughts pop out.

Goal

  1. Let user aware of the happening of the errors
  2. Maintain correct results of the MV/Sink when non-recoverable errors happen
  3. Reduce the impact of non-recoverable errors for streaming and serving

Meeting summary and some thoughts afterwards

  • Goal #1 can be achieved via reporting errors in grafana/cloud portal. We can also display the errors in a system table or display them on batch query if we want to draw attentions from the users.
  • Goal #2 (strict correctness - considering both MV and Sink) can be achieved via suspending the MV as well as all its downstream MVs but it hurts availablity of a subset of the streaming jobs.
  • Goal #3 can be achieved via
    • Not suspending any MVs when errors happen but just skipping the errors or filling in default value. Availabilty is high but this hurts goal #2.
    • Suspending only the affected MVs. Availibility is downgraded but goal RFC: The WatermarkFilter and StreamSort operator #2 can still be achieved.
  • Other thoughts brought up in the meeting:
    • Errors/panics caused by kernel bugs may be recoverable by upgrading the codes with the fixes and resume MVs from the lastest checkpoint.
    • We can provide an option to the user to choose whether he/she wants to suspend MV on non-recoverable errors but what the default behavior should be is still debatable.
    • We also talk about whether it is possible to keep the temporal join MV running if only its inner side MV is suspended. IMO, it is doable but complicated. We need atomic rename for MV so that the inner side MV can be recovered.
    • Not all panics are non-recoverable. Sometimes developer may just panic on meeting recoverble errors without thinking all through. In such cases, recovery may help.

In fact, goal #2 contradicts with goal #3 and user may pay different attentions depending on their use cases or development stages:

  • In test environment or use cases that cannot tolerate incorrect results (e.g. sinking a wrong signal to external system may not be acceptable), use may want to catch the non-recoverable errors and fix them immdieately before bad things happen. In this case, correnectness is preferred over availibility.
  • In production or use cases that can tolerate incorrect results, user may want its service to be always up. In this case, MV suspension may be a concern since availibility is preferred over strict correctness.

@hzxa21
Copy link
Author

hzxa21 commented Mar 6, 2023

In production or use cases that can tolerate incorrect results, user may want its service to be always up. In this case, MV suspension may be a concern since availibility is preferred over strict correctness.

This is the key concern of this RFC. However, in the current implementation, even though incorrect result is acceptable by the user, not all erros can be skipped. Let's assume all the errors can be skipped are recoverable errors. I suggest we constrain our discussion and focus on the following two issues:

How to identify the errors that cannot be skipped?

If we cannot identify these errors, we cannot do anything about it. The main debates are about how to handle panic.

The only feasible option seems to be letting developer carefully identify non-recoverable panic and report it as an non-recoverable error because as mentioned in the comment, we can hardly know the origin actor of a panic.

Example: we already did the conversion from panic to error for the MemTableError::InconsistentOperation and this error is non-recoverable: https://github.com/risingwavelabs/risingwave/blob/f5e41a190b284b26a83b7740af77c842d5e35837/src/storage/src/mem_table.rs#L99

What should we do when encountering errors that cannot be skipped?

Possible options:

  1. Trigger recovery over and over again, hoping that recovery can fix the errors.
  2. Pause all MVs and tell users to intervene. From user points of view, this is equivalent to option 1 if recovery cannot fix the errors (endless recovery == pausing all MVs).
  3. Pause some MVs and tell users to intervene. This is the proposal from this RFC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants