Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use raft disk i/o timeout in recovery stm #8741

Merged

Conversation

mmaslankaprv
Copy link
Member

@mmaslankaprv mmaslankaprv commented Feb 9, 2023

We saw an intermittent hang when reading batches in recovery stm. Added
a timeout when reading batches to send to the follower. When timeout
occur an error is logged. This way it will be easier to track the issue
in tests and real deployments.

Fixes: #8724

Backports Required

  • none - not a bug fix
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v22.3.x
  • v22.2.x
  • v22.1.x

UX Changes

Release Notes

Improvements

  • easier debugging of long reads during recovery

Signed-off-by: Michal Maslanka <michal@redpanda.com>
We saw an intermittent hang when reading batches in recovery stm. Added
a timeout when reading batches to send to the follower. When timeout
occur an error is logged. This way it will be easier to track the issue
in tests and real deployments.

Signed-off-by: Michal Maslanka <michal@redpanda.com>
Signed-off-by: Michal Maslanka <michal@redpanda.com>
Some tests use SIGSTOP to suspend nodes. In this case the raft I/O
timeout must be longer than max suspend time as otherwise it would
generate spurious failures.

Signed-off-by: Michal Maslanka <michal@redpanda.com>
@mmaslankaprv
Copy link
Member Author

mmaslankaprv commented Feb 9, 2023

new unrelated ci failure: #8745

std::move(gap_filled_batches));
} catch (const ss::timed_out_error& e) {
vlog(
_ctxlog.error,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recovery will be retried after this error, right? If so, what is the rationale for logging it at error severity? So that we can catch it in tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly, i would love this to be as visible as possible.

@mmaslankaprv mmaslankaprv merged commit ae944d4 into redpanda-data:dev Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent partition state between brokers during transfer of leadership
3 participants