Use raft disk i/o timeout in recovery stm #8741

mmaslankaprv · 2023-02-09T08:56:33Z

We saw an intermittent hang when reading batches in recovery stm. Added
a timeout when reading batches to send to the follower. When timeout
occur an error is logged. This way it will be easier to track the issue
in tests and real deployments.

Fixes: #8724

Backports Required

UX Changes

Release Notes

Improvements

easier debugging of long reads during recovery

Signed-off-by: Michal Maslanka <michal@redpanda.com>

We saw an intermittent hang when reading batches in recovery stm. Added a timeout when reading batches to send to the follower. When timeout occur an error is logged. This way it will be easier to track the issue in tests and real deployments. Signed-off-by: Michal Maslanka <michal@redpanda.com>

Signed-off-by: Michal Maslanka <michal@redpanda.com>

Some tests use SIGSTOP to suspend nodes. In this case the raft I/O timeout must be longer than max suspend time as otherwise it would generate spurious failures. Signed-off-by: Michal Maslanka <michal@redpanda.com>

mmaslankaprv · 2023-02-09T14:21:08Z

new unrelated ci failure: #8745

ztlpn · 2023-02-09T15:22:28Z

src/v/raft/recovery_stm.cc

+          std::move(gap_filled_batches));
+    } catch (const ss::timed_out_error& e) {
+        vlog(
+          _ctxlog.error,


Recovery will be retried after this error, right? If so, what is the rationale for logging it at error severity? So that we can catch it in tests?

exactly, i would love this to be as visible as possible.

mmaslankaprv added 4 commits February 9, 2023 09:29

r/consensus: made raft disk i/o timeout runtime configurable

26db7e3

Signed-off-by: Michal Maslanka <michal@redpanda.com>

r/recovery_stm: applied formatting

8e873c2

Signed-off-by: Michal Maslanka <michal@redpanda.com>

tests: adjust raft io timeout in tests

60ab0d1

Some tests use SIGSTOP to suspend nodes. In this case the raft I/O timeout must be longer than max suspend time as otherwise it would generate spurious failures. Signed-off-by: Michal Maslanka <michal@redpanda.com>

mmaslankaprv requested review from ztlpn, rystsov and bharathv February 9, 2023 08:56

github-actions bot added the area/redpanda label Feb 9, 2023

ztlpn approved these changes Feb 9, 2023

View reviewed changes

bharathv approved these changes Feb 9, 2023

View reviewed changes

mmaslankaprv merged commit ae944d4 into redpanda-data:dev Feb 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use raft disk i/o timeout in recovery stm #8741

Use raft disk i/o timeout in recovery stm #8741

mmaslankaprv commented Feb 9, 2023 •

edited

mmaslankaprv commented Feb 9, 2023 •

edited

ztlpn Feb 9, 2023

mmaslankaprv Feb 9, 2023

Use raft disk i/o timeout in recovery stm #8741

Use raft disk i/o timeout in recovery stm #8741

Conversation

mmaslankaprv commented Feb 9, 2023 • edited

Backports Required

UX Changes

Release Notes

Improvements

mmaslankaprv commented Feb 9, 2023 • edited

ztlpn Feb 9, 2023

Choose a reason for hiding this comment

mmaslankaprv Feb 9, 2023

Choose a reason for hiding this comment

mmaslankaprv commented Feb 9, 2023 •

edited

mmaslankaprv commented Feb 9, 2023 •

edited