Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repair: assert failure when local node fails to produce checksum #5238

Closed
denesb opened this issue Oct 29, 2019 · 0 comments
Closed

repair: assert failure when local node fails to produce checksum #5238

denesb opened this issue Oct 29, 2019 · 0 comments

Comments

@denesb
Copy link
Contributor

@denesb denesb commented Oct 29, 2019

Affected version: all active versions (2.1+).

scylla/repair/repair.cc

Lines 809 to 839 in 85168c5

when_all(checksums.begin(), checksums.end()).then(
[&ri, &cf, range, &neighbors, &success]
(std::vector<future<partition_checksum>> checksums) {
// If only some of the replicas of this range are alive,
// we set success=false so repair will fail, but we can
// still do our best to repair available replicas.
std::vector<gms::inet_address> live_neighbors;
std::vector<partition_checksum> live_neighbors_checksum;
for (unsigned i = 0; i < checksums.size(); i++) {
if (checksums[i].failed()) {
rlogger.warn(
"Checksum of range {} on {} failed: {}",
range,
(i ? neighbors[i-1] :
utils::fb_utilities::get_broadcast_address()),
checksums[i].get_exception());
success = false;
ri.nr_failed_ranges++;
// Do not break out of the loop here, so we can log
// (and discard) all the exceptions.
} else if (i > 0) {
live_neighbors.push_back(neighbors[i - 1]);
live_neighbors_checksum.push_back(checksums[i].get0());
}
}
if (checksums[0].failed() || live_neighbors.empty()) {
return make_ready_future<>();
}
// If one of the available checksums is different, repair
// all the neighbors which returned a checksum.
auto checksum0 = checksums[0].get0();

If checksums[0] is a failed future, the exception is extracted and logged. Calling checksum[0].get_exception() resets the future to state::invalid. If at least some of the other nodes produced a checksum, the if at line 834 will fall through as live_neighbours is not empty and checksums[0].failed() will return false, due to the future being in state::invalid state. Thus, execution happily proceeds to line 839, where it calls checksums[0].get0(), which leads to an assert failure, because the future::get() will attempt to get the thread context of the (non-available) future and will fail.

avikivity added a commit that referenced this issue Oct 29, 2019
…on only once

The loop that collects the result of the checksum calculations and logs
any errors. The error logging includes `checksums[0]` which corresponds
to the checksum calculation on the local node. This violates the
assumption of the code following the loop, which assumes that the future
of `checksums[0]` is intact after the loop terminates. However this is
only true when the checksum calculation is successful and is false when
it fails, as in this case the loop extracts the error and logs it. When
the code after the loop checks again whether said calculation failed, it
will get a false negative and will go ahead and attempt to extract the
value, triggering an assert failure.
Fix by making sure that even in the case of failed checksum calculation,
the result of `checksum[0]` is extracted only once.

Fixes: #5238
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191029151709.90986-1-bdenes@scylladb.com>
(cherry picked from commit e48f301)
avikivity added a commit that referenced this issue Oct 29, 2019
…on only once

The loop that collects the result of the checksum calculations and logs
any errors. The error logging includes `checksums[0]` which corresponds
to the checksum calculation on the local node. This violates the
assumption of the code following the loop, which assumes that the future
of `checksums[0]` is intact after the loop terminates. However this is
only true when the checksum calculation is successful and is false when
it fails, as in this case the loop extracts the error and logs it. When
the code after the loop checks again whether said calculation failed, it
will get a false negative and will go ahead and attempt to extract the
value, triggering an assert failure.
Fix by making sure that even in the case of failed checksum calculation,
the result of `checksum[0]` is extracted only once.

Fixes: #5238
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191029151709.90986-1-bdenes@scylladb.com>
(cherry picked from commit e48f301)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.