New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash: 2517747396662708227 #1511
Comments
|
Ah, interesting, this lookslike a race between reply read and reply read! We set fauly when a read completes with an error: fn on_request_reply_read_callback(
client_replies: *ClientReplies,
reply_header: *const Header.Reply,
reply_: ?*Message.Reply,
destination_replica: ?u8,
) void {
const self = @fieldParentPtr(Self, "client_replies", client_replies);
const reply = reply_ orelse {
log.debug("{}: on_request_reply: reply not found for replica={} (checksum={})", .{
self.replica,
destination_replica.?,
reply_header.checksum,
});
if (self.client_sessions.get_slot_for_header(reply_header)) |slot| {
self.client_replies.faulty.set(slot.index);
}
return;
}; and we unset faulty immediately when we enqueue a write. So, if the sequence of events is:
the slot ends up being marked as faulty, although it isn't. Additionally, I think theer's an unrelated (?) bug I've introduced into I use |
- it might be the case that `writing` bitset is not set, but there's a write for the slot in the queue: replies can wait on read to complete. - when a faulty read completes, it might clobber faulty bit, unset by a a write which was scheduled after the read. SEED: 2517747396662708227 Closes: #1511
- it might be the case that `writing` bitset is not set, but there's a write for the slot in the queue: replies can wait on read to complete. - when a faulty read completes, it might clobber faulty bit, unset by a a write which was scheduled after the read. The specific series of events here: 1. Replica receives a RequestReply and starts a reply read 2. The read completes with a failure, replica sets the faulty bit 3. Replica receives RequestReply starts a reply read 4. Replica receives Reply and starts a reply write - the write unsets faulty bit - the write doesn't start, because there's a read executing 4. The read completes, setting the faulty bit _again_ 5. Replica receives RequestReply - It _doesn't_ start reply read, because there's an in-progress write that can resolve a read. - But the faulty bit is set, tripping up an assertion. To prevent the clobbering: - Introduce `writing_or_queued(slot)` to check if we _will_ write to this slot soon (i.e, we are writing to it right now, _or_ there's a write queued, but not executing because there's also a pending read). - When a faulty read complets, check if there are any pending writes, and avoid setting the faulty bit. In other words, the state of the `faulty` bit is determined by the order in which IO operations start. Some alternative approaches here: - Embrace the races: allow clobbering of faulty bit by an old read, and relax the assertion. If the clobbering happens, a replica will run one more repair, which isn't a big deal - Properly resolve concurrent reads and writes, run read's callback _before_ we enqueue clobbering write. SEED: 2517747396662708227 Closes: #1511
- it might be the case that `writing` bitset is not set, but there's a write for the slot in the queue: replies can wait on read to complete. - when a faulty read completes, it might clobber faulty bit, unset by a a write which was scheduled after the read. The specific series of events here: 1. Replica receives a RequestReply and starts a reply read 2. The read completes with a failure, replica sets the faulty bit 3. Replica receives RequestReply starts a reply read 4. Replica receives Reply and starts a reply write - the write unsets faulty bit - the write doesn't start, because there's a read executing 4. The read completes, setting the faulty bit _again_ 5. Replica receives RequestReply - It _doesn't_ start reply read, because there's an in-progress write that can resolve a read. - But the faulty bit is set, tripping up an assertion. The root issue here is the race between a read and a write for the same reply. Remove the race by explicitly handling the interleaving: * When submitting a read, resolve it immediately if there's a pending write (this was already handled by `read_reply_sync`) * When submitting a write, resolve any pending reads for the same reply. * Remove the code to block the write while the read is in-progress, as this is no longer possible. Note that it is still possible that a read and a write for the same slot race, if they target different replies. In this case, there won't be clobbering, as, when the read completes, we double-check freshness by consulting `client_sessions`. SEED: 2517747396662708227 Closes: #1511
- it might be the case that `writing` bitset is not set, but there's a write for the slot in the queue: replies can wait on read to complete. - when a faulty read completes, it might clobber faulty bit, unset by a a write which was scheduled after the read. The specific series of events here: 1. Replica receives a RequestReply and starts a reply read 2. The read completes with a failure, replica sets the faulty bit 3. Replica receives RequestReply starts a reply read 4. Replica receives Reply and starts a reply write - the write unsets faulty bit - the write doesn't start, because there's a read executing 4. The read completes, setting the faulty bit _again_ 5. Replica receives RequestReply - It _doesn't_ start reply read, because there's an in-progress write that can resolve a read. - But the faulty bit is set, tripping up an assertion. The root issue here is the race between a read and a write for the same reply. Remove the race by explicitly handling the interleaving: * When submitting a read, resolve it immediately if there's a pending write (this was already handled by `read_reply_sync`) * When submitting a write, resolve any pending reads for the same reply. * Remove the code to block the write while the read is in-progress, as this is no longer possible. Note that it is still possible that a read and a write for the same slot race, if they target different replies. In this case, there won't be clobbering, as, when the read completes, we double-check freshness by consulting `client_sessions`. SEED: 2517747396662708227 Closes: #1511
- it might be the case that `writing` bitset is not set, but there's a write for the slot in the queue: replies can wait on read to complete. - when a faulty read completes, it might clobber faulty bit, unset by a a write which was scheduled after the read. The specific series of events here: 1. Replica receives a RequestReply and starts a reply read 2. The read completes with a failure, replica sets the faulty bit 3. Replica receives RequestReply starts a reply read 4. Replica receives Reply and starts a reply write - the write unsets faulty bit - the write doesn't start, because there's a read executing 4. The read completes, setting the faulty bit _again_ 5. Replica receives RequestReply - It _doesn't_ start reply read, because there's an in-progress write that can resolve a read. - But the faulty bit is set, tripping up an assertion. The root issue here is the race between a read and a write for the same reply. Remove the race by explicitly handling the interleaving: * When submitting a read, resolve it immediately if there's a pending write (this was already handled by `read_reply_sync`) * When submitting a write, resolve any pending reads for the same reply. * Remove the code to block the write while the read is in-progress, as this is no longer possible. Note that it is still possible that a read and a write for the same slot race, if they target different replies. In this case, there won't be clobbering, as, when the read completes, we double-check freshness by consulting `client_sessions`. SEED: 2517747396662708227 Closes: #1511
- it might be the case that `writing` bitset is not set, but there's a write for the slot in the queue: replies can wait on read to complete. - when a faulty read completes, it might clobber faulty bit, unset by a a write which was scheduled after the read. The specific series of events here: 1. Replica receives a RequestReply and starts a reply read 2. The read completes with a failure, replica sets the faulty bit 3. Replica receives RequestReply starts a reply read 4. Replica receives Reply and starts a reply write - the write unsets faulty bit - the write doesn't start, because there's a read executing 4. The read completes, setting the faulty bit _again_ 5. Replica receives RequestReply - It _doesn't_ start reply read, because there's an in-progress write that can resolve a read. - But the faulty bit is set, tripping up an assertion. The root issue here is the race between a read and a write for the same reply. Remove the race by explicitly handling the interleaving: * When submitting a read, resolve it immediately if there's a pending write (this was already handled by `read_reply_sync`) * When submitting a write, resolve any pending reads for the same reply. * Remove the code to block the write while the read is in-progress, as this is no longer possible. Note that it is still possible that a read and a write for the same slot race, if they target different replies. In this case, there won't be clobbering, as, when the read completes, we double-check freshness by consulting `client_sessions`. SEED: 2517747396662708227 Closes: #1511
- it might be the case that `writing` bitset is not set, but there's a write for the slot in the queue: replies can wait on read to complete. - when a faulty read completes, it might clobber faulty bit, unset by a a write which was scheduled after the read. The specific series of events here: 1. Replica receives a RequestReply and starts a reply read 2. The read completes with a failure, replica sets the faulty bit 3. Replica receives RequestReply starts a reply read 4. Replica receives Reply and starts a reply write - the write unsets faulty bit - the write doesn't start, because there's a read executing 4. The read completes, setting the faulty bit _again_ 5. Replica receives RequestReply - It _doesn't_ start reply read, because there's an in-progress write that can resolve a read. - But the faulty bit is set, tripping up an assertion. The root issue here is the race between a read and a write for the same reply. Remove the race by explicitly handling the interleaving: * When submitting a read, resolve it immediately if there's a pending write (this was already handled by `read_reply_sync`) * When submitting a write, resolve any pending reads for the same reply. * Remove the code to block the write while the read is in-progress, as this is no longer possible. Note that it is still possible that a read and a write for the same slot race, if they target different replies. In this case, there won't be clobbering, as, when the read completes, we double-check freshness by consulting `client_sessions`. SEED: 2517747396662708227 Closes: #1511
- it might be the case that `writing` bitset is not set, but there's a write for the slot in the queue: replies can wait on read to complete. - when a faulty read completes, it might clobber faulty bit, unset by a a write which was scheduled after the read. The specific series of events here: 1. Replica receives a RequestReply and starts a reply read 2. The read completes with a failure, replica sets the faulty bit 3. Replica receives RequestReply starts a reply read 4. Replica receives Reply and starts a reply write - the write unsets faulty bit - the write doesn't start, because there's a read executing 4. The read completes, setting the faulty bit _again_ 5. Replica receives RequestReply - It _doesn't_ start reply read, because there's an in-progress write that can resolve a read. - But the faulty bit is set, tripping up an assertion. The root issue here is the race between a read and a write for the same reply. Remove the race by explicitly handling the interleaving: * When submitting a read, resolve it immediately if there's a pending write (this was already handled by `read_reply_sync`) * When submitting a write, resolve any pending reads for the same reply. * Remove the code to block the write while the read is in-progress, as this is no longer possible. Note that it is still possible that a read and a write for the same slot race, if they target different replies. In this case, there won't be clobbering, as, when the read completes, we double-check freshness by consulting `client_sessions`. SEED: 2517747396662708227 Closes: #1511
- it might be the case that `writing` bitset is not set, but there's a write for the slot in the queue: replies can wait on read to complete. - when a faulty read completes, it might clobber faulty bit, unset by a a write which was scheduled after the read. The specific series of events here: 1. Replica receives a RequestReply and starts a reply read 2. The read completes with a failure, replica sets the faulty bit 3. Replica receives RequestReply starts a reply read 4. Replica receives Reply and starts a reply write - the write unsets faulty bit - the write doesn't start, because there's a read executing 4. The read completes, setting the faulty bit _again_ 5. Replica receives RequestReply - It _doesn't_ start reply read, because there's an in-progress write that can resolve a read. - But the faulty bit is set, tripping up an assertion. The root issue here is the race between a read and a write for the same reply. Remove the race by explicitly handling the interleaving: * When submitting a read, resolve it immediately if there's a pending write (this was already handled by `read_reply_sync`) * When submitting a write, resolve any pending reads for the same reply. * Remove the code to block the write while the read is in-progress, as this is no longer possible. Note that it is still possible that a read and a write for the same slot race, if they target different replies. In this case, there won't be clobbering, as, when the read completes, we double-check freshness by consulting `client_sessions`. SEED: 2517747396662708227 Closes: #1511
Commit:
7f92126c74c715c295503e7b41f56cef552a550c
Branches: main
Duration to run seed in ReleaseSafe mode: 7s
Stack Trace:
Debug Logs:
Tail
The text was updated successfully, but these errors were encountered: