VSR: Explicit deprecated message types#2763
Conversation
## Bug The `Command` enum lists all message types, with explicit values. But if a replica receives a deprecated message type (e.g. via network replay), we would `switch` on the value, and panic. Stack (from a replica on 0.16.28): ``` thread 24341 panic: switch on corrupt value .../tigerbeetle/src/vsr/message_header.zig:144:17: 0x1304df3 in into_any (tigerbeetle) .../tigerbeetle/src/vsr/message_header.zig:198:30: 0x18d9e98 in peer_type (tigerbeetle) .../tigerbeetle/src/message_bus.zig:878:78: 0x18330e3 in set_and_verify_peer (tigerbeetle) .../tigerbeetle/src/message_bus.zig:786:72: 0x17aeffc in parse_message (tigerbeetle) .../tigerbeetle/src/message_bus.zig:732:48: 0x174e8e3 in parse_messages (tigerbeetle) .../tigerbeetle/src/message_bus.zig:1007:42: 0x16edbab in on_recv (tigerbeetle) .../tigerbeetle/src/io/linux.zig:1160:29: 0x16a6046 in wrapper (tigerbeetle) .../tigerbeetle/src/io/linux.zig:686:40: 0x11f8ff5 in complete (tigerbeetle) .../tigerbeetle/src/io/linux.zig:192:49: 0x11f775f in flush (tigerbeetle) .../tigerbeetle/src/io/linux.zig:147:27: 0x1234796 in run_for_ns (tigerbeetle) .../tigerbeetle/src/tigerbeetle/main.zig:510:38: 0x12362b4 in start (tigerbeetle) .../tigerbeetle/src/tigerbeetle/main.zig:84:44: 0x12cfb3f in main (tigerbeetle) .../tigerbeetle/zig/lib/std/start.zig:524:37: 0x11e6335 in posixCallMainAndExit (tigerbeetle) .../tigerbeetle/zig/lib/std/start.zig:266:5: 0x11e5e51 in _start (tigerbeetle) ???:?:?: 0x4 in ??? (???) Unwind information for `???:0x4` was not available, trace may be incomplete ``` ## Fix List deprecated message types in the `Command` enum, so that they are handled by switch statements. Note that when we roll out a new message type, we need to make sure there is a transition release where the message type is received+ignored but not sent, otherwise we would be vulnerable to this bug in the other direction.
|
Reminds me of #1850! |
|
Hm, I am finding myself in a logical paradox here!
So I can't say I see what we actually should do here! A related question: when do we actually hit this? My mental models is that any two releases should be bidirectionally compatible, if we do rollout right. And for three releases, you'll need to have at least a checkpoint-worth of a gap, so actual network replay seems unlikely? That being said, the diff looks OK to me, and I am feeling bad delaying #1850 once already for wanting a better solution that didn't materialize. |
It was a view change concurrent with an upgrade. The new primary is still on 0.16.26 (which sends both versions of the SV message) and is repairing (but still able to send a SV). The crashed replica had just upgraded to 0.16.27 (which only uses the new SV message). |
|
Aha! So that is the bug in the upgrade! If one versions sends old&new, the next one should receive both, and explicitly ignore the old one, like was done here: #2211. Or rather, it is an open question whether that situation is a bug. It can be considered a bug, if we want to super-explicitly handle all version transitions, to not have unknown messages at all. It can also be considered non-a-bug, if we want to bless "ignoring unknown messages". Even if it is a bug, it is perhaps best to ignore the message and just log it with err? |
|
Our network fault model allows arbitrarily delayed/replayed packets, so I think an extra release stage would not be sufficient.
Do you mean ignoring unknown future messages, not just deprecated ones? e.g. not just gaps in |
matklad
left a comment
There was a problem hiding this comment.
Approving!
I don't think there's anything specifically wrong here, and it definitely would prevent a bug we've stepped into, but I must say I am still not happy about MessageBus interface. It feels like we haven't quite figured this out!
But I dont' want to block the progress (again) over my confusion :)
Ah, I didn't quite make that distinction! Ok, yeah, ignoring known deprecated messages seems good! (though, perhaps we still want to log.warn this, if we think we should have one release that explicitly dropes them without quite type-erasing them into deprecated completely) |
Bug
The
Commandenum lists all message types, with explicit values. But if a replica receives a deprecated message type (e.g. via network replay), we wouldswitchon the value, and panic.Stack (from a replica on 0.16.28):
Fix
List deprecated message types in the
Commandenum, so that they are handled by switch statements.Note that when we roll out a new message type, we need to make sure there is a transition release where the message type is received+ignored but not sent, otherwise we would be vulnerable to this bug in the other direction.