Skip to content

VSR: Explicit deprecated message types#2763

Merged
sentientwaffle merged 1 commit intomainfrom
dj-message-type-deprecated-switch
Feb 22, 2025
Merged

VSR: Explicit deprecated message types#2763
sentientwaffle merged 1 commit intomainfrom
dj-message-type-deprecated-switch

Conversation

@sentientwaffle
Copy link
Copy Markdown
Member

Bug

The Command enum lists all message types, with explicit values. But if a replica receives a deprecated message type (e.g. via network replay), we would switch on the value, and panic.

Stack (from a replica on 0.16.28):

thread 24341 panic: switch on corrupt value
.../tigerbeetle/src/vsr/message_header.zig:144:17: 0x1304df3 in into_any (tigerbeetle)
.../tigerbeetle/src/vsr/message_header.zig:198:30: 0x18d9e98 in peer_type (tigerbeetle)
.../tigerbeetle/src/message_bus.zig:878:78: 0x18330e3 in set_and_verify_peer (tigerbeetle)
.../tigerbeetle/src/message_bus.zig:786:72: 0x17aeffc in parse_message (tigerbeetle)
.../tigerbeetle/src/message_bus.zig:732:48: 0x174e8e3 in parse_messages (tigerbeetle)
.../tigerbeetle/src/message_bus.zig:1007:42: 0x16edbab in on_recv (tigerbeetle)
.../tigerbeetle/src/io/linux.zig:1160:29: 0x16a6046 in wrapper (tigerbeetle)
.../tigerbeetle/src/io/linux.zig:686:40: 0x11f8ff5 in complete (tigerbeetle)
.../tigerbeetle/src/io/linux.zig:192:49: 0x11f775f in flush (tigerbeetle)
.../tigerbeetle/src/io/linux.zig:147:27: 0x1234796 in run_for_ns (tigerbeetle)
.../tigerbeetle/src/tigerbeetle/main.zig:510:38: 0x12362b4 in start (tigerbeetle)
.../tigerbeetle/src/tigerbeetle/main.zig:84:44: 0x12cfb3f in main (tigerbeetle)
.../tigerbeetle/zig/lib/std/start.zig:524:37: 0x11e6335 in posixCallMainAndExit (tigerbeetle)
.../tigerbeetle/zig/lib/std/start.zig:266:5: 0x11e5e51 in _start (tigerbeetle)
???:?:?: 0x4 in ??? (???)
Unwind information for `???:0x4` was not available, trace may be incomplete

Fix

List deprecated message types in the Command enum, so that they are handled by switch statements.

Note that when we roll out a new message type, we need to make sure there is a transition release where the message type is received+ignored but not sent, otherwise we would be vulnerable to this bug in the other direction.

## Bug

The `Command` enum lists all message types, with explicit values.
But if a replica receives a deprecated message type (e.g. via network replay), we would `switch` on the value, and panic.

Stack (from a replica on 0.16.28):

```
thread 24341 panic: switch on corrupt value
.../tigerbeetle/src/vsr/message_header.zig:144:17: 0x1304df3 in into_any (tigerbeetle)
.../tigerbeetle/src/vsr/message_header.zig:198:30: 0x18d9e98 in peer_type (tigerbeetle)
.../tigerbeetle/src/message_bus.zig:878:78: 0x18330e3 in set_and_verify_peer (tigerbeetle)
.../tigerbeetle/src/message_bus.zig:786:72: 0x17aeffc in parse_message (tigerbeetle)
.../tigerbeetle/src/message_bus.zig:732:48: 0x174e8e3 in parse_messages (tigerbeetle)
.../tigerbeetle/src/message_bus.zig:1007:42: 0x16edbab in on_recv (tigerbeetle)
.../tigerbeetle/src/io/linux.zig:1160:29: 0x16a6046 in wrapper (tigerbeetle)
.../tigerbeetle/src/io/linux.zig:686:40: 0x11f8ff5 in complete (tigerbeetle)
.../tigerbeetle/src/io/linux.zig:192:49: 0x11f775f in flush (tigerbeetle)
.../tigerbeetle/src/io/linux.zig:147:27: 0x1234796 in run_for_ns (tigerbeetle)
.../tigerbeetle/src/tigerbeetle/main.zig:510:38: 0x12362b4 in start (tigerbeetle)
.../tigerbeetle/src/tigerbeetle/main.zig:84:44: 0x12cfb3f in main (tigerbeetle)
.../tigerbeetle/zig/lib/std/start.zig:524:37: 0x11e6335 in posixCallMainAndExit (tigerbeetle)
.../tigerbeetle/zig/lib/std/start.zig:266:5: 0x11e5e51 in _start (tigerbeetle)
???:?:?: 0x4 in ??? (???)
Unwind information for `???:0x4` was not available, trace may be incomplete
```

## Fix

List deprecated message types in the `Command` enum, so that they are handled by switch statements.

Note that when we roll out a new message type, we need to make sure there is a transition release where the message type is received+ignored but not sent, otherwise we would be vulnerable to this bug in the other direction.
@sentientwaffle sentientwaffle marked this pull request as ready for review February 21, 2025 22:20
@matklad
Copy link
Copy Markdown
Member

matklad commented Feb 21, 2025

Reminds me of #1850!

@matklad
Copy link
Copy Markdown
Member

matklad commented Feb 21, 2025

Hm, I am finding myself in a logical paradox here!

  • I think we shouldn't introduce explicit deprecated states, and instead drop these messages at the message bust
  • But message bus is not great place for tricky logic, as it isn't fuzzed, so, ideally, we should move the logic for dealing with raw, pre-parsed messages to the replica, as that is fuzzed
  • Except that won't actually help us, as the replica is fuzzed on the message level! The fake message bus doesn't do raw byte copying, it sends messaged directly from replica A to replica B.

So I can't say I see what we actually should do here!


A related question: when do we actually hit this? My mental models is that any two releases should be bidirectionally compatible, if we do rollout right. And for three releases, you'll need to have at least a checkpoint-worth of a gap, so actual network replay seems unlikely?


That being said, the diff looks OK to me, and I am feeling bad delaying #1850 once already for wanting a better solution that didn't materialize.

@sentientwaffle
Copy link
Copy Markdown
Member Author

A related question: when do we actually hit this? My mental models is that any two releases should be bidirectionally compatible, if we do rollout right. And for three releases, you'll need to have at least a checkpoint-worth of a gap, so actual network replay seems unlikely?

It was a view change concurrent with an upgrade. The new primary is still on 0.16.26 (which sends both versions of the SV message) and is repairing (but still able to send a SV). The crashed replica had just upgraded to 0.16.27 (which only uses the new SV message).

@matklad
Copy link
Copy Markdown
Member

matklad commented Feb 21, 2025

Aha! So that is the bug in the upgrade! If one versions sends old&new, the next one should receive both, and explicitly ignore the old one, like was done here: #2211.

Or rather, it is an open question whether that situation is a bug. It can be considered a bug, if we want to super-explicitly handle all version transitions, to not have unknown messages at all. It can also be considered non-a-bug, if we want to bless "ignoring unknown messages".

Even if it is a bug, it is perhaps best to ignore the message and just log it with err?

@sentientwaffle
Copy link
Copy Markdown
Member Author

Our network fault model allows arbitrarily delayed/replayed packets, so I think an extra release stage would not be sufficient.

if we want to bless "ignoring unknown messages".

Do you mean ignoring unknown future messages, not just deprecated ones? e.g. not just gaps in Command, but also values not yet used? I think I'd be in favor of that approach -- we'd still be logging them, but it seems less fragile, even if it shouldn't strictly be necessary with how we do upgrades.

Copy link
Copy Markdown
Member

@matklad matklad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving!

I don't think there's anything specifically wrong here, and it definitely would prevent a bug we've stepped into, but I must say I am still not happy about MessageBus interface. It feels like we haven't quite figured this out!

But I dont' want to block the progress (again) over my confusion :)

@sentientwaffle sentientwaffle added this pull request to the merge queue Feb 21, 2025
@matklad
Copy link
Copy Markdown
Member

matklad commented Feb 21, 2025

Do you mean ignoring unknown future messages, not just deprecated ones? e.g. not just gaps in Command, but also values not yet used?

Ah, I didn't quite make that distinction! Ok, yeah, ignoring known deprecated messages seems good! (though, perhaps we still want to log.warn this, if we think we should have one release that explicitly dropes them without quite type-erasing them into deprecated completely)

Merged via the queue into main with commit 9721f28 Feb 22, 2025
@sentientwaffle sentientwaffle deleted the dj-message-type-deprecated-switch branch February 22, 2025 00:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants