-
Notifications
You must be signed in to change notification settings - Fork 355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full log replay: unavoidable mug mismatches (v1.5 -> v1.8) #5836
Comments
To be clear, even replaying with the exact same version of 1.5 that you had used previously ended up hitting mug mismatches? Or: do you know if the mismatch still occurs when using the unpatched 1.5? And you hadn't used an even older runtime prior, right? |
Hi @Fang- . I only tried once with the original v1.5 and that got OOM killed after replaying almost 1.9M events. I guess I could try again with more memory available, because I don't know if the mug mismatch will occur with the original binary. This ship was first run with the v1.0 binary, to the best of my knowledge. I never tried replaying the event log with any binary older than v1.5. |
If this was a
Not sure if we expect mismatches between it and 1.5. I don't recall hearing about any, but @joemfb might be able to tell with more certainty. |
The mismatch was introduced in v1.6, so you should be able to do a full replay with the patched v1.5 binaries. Use the partial replay argument ( |
I was able to complete the replay with your patched v1.5 binaries, thanks for providing those. No memory problems whatsoever. The OOM I mentioned (not a I haven't yet dared to start the ship with (non-local) networking because I don't understand what the consequences are of all the mug mismatches that occurred during the replay. Should I attempt the replay from scratch with the unpatched v1.5 runtime to see if that avoids the mug mismatches, as Fang suggests? In case mug mismatches don't always cause serious problems, I'd like to compare the peer state of the crashed ship and my planet. I've been looking at the |
This is still an issue. A workaround not yet documented here is to restore the backup snapshot (at |
Describe the bug
I want to do a full event log replay because I let the snapshot of a star get corrupted during a (dirty) reboot. The ship was last run successfully on v1.5 binaries, not with any more recent binaries. During log replay several binary versions (patched v1.5 & v1.8) start printing
mug mismatch
from a specific event onward. It hasn't been possible to do the full replay with a recent binary due to a jet mismatch. I've tried switching from v1.5 to v1.8 just before the mug mismatch starts occurring, but to no avail. Log replay with the v1.8 binary has eventually crashed on each attempt with the following error:My attempts to complete the log replay have been informed by these helpful discussions: #5304, #5488, #5528, and #5551.
To Reproduce
Steps to reproduce the behaviour:
.urb/chk
;Expected behaviour
I expected to be able to replay the event log and have my ship end up in the state it was in when it last ran. Fingers crossed that I don't have to breach.
System:
Additional context
I'll try to take you through some of my thinking and the steps I took so far.
Container logs show that the ship wasn't able to start immediately after a forced reboot of the host.
BTRFS logs showed that
north.bin
had corrupted blocks that were not recoverable by the filesystem.I removed the snapshot from my pier. It took two attempts to get log replay to start.
After removing north.bin & south.bin from
.urb/chk
:I restarted urbit without making any changes:
This was still running with the unpatched v1.5 binary. It got OOM killed with cgroup max 4 GB.
After reading through similar issues #5304 and #5551, I started doing incremental replay with the patched binaries from #5304 (comment).
I found the first occurrence of `mug mismatch` and took a snapshot a couple hundred events earlier.
Decided to take a snapshot at 3710148:
I proceeded with the v1.8 binary and attempted to replay the rest of the log, starting from the snapshot.
It started to print `mug mismatch` at the same point v1.5 did, and eventually failed with `play: %hear event on /a/ames failed`.
Perhaps I hadn't let v1.5 make the snapshot close enough to the first mug mismatch. According to #5488 (comment) and #5528 (comment) the correct/ideal point to switch binaries is right before the mug mismatches start. I took another snapshot with v1.5 at event
3711651
.Continuing the replay with v1.8 led to an identical outcome as my previous attempt.
pier: (3712152): play: mug mismatch 30127ac9 6303093a
play: %hear event on /a/ames failed
afterpier: (3764270): play: bail
You'd almost think this was deterministic :P
I've made another snapshot with v1.5 at event
4231000
, and continued with v1.8 from there.This, of course, shows mug mismatches from the start. It eventually bailed in a similar way as before.
Ever since, I've been doing incremental replay with the patched v1.5 binary without trying the v1.8 binary in between. At the time of writing there are about a million more events to go before I reach the end. I'm mostly wondering if there is any point in me continuing this process. Even if I manage to finish the replay using v1.5 or by switching to v1.8 at the point where it fails, won't the networking state of this ship be messed up in a way that will require me to breach after booting it?
All things considered I would prefer to be able to see the data I had before abandoning it, so I will continue the replay with mug mismatches. If I successfully boot the ship without networking (
-L
), is there a way for me to compare the Ames state this ship knows about with what e.g. my planet expects it to be?Notify maintainers
Notifying @Fang- and @joemfb because of their help in solving similar issues, and for providing explicit instructions to avoid mug mismatches.
The text was updated successfully, but these errors were encountered: