Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full log replay: unavoidable mug mismatches (v1.5 -> v1.8) #5836

Open
aolieman opened this issue Jun 8, 2022 · 6 comments
Open

Full log replay: unavoidable mug mismatches (v1.5 -> v1.8) #5836

aolieman opened this issue Jun 8, 2022 · 6 comments
Labels

Comments

@aolieman
Copy link

aolieman commented Jun 8, 2022

Describe the bug
I want to do a full event log replay because I let the snapshot of a star get corrupted during a (dirty) reboot. The ship was last run successfully on v1.5 binaries, not with any more recent binaries. During log replay several binary versions (patched v1.5 & v1.8) start printing mug mismatch from a specific event onward. It hasn't been possible to do the full replay with a recent binary due to a jet mismatch. I've tried switching from v1.5 to v1.8 just before the mug mismatch starts occurring, but to no avail. Log replay with the v1.8 binary has eventually crashed on each attempt with the following error:

play: bail: %exit
play: %hear event on /a/ames failed

My attempts to complete the log replay have been informed by these helpful discussions: #5304, #5488, #5528, and #5551.

To Reproduce
Steps to reproduce the behaviour:

  1. run a ship with the v1.5 binaries until April 2022;
  2. corrupt the snapshot and remove north.bin & south.bin from .urb/chk;
  3. attempt log replay with (patched) v1.5 binaries, and with v1.8 binaries.

Expected behaviour
I expected to be able to replay the event log and have my ship end up in the state it was in when it last ran. Fingers crossed that I don't have to breach.

System:

Additional context
I'll try to take you through some of my thinking and the steps I took so far.

Container logs show that the ship wasn't able to start immediately after a forced reboot of the host.
2022-04-16T07:56:51.424645531Z ~winlur:dojo> 2022-04-16T07:59:22.016002068Z ~
2022-04-16T07:59:22.805101152Z boot: core limit: Invalid argument
2022-04-16T07:59:22.805140711Z loom: mapped 2048MB
2022-04-16T07:59:28.063002953Z boot: protected loom
2022-04-16T07:59:28.063033012Z live: loaded: MB/913.571.840
2022-04-16T07:59:28.077097060Z home: bailing out
2022-04-16T07:59:28.095745219Z urbit 1.5
2022-04-16T07:59:28.095792367Z boot: home is /urbit/piers/winlur
2022-04-16T07:59:28.095798767Z boot: core limit: Invalid argument
2022-04-16T07:59:28.095803547Z loom: mapped 2048MB
2022-04-16T07:59:28.095807937Z lite: arvo formula 79925cca
2022-04-16T07:59:28.095812056Z lite: core 59f6958
2022-04-16T07:59:28.095816006Z lite: final state 59f6958
2022-04-16T07:59:28.095819926Z pier: serf unexpectedly shut down

BTRFS logs showed that north.bin had corrupted blocks that were not recoverable by the filesystem.

I removed the snapshot from my pier. It took two attempts to get log replay to start.

After removing north.bin & south.bin from .urb/chk:

2022-05-17T14:42:21.344325215Z ~
2022-05-17T14:42:22.522991571Z urbit 1.5
2022-05-17T14:42:22.523017351Z boot: home is /urbit/piers/winlur
2022-05-17T14:42:22.523023861Z boot: core limit: Invalid argument
2022-05-17T14:42:22.523029661Z loom: mapped 2048MB
2022-05-17T14:42:22.523035011Z lite: arvo formula 79925cca
2022-05-17T14:52:53.689077636Z boot: installed 289 jets
2022-05-17T14:42:22.523045091Z lite: final state 59f6958
2022-05-17T14:42:22.523050291Z pier: serf unexpectedly shut down

I restarted urbit without making any changes:

2022-05-17T14:52:53.045490234Z ~
2022-05-17T14:52:53.689238575Z urbit 1.5
2022-05-17T14:52:53.689262915Z boot: home is /urbit/piers/winlur
2022-05-17T14:52:53.689273535Z boot: core limit: Invalid argument
2022-05-17T14:52:53.689283205Z loom: mapped 2048MB
2022-05-17T14:52:53.689292445Z lite: arvo formula 79925cca
2022-05-17T14:52:53.689301425Z lite: core 59f6958
2022-05-17T14:52:53.689310295Z lite: final state 59f6958
2022-05-17T14:52:53.689323205Z ---------------- playback starting ----------------
2022-05-17T14:52:53.689335785Z pier: replaying events 1-8636084
2022-05-17T14:52:56.560255784Z 1-b
2022-05-17T14:52:56.560311884Z 1-c (compiling compiler, wait a few minutes)
...
2022-05-17T20:06:32.122864900Z pier: (1897287): play: done
2022-05-17T20:06:35.050413386Z pier: (1897788): play: done
2022-05-17T20:06:38.141196246Z pier: (1898289): play: done
2022-05-17T20:06:41.205957000Z pier: (1898790): play: done
2022-05-17T20:06:44.241576714Z pier: (1899291): play: done

This was still running with the unpatched v1.5 binary. It got OOM killed with cgroup max 4 GB.

After reading through similar issues #5304 and #5551, I started doing incremental replay with the patched binaries from #5304 (comment).

I found the first occurrence of `mug mismatch` and took a snapshot a couple hundred events earlier.
2022-05-19T02:14:25.802317837Z ^M^Mkiln: OTA to %kids succeeded from %kids on ~lur
2022-05-19T02:14:31.943074061Z ^M^Mpier: (3712152): play: done
2022-05-19T02:14:31.943194870Z ^M^Mpier: (3712152): play: mug mismatch 30127ac9 1d47ad60
2022-05-19T02:14:36.432553119Z ^M^Mpier: (3712653): play: done
2022-05-19T02:14:36.432703648Z ^M^Mpier: (3712653): play: mug mismatch 59713af5 586eee45
2022-05-19T02:14:39.336100033Z ^M^Mpier: (3713154): play: done
2022-05-19T02:14:39.336151552Z ^M^Mpier: (3713154): play: mug mismatch 26f03065 1db15c21
2022-05-19T02:14:43.332572735Z ^M^Mpier: (3713655): play: done

Decided to take a snapshot at 3710148:

pier: (3709146): play: done
pier: (3709647): play: done
pier: (3710148): play: done
---------------- playback complete ----------------
pier: replay barrier reached, cramming
serf (3710148): saving rock
hash-cons arena:
  root: B/144
  atoms (130018):
    refs: MB/3.928.360
    data: MB/43.700.168
    dict: MB/6.602.200
  total: MB/54.230.728
  cells (5171821):
    refs: MB/114.057.740
    dict: MB/501.854.056
  total: MB/615.911.796
total: MB/670.142.668

serf: measuring memory:
  kernel: MB/137.422.536
total arvo stuff: MB/137.422.536
  warm jet state: KB/77.384
  cold jet state: MB/33.203.820
  hank cache: B/288
  battery hash cache: B/288
  hot jet state: KB/142.816
total jet stuff: MB/33.424.596
  bytecode cache: B/288
total nock stuff: B/288
  memoization cache: B/288
total road stuff: B/288
total marked: MB/170.847.708
free lists: KB/139.892
sweep: MB/170.847.708

pier: cram complete, shutting down

I proceeded with the v1.8 binary and attempted to replay the rest of the log, starting from the snapshot.

It started to print `mug mismatch` at the same point v1.5 did, and eventually failed with `play: %hear event on /a/ames failed`.
root@8144befbc253:/urbit# ./urbit-v1.8-x86_64-linux/urbit -r 3710148 piers/winlur/
~
urbit 1.8
boot: home is /urbit/piers/winlur
boot: core limit: Invalid argument
loom: mapped 2048MB
lite: arvo formula 27e494c5
lite: core 7b144622
lite: final state 7b144622
boot: core limit: Invalid argument
loom: mapped 2048MB
boot: protected loom
live: loaded: MB/171.016.192
boot: installed 348 jets
---------------- playback starting ----------------
pier: replaying events 3710149-8636084
kiln: downloading OTA update from %kids on ~lur
pier: (3710649): play: done
pier: (3711150): play: done
kiln: finished downloading OTA from %kids on ~lur
kiln: applying OTA to %home from %kids on ~lur
kiln: OTA to %home succeeded from %kids on ~lur
kiln: applying OTA to %kids from %kids on ~lur
>   arvo: beginning upgrade
arvo: compiling hoon
...
kiln: downloading OTA update from %kids on ~lur
pier: (3711651): play: done
kiln: finished downloading OTA from %kids on ~lur
kiln: applying OTA to %home from %kids on ~lur
kiln: OTA to %home succeeded from %kids on ~lur
kiln: applying OTA to %kids from %kids on ~lur
vane %clay: ~marsyn-sovmel
gall: read-at-aeon fail [care=%c case=[%ud p=73] path=/graph-update/resource]
[%no-cast-from %graph-update %resource]
[%error-building-cast %graph-update %resource]
[%error-building-tube %graph-update %resource]
gall: read-at-aeon fail [care=%c case=[%ud p=73] path=/group-update/resource]
[%no-cast-from %group-update %resource]
[%error-building-cast %group-update %resource]
[%error-building-tube %group-update %resource]
kiln: OTA to %kids succeeded from %kids on ~lur
pier: (3712152): play: done
pier: (3712152): play: mug mismatch 30127ac9 6303093a
pier: (3712653): play: done
pier: (3712653): play: mug mismatch 59713af5 56975f6c
pier: (3713154): play: done
pier: (3713154): play: mug mismatch 26f03065 2c23b9b1
pier: (3713655): play: done
pier: (3713655): play: mug mismatch 3bb81440 70236241
pier: (3714156): play: done
...
spider got sign for non-existent ~.eth-watcher--0v3l7.9qm8g.urq3c.nsnur.onk0e.rgn0p.thj8g.c879b.r5pk7.vc9nm.egbg9.paium.kok8l.027rc.pl1v1.ndbcg.7lpp5.ec8sn.1c5fc.r0j6r.hd797
pier: (3764269): play bail: mug mismatch 54501b73 7bf2ae52
pier: (3764270): play: bail

[%poke %hear]
bar-stack=~[~[//ames]]
call: failed
/sys/vane/ames/hoon:<[880 3].[909 20]>
/sys/vane/ames/hoon:<[882 3].[909 20]>
/sys/vane/ames/hoon:<[883 3].[909 20]>
/sys/vane/ames/hoon:<[885 3].[909 20]>
/sys/vane/ames/hoon:<[886 5].[907 7]>
/sys/vane/ames/hoon:<[889 5].[907 7]>
/sys/vane/ames/hoon:<[895 5].[907 7]>
/sys/vane/ames/hoon:<[897 14].[897 53]>
/sys/vane/ames/hoon:<[1.243 5].[1.243 43]>
/sys/vane/ames/hoon:<[1.249 5].[1.263 17]>
/sys/vane/ames/hoon:<[1.251 5].[1.263 17]>
/sys/vane/ames/hoon:<[1.254 5].[1.263 17]>
/sys/vane/ames/hoon:<[1.330 5].[1.379 62]>
/sys/vane/ames/hoon:<[1.331 5].[1.379 62]>
/sys/vane/ames/hoon:<[1.334 5].[1.379 62]>
/sys/vane/ames/hoon:<[1.342 5].[1.379 62]>
/sys/vane/ames/hoon:<[1.343 5].[1.379 62]>
/sys/vane/ames/hoon:<[1.344 5].[1.379 62]>
[%ames-crash-on-packet-from ~lur]
/sys/vane/ames/hoon:<[1.345 5].[1.379 62]>
/sys/vane/ames/hoon:<[1.349 5].[1.379 62]>
/sys/vane/ames/hoon:<[1.378 5].[1.379 62]>
/sys/vane/ames/hoon:<[1.379 5].[1.379 62]>
/sys/vane/ames/hoon:<[1.379 10].[1.379 62]>
/sys/vane/ames/hoon:<[1.939 7].[1.964 69]>
/sys/vane/ames/hoon:<[1.942 7].[1.964 69]>
/sys/vane/ames/hoon:<[1.944 7].[1.964 69]>
/sys/vane/ames/hoon:<[1.946 7].[1.964 69]>
/sys/vane/ames/hoon:<[1.958 7].[1.964 69]>
/sys/vane/ames/hoon:<[1.964 7].[1.964 69]>
/sys/vane/ames/hoon:<[2.111 7].[2.175 9]>
/sys/vane/ames/hoon:<[2.114 7].[2.175 9]>
/sys/vane/ames/hoon:<[2.117 7].[2.175 9]>
/sys/vane/ames/hoon:<[2.118 7].[2.175 9]>
/sys/vane/ames/hoon:<[2.118 47].[2.118 71]>
/sys/vane/ames/hoon:<[2.336 5].[2.343 7]>
/sys/vane/ames/hoon:<[2.338 5].[2.343 7]>
/sys/vane/ames/hoon:<[2.338 9].[2.338 29]>
/sys/vane/ames/hoon:<[2.348 5].[2.359 7]>
/sys/vane/ames/hoon:<[2.350 5].[2.359 7]>
/sys/vane/ames/hoon:<[2.354 9].[2.357 11]>
/sys/vane/ames/hoon:<[2.356 15].[2.356 79]>
/sys/vane/ames/hoon:<[2.391 5].[2.460 7]>
/sys/vane/ames/hoon:<[2.394 5].[2.460 7]>
play: bail: %exit
play: %hear event on /a/ames failed

Perhaps I hadn't let v1.5 make the snapshot close enough to the first mug mismatch. According to #5488 (comment) and #5528 (comment) the correct/ideal point to switch binaries is right before the mug mismatches start. I took another snapshot with v1.5 at event 3711651.

Continuing the replay with v1.8 led to an identical outcome as my previous attempt.
root@8144befbc253:/urbit# ./urbit-v1.8-x86_64-linux/urbit -r 3711651 piers/winlur/
~
urbit 1.8
boot: home is /urbit/piers/winlur
boot: core limit: Invalid argument
loom: mapped 2048MB
lite: arvo formula 27e494c5
lite: core 7b144622
lite: final state 7b144622
boot: core limit: Invalid argument
loom: mapped 2048MB
boot: protected loom
live: loaded: MB/172.228.608
boot: installed 348 jets
---------------- playback starting ----------------
pier: replaying events 3711652-8636084
kiln: finished downloading OTA from %kids on ~lur
kiln: applying OTA to %home from %kids on ~lur
kiln: OTA to %home succeeded from %kids on ~lur
kiln: applying OTA to %kids from %kids on ~lur
vane %clay: ~marsyn-sovmel
gall: read-at-aeon fail [care=%c case=[%ud p=73] path=/graph-update/resource]
[%no-cast-from %graph-update %resource]
[%error-building-cast %graph-update %resource]
[%error-building-tube %graph-update %resource]
gall: read-at-aeon fail [care=%c case=[%ud p=73] path=/group-update/resource]
[%no-cast-from %group-update %resource]
[%error-building-cast %group-update %resource]
[%error-building-tube %group-update %resource]
kiln: OTA to %kids succeeded from %kids on ~lur
pier: (3712152): play: done
pier: (3712152): play: mug mismatch 30127ac9 6303093a
[same result as the previous v1.8 attempt]
...
pier: (3764269): play bail: mug mismatch 54501b73 7bf2ae52
pier: (3764270): play: bail

[%poke %hear]
bar-stack=~[~[//ames]]
call: failed
/sys/vane/ames/hoon:<[880 3].[909 20]>
/sys/vane/ames/hoon:<[882 3].[909 20]>
/sys/vane/ames/hoon:<[883 3].[909 20]>
/sys/vane/ames/hoon:<[885 3].[909 20]>
/sys/vane/ames/hoon:<[886 5].[907 7]>
/sys/vane/ames/hoon:<[889 5].[907 7]>
/sys/vane/ames/hoon:<[895 5].[907 7]>
/sys/vane/ames/hoon:<[897 14].[897 53]>
/sys/vane/ames/hoon:<[1.243 5].[1.243 43]>
/sys/vane/ames/hoon:<[1.249 5].[1.263 17]>
/sys/vane/ames/hoon:<[1.251 5].[1.263 17]>
/sys/vane/ames/hoon:<[1.254 5].[1.263 17]>
/sys/vane/ames/hoon:<[1.330 5].[1.379 62]>
/sys/vane/ames/hoon:<[1.331 5].[1.379 62]>
/sys/vane/ames/hoon:<[1.334 5].[1.379 62]>
/sys/vane/ames/hoon:<[1.342 5].[1.379 62]>
/sys/vane/ames/hoon:<[1.343 5].[1.379 62]>
/sys/vane/ames/hoon:<[1.344 5].[1.379 62]>
[%ames-crash-on-packet-from ~lur]
/sys/vane/ames/hoon:<[1.345 5].[1.379 62]>
/sys/vane/ames/hoon:<[1.349 5].[1.379 62]>
/sys/vane/ames/hoon:<[1.378 5].[1.379 62]>
/sys/vane/ames/hoon:<[1.379 5].[1.379 62]>
/sys/vane/ames/hoon:<[1.379 10].[1.379 62]>
/sys/vane/ames/hoon:<[1.939 7].[1.964 69]>
/sys/vane/ames/hoon:<[1.942 7].[1.964 69]>
/sys/vane/ames/hoon:<[1.944 7].[1.964 69]>
/sys/vane/ames/hoon:<[1.946 7].[1.964 69]>
/sys/vane/ames/hoon:<[1.958 7].[1.964 69]>
/sys/vane/ames/hoon:<[1.964 7].[1.964 69]>
/sys/vane/ames/hoon:<[2.111 7].[2.175 9]>
/sys/vane/ames/hoon:<[2.114 7].[2.175 9]>
/sys/vane/ames/hoon:<[2.117 7].[2.175 9]>
/sys/vane/ames/hoon:<[2.118 7].[2.175 9]>
/sys/vane/ames/hoon:<[2.118 47].[2.118 71]>
/sys/vane/ames/hoon:<[2.336 5].[2.343 7]>
/sys/vane/ames/hoon:<[2.338 5].[2.343 7]>
/sys/vane/ames/hoon:<[2.338 9].[2.338 29]>
/sys/vane/ames/hoon:<[2.348 5].[2.359 7]>
/sys/vane/ames/hoon:<[2.350 5].[2.359 7]>
/sys/vane/ames/hoon:<[2.354 9].[2.357 11]>
/sys/vane/ames/hoon:<[2.356 15].[2.356 79]>
/sys/vane/ames/hoon:<[2.391 5].[2.460 7]>
/sys/vane/ames/hoon:<[2.394 5].[2.460 7]>
play: bail: %exit
play: %hear event on /a/ames failed
  • First mug mismatch: pier: (3712152): play: mug mismatch 30127ac9 6303093a
  • play: %hear event on /a/ames failed after pier: (3764270): play: bail

You'd almost think this was deterministic :P

I've made another snapshot with v1.5 at event 4231000, and continued with v1.8 from there.

This, of course, shows mug mismatches from the start. It eventually bailed in a similar way as before.
root@8144befbc253:/urbit# ./urbit-v1.8-x86_64-linux/urbit -r 4231000 -n 5432000 piers/winlur/
~
urbit 1.8
boot: home is /urbit/piers/winlur
boot: core limit: Invalid argument
loom: mapped 2048MB
lite: arvo formula 27e494c5
lite: core 7b144622
lite: final state 7b144622
boot: core limit: Invalid argument
loom: mapped 2048MB
boot: protected loom
live: loaded: MB/181.682.176
boot: installed 348 jets
pier: replay till 5432000
---------------- playback starting ----------------
pier: replaying events 4231001-5432000
pier: (4231501): play: done
pier: (4231501): play: mug mismatch 33b1adbf 3152aaad
pier: (4232002): play: done
pier: (4232002): play: mug mismatch 4a85ba8e 7483a2cd
pier: (4232503): play: done
pier: (4232503): play: mug mismatch 1960d458 7fb4187f
pier: (4233004): play: done
pier: (4233004): play: mug mismatch 7b8e1df3 340afd06
pier: (4233505): play: done
...
kiln: downloading OTA update from %kids on ~lur
pier: (4339717): play: done
pier: (4339717): play: mug mismatch 128727f6 67189feb
kiln: finished downloading OTA from %kids on ~lur
kiln: applying OTA to %home from %kids on ~lur
kiln: OTA to %home succeeded from %kids on ~lur
kiln: applying OTA to %kids from %kids on ~lur
gall: read-at-aeon fail [care=%c case=[%ud p=85] path=/group-update/resource]
[%no-cast-from %group-update %resource]
[%error-building-cast %group-update %resource]
[%error-building-tube %group-update %resource]
gall: read-at-aeon fail [care=%c case=[%ud p=85] path=/graph-update/resource]
[%no-cast-from %graph-update %resource]
[%error-building-cast %graph-update %resource]
[%error-building-tube %graph-update %resource]
gall: reloading agents
gall: reloading %group-view
gall: reloading %graph-pull-hook
gall: reloading %chat-cli
gall: reloading %hark-store
gall: reloading %hark-graph-hook
gall: reloading %dm-hook
gall: reloading %graph-push-hook
gall: reloading %metadata-hook
gall: reloading %metadata-store
gall: reloading %chat-hook
gall: reloading %metadata-pull-hook
gall: reloading %graph-store
gall: reloading %metadata-push-hook
gall: reloading %hark-group-hook
gall: reloading %glob
gall: reloaded agents
kiln: OTA to %kids succeeded from %kids on ~lur
pier: (4340218): play: done
pier: (4340218): play: mug mismatch 1147eca2 67048387
pier: (4340719): play: done
pier: (4340719): play: mug mismatch 140f0e48 85f21c5
[%eyre-unknown-receive 15]
[%eyre-unknown-receive 16]
[%eyre-unknown-receive 17]
[%eyre-unknown-receive 18]
[%eyre-unknown-receive 19]
[%eyre-unknown-receive 20]
[%eyre-unknown-receive 21]
[%eyre-unknown-receive 22]
[%eyre-unknown-receive 23]
[%eyre-unknown-receive 24]
[%eyre-unknown-receive 25]
[%eyre-unknown-receive 26]
[%eyre-unknown-receive 27]
[%eyre-unknown-receive 28]
[%eyre-unknown-receive 29]
[%eyre-unknown-receive 30]
[%eyre-unknown-receive 31]
pier: (4341220): play: done
pier: (4341220): play: mug mismatch 130e756b 4430c8f0
...
pier: (4387813): play: done
pier: (4387813): play: mug mismatch 79961d6a 1a97148f
[%eyre-unknown-receive 910]
[%eyre-unknown-receive 911]
[%eyre-unknown-receive 912]
[%eyre-unknown-receive 913]
spider got sign for non-existent ~.eth-watcher--0v3jg.r4ks1.mjd96.65aud.1kld4.ckdca.jgf9a.t8pkb.8q0nq.3qk84.rnqhg.3a6m6.k0l4e.s4ui2.nd6uq.nr0hh.c9t6r.ujb7b.vkme7.gqcrv.fnfgj
spider got sign for non-existent ~.eth-watcher--0v1f0.cbhma.75fie.0vv01.fc2tf.oap0p.4cj10.g93bs.tv9m0.v43fl.dso01.ptom3.ogfaq.cai91.r3mfb.9g6m9.5ta8o.posfn.l2va0.t1q2u.3f873
spider got sign for non-existent ~.eth-watcher--0vjc.3kr2a.28vbt.tgf97.s6rtk.i9kun.kn6u8.7isuc.2rbff.r68ge.m35ia.eq75n.449jr.t86p4.n54cn.o8rrs.mq98e.3vqtv.5srlp.cn2fg.p8210
spider got sign for non-existent ~.eth-watcher--0v1h3.ssob0.7rdev.8l2rt.n3p4v.6ff02.u70hs.a0mev.grfdj.u2vag.dkdgs.8k5b6.9lrll.6vao3.d8p9m.7gi1m.k1o36.675fs.s9ldj.qve8h.0itqg
spider got sign for non-existent ~.eth-watcher--0v12g.16ngv.d1cn3.a99vg.oh6hh.40ivt.628ul.shpcj.h67ne.c5tnl.ot5jb.0h3kk.h5v78.d5k9j.qqku5.5iesp.b29kb.m53pf.a31sj.9mafn.jjp0n
spider got sign for non-existent ~.eth-watcher--0v33m.t3ie4.otpks.u2bev.qri1m.scup4.ifgn6.joraa.kbd4q.76l29.ddc1c.ifmf5.ngujg.89p84.f3jc8.bn6gn.ou9jo.7ae08.m650c.5d9rb.0mim5
spider got sign for non-existent ~.eth-watcher--0v34.h7ug1.3e857.irftf.gkpcm.oj62f.1lnsb.bmu1c.pntug.l9u5m.rn9i6.mva6v.kk6qk.psnhs.8ee5i.8ave2.mln4v.ldtia.o1k03.b4tgp.2afac
spider got sign for non-existent ~.eth-watcher--0v2le.07oh5.7m0pm.vq4bm.r57u3.65hk1.npssg.viai8.tnd8b.4njkr.lhrhq.i7h6v.lkk40.gclkq.tsjrg.f6c1i.hneiq.f11po.nqa0c.gr8f5.uf9jp
spider got sign for non-existent ~.eth-watcher--0v163.bfksi.aatec.q41ho.klv6o.1lb7p.5uk4d.nv4a1.pvv0k.h59f2.d2nf9.rfqn7.qp44h.r99em.vj4no.bo2ve.9n3k7.4jcrl.pnr6g.9jpg7.tps9p
spider got sign for non-existent ~.eth-watcher--0vda.akm0t.j9m44.81bev.mab25.0fec0.2q39u.3gfmm.ap9f7.vjeom.edojc.inpic.vlbgn.9veqr.8m99l.7hjl3.hhn8c.oe5fk.7q48g.sqkmh.dg8tv
spider got sign for non-existent ~.eth-watcher--0v13r.h5u1r.cm0ng.4furi.rvr8v.o27ra.1ii8m.suvap.judjt.a7fni.g6llk.7cvvi.ti1jb.7n1h4.t11lj.ci967.or0se.t8fc0.8j5s8.7hfd9.9nadp
spider got sign for non-existent ~.eth-watcher--0v1ae.h032c.rir5q.pe6ha.ko2bo.cncp9.si7re.gfg2a.hitmo.a16b6.5ihb1.gqs54.657cd.eggpe.kndcu.95n05.hbk74.kvjbl.mamva.usn8n.ph7jn
spider got sign for non-existent ~.eth-watcher--0v3n1.gbqln.at3dq.7m775.4q0l3.hd2u1.km3b8.hjal8.g4nsr.939i6.223o4.rsbii.dlmf8.6ka97.nmtv8.mk880.vc5su.penne.6d92g.b62kb.phlpb
spider got sign for non-existent ~.eth-watcher--0v1f0.60uof.7g6g1.tevcp.u04is.ffvck.7ck22.p02qh.0lacc.s0me6.6mape.4afq2.s7krt.muno5.a84c2.dlfs7.ie42b.v4rei.s7ku2.jpqkg.d108u
spider got sign for non-existent ~.eth-watcher--0v2pl.pnmic.58r80.lig44.9j70g.5r0ss.3cnhb.hq94s.jg7ce.3d3f1.vvoce.u176f.jtd1p.ajauu.5g9nb.us660.qqjgk.69v58.60o47.d8aek.51rj8
spider got sign for non-existent ~.eth-watcher--0v2al.k6pdu.5pp0m.p2due.nsrdl.u3eao.00jnm.7bho7.78r03.kb7q9.rn4dq.jetfg.tnbmr.bobum.g2kfd.mjotr.bur0g.p6g37.5eu2g.h8kt5.0koga
spider got sign for non-existent ~.eth-watcher--0v3dn.3heg6.mn0up.lfomv.69fhf.n5l9s.upkdj.mcfub.g3shl.oi90c.9b260.1416u.nho4j.2hbtf.eiqrl.m7p1q.6r1l0.6a3tk.fdf0j.tch86.jgahi
spider got sign for non-existent ~.eth-watcher--0v1go.8adq5.ljupi.chp66.3f6d5.bdq53.gj7qc.2e251.v5a10.23ljm.4dnj6.ki7bc.gjs25.0jnbt.l6n61.s4csr.b603b.jibn6.ct2rl.2o2sj.dqtum
spider got sign for non-existent ~.eth-watcher--0vkt.l6oh1.bjab4.8r1j9.haeqp.8qde8.5v28j.50kdb.8s9i9.hv2qr.hmf5v.i96hq.blein.o8qmn.mvs2k.fvd5r.9rhbj.5gs3l.rk7u0.c1daj.rjv5a
spider got sign for non-existent ~.eth-watcher--0v2ke.43m5a.2n6mp.q2dd9.pt499.kqm3c.2dvf4.8j806.kj98i.s5ueq.c49ca.0hs3e.rqp6b.cq7pl.7ndb0.jvqh0.kgtcc.8pe33.t5cds.4k9lq.dv43c
spider got sign for non-existent ~.eth-watcher--0v13c.115q8.km0bf.lbt1q.5radt.d59ld.v0cq2.ih11p.7mu36.ttdd4.l2b97.00tlq.ea61j.oh4at.ragig.rbukc.e4nfa.vgifi.jaol1.748e4.v9ru0
spider got sign for non-existent ~.eth-watcher--0v39c.ov0lq.s5dh0.2smoa.6n4as.ri8rt.65oi3.r71t5.dbsok.na0rp.24aib.2mt2t.7637c.33aqq.52e72.akupg.tmscg.quch9.t6hk7.kokjl.tcgcg
spider got sign for non-existent ~.eth-watcher--0v3j0.rktnu.qpnlq.b4bsk.baddu.s97oh.gqhan.05m9o.t1tke.e0ang.33ksc.og5ts.7bdgq.tvjku.dkepn.shq45.hkirp.74sdo.ntuaq.4ea21.mt0ed
spider got sign for non-existent ~.eth-watcher--0v8m.2cfro.04ei6.ll0ik.t5d3i.rkqhc.0n5nr.4dbpd.m6k9h.tro9s.gr5u3.f2sep.d308f.mq68u.5640j.0icia.vvkh8.fhngc.ai6dv.0drup.ba0lb
spider got sign for non-existent ~.eth-watcher--0v3rc.l8gfs.ibeve.o6khi.oacvj.j0i7k.dofai.ai2ls.tnmrc.024c0.91fbl.jsm80.fp1md.61c9e.kgm8j.np3co.rdkgr.ukqrp.hsc46.hcr22.hamtl
spider got sign for non-existent ~.eth-watcher--0v1oe.8oo0a.rvtnt.p487p.0vepu.4vrmt.qddhi.g45as.ra02o.4nc6j.4ukhn.ogguq.lsrj5.tmbic.3q95j.24f23.k5hm3.dfhnq.lckjh.s6767.p4oao
spider got sign for non-existent ~.eth-watcher--0v33l.vtuqg.fbld6.c755p.a0k3k.1mhlg.9sad4.5vkf0.gtgdg.0u8qt.un0ij.b8hjb.b4c2d.9p372.q86lj.mr02g.8un71.g14p8.9ghkg.5c80b.kuc6r
spider got sign for non-existent ~.eth-watcher--0vs2.oup9h.ujja8.pkkj1.28ls8.cu7ju.hkqak.29bup.0i95a.rdncs.0us6t.ohgn8.386nd.01plm.kof7e.e4818.7ameq.tccg5.p8bm7.tqnja.88505
spider got sign for non-existent ~.eth-watcher--0v3a0.cqud2.loq82.fo6t9.g1h46.c96q6.ani19.1eiq4.344vh.f5fp1.f0e6d.70t2a.s7p2q.0p0oe.k5fe3.k7589.a796c.ij8vu.uba7l.ml7bs.aib7c
spider got sign for non-existent ~.eth-watcher--0v33n.pfqhq.8ndm9.el3sa.bjpmv.sj3jh.cjmoj.7mtq6.g8f01.o81kg.2gcbu.91b5o.oantb.heec1.5a4c1.0hmmb.boceu.61t3k.ptu5a.elq8n.tgrpv
spider got sign for non-existent ~.eth-watcher--0vm3.gughl.dc895.13o9h.efclq.im268.hbcpn.nnudo.2i6ge.7htbk.be4dv.4b5pk.9mh4m.om4st.g4p66.fk6mo.fkv19.52gb4.kvdbn.o7721.nfg52
spider got sign for non-existent ~.eth-watcher--0v3b5.4nues.jbcvv.3v4a3.840gg.gp8p6.c9thc.9t1om.hkj0m.ev01b.3tssu.kavju.10877.7b296.inudm.m79pe.sgd74.qrm6i.c21c0.517r6.atomm
spider got sign for non-existent ~.eth-watcher--0v4f.f9h3i.iqa9u.riudc.9g2ku.t84o1.0i7e3.l7g23.keigs.dpmhu.96vhj.ego2k.qp72j.bl2fo.nilii.pmlng.mh7uk.dm4vl.oalv2.af8ae.cseta
spider got sign for non-existent ~.eth-watcher--0v1ik.s5utb.5t1rb.95ckm.mtcf1.ure2t.3bq44.nlkro.1d887.q0c5b.cvsal.smtjc.htsep.kr7fk.lkh7l.egggt.mmc77.apen0.r5i5q.imqpp.hv9uh
spider got sign for non-existent ~.eth-watcher--0v1bi.eumov.6ceoc.j2di7.2iimh.vhjh6.jsj8p.4rgep.mk7vt.534v8.sice8.ijt3s.5vbe2.gdsge.3ian4.b3jqr.dhtsa.cj08v.qthli.42ivr.7fmcs
spider got sign for non-existent ~.eth-watcher--0v356.p5gj2.bj3o9.uve4g.ui5io.rraoq.5bld9.sv8fc.h97ue.a64k6.t61g2.30ckj.u6vgp.cv9sh.p18ki.s7aen.dso0a.q2m58.tlm7v.up4kt.pf3kj
spider got sign for non-existent ~.eth-watcher--0v3ms.af469.ev9dt.tkpp3.a6dtc.0q46c.geqfa.1fnge.aoe4t.i93ct.vbimg.v6tpj.d36k9.3vhpi.qdqno.vvqba.b8mj3.vikeo.mmsdb.2use5.d7hn3
spider got sign for non-existent ~.eth-watcher--0v35d.qro0o.tajpc.sa7mf.fqda1.m2u4e.pn5cm.meofb.bvd6c.pp49h.lnojk.f2cvk.1ib3c.7erhj.87hv2.jjpmv.qdju5.hcae1.sgl8s.852p6.3ln5q
spider got sign for non-existent ~.eth-watcher--0v2hp.bs5oc.dof3i.gqara.79em4.lpb5a.evp8l.fh9gc.7qlc7.sn55b.8gecc.um44v.ip1h2.dvd0o.bnkv1.4ecnh.ero3h.alebv.uoi5m.l6lsk.0a18l
spider got sign for non-existent ~.eth-watcher--0v1m3.on0rc.g4f4k.5rh7s.bvjcv.tiov7.q4nhu.vamjj.68e66.0iioi.vk0qi.m2e17.52ju1.6f23v.n6uj4.tpr86.j3sar.8h23d.fn182.s63d9.nrb7l
spider got sign for non-existent ~.eth-watcher--0v102.hvf6v.oae6c.k81qd.flggj.rfr6k.mqcur.lkbhk.5p4bj.ea0tk.luvdn.qrobg.42rsm.r0m10.g4git.89vav.3661n.j7rkh.nmude.m0udf.8p3ip
spider got sign for non-existent ~.eth-watcher--0v4t.emfpl.onvb3.5hu2f.bihgk.jg36f.uh4j7.cn5mn.3fau3.5nf04.plm58.9k390.h5b0m.2bpc3.1k5m5.nrlek.aaagd.pv45q.eiujh.l38j1.aj5n6
spider got sign for non-existent ~.eth-watcher--0v3c4.62bin.3g71n.dj6n2.of82g.ea64e.36pfk.98t3f.gjulh.q4kgi.85n67.og6mm.rvl71.q3slh.93vuo.e9hup.79phh.7sdic.0rtd4.tctjh.k6cmv
eth-watcher /azimuth-tracker timed out; will restart
spider got sign for non-existent ~.eth-watcher--0v2sg.8birb.ub742.g2rft.uh17n.f2cm8.db8km.6ppss.a103m.74jt9.r7mju.1ietc.mgumv.f4igh.d6igd.4os8f.k4ovc.aqg28.tiqia.82pi0.li2t4
[%eyre-unknown-receive 2]
pier: (4387924): play bail: mug mismatch 3658b5df dca2b3a
pier: (4387925): play: bail

[%poke %hear]
bar-stack=~[~[//ames]]
call: failed
/sys/vane/ames/hoon:<[880 3].[909 20]>
/sys/vane/ames/hoon:<[882 3].[909 20]>
/sys/vane/ames/hoon:<[883 3].[909 20]>
/sys/vane/ames/hoon:<[885 3].[909 20]>
/sys/vane/ames/hoon:<[886 5].[907 7]>
/sys/vane/ames/hoon:<[889 5].[907 7]>
/sys/vane/ames/hoon:<[895 5].[907 7]>
/sys/vane/ames/hoon:<[897 14].[897 53]>
/sys/vane/ames/hoon:<[1.243 5].[1.243 43]>
/sys/vane/ames/hoon:<[1.249 5].[1.263 17]>
/sys/vane/ames/hoon:<[1.251 5].[1.263 17]>
/sys/vane/ames/hoon:<[1.254 5].[1.263 17]>
/sys/vane/ames/hoon:<[1.330 5].[1.379 62]>
/sys/vane/ames/hoon:<[1.331 5].[1.379 62]>
/sys/vane/ames/hoon:<[1.334 5].[1.379 62]>
/sys/vane/ames/hoon:<[1.342 5].[1.379 62]>
/sys/vane/ames/hoon:<[1.343 5].[1.379 62]>
/sys/vane/ames/hoon:<[1.344 5].[1.379 62]>
[%ames-crash-on-packet-from ~lur]
/sys/vane/ames/hoon:<[1.345 5].[1.379 62]>
/sys/vane/ames/hoon:<[1.349 5].[1.379 62]>
/sys/vane/ames/hoon:<[1.378 5].[1.379 62]>
/sys/vane/ames/hoon:<[1.379 5].[1.379 62]>
/sys/vane/ames/hoon:<[1.379 10].[1.379 62]>
/sys/vane/ames/hoon:<[1.939 7].[1.964 69]>
/sys/vane/ames/hoon:<[1.942 7].[1.964 69]>
/sys/vane/ames/hoon:<[1.944 7].[1.964 69]>
/sys/vane/ames/hoon:<[1.946 7].[1.964 69]>
/sys/vane/ames/hoon:<[1.958 7].[1.964 69]>
/sys/vane/ames/hoon:<[1.964 7].[1.964 69]>
/sys/vane/ames/hoon:<[2.111 7].[2.175 9]>
/sys/vane/ames/hoon:<[2.114 7].[2.175 9]>
/sys/vane/ames/hoon:<[2.117 7].[2.175 9]>
/sys/vane/ames/hoon:<[2.118 7].[2.175 9]>
/sys/vane/ames/hoon:<[2.118 47].[2.118 71]>
/sys/vane/ames/hoon:<[2.336 5].[2.343 7]>
/sys/vane/ames/hoon:<[2.338 5].[2.343 7]>
/sys/vane/ames/hoon:<[2.338 9].[2.338 29]>
/sys/vane/ames/hoon:<[2.348 5].[2.359 7]>
/sys/vane/ames/hoon:<[2.350 5].[2.359 7]>
/sys/vane/ames/hoon:<[2.354 9].[2.357 11]>
/sys/vane/ames/hoon:<[2.356 15].[2.356 79]>
/sys/vane/ames/hoon:<[2.391 5].[2.460 7]>
/sys/vane/ames/hoon:<[2.394 5].[2.460 7]>
play: bail: %exit
play: %hear event on /a/ames failed

Ever since, I've been doing incremental replay with the patched v1.5 binary without trying the v1.8 binary in between. At the time of writing there are about a million more events to go before I reach the end. I'm mostly wondering if there is any point in me continuing this process. Even if I manage to finish the replay using v1.5 or by switching to v1.8 at the point where it fails, won't the networking state of this ship be messed up in a way that will require me to breach after booting it?

All things considered I would prefer to be able to see the data I had before abandoning it, so I will continue the replay with mug mismatches. If I successfully boot the ship without networking (-L), is there a way for me to compare the Ames state this ship knows about with what e.g. my planet expects it to be?

Notify maintainers
Notifying @Fang- and @joemfb because of their help in solving similar issues, and for providing explicit instructions to avoid mug mismatches.

@aolieman aolieman added the bug label Jun 8, 2022
@Fang-
Copy link
Member

Fang- commented Jun 13, 2022

To be clear, even replaying with the exact same version of 1.5 that you had used previously ended up hitting mug mismatches? Or: do you know if the mismatch still occurs when using the unpatched 1.5?

And you hadn't used an even older runtime prior, right?

@aolieman
Copy link
Author

Hi @Fang- . I only tried once with the original v1.5 and that got OOM killed after replaying almost 1.9M events. I guess I could try again with more memory available, because I don't know if the mug mismatch will occur with the original binary.

This ship was first run with the v1.0 binary, to the best of my knowledge. I never tried replaying the event log with any binary older than v1.5.

@Fang-
Copy link
Member

Fang- commented Jun 14, 2022

got OOM killed

If this was a bail: meme then the strategy is to replay to someplace before that point and compact the snapshot, proceeding stepwise. (Which the patched 1.5 makes easy, using the rocks, but otherwise you need to manually backup the checkpoint iirc.)
iow, urbit-1.5/urbit -n 500000 your-pier, backup the your-pier/.urb/chk, (maybe need to manually meld?) then continue with -n 1000000, or doing smaller steps as needed.

This ship was first run with the v1.0 binary

Not sure if we expect mismatches between it and 1.5. I don't recall hearing about any, but @joemfb might be able to tell with more certainty.

@joemfb
Copy link
Member

joemfb commented Jun 14, 2022

The mismatch was introduced in v1.6, so you should be able to do a full replay with the patched v1.5 binaries. Use the partial replay argument (-n) to step through the log, taking smaller steps if it runs out of memory. After each step, -n deduplicates in the same way as meld, so no additional commands are needed.

@aolieman
Copy link
Author

I was able to complete the replay with your patched v1.5 binaries, thanks for providing those. No memory problems whatsoever. The OOM I mentioned (not a bail: meme) happened one time with the unpatched v1.5 when I still had artificially limited memory with cgroups.

I haven't yet dared to start the ship with (non-local) networking because I don't understand what the consequences are of all the mug mismatches that occurred during the replay. Should I attempt the replay from scratch with the unpatched v1.5 runtime to see if that avoids the mug mismatches, as Fang suggests?

In case mug mismatches don't always cause serious problems, I'd like to compare the peer state of the crashed ship and my planet. I've been looking at the $message-pump-state and $message-sink-state. This is me trying to be cute... I don't really know what I should expect to match up here, or not in the case of a problem.

@joemfb
Copy link
Member

joemfb commented Oct 11, 2022

This is still an issue. A workaround not yet documented here is to restore the backup snapshot (at $pier/.urb/bhk) before attempting replay -- it should always avoid the need to switch binary versions. A full fix is blocked on event-log truncation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants