Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

epoch system #346

Closed
wants to merge 35 commits into from
Closed

epoch system #346

wants to merge 35 commits into from

Conversation

matthew-levan
Copy link
Contributor

@matthew-levan matthew-levan commented Apr 12, 2023

This PR implements a new format for how piers store their event logs on disk.

Resolves #313.

Design

Existing format:

./zod/.urb/log
├── data.mdb
└── lock.mdb

New format:

./zod/.urb/log
├── 0i0             # epoch dirnames specify the last event of the previous epoch
│   ├── data.mdb    # lmdb file containing events 1-132
│   ├── epoc.txt    # disk format version (this PR starts versioning at 1)
│   ├── lock.mdb    # lmdb lock file
│   └── vere.txt    # binary version this set of events was originally run with
└── 0i132
    ├── data.mdb
    ├── epoc.txt
    ├── lock.mdb
    ├── north.bin   #
    ├── south.bin   # snapshot files (state as of event 132), strictly read-only
    └── vere.txt

The new format introduces epochs, which are simply "slices" or "chunks" of a ship's complete event log. Above, you can see the ship's event log chunked into two epochs: 0i0 and 0i132.

New ships booted with the code in this PR instantiate their log directories with the new format. Existing piers are automatically migrated on boot.

Epoch "rollovers" (when the current epoch is ended and a new, empty epoch is created) occur under three conditions:

  1. The pilot uses the new roll subcommand to manually rollover.
  2. The pilot runs the chop subcommand.
  3. We detect a different running binary version than the one pinned in the current epoch.

Both migrations and epoch rollovers ensure there's a current snapshot before running.

A few TODOs left:

  • Iron out small kink in migration behavior for previously chopped piers
  • Make sure correct binary version gets pinned to first epoch of migrated piers
  • Rollover to new epoch when a new binary version is detected
  • Make sure manual migration logic is idempotent
  • Update prep command
  • Fix chop so it works when there are 3 epochs starting with 0i0
  • Reproduce and fix partially-deleted epoch 0i0 after chop
  • Pair with someone to run manual GDB testing for migration idempotency and rollover logic
  • Take a look at @joemfb's replay code and compare/find overlaps
  • Document final system design in this PR
  • Correct epoch naming scheme
  • Make chop leave the latest two epochs
  • Better error handling
  • Better cleanup
  • Test migration with real ships running on local-networking mode
  • Test epoch rollover idempotency
  • Test fresh boot
  • Handle case where snapshot has been deleted from chk/
  • Ensure u3_disk_epoc_good() is implemented and used how we want
  • Ensure u3_disk_epoc_init() is implemented and used how we want

@matthew-levan matthew-levan added the feature New feature or feature request label Apr 12, 2023
@matthew-levan matthew-levan requested a review from a team as a code owner April 12, 2023 18:44
@matthew-levan matthew-levan self-assigned this Apr 12, 2023
@matthew-levan matthew-levan marked this pull request as draft April 12, 2023 18:44
@matthew-levan matthew-levan force-pushed the i/313/epoch branch 2 times, most recently from ac9e02c to abb4ea7 Compare April 12, 2023 19:09
pkg/vere/disk.c Outdated Show resolved Hide resolved
pkg/vere/disk.c Outdated Show resolved Hide resolved
@matthew-levan
Copy link
Contributor Author

Midnight update: I can now boot a fresh fakezod with the epoch format, roll it, and start it up again. Events appear to be processing normally in the dojo from that state, but only the snapshot (not the database) seems to be getting updated. Thus, when shutting down again and trying to reboot, I get a:

...
boot: installed 652 jets
pier: eve_d 146, dun_d 133
Assertion 'god_u->eve_d <= log_u->dun_d' failed in pkg/vere/pier.c:1401
Assertion failed: god_u->eve_d <= log_u->dun_d (pkg/vere/pier.c: _pier_on_lord_live: 1401)

Thoughts welcome, just taking notes here for my future self.

@matthew-levan
Copy link
Contributor Author

matthew-levan commented Apr 15, 2023 via email

@matthew-levan
Copy link
Contributor Author

Another update: got roll working nicely. chop is next on the.. chopping block. prep too. Finally will come the migration code. Stay tuned!

@jalehman
Copy link
Member

jalehman commented Apr 17, 2023 via email

@matthew-levan
Copy link
Contributor Author

chop was easy. It's probably not perfect, but it works... Onto prep, then onto the migration code.

@matthew-levan
Copy link
Contributor Author

Implemented a very minimal migration routine for existing piers-- still need to validate there's a current snapshot, rollover to a new epoch, and delete the backup snapshot (in that order).

_cw_roll is currently just a private function in main.c-- maybe it becomes u3_disk_epoc_roll in disk.c so I can use it in disk.c too. Hm...

pkg/vere/disk.c Outdated Show resolved Hide resolved
@barter-simsum
Copy link
Member

u3_disk_epoc_init and several other functions in disk.c should have brace on newline rather than same line

pkg/vere/disk.c Outdated Show resolved Hide resolved
pkg/vere/disk.c Outdated Show resolved Hide resolved
@matthew-levan
Copy link
Contributor Author

matthew-levan commented Jun 1, 2023 via email

@barter-simsum
Copy link
Member

pushed minor change in f772262. dut_o is pointless, just exit the migration early if data.mdb can't be accessed

pkg/vere/disk.c Outdated Show resolved Hide resolved
@matthew-levan
Copy link
Contributor Author

matthew-levan commented Jun 6, 2023

Successfully ran a migration on ~mastyr-bottec using local-networking mode:

matthew@domus:~/src/urbit/vere$ ./urbit --local ~/ships/planets/mastyr-bottec/mastyr-bottec/
~
urbit 2.9-c045119fd3
boot: home is /home/matthew/ships/planets/mastyr-bottec/mastyr-bottec
loom: mapped 2048MB
boot: protected loom
live: mapped: GB/1.345.880.064
live: loaded: KB/16.384
boot: installed 657 jets
loom: mapped 2048MB
lite: arvo formula 2a2274c9
lite: core 4bb376f0
lite: final state 4bb376f0
loom: image backup complete
disk: migrated disk to v1 format
disk: loaded epoch 0i219237052
loom: mapped 2048MB
boot: protected loom
live: mapped: GB/1.345.880.064
live: loaded: KB/16.384
boot: installed 657 jets
vere: checking version compatibility
%unlagging
ames: live on 51723 (localhost only)
conn: listening on /home/matthew/ships/planets/mastyr-bottec/mastyr-bottec/.urb/conn.sock
http: web interface live on http://localhost:8080
http: loopback live on http://localhost:12321
pier (219237063): live
~mastyr-bottec:dojo> 

Note that I also had a chop directory in my .urb/log folder-- didn't bother the migration. I'd tested that before with a fakezod too, but nice to see the same behavior on my own planet's pier as well.

@jalehman jalehman requested a review from joemfb June 9, 2023 15:30
@matthew-levan
Copy link
Contributor Author

Closed in favor of #459, which implements the same and also integrates with urbit play a la #420.

pkova added a commit that referenced this pull request Sep 27, 2023
This PR ports urbit/urbit#5676, correcting an integer overflow in the
+jam jet (leading to a mismatch on large-atom input). Fixes
urbit/urbit#5674.

Releasing this fix is blocked on #346. DO NOT MERGE.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or feature request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pier: epoch system
5 participants