Implement `chop` (event log truncation tool) #122

matthew-levan · 2023-01-18T08:55:08Z

Write a minimally viable event log truncation program. Intended for power users with particularly large piers, it should:

Safely reduce a ship's pier size
Operates only on a ship which has been gracefully shutdown (i.e., has no unprocessed events in its queue)
- Performs a graceful shutdown if the ship is currently running (maybe?)
Exports the event log to a backup file for optional archiving/compressing/deleting

Upon completion, a more permanent, tunable, and feature-rich event log truncation system could be built.

References:

Tagging: @mcevoypeter @joemfb @zalberico

The text was updated successfully, but these errors were encountered:

matthew-levan · 2023-01-18T22:44:24Z

Notes from @eamsden and @philipcmonk:

Vere snapshot durability is fine as designed (modulo bugs) but you need to copy out patch files if they exist, not just north.bin and south.bin.
However he (Philip) also said that it was his understanding that there are incidental places in vere that depend on the actual number of events stored in the DB, so it's not presently possible to do event log truncation without changes to vere.
- Edward thinks these were already changed but only as a part of 1.14 (mars/urth split on next/vere) which was reverted because of snapshot corruption issues.
When do we patch?
- https://github.com/urbit/vere/blob/develop/pkg/noun/events.c#L1043-L1062
- Dirty RAM pages get written to patch files then separately applied to the snapshot image file.
- The process of applying the patches is idempotent.
- This is the fully durable construction that vere uses.

zalberico · 2023-01-19T18:11:36Z

First - thanks for creating a detailed issue and writing this up.

Tagging @lukebuehler for hosting context.

I don't think this approach makes sense and isn't what we need for hosting. We have a path forward here for live snapshots (and truncation) that doesn't require an offline tool. This is necessary for hosting. @mcevoypeter is going to build this replicating a similar implementation that existed in the old repo (though might be a good candidate for pairing potentially) - @philipcmonk tagging you for context.
edit: This tool will be useful in the short term, we'll get the automatic fix working in the newer runtime.

This work is on deck after the doccords release and getting the fixed replay command out of 1.14 and released in its own release.

Also tagging @lukestiles for context.

zalberico · 2023-01-24T18:00:45Z

@matthew-levan gave of us an overview of the plan and it looks good, some context from our side.

One thing that was missing is it's really important to make a chk backup post chop (renamed from trim based on feedback from philip about trim's use elsewhere in the runtime) - this could replace the bhk directory that post trim becomes useless (though we may should backup bhk alongside the db backup when the job runs in case we have to revert).
The reason we need this backup is future corruption of chk will require users to replay from post chop and if we have no chk from that point they will otherwise have to breach (we routinely see this type of corruption that requires a replay).
A suggestion from @Fang- that would be even better would be putting the chk state itself as the first event (effectively making it the 0 start for replay). @philipcmonk said this would require the runtime to understand a special event because you wouldn't be able to load a huge event like this into the loom. The advantage of this would be the user wouldn't need to worry about the chk backup and could always replay post chop. This is likely more complicated though and doesn't need to block v0.

On our side, we're figuring out live snapshotting and truncation but it's possible we'll want to make changes to the runtime prior to that so shouldn't block on it.

Thanks for the overview Matt! @mcevoypeter for context. Also @belisarius222

belisarius222 · 2023-01-24T20:01:19Z

Is there documentation anywhere about the chk/bhk system? I know it was added fairly recently, and it has something to do with backup snapshots, but I'm not sure about the details, and I'd like to understand it better.

matthew-levan · 2023-01-24T20:07:44Z

I've updated this issue with a working design gist, please see it for add'l context.

Is there documentation anywhere about the chk/bhk system? I know it was added fairly recently, and it has something to do with backup snapshots, but I'm not sure about the details, and I'd like to understand it better.

@zalberico said that the bhk directory is used to store a one-off snapshot after some jet mismatch bug hit the live network. From what I gathered, only some ships even have this folder in their piers (due to not all of them being booted at time of bug).

If possible I'd like to use _ce_backup() (which is the code that copies images into bhk) to perform a backup of the snapshot we save pre-crop.

zalberico · 2023-01-24T21:02:35Z

There was a jet mismatch issue, this is the history: urbit/urbit#5836

Basically you can't replay successfully from 0 because you'll hit mug mismatch issues due to this and it will fail. When this was fixed a one time bhk was created post jet mismatch you could replay from there on future runtime versions without having to do complicated switching (I haven't gotten the switching to work myself on my ID that exhibits this, but replaying from bhk does work).

With chop the bhk will no longer be useful after truncation, but we should back it up along with the db just in case something goes wrong with chop and we need it to revert.

After chop, we'll want to make a separate backup of the first new checkpoint so we can always replay from that copy in the future if the normal checkpoint gets corrupted.

belisarius222 · 2023-01-25T14:39:26Z

I see. That makes sense, thanks.

## `chop` `urbit chop <pier>` implements a simple, offline **event log truncation**[^1] tool. `chop` gracefully stops the given pier (if running), backs up the current snapshot to `<pier>/.urb/bhk`, makes sure a current snapshot exists (i.e., is fully written to disk in `chk/*.bin` with no existing patch files), reads the metadata and the last event from the pier's event log, initializes a fresh event log in the `<pier>/.urb/log/chop` directory, writes the metadata and last event from the original log into the fresh one, renames the original event log to `<pier>/.urb/log/chop/data_<first>_<last>.mdb.bak` where `first` and `last` are the first and last event numbers from the event log, and exits. Pilots are then free to move, archive, or delete their `.bak` event log file, resume normal operation of their ship, and enjoy the many benefits of lowered disk pressure and any reductions in associated hosting costs. I've tested `chop` successfully on my own planet `~mastyr-bottec` (multiple times), three different comets (all fresh), and multitudes of fake galaxies. Resolves #122. Note: `knit`, which is the "undo" button for `chop`, is being implemented in its own PR #184. [^1]: https://roadmap.urbit.org/project/event-log-truncation

zalberico · 2023-04-22T20:59:53Z

@matthew-levan I ran chop for the first time myself and it doesn't look like it makes a separate post chop backup of chk that's distinct from the one chk? Am I not looking in the right place? If that's the case, then whenever the user's chk gets corrupted it will be unrecoverable (since they can no longer replay the event log) and will require a breach.

One thing that was missing is it's really important to make a chk backup post chop (renamed from trim based on feedback from philip about trim's use elsewhere in the runtime) - this could replace the bhk directory that post trim becomes useless (though we may should backup bhk alongside the db backup when the job runs in case we have to revert).

It doesn't look like bhk updated on mine when I checked either.

zalberico · 2023-04-22T21:03:10Z

Nevermind - it did back it up inside bhk, I was looking at the created date for the directory, not the content - thanks!

#165

zalberico assigned matthew-levan Jan 24, 2023

This was referenced Jan 25, 2023

chop: offline event log truncation #165

Merged

Implement knit as an "undo" for chop #184

Closed

matthew-levan changed the title ~~Implement an event log truncation tool~~ Implement chop (event log truncation tool) Jan 31, 2023

jalehman closed this as completed in #165 Feb 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `chop` (event log truncation tool) #122

Implement `chop` (event log truncation tool) #122

matthew-levan commented Jan 18, 2023 •

edited

Loading

matthew-levan commented Jan 18, 2023

zalberico commented Jan 19, 2023 •

edited

Loading

zalberico commented Jan 24, 2023 •

edited

Loading

belisarius222 commented Jan 24, 2023

matthew-levan commented Jan 24, 2023 •

edited

Loading

zalberico commented Jan 24, 2023 •

edited

Loading

belisarius222 commented Jan 25, 2023

zalberico commented Apr 22, 2023 •

edited

Loading

zalberico commented Apr 22, 2023 •

edited

Loading

Implement chop (event log truncation tool) #122

Implement chop (event log truncation tool) #122

Comments

matthew-levan commented Jan 18, 2023 • edited Loading

matthew-levan commented Jan 18, 2023

zalberico commented Jan 19, 2023 • edited Loading

zalberico commented Jan 24, 2023 • edited Loading

belisarius222 commented Jan 24, 2023

matthew-levan commented Jan 24, 2023 • edited Loading

zalberico commented Jan 24, 2023 • edited Loading

belisarius222 commented Jan 25, 2023

zalberico commented Apr 22, 2023 • edited Loading

zalberico commented Apr 22, 2023 • edited Loading

Implement `chop` (event log truncation tool) #122

Implement `chop` (event log truncation tool) #122

matthew-levan commented Jan 18, 2023 •

edited

Loading

zalberico commented Jan 19, 2023 •

edited

Loading

zalberico commented Jan 24, 2023 •

edited

Loading

matthew-levan commented Jan 24, 2023 •

edited

Loading

zalberico commented Jan 24, 2023 •

edited

Loading

zalberico commented Apr 22, 2023 •

edited

Loading

zalberico commented Apr 22, 2023 •

edited

Loading