Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement chop (event log truncation tool) #122

Closed
4 tasks done
matthew-levan opened this issue Jan 18, 2023 · 9 comments · Fixed by #165
Closed
4 tasks done

Implement chop (event log truncation tool) #122

matthew-levan opened this issue Jan 18, 2023 · 9 comments · Fixed by #165
Assignees

Comments

@matthew-levan
Copy link
Contributor

matthew-levan commented Jan 18, 2023

Write a minimally viable event log truncation program. Intended for power users with particularly large piers, it should:

  • Safely reduce a ship's pier size
  • Operates only on a ship which has been gracefully shutdown (i.e., has no unprocessed events in its queue)
    • Performs a graceful shutdown if the ship is currently running (maybe?)
  • Exports the event log to a backup file for optional archiving/compressing/deleting

Upon completion, a more permanent, tunable, and feature-rich event log truncation system could be built.

References:

Tagging: @mcevoypeter @joemfb @zalberico

@matthew-levan
Copy link
Contributor Author

Notes from @eamsden and @philipcmonk:

  • Vere snapshot durability is fine as designed (modulo bugs) but you need to copy out patch files if they exist, not just north.bin and south.bin.
  • However he (Philip) also said that it was his understanding that there are incidental places in vere that depend on the actual number of events stored in the DB, so it's not presently possible to do event log truncation without changes to vere.
    • Edward thinks these were already changed but only as a part of 1.14 (mars/urth split on next/vere) which was reverted because of snapshot corruption issues.
  • When do we patch?

@zalberico
Copy link
Collaborator

zalberico commented Jan 19, 2023

First - thanks for creating a detailed issue and writing this up.

Tagging @lukebuehler for hosting context.

I don't think this approach makes sense and isn't what we need for hosting. We have a path forward here for live snapshots (and truncation) that doesn't require an offline tool. This is necessary for hosting. @mcevoypeter is going to build this replicating a similar implementation that existed in the old repo (though might be a good candidate for pairing potentially) - @philipcmonk tagging you for context.
edit: This tool will be useful in the short term, we'll get the automatic fix working in the newer runtime.

This work is on deck after the doccords release and getting the fixed replay command out of 1.14 and released in its own release.

Also tagging @lukestiles for context.

@zalberico
Copy link
Collaborator

zalberico commented Jan 24, 2023

@matthew-levan gave of us an overview of the plan and it looks good, some context from our side.

  • One thing that was missing is it's really important to make a chk backup post chop (renamed from trim based on feedback from philip about trim's use elsewhere in the runtime) - this could replace the bhk directory that post trim becomes useless (though we may should backup bhk alongside the db backup when the job runs in case we have to revert).
  • The reason we need this backup is future corruption of chk will require users to replay from post chop and if we have no chk from that point they will otherwise have to breach (we routinely see this type of corruption that requires a replay).
  • A suggestion from @Fang- that would be even better would be putting the chk state itself as the first event (effectively making it the 0 start for replay). @philipcmonk said this would require the runtime to understand a special event because you wouldn't be able to load a huge event like this into the loom. The advantage of this would be the user wouldn't need to worry about the chk backup and could always replay post chop. This is likely more complicated though and doesn't need to block v0.

On our side, we're figuring out live snapshotting and truncation but it's possible we'll want to make changes to the runtime prior to that so shouldn't block on it.

Thanks for the overview Matt! @mcevoypeter for context. Also @belisarius222

@belisarius222
Copy link
Contributor

Is there documentation anywhere about the chk/bhk system? I know it was added fairly recently, and it has something to do with backup snapshots, but I'm not sure about the details, and I'd like to understand it better.

@matthew-levan
Copy link
Contributor Author

matthew-levan commented Jan 24, 2023

I've updated this issue with a working design gist, please see it for add'l context.

Is there documentation anywhere about the chk/bhk system? I know it was added fairly recently, and it has something to do with backup snapshots, but I'm not sure about the details, and I'd like to understand it better.

@zalberico said that the bhk directory is used to store a one-off snapshot after some jet mismatch bug hit the live network. From what I gathered, only some ships even have this folder in their piers (due to not all of them being booted at time of bug).

If possible I'd like to use _ce_backup() (which is the code that copies images into bhk) to perform a backup of the snapshot we save pre-crop.

@zalberico
Copy link
Collaborator

zalberico commented Jan 24, 2023

There was a jet mismatch issue, this is the history: urbit/urbit#5836

Basically you can't replay successfully from 0 because you'll hit mug mismatch issues due to this and it will fail. When this was fixed a one time bhk was created post jet mismatch you could replay from there on future runtime versions without having to do complicated switching (I haven't gotten the switching to work myself on my ID that exhibits this, but replaying from bhk does work).

With chop the bhk will no longer be useful after truncation, but we should back it up along with the db just in case something goes wrong with chop and we need it to revert.

After chop, we'll want to make a separate backup of the first new checkpoint so we can always replay from that copy in the future if the normal checkpoint gets corrupted.

@belisarius222
Copy link
Contributor

I see. That makes sense, thanks.

@matthew-levan matthew-levan changed the title Implement an event log truncation tool Implement chop (event log truncation tool) Jan 31, 2023
jalehman added a commit that referenced this issue Feb 9, 2023
## `chop`

`urbit chop <pier>` implements a simple, offline **event log
truncation**[^1] tool.

`chop` gracefully stops the given pier (if running), backs up the
current snapshot to `<pier>/.urb/bhk`, makes sure a current snapshot
exists (i.e., is fully written to disk in `chk/*.bin` with no existing
patch files), reads the metadata and the last event from the pier's
event log, initializes a fresh event log in the `<pier>/.urb/log/chop`
directory, writes the metadata and last event from the original log into
the fresh one, renames the original event log to
`<pier>/.urb/log/chop/data_<first>_<last>.mdb.bak` where `first` and
`last` are the first and last event numbers from the event log, and
exits.

Pilots are then free to move, archive, or delete their `.bak` event log
file, resume normal operation of their ship, and enjoy the many benefits
of lowered disk pressure and any reductions in associated hosting costs.

I've tested `chop` successfully on my own planet `~mastyr-bottec`
(multiple times), three different comets (all fresh), and multitudes of
fake galaxies.

Resolves #122.

Note: `knit`, which is the "undo" button for `chop`, is being
implemented in its own PR #184.

[^1]: https://roadmap.urbit.org/project/event-log-truncation
@zalberico
Copy link
Collaborator

zalberico commented Apr 22, 2023

@matthew-levan I ran chop for the first time myself and it doesn't look like it makes a separate post chop backup of chk that's distinct from the one chk? Am I not looking in the right place? If that's the case, then whenever the user's chk gets corrupted it will be unrecoverable (since they can no longer replay the event log) and will require a breach.

One thing that was missing is it's really important to make a chk backup post chop (renamed from trim based on feedback from philip about trim's use elsewhere in the runtime) - this could replace the bhk directory that post trim becomes useless (though we may should backup bhk alongside the db backup when the job runs in case we have to revert).

It doesn't look like bhk updated on mine when I checked either.

@zalberico
Copy link
Collaborator

zalberico commented Apr 22, 2023

Nevermind - it did back it up inside bhk, I was looking at the created date for the directory, not the content - thanks!

#165

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants