[ares] Abstract snapshotting system #33

philipcmonk · 2023-02-28T06:18:44Z

This vendors pma_malloc from @ashelkovnykov and hooks it up to our event lifecycle. This works for the small pills, but it doesn't work for the full azimuth pill because that needs to write 3MB of data in one event, and the hard-coded dirty page limit of 163 only allows ~650KB. It's unclear what the right solution to this is -- I suppose we should convert the page directory to a B+ tree, otherwise you would need a lot more double-buffered metadata pages, or another solution entirely.

This includes several fixes in pma_malloc, which should be reviewed carefully (and probably synchronously). The full PR diff is unhelpful for these, but this sub-diff should work.

This puts pma_malloc alongside the existing "just jam and double-buffer" persistence mechanism and puts them both behind a Snapshot trait. It's unclear if this will continue to be useful, but it's convenient for now. You can switch between them in serf.rs by changing one line.

This extracts arvo to the PMA after every event (though only syncs when the king requests it). This isn't strictly required, but in the future it will make it easier to recover from errors and add parallelism, and even now it reduces snapshot latency.

Since we can't write to pages after they've been synced except by mallocing again, we mug every noun immediately before it goes into the PMA. This shouldn't be expensive, we already do this in vere.

However, it's unclear how unifying equality can work on two noun references in the PMA. If they turn out to be equal, we should rewrite one to the other, but this requires changing a page which may not be dirty and therefore is likely mmap'd PROT_READ. Two solutions that come to mind are: always mmap PROT_READ | PROT_WRITE, or supply an interface function for marking a page as dirty (and therefore copying the page behind the scenes).

Finally, there is not yet a garbage collector for the PMA. I'm not sure what algorithm we should prefer -- copying is likely the easiest and implicitly compacts, though mark-and-sweep may work out better for larger arenas.

All tests pass except the azimuth pill, which requires too many dirty pages in one event.

eamsden

@philipcmonk the PMA was signed off as an initial version to free up @ashelkovnykov to work on other things, but definitely requires additional work to be suitable for integration. I expect this work would change the interface significantly, so I'm a bit wary of merging this right now. In particular, we do need to use B+ trees instead of an array in the root page as the index, and we do need an interface to explicitly dirty a page for unifying equality.

ashelkovnykov

Putting pma_malloc code under rust is weird. Is there no way to put it into its own c folder but still reference it from the Rust code?
I'm not sure what it means to "vendor" something in this context, but copying the entire contents of the pma_malloc repo is not the direction we should go.
- The code was neither fully documented nor fully tested before I was pulled off of it to do other work.
- If we're going to make the PMA a first-class citizen of New Mars, then we should move the code into this repo properly. Tons of the files that are in the repo are just there for my own reference and offline work. None of the following files/folders from pma_malloc should make it into this PR:
  - .gitignore
  - reference/
  - snippets/
- Beyond that, the code in pma_malloc/src and the makefile may want to be restructured if the code is now a component of a larger product.
- Alternatively, if we want to keep pma_malloc at arm's length, then we should:
  - Transfer ownership of the repo to the Urbit org
  - Setup a stable release branch
  - Import the code as a git submodule
The "dirty page" limit of 164 is actually somewhat of a misnomer, since it is actually limited to 164 entries. All multi-page allocations require only one entry. The assumption was that any single event that generates a large volume of data will do so by means of a few large allocations, not very many small allocations. Is this not the case for the bootstrap pill? Is the pill generating a massive number of very small allocations as a single event? Is this an issue that's occurring during testing now or is this something that you were foreseeing in the future due to confusion over the "dirty page limit"?
- If the pill can be split into more events or perform larger allocations, there should be no issue here.
- Of course, this is only a stop-gap solution. The intent was to move to a B+ tree page directory after documenting and writing unit tests for the existing code.
Unifying equality and other optimizations (e.g. in-place free-and-malloc) were going to come later.
I'll let @eamsden speak to the above and regarding garbage allocation.

Will take a look at the specific changes to the pma_malloc code separately.

eamsden · 2023-02-28T16:37:03Z

@philipcmonk I'm out today but it might be worth syncing this evening anyway.

philipcmonk · 2023-02-28T18:43:43Z

To be clear, the first goal of this work was to determine what kind of an impact integrating pma_malloc would have on the serf architecture. I want to drive toward error handling a nock computation by running it in a thread and killing it, and you need to be copying out of the stack between each event to do that properly, so I wanted at least a mock of a PMA. I figured I'd play around with pma_malloc and stop if I ran into an individual problem I couldn't solve in an hour or two, and this is how far I got. Total time investment in this code is only 2-3 half-days, which is completely worth it just for how edifying it has been for me.

With the understanding I now have of pma_malloc, It would be easy to create a mock PMA with system malloc, and that would let me continue on to look at error handling. So this PR isn't blocking for me.

The bugs I fixed in pma_malloc are mostly small and simple -- forgot to intialize a variable, etc. I knew development had been interrupted, so that wasn't a surprise. I did want to know if there were deeper issues that hadn't been considered, and the easiest way for me to learn that kind of thing is dig into the code. The answer seems to be no: once those small bugs were fixed, it mostly just works, except those issues you guys have already identified and planned solutions for.

The changes I made to pma_malloc beyond bug fixes are:

Added pma_in_arena(void*) so the copying code knows whether it needs to copy a noun out.
Added root to the metadata to record where the most recent root is, which will generally be the arvo state, bytecode cache, and anything else that needs to be persisted. You need this so that when you load, you know where to get started. There are a few possible variations, for example we could put the event number and epoch number inside the root as well.
Made pma_load return the epoch number, event number, and root. Right now it returns these as a struct, but out parameters plus an error code return could be preferable.

I don't have an opinion on the location of the pma_malloc code, and I'm happy to port these changes wherever. It can certainly go into a c/ directory, though I'm not sure programming languge is the best top-level directory scheme. It's also easy to keep it in a separate repo, though I believe @eamsden would object to a submodule.

We use a lot of cells, and 650KB is about 20,000 cells. Both my test pill and many real-life events will generate more than 20,000 cells in a single event. For example, if you save an 800 line hoon file as a tape, even just in a dojo variable, you will generate more than 20,000 cells. However, if the goal is to move to B+ trees, then we don't need to consider this anymore, and I can work around the limit if I need to for any testing.

eamsden · 2023-03-01T13:22:30Z

We use a lot of cells, and 650KB is about 20,000 cells. Both my test pill and many real-life events will generate more than 20,000 cells in a single event. For example, if you save an 800 line hoon file as a tape, even just in a dojo variable, you will generate more than 20,000 cells. However, if the goal is to move to B+ trees, then we don't need to consider this anymore, and I can work around the limit if I need to for any testing.

To Alex's point, the limit is 164 possibly multi-page entries, not 164 pages. The idea is that the copier should count up how much space it needs (rather like the jam implementation in vere) and then malloc a single chunk of space.

But yes, this limit will be entirely dropped with a B+ tree page index. This will also allow us to run the NockStack inside the PMA and store the root by resetting the NockStack to be a single frame with the roots in its locals.

eamsden · 2023-03-01T22:13:16Z

Proposal for top-level directory schema:

silt - for glue code to the big ball of mud (c/rust/other terran code)
rock - for the crust around the diamond (runtime-level hoon)

</bikeshed>

philipcmonk · 2023-03-02T00:44:43Z

To Alex's point, the limit is 164 possibly multi-page entries, not 164 pages. The idea is that the copier should count up how much space it needs (rather like the jam implementation in vere) and then malloc a single chunk of space.

I'm not sure I understand this idea. If you do one big malloc, you aren't able to free nouns individually, right? And you need to do that for mark-and-sweep gc. Do you construct entire pages of each size of allocation, or something like that?

This will also allow us to run the NockStack inside the PMA and store the root by resetting the NockStack to be a single frame with the roots in its locals.

I still don't really understand this, or why it's helpful to run the NockStack inside the PMA.

eamsden · 2023-03-02T01:55:45Z

To Alex's point, the limit is 164 possibly multi-page entries, not 164 pages. The idea is that the copier should count up how much space it needs (rather like the jam implementation in vere) and then malloc a single chunk of space.

I'm not sure I understand this idea. If you do one big malloc, you aren't able to free nouns individually, right? And you need to do that for mark-and-sweep gc. Do you construct entire pages of each size of allocation, or something like that?

Presumably you're not mallocing and freeing in the same event. The dirty page cache isn't the limit for the arena, it's just the limit for pages that need to be written in the next snapshot.

@ashelkovnykov am I remembering this right?

This will also allow us to run the NockStack inside the PMA and store the root by resetting the NockStack to be a single frame with the roots in its locals.

I still don't really understand this, or why it's helpful to run the NockStack inside the PMA.

Running the NockStack inside the PMA decouples computation memory limits from available system RAM and swap by treating the PMA backing store as swap. It also lets us store a custom struct (with the Preserve trait) as a snapshot root instead of needing to use space in the snapshot metadata to for this.

The idea is that we won't do a snapshot in the middle of an event, so the event computation can just run in the arena with stack pages mapped to disk, but never written to disk unless the kernel decides to evict them. When we're done with an event computation, we reset the stack and then write a struct with references to a noun (Arvo) and a bytecode cache into a local in the top frame of the newly-reset stack. We know where this top frame is, so we can treat that local as the root when we garbage-collect the PMA. When we start a new event computation we just grab arvo and the bytecode cache from the local slot in that stack frame and commence the event computation.

ashelkovnykov · 2023-03-02T10:48:01Z

The dirty page limit is 164 [multi-page allocations/frees] + [pages touched by shared-page allocations/frees].

You're both correct, in different ways. @philipcmonk is correct that anything we pma_malloc as a single chunk needs to be freed as a single chunk as well, regardless of when this happens in the future. @eamsden is correct that there's optimizations on the table for how to work around this: (e.g. even if we read a file as a tape, we can pre-compute how much space it needs).

However, I recall that in a conversation we had with Ted, Ed brought up an idea for how it's possible to work around this based on some guarantees about what we're storing/how we're using it. @eamsden Does this sound familiar?

Regardless, B+ Tree page directory is the future.

re: NockStack -
Please correct me @eamsden if this is a bad way to visualize it, but the way I picture it is that it's like the PMA is one giant Vere road, but at the end of an event it's always left with only one thing in the stack: a pointer to Arvo root.

philipcmonk · 2023-03-14T22:00:09Z

I think the right thing to do here is:

I'll remove pma_malloc from this PR and transfer the diffs against it to a PR to ashelkovnykov/pma_malloc
I'll delete snapshot/pma.rs. It will be available in the git history if we want to refer to it when we actually integrate pma_malloc
I'll preserve the changes to abstract over the snapshot mechanism, to make it easier to switch between different options during development.
I may go ahead and implement a trivial outside-the-stack arena so that we can look at running events inside killable threads. That arena will either be system malloc or just a separate NockStack (either of which will leak, but that's fine for scaffolding)

These are mostly small bugs I ran into when trying to run this in zorp-corp/sword#33. They should mostly be self-explanatory, but I'm happy to explain any of them or go over this synchronously. The changes I made beyond bug fixes are: - Added `pma_in_arena(void*)` so the copying code knows whether it needs to copy a noun out. - Added `root` to the metadata to record where the most recent root is, which will generally be the arvo state, bytecode cache, and anything else that needs to be persisted. You need this so that when you load, you know where to get started. There are a few possible variations, for example we could put the event number and epoch number inside the root as well. - Made pma_load return the epoch number, event number, and root. Right now it returns these as a struct, but out parameters plus an error code return could be preferable.

philipcmonk · 2023-03-15T04:12:09Z

I did the first three of those, and will reserve the fourth for a later PR. I also merged status, fixing clippy errors, etc. This is ready for review.

eamsden

This PR probably needs to be renamed, as integrating pma_malloc is precisely what it does not do anymore. But substantively it is good to go.

* crown: add time and sync features * spacing

philipcmonk added 5 commits February 24, 2023 22:31

[pma_malloc] vendor pma_malloc

3760bc4

[pma_malloc] various fixes

9400ad2

[ares] factor pma into separate module, fix warnings

2014cbb

[ares] refactor snapshots

7c22096

[ares] hook up pma_malloc to event lifecycle

ffff065

All tests pass except the azimuth pill, which requires too many dirty pages in one event.

philipcmonk requested review from eamsden and ashelkovnykov February 28, 2023 06:18

eamsden requested changes Feb 28, 2023

View reviewed changes

ashelkovnykov reviewed Feb 28, 2023

View reviewed changes

eamsden mentioned this pull request Feb 28, 2023

PMA discussion issue #35

Closed

3 tasks

Merge branch 'status' into philip/pma

7dcd0c3

[ares] remove pma_malloc integration until further work

6ecec97

philipcmonk force-pushed the philip/pma branch from 4e1a9a9 to 6ecec97 Compare March 15, 2023 03:51

Merge remote-tracking branch 'origin/peter/lint' into philip/pma

b4d0071

philipcmonk changed the base branch from status to peter/lint March 15, 2023 03:57

Base automatically changed from peter/lint to status March 15, 2023 04:01

philipcmonk mentioned this pull request Mar 15, 2023

Various fixes and improvements ashelkovnykov/pma_malloc#1

Open

Merge remote-tracking branch 'origin/status' into philip/pma

f629288

eamsden approved these changes Mar 15, 2023

View reviewed changes

philipcmonk changed the title ~~[ares] Initial integration of pma_malloc~~ [ares] Abstract snapshotting system Mar 15, 2023

philipcmonk merged commit 853b96c into status Mar 15, 2023

philipcmonk deleted the philip/pma branch March 15, 2023 04:30

litlep-nibbyt added a commit that referenced this pull request Nov 4, 2024

crown: add time and sync features (#33)

03225cc

* crown: add time and sync features * spacing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ares] Abstract snapshotting system #33

[ares] Abstract snapshotting system #33

philipcmonk commented Feb 28, 2023

eamsden left a comment

ashelkovnykov left a comment

eamsden commented Feb 28, 2023

philipcmonk commented Feb 28, 2023

eamsden commented Mar 1, 2023

eamsden commented Mar 1, 2023

philipcmonk commented Mar 2, 2023

eamsden commented Mar 2, 2023

ashelkovnykov commented Mar 2, 2023

philipcmonk commented Mar 14, 2023

philipcmonk commented Mar 15, 2023

eamsden left a comment

[ares] Abstract snapshotting system #33

[ares] Abstract snapshotting system #33

Conversation

philipcmonk commented Feb 28, 2023

eamsden left a comment

Choose a reason for hiding this comment

ashelkovnykov left a comment

Choose a reason for hiding this comment

eamsden commented Feb 28, 2023

philipcmonk commented Feb 28, 2023

eamsden commented Mar 1, 2023

eamsden commented Mar 1, 2023

philipcmonk commented Mar 2, 2023

eamsden commented Mar 2, 2023

ashelkovnykov commented Mar 2, 2023

philipcmonk commented Mar 14, 2023

philipcmonk commented Mar 15, 2023

eamsden left a comment

Choose a reason for hiding this comment