-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ares] Abstract snapshotting system #33
Conversation
All tests pass except the azimuth pill, which requires too many dirty pages in one event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@philipcmonk the PMA was signed off as an initial version to free up @ashelkovnykov to work on other things, but definitely requires additional work to be suitable for integration. I expect this work would change the interface significantly, so I'm a bit wary of merging this right now. In particular, we do need to use B+ trees instead of an array in the root page as the index, and we do need an interface to explicitly dirty a page for unifying equality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Putting
pma_malloc
code underrust
is weird. Is there no way to put it into its ownc
folder but still reference it from the Rust code? - I'm not sure what it means to "vendor" something in this context, but copying the entire contents of the
pma_malloc
repo is not the direction we should go.- The code was neither fully documented nor fully tested before I was pulled off of it to do other work.
- If we're going to make the PMA a first-class citizen of New Mars, then we should move the code into this repo properly. Tons of the files that are in the repo are just there for my own reference and offline work. None of the following files/folders from
pma_malloc
should make it into this PR:.gitignore
reference/
snippets/
- Beyond that, the code in
pma_malloc/src
and themakefile
may want to be restructured if the code is now a component of a larger product. - Alternatively, if we want to keep
pma_malloc
at arm's length, then we should:- Transfer ownership of the repo to the Urbit org
- Setup a stable release branch
- Import the code as a git submodule
- The "dirty page" limit of 164 is actually somewhat of a misnomer, since it is actually limited to 164 entries. All multi-page allocations require only one entry. The assumption was that any single event that generates a large volume of data will do so by means of a few large allocations, not very many small allocations. Is this not the case for the bootstrap pill? Is the pill generating a massive number of very small allocations as a single event? Is this an issue that's occurring during testing now or is this something that you were foreseeing in the future due to confusion over the "dirty page limit"?
- If the pill can be split into more events or perform larger allocations, there should be no issue here.
- Of course, this is only a stop-gap solution. The intent was to move to a B+ tree page directory after documenting and writing unit tests for the existing code.
- Unifying equality and other optimizations (e.g. in-place free-and-malloc) were going to come later.
- I'll let @eamsden speak to the above and regarding garbage allocation.
Will take a look at the specific changes to the pma_malloc
code separately.
@philipcmonk I'm out today but it might be worth syncing this evening anyway. |
To be clear, the first goal of this work was to determine what kind of an impact integrating pma_malloc would have on the serf architecture. I want to drive toward error handling a nock computation by running it in a thread and killing it, and you need to be copying out of the stack between each event to do that properly, so I wanted at least a mock of a PMA. I figured I'd play around with pma_malloc and stop if I ran into an individual problem I couldn't solve in an hour or two, and this is how far I got. Total time investment in this code is only 2-3 half-days, which is completely worth it just for how edifying it has been for me. With the understanding I now have of pma_malloc, It would be easy to create a mock PMA with system malloc, and that would let me continue on to look at error handling. So this PR isn't blocking for me. The bugs I fixed in pma_malloc are mostly small and simple -- forgot to intialize a variable, etc. I knew development had been interrupted, so that wasn't a surprise. I did want to know if there were deeper issues that hadn't been considered, and the easiest way for me to learn that kind of thing is dig into the code. The answer seems to be no: once those small bugs were fixed, it mostly just works, except those issues you guys have already identified and planned solutions for. The changes I made to pma_malloc beyond bug fixes are:
I don't have an opinion on the location of the We use a lot of cells, and 650KB is about 20,000 cells. Both my test pill and many real-life events will generate more than 20,000 cells in a single event. For example, if you save an 800 line hoon file as a tape, even just in a dojo variable, you will generate more than 20,000 cells. However, if the goal is to move to B+ trees, then we don't need to consider this anymore, and I can work around the limit if I need to for any testing. |
To Alex's point, the limit is 164 possibly multi-page entries, not 164 pages. The idea is that the copier should count up how much space it needs (rather like the But yes, this limit will be entirely dropped with a B+ tree page index. This will also allow us to run the NockStack inside the PMA and store the root by resetting the NockStack to be a single frame with the roots in its locals. |
Proposal for top-level directory schema:
|
I'm not sure I understand this idea. If you do one big malloc, you aren't able to free nouns individually, right? And you need to do that for mark-and-sweep gc. Do you construct entire pages of each size of allocation, or something like that?
I still don't really understand this, or why it's helpful to run the NockStack inside the PMA. |
Presumably you're not mallocing and freeing in the same event. The dirty page cache isn't the limit for the arena, it's just the limit for pages that need to be written in the next snapshot. @ashelkovnykov am I remembering this right?
Running the NockStack inside the PMA decouples computation memory limits from available system RAM and swap by treating the PMA backing store as swap. It also lets us store a custom struct (with the The idea is that we won't do a snapshot in the middle of an event, so the event computation can just run in the arena with stack pages mapped to disk, but never written to disk unless the kernel decides to evict them. When we're done with an event computation, we reset the stack and then write a struct with references to a noun (Arvo) and a bytecode cache into a local in the top frame of the newly-reset stack. We know where this top frame is, so we can treat that local as the root when we garbage-collect the PMA. When we start a new event computation we just grab arvo and the bytecode cache from the local slot in that stack frame and commence the event computation. |
The dirty page limit is 164 You're both correct, in different ways. @philipcmonk is correct that anything we However, I recall that in a conversation we had with Ted, Ed brought up an idea for how it's possible to work around this based on some guarantees about what we're storing/how we're using it. @eamsden Does this sound familiar? Regardless, B+ Tree page directory is the future. re: NockStack - |
I think the right thing to do here is:
|
These are mostly small bugs I ran into when trying to run this in zorp-corp/sword#33. They should mostly be self-explanatory, but I'm happy to explain any of them or go over this synchronously. The changes I made beyond bug fixes are: - Added `pma_in_arena(void*)` so the copying code knows whether it needs to copy a noun out. - Added `root` to the metadata to record where the most recent root is, which will generally be the arvo state, bytecode cache, and anything else that needs to be persisted. You need this so that when you load, you know where to get started. There are a few possible variations, for example we could put the event number and epoch number inside the root as well. - Made pma_load return the epoch number, event number, and root. Right now it returns these as a struct, but out parameters plus an error code return could be preferable.
I did the first three of those, and will reserve the fourth for a later PR. I also merged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR probably needs to be renamed, as integrating pma_malloc is precisely what it does not do anymore. But substantively it is good to go.
* crown: add time and sync features * spacing
This vendors pma_malloc from @ashelkovnykov and hooks it up to our event lifecycle. This works for the small pills, but it doesn't work for the full azimuth pill because that needs to write 3MB of data in one event, and the hard-coded dirty page limit of 163 only allows ~650KB. It's unclear what the right solution to this is -- I suppose we should convert the page directory to a B+ tree, otherwise you would need a lot more double-buffered metadata pages, or another solution entirely.
This includes several fixes in pma_malloc, which should be reviewed carefully (and probably synchronously). The full PR diff is unhelpful for these, but this sub-diff should work.
This puts pma_malloc alongside the existing "just jam and double-buffer" persistence mechanism and puts them both behind a
Snapshot
trait. It's unclear if this will continue to be useful, but it's convenient for now. You can switch between them in serf.rs by changing one line.This extracts arvo to the PMA after every event (though only syncs when the king requests it). This isn't strictly required, but in the future it will make it easier to recover from errors and add parallelism, and even now it reduces snapshot latency.
Since we can't write to pages after they've been synced except by mallocing again, we mug every noun immediately before it goes into the PMA. This shouldn't be expensive, we already do this in vere.
However, it's unclear how unifying equality can work on two noun references in the PMA. If they turn out to be equal, we should rewrite one to the other, but this requires changing a page which may not be dirty and therefore is likely mmap'd
PROT_READ
. Two solutions that come to mind are: always mmapPROT_READ | PROT_WRITE
, or supply an interface function for marking a page as dirty (and therefore copying the page behind the scenes).Finally, there is not yet a garbage collector for the PMA. I'm not sure what algorithm we should prefer -- copying is likely the easiest and implicitly compacts, though mark-and-sweep may work out better for larger arenas.