Conversation
hannahhoward
left a comment
There was a problem hiding this comment.
Overall, this is good, but there are some critical missing bits, that I think will emerge from working closely with @Peeja , @alanshaw and the existing go devs.
Specifically,
-
"Service layer" -- is this a server? a secondary process process on the dev machine? I don't either of these are a good idea, and the server approach would break a number of design principles about our system (namely that all CIDs should be generated on the client). Personally I think a language port is gonna be WAY faster (especially with AI aided dev) and product a way less complex system, so I'd argue strongly for a full port. These aren't complex libraries and I think we could have them ported in a week or two with the AI helping. And then we have a single process for a single machine, way simpler to maintain and reason about.
-
There's a bit of unspecified confusion about how Pail works in the Forge context, that I think you might need to embed with @Peeja on guppy to really grok. So Guppy has a notion of "sources" -- i.e. data sources (usually large, deep directories) that get uploaded within a space. Each space has 1..n sources, and when you upload within a space, after the first upload of a source, only the "delta" gets updated-- Guppy knows how to upload just blocks to make a new updated UnixFS root. So with mutabiltiy:
- You have the list of sources which get updated, and you DEFINITELY want that to be represented by Pail + UCN.
- You have the directory tree structure within the sources itself. This is currently UnixFS and is updated properly each incremental upload.
- So the real question is about whether to use Pail for the whole directory tree, and I think that's a complicated question that merits further examination
Reasons not to use Pail:
- These are extremely big complicated directories and Pail hasn't been tested at a scale even remotely close to working with these directories
- The retrieval patters and general usage for Pail is totally different than for UnixFS -- so the downstream change implications of using Pail for the whole directory tree structure are unknown.
Reasons to use Pail:
- Much more fine grained "multi-writer" capabilities are unlocked if you use Pail for everything. If you used pail for just the sources list, then you'd essentially have a last-writer-wins on a per-source level -- if source X is in state A, and two different guppies make several changes to the directory tree structure, written as UnixFS, then the directory structure would by default ONLY get the changes of the last client to write. Note: we could apply a smarter merge outside of PAIL, similar to the way I merge Markdown files in Clawracha. I actually believe this wouldn't be TOO hard.
Final sidebar: Current Guppy is also smart enough to only upload diff blocks for Files when they change. Encryption will kill that ability I believe, unless there's some useful way to encode only changes that works for encrypted data. Worth a google.
| │ │ │ | ||
| │ ▼ │ | ||
| │ 9. Publish to UCN: Name.publish(pailRootCID) │ | ||
| │ → mutable name now points to updated index │ |
There was a problem hiding this comment.
This is "pail without CRDT" - in the case of multiple concurrent updates to the same name, the UCN resolution is to just use the first of the alphabetically sorted CIDs (IIRC). It means if 2 users start with the same pail, and both make an update, only 1 wins.
The Pail CRDT library allows the two updates to be applied, only resorting to alphabetically sorted CIDs when the two updates have the same causal order and touch the same key.
| │ - KMS info │ | ||
| │ │ │ | ||
| │ ▼ │ | ||
| │ 5. Extract encrypted content from CAR using encryptedDataCID│ |
There was a problem hiding this comment.
Why don't we encrypt each block?
|
After the feedback received and the new POC completed by @alanshaw, I decided to break it into 2 RFCs: |
There was a problem hiding this comment.
I'm pretty set on using Pail, especially since the go library exists, but if @alanshaw disagrees I'd be open to using the catalog approach.
| ``` | ||
|
|
||
| **POC Status:** | ||
| - Step 1 (gateway fetch): ✅ Existing infrastructure |
There was a problem hiding this comment.
no gateway in the forge context -- direct retrieval from SPs
There was a problem hiding this comment.
@hannahhoward, when we run the guppy gateway serve, is that how the SPs will allow retrievals? Through this "gateway".
|
|
||
| Guppy SHOULD support two types of key rotation via CLI commands (KMS mode only): | ||
|
|
||
| ### KEK Rotation (Space Key) |
There was a problem hiding this comment.
This is pretty cool that we can rotate the core RSA key without reencrypting all the files.
|
|
||
| This mode does NOT provide access control — anyone with the key can decrypt. | ||
|
|
||
| ### KMS Mode (Production/Enterprise) |
There was a problem hiding this comment.
I believe all external forge users will be in this tier
|
|
||
| ## Approaches Under Consideration | ||
|
|
||
| ### Option A: Simple Catalog (Alex's Proposal) |
There was a problem hiding this comment.
So:
I do not believe a single catalog file is a good idea for the size of Forge directories (potentially several thousands or up to a million files/folders) -- an update would mean a large upload just to change a single value.
Yes, you could start chunking past a certain size, but then you're chunking up a key value list and essentially reimplementing Pail from scratch. Pail already considers chunking and all that comes with it -- like inserts and deletes, as well as sorting and range querying.
And https://github.com/storacha/go-pail already exists
|
|
||
| **How it works** | ||
|
|
||
| - **Namespace:** UCN Names - ed25519 keypairs that can be delegated and shared |
There was a problem hiding this comment.
UCN is a bit of a misnomer as a library one needs to port -- it's all of a few hundred lines of code and is really just scaffolding to provide a simple UI around clock/head + clock/advance.
Ultimately I think we can just use UCN + Pail
| **How it works** | ||
|
|
||
| - **Namespace:** UCN Names - ed25519 keypairs that can be delegated and shared | ||
| - **State Index:** Pail - sharded Merkle trie for `path → CID` mappings |
There was a problem hiding this comment.
This is an interesting question as it relates to uploads -- we upload large folders as UnixFS -- this is important for retrieval as all software tends to assume UnixFS.
We've discussed various proposals for putting the whole index in Pail vs putting only the uploads in UnixFS.
I think we could get by with just Pail for the upload list, or, if we want to do pail for everything, I think it would make sense to keep uploading in UnixFS (for traditional retrieval patterns) but also store all the file CIDs in Pail
After the feedback received and the new POC completed by @alanshaw, I decided to break it into 2 RFCs:
Closes storacha/project-tracking#663