Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON CRDT 2.0 #228

Open
streamich opened this issue Dec 7, 2022 · 0 comments
Open

JSON CRDT 2.0 #228

streamich opened this issue Dec 7, 2022 · 0 comments

Comments

@streamich
Copy link
Owner

streamich commented Dec 7, 2022

Things to consider:

  • Faster RGA implementation, one optimisation is to reference RGA block ID in insert operation. This will speed up insertion from ~O(log N) to ~O(1), but at a cost of extra ID passed with each insert operation and the internal data structure of the RGA will need to index each block by its initial ID.
  • Move operations across different document nodes. Consider moves within an object vs moves across the whole document.
  • Low-level multi-value register support. (MV register can be done in user-space using an RGA array.)
  • Currently, JSON CRDT is operation-based CRDT. Consider if it also should work as state-based CRDT and delta CRDT.
    • Store metadata necessary to re-create patches in the document, so that patches don't need to be stored separately.
  • Recursion and duplicate nodes.
    • In principle a CRDT can function where some of the nodes are used/referenced more than once, even if the references create a recursion in node structure. In practice, does that have a use case?
  • Serialisation for CAS
    • Stable serialisation, e.g. same document or same user changes result in the same serialised representation, which allows for de-duplication.
    • Serialisation that works well with CDC.
    • Consider if there is a good scenario where JSON CRDT 2.0 could support infinite size JSON-like CRDT based on CAS, where nodes can be referenced by their CIDs. (CAS ensures there are no recursive loops.)
  • Low barrier to entry
    • Consider read-only clients, that only want to read the document. There could be a serialisation method which is easy for them to parse.
    • Consider JSON CRDT clients that should function but cannot support Block-wise RGA algorithm. I.e. all RGA data types would be read-only in those clients, but LWW data types would still function as normal.
  • Rich-text support considerations.
    • Storing node IDs in user-space, for example, Peritext block and inline annotations storage in user-space RGA array. The main semantic problem is that now user-space stores IDs, which are text RGA internals, and those IDs cannot be used to render rich text, without parsing the RGA internals.
    • Consider if it is possible to have nested inline text annotations. For example, <sub> inside of <sup>. That way arbitrary complexity math formulas should be possible to edit.
  • Algorithm support: JSON CRDT supports just two algorithms RGA and Last-write wins registers. Out of those more complex data types can be built, for example, out of RGA a sets and multi-value registers can be constructed. Consider if it is necessary to introduce more/other algorithms into the implementation.
  • Reconsider clocks
    • JSON CRDT supports two clock types: (1) Logical vector clock; (2) Total Order Broadcast server clock.
    • For JSON CRDT 2.0 consider if any of the following clocks could be used: (1) Merkle clock; (2) bounded version vectors; (3) Hybrid logical clock; (3) Dotted version vectors; (4) Tree clocks; (5) Interval tree clocks; (6) GUN-like wall clock with constraints.
  • Merkle DAG CRDT native support. Maybe embrace Merle DAG:
    • Clocks would be Merkle clocks.
    • All objects could reference each other by CIDs.
    • There could be "infinite" length CRDTs, such as HAMT CRDT and Feed CRDT in P4. Infinite-length data type support:
      • HAMT CRDT — an infinite size key-value LWW CRDT.
      • Feed CRDT — an infinite size ordered list.
      • Maybe group CRDTs into: (1) finite size; (2) unbound size ones.
    • For performance, could be still useful to group some objects into blocks: say, RGA objects are their separate blocks but LWW objects could be grouped in one DAG block.
    • Consider how to work around IPFS block size constraints for large CRDTs.
    • Adding additional pointers to the DAG to work around thin DAH problem.
  • Natural evolution to v2. Maybe, instead of the stop-the-world-full-rewrite the JSON CRDT could seamlessly migrate to JSON CRDT 2.0, e.g. JSON CRDT 2.0 would be a superset of JSON CRDT.
  • Consider embedding "stored procedures" inside the CRDT or having an ability for operations to be custom defined by some scripting language. For example, use the JSON Expression language.
    • The stored procedure would need to be able to integrate into the CRDT algorithm, like, into them merge algorithm. Otherwise, if it is just there for user-space purposes, like, for computing an alternative view of the data, then the "procedure" might as well live in application code.
  • Consider adding Reference Value type, which is a value which refers to another CRDT.
  • GraphQL-like API, IPLD-like lenses, "projectsion API", ability to subscribe to a subset of data, especially if CRDT is a graph of nodes in CAS.
  • Research why YATA algorithm generates less chunks than Blockwise-RGA. (At least that is the case using Martin's trace.)
  • Rich-text/Peritext integration into the main spec?
  • Code editor support—they require quick navigation by line number. Maybe Peritext could insert virtual block at every newline \n position and provide fast line indexing.
  • Dealing with loss of context (large deletions)—imagine in a collaborative rich-text editor user selects few paragraphs (maybe whole document) to copy the text Ctrl + C, but accidentally presses Ctrl + X, instead, then—to fix the document—user re-inserts the text with Ctrl + V. This results in complete loss of content and all causal context for other users (other user cursors get lost, other user inserts get inserted into a void).
  • Optimization: local RGA inserts do not need to follow the standard RGA algorithm. Local inserts do not encounter preemptive siblings, i.e. when doing a local insert, there should be no IDs between the ref ID and the new op ID.
  • RGA is composed of two binary trees. It is serialized by text-order. Consider ability to serialize by ID-order. This way on the server the RGA could be unserialized with balanced ID tree and insertion could be implemented without hydrating the text-order tree. Even if text-order tree is hydrated, the main advantaged is to have the ID-order tree balanced, as that tree will be used to perform an insertion lookup.
  • When unserializing balance the by-ID tree, instead the by-text tree.
  • Are there better algorithms then RGA? What about YATA?
  • Consider explicitly support deep integration with UI to be able to perform local operations at O(1) speed. For example, a cursor in a text editor can maintain a direct pointer to the RGA chunk to which the cursor is pointing.
  • Use real world traces to compute relative frequencies of different insert/delete code paths. (Inset at root, insert at root with concurrency, insert after a block, insert after a block with merge, block split, deletes ...).
  • When doing splay rotations, consider moving tobstones as low as possible in zig-zig rotation.
  • Consider implementing triple-zip rotations, where tombsones are extra rotated down.
  • Consider storing the whole causal graph, i.e. each operation would also contain the "parent"—the last know operation in the document or the last known operation in a specific object.
  • Maybe Peritext can be integrated somehow into JSON CRDT such that internal timestamps are not exposed. Maybe using some format like Quill Delta.
  • Switch to Fugue algorithm? Or, execute RGA by default and add ability to optionally switch into Fugue-mode, when the right origin is presented in the operation.
  • Sync9 and Fugue result into the same sorting order, because, apparently, to solve for interleaving there is not much possibilities for alternative sorting variants. Maybe, it is possible to solve for all interleaving problems without much extra metadata (with less metadata than proposed in Fugue)?
  • Permanent Range Deletes --- RGA delete operation where: (1) a range is deleted; (2) tombstones are deleted forever, with exception for the last character in the deletion range, which is left as a tombstone.
  • Consider ability of embedding document schema at the creation time of the document. All patches would need to result in a valid model according to the schema, a patch which create an invalid schema would be illegal.
  • Add ability to support "cherry-picking", where only the picked patch and its immediate dependencies are merged, without needing to merge the whole prior history.
  • User-centric undo/redo: the same user might be on different tabs and different devices, all those different processes might have different CRDT session IDs. Specify undo/redo such that it combines all user changes across the processes.
  • Consider how Fully Replayable Histories (FRH) could be supported.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Needs clarification 🧪
Development

No branches or pull requests

1 participant