Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Introduce an atomic bulk append RPC function #264

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

Happy0
Copy link

@Happy0 Happy0 commented Aug 7, 2019

This pull request introduces a publishAll RPC function which accepts an array of messages to be appended to the log atomically. Either all messages are written successfully, or none are written.

Under the hood, flume's append function already uses level DB's batch function under the hood which writes the messages atomically. This pull request just exposes that functionality further up the stack.

In level-db: https://github.com/Level/levelup#batch

In flume: https://github.com/flumedb/flumelog-level/blob/37edf4a94ceaad88b3bb3bf436517c185449b059/index.js#L25

I still have a lot of tests to write, but I thought I'd open this pull request early for discussion in case anyone has any comments / recommendations about my overall approach (or any reason we shouldn't have this feature.)

Motivation

Sometimes it is desirable to append many messages at once atomically. For example, if an application occasionally persists data to the log that exceeds the individual message size, it might split the message into parts (as an alternative to using the blob system.) In such cases, it can be tricky to recover from failure if only some of the writes succeed (due to a power failure or the application being closed non-gracefully.) This is something that I'm doing in the scuttlebutt-akka-persistence plugin that I've been building as part of my work for Openlaw and I'd like to be able to publish the parts to the log atomically (as recovering from having published a subset of the parts would be tricky.)

An additional use case is being able to publish multiple related messages at the same time atomically. For example (off the top of my head) - 'create gathering' and 'add invitees' messages.

Questions

  • This change calls timestamp() to generate a monotonic timestamp for each message in the batch. This is because the current indexes (such as indexes/feed) assume an increasing timestamp. Philosophically, these messages are really written at the same time. Is this approach okay? Or do we want to change the indexing?

  • What is the difference between sbot.add and sbot.publish? Do we need an equivalent for publishAll?

  • What is db.post used for?

ssb-validate

The required changes in ssb-validate can be found here: ssbc/ssb-validate#17

TODO

  • Update api.md with the new function and documentation about it.
  • Add a lot more unit tests

@dominictarr
Copy link
Contributor

Okay I agree atomic writes is something we should support.

Answers:

  • it's probably better to use an increasing timestamp, because somethings use that to disambiguate orders of things.
  • sbot.add appends an already valid (i.e. signed) message, and sbot.publish takes a message content and creates a message (i.e. signs it for you).
  • db.post calls back whenever a message is appended. I think it's mainly used in tests.

A Problem: if you called publishAll with multiple message contents, in cases where the second message needs to refer to the first, you'll need to know the hash of the first message, because it goes in the content of the second message, but you don't know that until it's signed. So you can't do gatherings/create and invite attendees in one go. (a better solution would be if gatherings supported creating the invite and editing within the same initial message)

@dominictarr
Copy link
Contributor

an option though, would be to create the messages one at a time, without publishing them, and then pass them all to add in one go.

@Happy0
Copy link
Author

Happy0 commented Aug 8, 2019

Thanks @dominictarr :)

I see your point regarding one of the use cases I gave where you have to link to message in the bulk appends. The use case I'm working towards doesn't require this (since the messages I'm posting in bulk have an 'id' field and a 'part' field that describes how they relate to each other.)

Do you think I should try and support this use case (where you create the messages one at a time without publishing, then passing them in one go to allow you to link to a message that hasn't been published yet)? I think it'd be quite tricky / dangerous ...

Or would you be happy to merge this pull request in its current form (but with a lot more tests)?

@christianbundy
Copy link
Contributor

I think an atomic publish would be great. Referencing them would be tricky, but I think that's a trade-off: you can publish lots of atomic messages, but they probably shouldn't refer to each other by their sigil hash (because you can't know what it will be up-front). I'm always down for more tests, but I don't dislike the direction of this change.

@dominictarr
Copy link
Contributor

I'm in favor of bulk append, but I do think it's a database feature, so support should be implemented in the database layer, not the server/application layer. It will be more potentially useful and also easier to test that way.

@stale
Copy link

stale bot commented Feb 20, 2020

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

@stale stale bot added the stale label Feb 20, 2020
@stale stale bot removed the stale label Feb 20, 2020
@christianbundy
Copy link
Contributor

Hey @Happy0 is this something you're still working on? I'm trying to reduce the number of open PRs and would love to merge if we can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants