Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recording comprehensive balance change history in evaluation nodes #2191

Open
theoreticalbts opened this issue Mar 2, 2018 · 3 comments
Open

Comments

@theoreticalbts
Copy link
Contributor

We want to record every change to account balances #2173.

We should clearly put the code that emits a note every time a balance changes in adjust_balance(), because while obviously we weren't careful to record every balance change in account history, I am pretty sure that during all phases of Steem's development we were pretty careful to always use adjust_balance() to change balances. So if you put it in adjust_balance(), then it will always Just Work and we don't have to worry about hunting down all the corner cases we missed, now and in the future.

This will produce balance change notes for everything, including things that already have virtual ops, such as transfer_operation. Which is why I'm calling the things that are emitted by this code "balance change notes," not "virtual ops."

For each balance change, we also want to know what caused it (an op, a virtual op, or per-block processing -- it would also be useful to know which phase of per-block processing). This means we have a tree structure.

Now at any point in time, we need to keep track in a data structure somewhere of what cause to record in the resulting balance change note when adjust_balance() happens to be called.

Which means we effectively have database keep track of a stack data structure of nodes. Each phase of processing that can serve as a cause will need to push a node onto the stack data structure. (With a sufficiently clever class destructor, popping the stack node from the data structure can be made to occur at all exit points automatically.)

Now already account_history is one of the most bloated parts of the database. So instead of writing the nodes to account_history, let's write them somewhere else -- for example, stdout, or a plain file whose path is specified in the config file. (We can have the file be a UNIX fifo which a script listens to and records in a database for later querying.)

If we regard operation nodes as being children of transaction nodes, and transaction nodes as being children of block nodes, we can see that each tree is attached to a block, or an unconfirmed transaction. This simplifies the fork handling logic that will be needed in the external script.

Let's call the tree nodes "evaluation nodes," or "enodes" for short.

I had working code for all of this here, but it has some very rough edges:

  • It is quite old and bit-rotted, it will need adapted since steemd has had multiple breaking refactors since then.
  • It only pushes nodes in a few places, many more pushes need to be added.
  • The external indexing script hasn't been written yet.
  • I was advised that the patch was not mergeable until we could build with the enode functionality compiled out.

This system is potentially very useful because it's quite general. There are many occasions when people want to be able to get statistics or other historical information out of steemd (for example right in #2173 a desire's expressed to record the VESTS / STEEM exchange rate every block). The data path created by the enode system would work quite well for this.

@vogel76
Copy link
Contributor

vogel76 commented Mar 7, 2018

This leads to production another level of objects (dedicated virtual operation) which must be somehow associated to original operation performing the change. As I understand virt-ops shall be used for cases having no equivalents in real ops, but need some action to be taken (what is not true in this case). Then needs another storage, external service and a lot of work to integrate them. Also will be troublesome on deployment side, because another entity (called as eNode) must be spawned and maintained. Same can be done by extending the account_history_api, which actually even don't need additional data storage, just proper op filtering during retrieving them for given account.
Work done in Issue #2066 provides implementation of AH which does not have huge resource needs. Once (finally) accepted will need just have extended its API for given method.

@sneak
Copy link
Contributor

sneak commented Mar 20, 2018

Now already account_history is one of the most bloated parts of the database. So instead of writing the nodes to account_history, let's write them somewhere else -- for example, stdout, or a plain file whose path is specified in the config file. (We can have the file be a UNIX fifo which a script listens to and records in a database for later querying.)

The correct way to do this in 2018 is not to "run more stuff inside the same container that listens to a fifo socket" (which also has the side effect of pushing the burden of dealing with the data off to another process that hasn't been written yet, which will need to then reimplement RPC), but to write it to some sane, modern, disk-based database (e.g. rocksdb) that can then be easily queried via the standard rpc protocol already supported by steemd. A "balance_history" plugin or something to avoid account_history, that uses an entirely separate db, that can be queried via entirely separate json-rpc methods. The way we get data in and out of steemd (when we're not also steemd speaking p2p) is via json-rpc and is always via json-rpc. There will not be a plugin that writes to a fifo, there will not be a plugin that sends SQL to an external db, there will not be a plugin that does anything except read and write to local disk (in and to files that will only ever be read by steemd) and add json-rpc methods by which to access that file-based datastore.

The options are not exclusively "use our existing, bloated datastore" or "do it out of process". The interfaces are defined, and local disk is fast and cheap, and now under appbase, RPC reads should be too.

@sneak
Copy link
Contributor

sneak commented Mar 20, 2018

The goal here should be to index relatively little in steemd, but make all of the relevant data available to external services (which will be json-rpc clients of steemd) in perpetuity so that those external services can do their indexing. That means probably a lot of memoizing to disk, indexed by block height, so that those external services can fetch the full history by iterating over the block numbers and building their own indices.

I would love to see what we call "consensus-only nodes" use much more disk, and be useful as historical databases (without many indexes other than block height) out of the box, in a default config.

Then, with the maturation of better indexing services (e.g. hivemind, sbds, et c), the idea of what we call a "full node" (which I think we should call an "indexer node", to drop the unintended connotation of a consensus-only node being "less than full") would basically go away, as account_history would be removed or replaced.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants