Skip to content

Database hardening proposal #2918

@dshulyak

Description

@dshulyak

The proposal focuses on eliminating hard to debug bugs that will occur once spacemesh software will be running in non-managed environments, and besides bugs make database code more robust. The main issues with database code:

  • all writes are not synchronous

If OS or hardware crashes, due to the power loss, for example, the node may enter into a state that is not visibly corrupted but invalid. One example could be #2871, in case if ATX wasn't persisted but was broadcasted to another node - any future ATX that is produced by the crashed node is discarded by the network, but considered valid by the crashed node.

  • in many places writes that must be atomic are not atomic

Examples of this are #2516 and #2547. Even though the software can be "designed" to handle non-atomic writes this is usually a bad idea and will lead to many unexpected bugs. In the case of ATXs and blocks none of them will be processed again if the main body of the object was written but the write for the secondary index failed. For example, the node will think that block is fully synced but we will not add it to the layer.

  • non-atomic state transitions between go-sm and svm

Because of this design decision, there will be certain problems that we will need to handle. One example: marking a block during rerun contextually valid before rewinding state in svm. If we crash before rewinding the state we will never discover that block again in tortoise and the state will never again be reapplied.

We won't be able to solve such issues just by writing the correct database code, and they will have to be handled in a special way.

Schema and requirements

  • schema is simple, no complex queries
  • handle large binary blobs efficiently (ATX is around ~3kb, block size depends on the tortoise state and can vary between ~1-10kb)
  • database is mostly random reads, with low writes, so better to optimize for reads

we are also using databases in some other modules (tortoise, tortoise beacon), but they are not critical for correctness since we can always rebuild them from the main data.

mesh module

transactions and rewards are omitted, as i wasn't following what is getting moved to svm.

blocks
  • main blob storage, indexed by block id
  • layer index, indexed by layer_id || block_id
  • contextual validity index, indexed by block_id, separate database
  • input vector, indexed by layer id, separate database
layers

separate database

  • indexes for latest, processed, instate layers. using constant as an index
  • indexes for hash and aggregated hash, using layer_id in each bucket

activation module

activations

all in a single database, but no atomic and synced writes

  • blob storage for headers and bodies
  • index from epoch_id || node_id to ATX
  • ATX timestamp, indexed by ATX id
  • constant index for first received ATX with largest publication layer
poet proofs
  • blob storage for poet proofs
identities
  • index from node key to node vrf key

Changes and implementation

For safety and correctness we need to:

  1. make all related writes atomic

All writes that are executed as a part of a gossip handler should be atomic.
Writes that are made during and after tortoise execution should be atomic.
Not sure if syncer needs special handling, as it relies on the code that is used in gossip handlers.

  1. make writes durable, preferably always

some writes, such as that are made during ATX publishing, should be always durable as the error in that domain will lead to an invalid state. some other writes may not require durability. but to simplify our life we can always go for durable if it is detrimental for performance.

sometimes there are implicit dependencies, such as when we receive a block we validate that ATX is already stored on disk. if ATX's are stored in a separate database we can't know that this data will be recovered unless we always fsync that data first.

  1. get atomicity between go-sm and svm

unlike two previous examples, we can't rely on DB atomicity for correctness. therefore we should persist data on the side of go-sm only after svm finished the write on their end. and svm must be ready to receive the same data multiple times in the event of crashes.

implementation

preferably we will use the same database for the whole application to guarantee consistency between cross-module writes. such as with blocks that rely on ATX's to be persisted.

the alternative is to use a separate database but make sure that the written data is always fsynced.

leveldb (or any key-value blob storage)
  • atomic writes can be persisted using batches. leveldb provides an option to follow a write with fsync.
  • bucker per submodule (just 1-2 bytes of unique prefix)

pros:

  • less refactoring
  • faster writes for small values, due to the append-only nature of LSM trees

cons:

  • harder to work with, which was the case in mesh module, most of the indexes were bugged
  • more complex data model
  • slower reads
  • slower writes for large values
switch to sqlite
  • transactions for atomic writes. synchronicity mode is defined with pragma synchronous.
  • table per sub-module (blocks, layers, activations, poet, identity)

pros:

  • easier to use correctly, the programmer makes fewer choices
  • faster random reads, no merges such as in LSM trees
  • simpler data model, instead of writing code for custom indexes we will use SQL table and indexes for specific fields out of the box
  • faster writes for large values

cons:

  • more refactoring
  • slower writes

This would be my choice, spacemesh doesn't do many writes (thousands per minute is nothing for any database). But in some use cases, it is very read-heavy. I would recommend doing a POC using SQLite and comparing the performance of the tortoise rerun for example.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions