sit edited this page Mar 11, 2013 · 1 revision
Clone this wiki locally

Merkle Tree on disk design

[Last updated October 2006]

The primary goals of moving DHash's Merkle tree synchronization system to a disk-based approach are to reduce the memory pressure caused by holding multiple Merkle tree in memory in both lsd and syncd, and also reduce the amount of disk I/O syncd must do to synchronize with each of its neighbors. In the current system, syncd chooses one successor per synchronization period, reads all the keys that it knows about for that successor off of disk, and constructs an in-memory Merkle tree. In the new system, syncd would need only the root block for a successor-specific tree from disk, and then if it differs, it need only read certain blocks from disk to continue with the synchronization, rather than constructing the complete tree in memory.

The changes will affect all the major components of Chord/DHash: syncd, lsd, and adbd. First, we'll look at new data structure for an on-disk Merkle tree, and then we'll talk about the changes needed for each component.

On-disk Merkle trees


For each logically separate group of keys that DHash must keep track of (the keys for the blocks stored on the local node, and the set of key for blocks stored on each successor node), DHash with keep a separate set of three files:

  • Leaf nodes (<db_dir>/<node_id>.<ctype>/leaf.mrk): This file consists of blocks of up to 64 20-byte keys, plus an integer count of the number of keys in the block. The same amount of space will be allocated for each block (1284 bytes), so a block offset is enough to skip directly to a particular record in the file.
  • Internal nodes (<db_dir>/<node_id>.<ctype>/internal.mrk): This file consists of blocks of exactly 64 20-byte hashes. a pointer to each of its 64 children, and a combined key count for all 64 children. Each pointer is a block index into either the internal file or the leaf file, and consists of a 31-bit block offset (for up to 137 billion keys per tree), and a 1-bit identifier saying whether the block is internal or leaf. Each internal block is then (64*(20+4)+4)=1540 bytes.
  • Index file (<db_dir>/<node_id>.<ctype>/index.mrk) This file contains a pointer to the root block of the Merkle tree, a freelist containing pointers to free blocks within both the leaf and internal files. The pointers are just like the pointers above.


The on-disk Merkle tree should be able to share a lot of code with the existing Merkle tree implementation. In fact, we should be able to create a subclass of the `merkle_tree'' class, let's call it ''merkle_tree_disk'', that overrides only a few functions:

  • insert (u_int depth, merkle_hash& key, merkle_node *n)
  • remove (u_int depth, merkle_hash& key, merkle_node *n)
  • lookup (u_int *depth, u_int max_depth, const merkle_hash &key, merkle_node *n)
  • The constructor (take in file names, construct the root block, etc).

This will likely also require subclassing the merkle_node class so that the child() method constructs the child by looking it up on disk, and stores the children pointers properly. Note that this means the root member of a merkle_tree must be a pointer, so that it can be instantiated as a subclassed type.

The insert, remove, and lookup functions will now require synchronous disk I/O in merkle_tree_disk, so they should never be used directly by the lsd process, which needs to remain un-blocked to handle network requests quickly. Thus, this implementation requires moving the handling of the local-keys Merkle tree into the syncd process, and having the syncd processes communicate among themselves. More on this in the next section.

Note that the atomicity of inserts will be important if more than one process has to read the Merkle tree. This could be done by copying all modified blocks into free slots, and only switching the root pointer in the index file once all blocks have been modified, and finally adding all the old blocks on to the free list. In the current design, adbd must update the Merkle tree of local keys on a block insert, while syncd reads it, so atomic inserts are pretty important.


syncd will require the most work of any of the Chord/DHash components. Basically, it will have to serve as both the server and client in the synchronization protocol, similar to the way sampled works. (In the current system, lsd serves as the server during synchronization.) ``syncd` will consist of the following components:

  • For each vnode:
    • For each content type (noauth, chash, and possibly keyauth):
      • A merkle_server, which listens for RPCs on a well-known port and answers queries using a merkle_tree_disk over the nodes local set of keys.
      • A syncer that, once every synchronization period, picks a successor and makes a merkle_tree_disk for the keys known to be on that successor. It passes this tree to the merkle_syncer to make the RPCs to the remote syncd for synchronization.

The majority of this code already exists. Once the merkle_tree_disk is implemented, it should just be a matter of moving things around from lsd to syncd.

When syncd learns about new keys on a successor, it will insert them into the merkle_tree_disk, which will save it on disk synchronously. It will also need to update the Berkeley DB files as usual, via adbd, to make sure it maintains the correct replica counts so that lsd can decide which replicas to repair.


All merkle_tree and merkle_server references should be removed entirely from lsd. In the future, it's possible we'll need to change the repair algorithms, but it will still likely involve similar communication with adbd, so maybe that won't involve much change.


adbd must now update the Merkle tree over the node's local keys on every block insert and removal received from lsd. This will require it to know the locations of the Merkle files on disk, but we can just put them in the same directory as the Berkeley DB files, so that we won't have to pass any more paths around.

Also, the use of the on-disk Merkle trees should obviate the need for the keyauxdb file, though it still seems pretty useful for iterating through all keys for the local node, so maybe we'll keep it around (and updated by adbd) for now.

Eventually it may make sense to use an on-disk min heap structure to inform lsd of the objects in most need of repair, but we don't think this is too necessary for the first phase of the implementation.


We'll need a way to initially create the on-disk Merkle trees from the information currently stored in adbd. I imagine the easiest way to do this will be to write a standalone program in devel/ or utils/ that reads in everything from adbd and inserts the data into merkle_tree_disk objects, making a new one everytime it comes across a new successor.