Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Clone this wiki locally
Merkle Tree on disk design
[Last updated October 2006]
The primary goals of moving DHash's Merkle tree synchronization system to a disk-based approach are to reduce the memory pressure caused by holding multiple Merkle tree in memory in both
syncd, and also reduce the amount of disk I/O
syncd must do to synchronize with each of its neighbors. In the current system,
syncd chooses one successor per synchronization period, reads all the keys that it knows about for that successor off of disk, and constructs an in-memory Merkle tree. In the new system,
syncd would need only the root block for a successor-specific tree from disk, and then if it differs, it need only read certain blocks from disk to continue with the synchronization, rather than constructing the complete tree in memory.
The changes will affect all the major components of Chord/DHash:
adbd. First, we'll look at new data structure for an on-disk Merkle tree, and then we'll talk about the changes needed for each component.
On-disk Merkle trees
For each logically separate group of keys that DHash must keep track of (the keys for the blocks stored on the local node, and the set of key for blocks stored on each successor node), DHash with keep a separate set of three files:
Leaf nodes (
<db_dir>/<node_id>.<ctype>/leaf.mrk): This file consists of blocks of up to 64 20-byte keys, plus an integer count of the number of keys in the block. The same amount of space will be allocated for each block (1284 bytes), so a block offset is enough to skip directly to a particular record in the file.
Internal nodes (
<db_dir>/<node_id>.<ctype>/internal.mrk): This file consists of blocks of exactly 64 20-byte hashes. a pointer to each of its 64 children, and a combined key count for all 64 children. Each pointer is a block index into either the internal file or the leaf file, and consists of a 31-bit block offset (for up to 137 billion keys per tree), and a 1-bit identifier saying whether the block is internal or leaf. Each internal block is then (64*(20+4)+4)=1540 bytes.
Index file (
<db_dir>/<node_id>.<ctype>/index.mrk) This file contains a pointer to the root block of the Merkle tree, a freelist containing pointers to free blocks within both the leaf and internal files. The pointers are just like the pointers above.
The on-disk Merkle tree should be able to share a lot of code with the existing Merkle tree implementation. In fact, we should be able to create a subclass of the `merkle_tree'' class, let's call it ''merkle_tree_disk'', that overrides only a few functions:
insert (u_int depth, merkle_hash& key, merkle_node *n)
remove (u_int depth, merkle_hash& key, merkle_node *n)
lookup (u_int *depth, u_int max_depth, const merkle_hash &key, merkle_node *n)
- The constructor (take in file names, construct the root block, etc).
This will likely also require subclassing the
merkle_node class so that the
child() method constructs the child by looking it up on disk, and stores the children pointers properly. Note that this means the
root member of a
merkle_tree must be a pointer, so that it can be instantiated as a subclassed type.
lookup functions will now require synchronous disk I/O in
merkle_tree_disk, so they should never be used directly by the
lsd process, which needs to remain un-blocked to handle network requests quickly. Thus, this implementation requires moving the handling of the local-keys Merkle tree into the
syncd process, and having the
syncd processes communicate among themselves. More on this in the next section.
Note that the atomicity of inserts will be important if more than one process has to read the Merkle tree. This could be done by copying all modified blocks into free slots, and only switching the root pointer in the index file once all blocks have been modified, and finally adding all the old blocks on to the free list. In the current design,
adbd must update the Merkle tree of local keys on a block insert, while
syncd reads it, so atomic inserts are pretty important.
syncd will require the most work of any of the Chord/DHash components. Basically, it will have to serve as both the server and client in the synchronization protocol, similar to the way
sampled works. (In the current system,
lsd serves as the server during synchronization.) ``syncd` will consist of the following components:
- For each vnode:
- For each content type (noauth, chash, and possibly keyauth):
merkle_server, which listens for RPCs on a well-known port and answers queries using a
merkle_tree_diskover the nodes local set of keys.
syncerthat, once every synchronization period, picks a successor and makes a
merkle_tree_diskfor the keys known to be on that successor. It passes this tree to the
merkle_syncerto make the RPCs to the remote
- For each content type (noauth, chash, and possibly keyauth):
The majority of this code already exists. Once the
merkle_tree_disk is implemented, it should just be a matter of moving things around from
syncd learns about new keys on a successor, it will insert them into the
merkle_tree_disk, which will save it on disk synchronously. It will also need to update the Berkeley DB files as usual, via
adbd, to make sure it maintains the correct replica counts so that
lsd can decide which replicas to repair.
merkle_server references should be removed entirely from
lsd. In the future, it's possible we'll need to change the repair algorithms, but it will still likely involve similar communication with
adbd, so maybe that won't involve much change.
adbd must now update the Merkle tree over the node's local keys on every block insert and removal received from
lsd. This will require it to know the locations of the Merkle files on disk, but we can just put them in the same directory as the Berkeley DB files, so that we won't have to pass any more paths around.
Also, the use of the on-disk Merkle trees should obviate the need for the
keyauxdb file, though it still seems pretty useful for iterating through all keys for the local node, so maybe we'll keep it around (and updated by
adbd) for now.
Eventually it may make sense to use an on-disk min heap structure to inform
lsd of the objects in most need of repair, but we don't think this is too necessary for the first phase of the implementation.
We'll need a way to initially create the on-disk Merkle trees from the information currently stored in
adbd. I imagine the easiest way to do this will be to write a standalone program in
utils/ that reads in everything from
adbd and inserts the data into
merkle_tree_disk objects, making a new one everytime it comes across a new successor.