Haskell Ruby
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
app
bin
src
test
.gitignore
LICENSE
README.org
Setup.hs
hademlia.cabal
stack.yaml

README.org

Hademlia

A Haskell Kademlia implementation.

What is Kademlia?

Lovely you should ask.

Kademlia is a distributed hash table – why is this cool? It allows a bunch of computers to get together and store information across the entire system. Then, each can find/retrieve information that exists only on a given computer, without ever having talked to it before!

So let’s say Computer 1 wants File A. File A exists on Computer 1337. Computer 1 can make a few jumps across the network (log_2(n) to be exact) to find exactly where File A is - then C1 can talk to C1337 - and splendid piracy continues.

How does it actually work?

A node U keeps track of the following information:

  • Its ID, which let’s say is in binary format, like 101101.
  • A Binary Tree containing information about a subset of other nodes in the network.
    • The tree has lists as its leaves
    • These lists are filled with <Node ID, IP Address, UDP Port> tuples.

A node can perform the following operations:

  • PING
  • STORE
  • FIND_NODE
  • FIND_VALUE

Most of these operations involve a lookup, which is described further down.

Objects are treated the same as nodes, identity-wise

  • A given object is hashed into the same format as the node IDs. (Let’s assume binary format)
  • When performing a FIND_VALUE, the goal is to look for nodes close (see below) to that value (in terms of their IDs)

Nodes have a notion of distance defined by XOR

  • If we consider Node / Object IDs as binary digits, we can XOR them to find the “distance”
  • This has a number of cool properies
    • unidirectionality : all paths lead to Rome (the node). For a distance d and point x, there is one y such that dist(x,y) = d
    • symmetry : dist(x,y) = dist(y,x)

The Lookup

The following is pseudo-code for a node U looking up the ID W.

user sets A, a concurrency parameter
user sets K, a replication parameter
maintain k-heap, a min-heap of nodes ordered by distance from W
                 each entry also maintains a queried? flag (which means queried & received response)
define get-k-closest, which takes a node and target and returns node's K closest nodes to target

define query(to-query):
  results := to-query.each { |x| get-k-closest(x, W) }
  terminate if any of the results contain our value or have our desired node ID
  add results to k-heap
  terminate if the first K of k-heap has been queried
  distances := map (distance to W) on to results
  if min(distances) < peek(k-heap):
    query-next := grab A unqueried from k-heap
  else:
    query-next := grab any node in the first K of k-heap that's unqueried
  query(query-next)

start by calling query on A closest nodes to W.

In order to get the initial A closest nodes:

  • Assume IDs are 160-bits.
  • Assume the lookup-tree is a binary tree where the branches at each point are labeled 0 and 1.
  • Walk the digits of W while walking the tree until you hit a leaf.
  • Grab the <= k nodes in that bucket.
  • If < k, walk back a digit and find the next leaf bucket (repeat if necessary).

If the request is for a value, terminate once a node returns the value associated with the key.

Other Implementation Notes

  • When a node receives any message from another node W, it updates the k-bucket for W.
    • If W is already there, move it to the end of the list.
    • If it’s not there, add it to the end of the list.
    • If it’s not there and the bucket is full, ping the last recently seen node (first) in the k-bucket.
      • If it responds, move it to the end of the list and discard W.
      • If it does not, evict it and add W to the end of the list.
    • This way, new nodes can’t flood routing tables.
  • Once a lookup succeeds, the requesting node stores the <key,value> it was looking for at the closest node that did not return the value.
  • <key,value>s expire in a time exponentially proportional to the number of nodes between the storing node and the next closest node to the key.
  • If a node hasn’t performed a lookup in a given bucket in an hour, it will pick one at random and perform a lookup.
  • To join the network, node U adds W into a k-bucket and then performs a node-lookup for itself. Then it refreshes all k-buckets further than closest neighbor.
  • <key,value>s are republished once every hour
    • TODO: there’s more to talk about here