CH/OTP Test Task
This project is my solution to the "CH/OTP Test Task" problem defined in http://f.nn.lv/od/5c/8y/CH_OTP_Test_Task(1).pdf.
To build executables just run
stack build. This will create binaries for your operating system.
How to execute the program
1. Initialize worker nodes
You can initilize worker node using script
run-worker.sh with following arguments:
./run-worker.sh [--host HOST] [--port PORT]
Run this for every node in the cluster.
2. Run master node to trigger calculations
You can run master node using script
run-master.sh with following arguments:
./run-master.sh [--host HOST] [--port PORT] [--sendFor SEND_FOR] [--waitFor WAIT_FOR]
Run this on a single node. Please not that
waitFor should be reasonably big so that during grace period a worker can properly handle all waiting messages in the queue.
Once the master nodes terminates, worker nodes will also terminate printing the final result on the console.
I approach the problem by implementing a total ordered queue of events as described in Time, Clocks, and the Ordering of Events in a Distributed System by Leslie Lamport.
The general algorithm is described as follows:
- master initialize work for workers and pauses for timeout provided by
- workers generate internal events, send them to other workers and store events comming as messages from other workers
- logical clock is implemented in order to keep partial ordering of events
- upon incoming
Stopmessage from the master, workers stop generating internal events, while still acceping external messages
- master pauses for timeout provided by
- upon incoming
Resultsmessage from the master, workers sort events according to the => relation (which gives total ordering by combining partial ordering from logical clocks and total ordering of process)
- having events sorted, each worker prints out the result
- workers send
Donemessage to the master
- once all
Donemessages are received, master terminates all worker nodes and then stops its execution
1. Handling of process failure
As described by Lamport
However, the resulting algorithm requires the active participation of all the processes. A process must know all the commands issued by other processes, so that the failure of a single process will make it impossible for any other process to execute State Machine commands, thereby halting the system.
Thus the algorithm should work in non-perfect network but will halt if one the process dies definitely.
Lamport point to his other paper The Implementation of Reliable Distributed Multiprocess Systems" which is an introduction to the consensus protocols (which seemd out of scope of this test)
2. Process as (sort of) partial functions
Implementation uses more "low-level" API of
distributed-process, instead of
Typed Channels. During the implementation process this even led to one error, that could be spotted during compilation time (rather then runtime) if
typed channels were used.
3. Huge event lists failing
List was chosen as an obvious data strucutre with O(1) append. However it seems when size of events list becomes roughly huge ( >1000000 on Mac Book Pro 2015 edition), the workers break with system errors - which seem to have something to do with memory issues. Solution might imply using other data structure or further investigation on error messages.