This is a simple client/master/worker system for distributing Python jobs in a cluster. It is my third attempt (after clusterfutures and execnet-futures) to build a comfortable system for running huge, parallel jobs in Python.
As with something like Hadoop, there are three distinct machine roles in cluster-workers. Workers run jobs; you typically run one of these per core or per machine. The client produces the jobs; it's your main program that you're aiming to offload work from. The master maintains connections to the workers and the client; it is responsible for routing jobs and responses between the other nodes.
It's certainly possible for multiple clients to share the same master, but fairness enforcement is currently pretty simplistic (FIFO). The distinction between the master and the client is primarily to avoid needing to spin up new workers for every new task you want to run. This way, you can allocate nodes to do your work and leave them running while you run various client programs. You can add and remove workers at your leisure whether a client is running or not.
The advantage over clusterfutures is that the cluster management infrastructure is not involved with the inner loop. On a SLURM cluster, for example, each new call would require a new SLURM job. If the cluster failed to start one job, clusterfutures had no way of knowing this and would wait forever for the job to complete. With cluster-workers, SLURM is only involved "offline" and cannot hold up the actual task execution.
Documentation is currently sparse. But take a look at the square.py example in the examples directory for a quick introduction. Here's the gist of how things work:
- Start a master process with python -m cw.master.
- Start lots of workers with python -m cw.worker [HOST]. Provide the hostname of the master (or omit it if the master is on the same host).
- In your client program, start a ClientThread. The constructor takes a callback function and the master hostname (the default is again localhost). Call the submit method to send jobs and wait for callbacks.
There's also a ClusterExecutor class that lets you use Python's concurrent.futures module as a more convenient way to start jobs. This may be untenable, however, if your task has a lot of jobs and a lot of data because of the way futures must persist in memory.
SLURM is a cluster management system for Linux. This package contains a few niceties for running jobs on a SLURM cluster.
The slurm.py script lets you run SLURM jobs for your master and workers. Just run ./slurm.py -n NWORKERS start to kick off a master job and a bunch of worker jobs.
Then, use cw.slurm_master_host() in your client programs to automatically find the host of the master to connect to. Pass this to the ClientThread constructor.
For testing and small jobs, you may want to run a cluster-workers program on a single multiprocessor machine. The included mp.py script works like slurm.py but starts the master and workers on the local machine.