Map reduce in Go with etcd and S3
To make a master-less Map/Reduce framework in Go. etcd is used for co-ordination AWS S3 is used for storing intermittent files, and results. Nobody tells the workers what to do, each worker picks up a pending task and performs it.
I currently use Disco. And was thinking if I can make something simpler that does not require a master, or complex setup.
This repository is work in progress.
Distributed computing is hard. We take the lazy way out and outsource the hard parts to other tools/services.
- etcd server/cluster - I have only tested with single etcd server so far.
- AWS S3 - Currently I only support AWS S3, but its relatively easy to support S3 clones.
- workers - These wait for some task to be done, and fulfill them accordingly.
- webapp - To view status(and logs) of tasks.
- The user
go get github.com/turbobytes/gomr
The following environment variables need to be set. Adjust accordingly.
export S3_BUCKET=gomr #Existing S3 bucket we will use for this. Its best to use a dedicated bucket. export ETCD_SERVERS="http://127.0.0.1:2379" #Comma separated list of etcd servers export AWS_REGION=ap-southeast-1 #AWS Region export AWS_SECRET_ACCESS_KEY="xxxxxxxxxxxxxx" export AWS_ACCESS_KEY_ID="xxxxxxxxxxxx" export LOGGLY_TOKEN="xxxxxxxxx" #Optional - For centralized logging export LOGGLY_ACCOUNT="xxxxx" #Optional - Only the webapp needs it to show logs in UI export LOGGLY_USERNAME="xxxxxx" #Optional - Only the webapp needs it to show logs in UI export LOGGLY_PASSWORD="xxxxx" #Optional - Only the webapp needs it to show logs in UI
In one(or multiple) terminal(s) launch the worker process, can be launched anywhere that has access to the etcd cluster and S3.
go run $GOPATH/src/github.com/turbobytes/gomr/cli/worker.go
Then submit the job.
cd /tmp go build $GOPATH/src/github.com/turbobytes/gomr/examples/word_count.go go run $GOPATH/src/github.com/turbobytes/gomr/examples/word_count_execute.go
This should print a job id. Running word_count_execute.go might take some time depending on your upload bandwidth because it's uploading ~14MB binary to S3. Subsequent runs will be faster because the binary is only uploaded/downloaded if things change.
Check status/fetch result using
go run $GOPATH/src/github.com/turbobytes/gomr/cli/fetchresult.go -jobname=ID_FROM_PREVIOUS_STEP -o=/path/to/resultfile
Make sure you have set the environment variables. If using loggly then remember to set
cd $GOPATH/src/github.com/turbobytes/gomr/webapp go run server.go
Then visit http://localhost:8181/ in browser. Replace localhost with ip or hostname of the machine running it...
This project is in Proof-of-Concept stage. Many failure/retry cases are being ignored currently. Also, currently the whole cluster of workers must target same GOOS/GOARCH .