An R package to do parallel processing on Amazon, (more) easily. Born 2016, at the Brisbane ROpenSci Unconference. This is a work in progress, and is currently in development.
- Dan Pagendam
- Jonathan Carroll
- Daniel Thomas
- Zoé van Havre
- Cameron Roach
- Felix Leung
- Suren Rathnayake
Automatically sets up and starts a cluster of AWS workers, does parallel processing, and saves the output to S3 Bucket.
# Install devtools::install_github("ropenscilabs/snowball")
WARNING: Check yourself, before you wreck yourself! You are the ruler of your own Amazon costs.(No responsibility taken for your AWS bill...)
snowball takes the location of data, a user defined function, and some basic instructions to set up and run virtual machines in parallel on Amazon, and save results in an S3 bucket.
- An AWS account, with:
- IAM user with permissions to manage EC2 and S3.
- API keys for the IM account.
- an S3 bucket
- With policy allowing an IAM user full access
- Containing the data, and the user function, as
Overview / workflow:
- Put job list and data in S3 bucket (job list is like a job roster, a data table with names of workers and functions )
- SpinUp all workers start monitoring S3
snowball(function, bucketName, ...)
- snowball calls snowpack'
- this writes the snowpack function that will be run on each worker.
1. Setup snowball
Save a .snowball file into your current working directory with the following configuration,
snowball_setup to set global variables.
2. Pack the snowball.
Start an AWS instance with buckets, while setting up the data/feature split
snowpack(fn, listItem, bucketNameString, rdsInputObjectString, rdsOutputString)
3. Throw the snowball.
Give data location and user function
4. Avalanche the outputs.
combine all results into one file
What is an S3 Bucket..??
We assume you have a (very) basic understanding of what an S3 Bucket is (it's like dropbox, for data). Click here for info from Amazon.. It is very easy to create a bucket. You just click
Setting up the 'bucket policy allowing an IAM user full access' is harder:
- In the top left of an AWS window click on
IAM, then click on the user you want to give access to (you, most likely).
- copy the User ARN into your clipboard.
- go to the newly created bucket, click on
- click on
add policy, which opens a window called "AWS Policy Generator"
- Select policy type: S3 Bucket Policy
- AWS Services should be Amazon S3,
- Actions: tick
- Paste your ARN into principal (I know... logical.)
- Paste this (with YOUR bucket name) into the ARN box:
Add Statement, copy the contents to clipboard. Go back to bucket page, click "Edit bucket policy" and paste clipboard into this.
- click on