Automatically start a cluster of worker nodes to do parallel processing
R
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
R
examples
man
misc
.gitignore
DESCRIPTION
NAMESPACE
README.md
snowball.Rproj

README.md

snowball

An R package to do parallel processing on Amazon, (more) easily. Born 2016, at the Brisbane ROpenSci Unconference. This is a work in progress, and is currently in development.

Authors:

Automatically sets up and starts a cluster of AWS workers, does parallel processing, and saves the output to S3 Bucket.

# Install
devtools::install_github("ropenscilabs/snowball")

WARNING: Check yourself, before you wreck yourself! You are the ruler of your own Amazon costs.(No responsibility taken for your AWS bill...)

snowball takes the location of data, a user defined function, and some basic instructions to set up and run virtual machines in parallel on Amazon, and save results in an S3 bucket.

Requirements

  • An AWS account, with:
    • IAM user with permissions to manage EC2 and S3.
    • API keys for the IM account.
    • an S3 bucket
      • With policy allowing an IAM user full access
      • Containing the data, and the user function, as .rds file

Overview / workflow:

  1. Put job list and data in S3 bucket (job list is like a job roster, a data table with names of workers and functions )
  2. SpinUp all workers start monitoring S3
  3. snowball(function, bucketName, ...)
  • snowball calls snowpack'
  • this writes the snowpack function that will be run on each worker.

How to

1. Setup snowball

Save a .snowball file into your current working directory with the following configuration,

AWS_ACCESS_KEY_ID: <YOURACCESSSKEYID>

AWS_SECRET_ACCESS_KEY: <YOURSECRETACCESSKEY>

AWS_DEFAULT_REGION: <YOURDEFAULTREGION>

Next, run snowball_setup to set global variables.

snowball_setup(config_file, echo)

2. Pack the snowball.

Start an AWS instance with buckets, while setting up the data/feature split

snowpack(fn, listItem, bucketNameString, rdsInputObjectString, rdsOutputString)

3. Throw the snowball.

Give data location and user function

throwSnowball(...)

4. Avalanche the outputs.

combine all results into one file

avalanche(...)

More help?

Snow what?

Check out the Snow and Snowfall package documentations.

What is an S3 Bucket..??

We assume you have a (very) basic understanding of what an S3 Bucket is (it's like dropbox, for data). Click here for info from Amazon.. It is very easy to create a bucket. You just click create bucket.

Setting up the 'bucket policy allowing an IAM user full access' is harder:

  • In the top left of an AWS window click on Services, then IAM, then click on the user you want to give access to (you, most likely).
  • copy the User ARN into your clipboard.
  • go to the newly created bucket, click on Properties
    • click on add policy, which opens a window called "AWS Policy Generator"
      • Select policy type: S3 Bucket Policy
      • AWS Services should be Amazon S3,
      • Actions: tick All Actions.
      • Paste your ARN into principal (I know... logical.)
      • Paste this (with YOUR bucket name) into the ARN box: arn:aws:s3:::bucketName
    • Click Add Statement, copy the contents to clipboard. Go back to bucket page, click "Edit bucket policy" and paste clipboard into this.