Skip to content
eightysteele edited this page May 7, 2012 · 32 revisions

How to deploy to EMR

Sanity checks

Before you do anything else, make sure you have the right version of [lein] (https://github.com/technomancy/leiningen). If you don't have the right version, run lein upgrade and hen run lein clean, deps.

If inexplicable things happen, make sure your local repository (~/.m2) has the latest versions of all dependencies. We've gotten f'd by outdated versions more than once. We're using development versions of a few things after all.

One issue we've run into is having an outdated version of Cascalog. The solution seems to be the following:

rm -r ~/.m2/repository/cascalog

Now that you have the right code, you need to download and install the elastic-mapreduce ruby script. Installation is simple: unzip the the file you downloaded, then add that location to your .bash_profile, akin to this:

PATH="/path/to/elastic-mapreduce-ruby:${PATH}"
export PATH

Then either run the command source ~/.bash_profile or open a new terminal window and work in that. You'll set up the credentials for actually using elastic-mapreduce below.

Finally, and most importantly, pronounce lein as LINE and not LEAN else the cluster will shutdown in a non-deterministic way.

Credentials

Save the json object below in a file called credentials.json in the forma-deploy directory.

{
  "access-id":"AKIAJ56QWQ45GBJELGQA",
  "private-key":"6L7JV5+qJ9yXz1E30e3qmm4Yf7E1Xs4pVhuEL8LV",
  "key-pair":"forma-keypair",
  "key-pair-file":"~/.ssh/id_rsa-forma-keypair",
  "log-uri":"s3n://reddemrlogs"
}

The id_rsa-forma-keypair file referenced in credentials.json is required. Ask Dan, Sam, or Robin for it. Once you have it, be sure the permissions are set correctly - too loose and you won't be able to do much. The command to properly set the permissions is.

$ chmod 600 ~/.ssh/id_rsa-forma-keypair

Spin up the cluster

First, upload the latest forma-clj/dev/forma_bootstrap.sh script to S3 here:

https://s3.amazonaws.com/reddconfig/bootstrap-actions/forma_bootstrap.sh

Note that the above URL is hard coded here:

https://github.com/reddmetrics/forma-deploy/blob/master/src/forma/hadoop/cluster.clj#L241

In forma-deploy/src/forma/hadoop/cli.clj is the CLI for deploying to cluster. From within the forma-deploy directory, use this:

lein run --type large --emr --size 25

There are three supported cluster types, defined in forma-deploy/src/forma/hadoop/cluster.clj: large which launches instances of type m1.large (ami-4db76624), high-memory which launches instances of type m2.4xlarge, and cluster-compute which launches instances of type cc1.4xlarge.

It takes awhile to boot up the cluster. Here are a few helpful commands for monitoring and terminating a cluster:

$ elastic-mapreduce --list --active

$ # terminate all active clusters/job flows
$ elastic-mapreduce --list --active --terminate

$ # get a cluster/job flow id, and terminate it
$ elastic-mapreduce --terminate --jobflow j-ABABABASABA

$ # log into master node
elastic-mapreduce --ssh --jobflow j-ABABABABA

More helpful hints from AWS: http://aws.amazon.com/developertools/2264

You can use the option --all to list all past jobs on EMR, not just the ones from the previous week or two.

You can also monitor the cluster via the AWS EMR console:

https://console.aws.amazon.com/elasticmapreduce/home?region=us-east-1

And to monitor nodes coming online, see the AWS EC2 console:

https://console.aws.amazon.com/ec2/home?region=us-east-1#s=Instances

If your cluster gets killed due to spot price spikes, try moving it to a different availability zone (AZ). For now you need to edit the file src/forma/hadoop/cluster.clj and add a line like the following (choose your own AZ) in the boot-emr! function:

--availability-zone us-east-1e

Working with the cluster

After the cluster on EMR is in the Waiting state, we're ready. Let's lein uberjar the forma-clj project and then upload it to the job tracker:

$ lein uberjar
# Creates forma-0.2.0-SNAPSHOT-standalone.jar

Then from within forma-deploy:

$ elastic-mapreduce --list # Get the job tracker URL (e.g., ec2-67-202-41-140.compute-1.amazonaws.com)
$ sftp -i ~/.ssh/id_rsa-forma-keypair hadoop@ec2-67-202-41-140.compute-1.amazonaws.com
$ put forma-0.2.0-SNAPSHOT-standalone.jar

# alteratively, for the scp-inclined:
$ scp -i ~/.ssh/id_rsa-forma-keypair forma-0.2.0-SNAPSHOT-standalone.jar hadoop@ec2-67-202-41-140.compute-1.amazonaws.com:

Next SSH in, create a new screen, and become the hadoop user:

$ ssh -i ~/.ssh/id_rsa-forma-keypair hadoop@ec2-67-202-41-140.compute-1.amazonaws.com
$
$ # this gets you a fresh, ready-to-use "window" in screen, 
$ # and turns on logging for anything printed to stdout.
$ screen -Lm

Then you're ready to run a job! Try this, for 500m data in Indonesia.

$ hadoop jar forma-0.2.0-SNAPSHOT-standalone.jar forma.hadoop.jobs.modis s3n://modisfiles/MOD13A1 s3n://pailbucket/master * :IDN

Or launch a REPL and run your job:

$ hadoop jar forma-0.2.0-SNAPSHOT-standalone.jar clojure.main

Once in the REPL:

(use 'forma.hadoop.jobs.scatter)
(in-ns 'forma.hadoop.jobs.scatter)
(ultrarunner "/user/hadoop/checkpoint"
             "s3n://formaresults/staticbuckettemp"
             "s3n://formaresults/finalbuckettemp"
             "s3n://formaresults/finaloutput")

Logging into slave nodes

It can be useful to log into slave nodes on occasion, whether to check memory usage or browse local log files. You can do this by grabbing the private IP address of an instance using the Hadoop Jobtracker. For example, this page gives you the list of nodes you're running:

http://ec2-23-20-44-190.compute-1.amazonaws.com:9100/machines.jsp?type=active

The Host column gives you the private IP address for each node, for example 10.28.100.164. Log into the AWS console and go to the EC2 page, and enter the private IP address in the search/filter box. The instance this matches will become the only one visible. Click as usual to get its public DNS address. If you have function emr() { ssh -i ~/.ssh/id_rsa-forma-keypair hadoop@$@; } in your ~/.bash_profile, you can login using the public DNS like this:

$ emr ec2-23-20-44-190.compute-1.amazonaws.com

Notes on using screen

Nice tutorial for getting started. And here are the basic commands you'll need:

Launching:
launch: screen
launch with no welcome screen: screen -m
launch with logging: screen -Lm
launch and immediately disconnect: screen -dLm

Once in screen:
disconnect and kill: ^-d
disconnect without killing: ^-a d
view stdout history: ^-a [
see key bindings: ^-a ?

Reconnect:
get list of existing screens (if more than one): screen -r
reconnect to existing screen (if only one): screen -r
reconnect to most recent screen (if one or more): screen -rr
reconnect to specific screen: screen -r internal-address.ttys.other-stuff

If you somehow get disconnected and screen says something is still attached to a "window", use this to force a disconnect and reconnect immediately:

screen -dr internal-address.ttys.other-stuff

Monitor the job

Sign into the jobtracker by

http://<Master DNS>:9100/jobtracker.jsp

For example,

http://ec2-107-21-135-90.compute-1.amazonaws.com:9100/jobtracker.jsp

You can find the DNS here.

When something goes wrong

Sometimes the forma_bootstrap.sh is messed up, or it's not public on S3, causing the EMR cluster to automatically fail and shutdown. If this happens, stay calm! Just comment out the bootstrap step here, spin up a small cluster, upload forma_bootstrap.sh to the job tracker, run it, and investigate the errors.

Make sure you have the latest version of the EMR CLI tools. Defaults change across versions, and the bootstrap script as of 2/13/2011 expects the version released in December 2011.

Investigating errors

To examine the EMR logs, navigate to the reddemrlogs bucket on the S3 console or some other S3 access service. The folders are tagged with with the Job Flow ID, which can be found by navigating to Elastic MapReduce tab on the AWS console. Click on the relevant EMR job, and the Job Flow ID will be at the top of the descriptive panel.

If there are memory errors, try out a few things listed on this StackOverflow page

If there's a problem with bootstrapping the cluster, it can be useful to run the bootstrapping script on the AMI we use: ami-4db76624.

When you need a REPL

hadoop jar forma-0.2.0-SNAPSHOT-standalone.jar clojure.main