In [None]:
%matplotlib inline
import matplotlib
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 144

# mrjob in the Cloud

[mrjob](https://github.com/Yelp/mrjob) is a Python library originally developed by Yelp to streamline the writing and execution of Hadoop Streaming jobs both locally and on the cloud. Instead of directly callng the Hadoop Streaming API, mrjob allows us to write Python code and handle much of the cluster configuration automatically.

## Amazon's Cloud Computing Services

Amazon Web Services has extremely thorough documentation around everything from the commands available to the command line interface (CLI) `aws {commands}`, to the Python wrapper for said interface `boto`, to full tutorials and examples on how to fire up an EMR cluster or a bunch of EC2 instances with almost any desired data processing framework.

EC2 is cheaper than EMR, but EMR is recommended for immediate use of Hadoop and any other project in the ecosystem (configurable for your cluster via [Amazon Machine Images](http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/ami-versions-supported.html) (AMIs). In a production setting it's possible you'll want to use specific versions for consistency; in our case it's safe to use the most recent version (`3.6.0` at the time of this writing).

### Setting up a personal AWS account

To use AWS you'll need to [create an account](http://aws.amazon.com/) if you haven't already. For the first year after new account creation, you'll be eligible for discounts on some services as part of the Free Tier program.

Access the AWS [web console](https://console.aws.amazon.com/s3/) to handle most of your configuration. You'll need at least one S3 bucket to serve as storage for your logs and output.

From there you can create EMR clusters as you wish and run jobs. Be careful about the nodes you use, as only certain sizes are eligible for the free tier discounts. Still, you only pay for what you use, and the costs for small, educational jobs are relatively manageable.

There's an in depth [tutorial](http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-get-started.html) available, and more detailed cluster configuration information can be found in this notebook, and in the Spark module.

Also note that in addition to the normal credentials you might need to take care of:
* Generating an EC2 keypair (this is separate from the AWS general keypair and goes in .mrjob.conf)
* `aws emr create-default-roles` if you plan on just using the defaults for EMR
* Making sure your user is part of a group with sufficient permissions (admin is probably fine)

### Managing AWS credentials on Digital Ocean

"If you want to use your personal AWS keys, you can overwrite `~/.mrjob.conf` with yours. Keys for our S3 bucket are saved under `~/.aws/config`. If you ever want to run mrjob using our keys for S3 access, you can simply remove the keys from `~/.mrjob.conf` and (according to the mrjob docs), the keys in `~/.aws/config` will take precedence."

In general, make sure that the configuration in `~/.mrjob.conf` makes sense.

`- sudo apt-get install -y python-pip || sudo yum install -y python-pip` is a more robust bootstrapping statement.

You can set up multiple profiles in the `~/.aws/credentials` file in order to facilitate copying data from our S3 bucket while still being able to access your own.

### AWS credentials and command line tools

1. To verify that it is working, try 
``` bash
aws s3 ls
```
You should get back a json blob that contains the permissions you just added for your user.  If not, double-check that you got your permissions setup correctly.

1. `boto` ([docs](https://boto.readthedocs.org/en/latest/)) is a python library that wraps the functionality of `awscli`.  You can install it using
``` bash
pip install boto
```
and follow the instructions in the docs to get started.

1. Another option for interacting with s3 from the command line is `s3cmd`. You can download/start using it via   
``` bash
git clone https://github.com/s3tools/s3cmd.git
```
and follow the documentation [here](https://github.com/s3tools/s3cmd).

### Python `mrjob`

To test it, clone this [github repo](https://github.com/Yelp/mrjob) and run wordcount on README.rst:
```bash
git clone https://github.com/Yelp/mrjob.git
cd mrjob

# run command locally
# this is good for testing and debugging on files locally
python examples/mr_word_freq_count.py README.rst\
   --no-output --output-dir=/tmp/wc/
   
# check the output file contents:
cat /tmp/wc/* | more

# run command on ec2 and write output to our s3 bucket
# this costs money so only do it when you have working code

python examples/mr_word_freq_count.py -r emr README.rst \
   --no-output --output-dir=s3://dataincubator-fellow/<user>-wc/
```
Note: if you're unable to start a new jobflow, use the `check_emr_jobflows.py` script in `datacourse/scripts/` and explicitly join the shortest queue by adding the flag `--emr-job-flow-id=j-JOBFLOWID`.

### check the output file contents:
```bash
aws s3 ls s3://dataincubator-fellow/<user>-wc/
aws s3 cp --recursive s3://dataincubator-fellow/<user>-wc/ /tmp/<user>-wc/
```
Note: be sure to fill in `<user>` with a key that is unique to you.

A few notes:
1. You can also upload files to s3 using the AWS CLI so that your entire workflow can be on s3.
1. The server will take a while to boot up the first time but it will stay alive.  Any subsequent jobs that are submitted will not have to reboot.  If there are already jobs running, it will wait up to 5 minutes before spawning another server (please be patient).  It will stay idle for 2 hours and then kill itself.
1. Take a look at `examples/mr_word_freq_count.py`.  This is the simple "word count" mapreduce.

## Useful AWS CLI commands

- Access Hadoop web UI, e.g. ResourceManager
    - Option 1: via a local port
```bash 
ssh -L 8158:ec2-52-0-25-37.compute-1.amazonaws.com:9026 hadoop@ec2-52-0-25-37.compute-1.amazonaws.com -i /path/to/fellos201501.pem
```
Now in your browser, go to the address `localhost:8158`  

    - Option 2: via dynamic port forwarding
        1. Type: 
```bash
aws emr socks --cluster-id j-XXXX --key-pair-file ~/path/to/keypair.pem
```
        1. Then follow [these steps)(http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-connect-master-node-proxy.html) to add FoxyProxy to your Chrome/Firefox browser to natively use the web UI.


- `ssh` into master node to access HDFS, Pig interactive, etc.

```bash
ssh hadoop@{master-public-dns}.compute-1.amazonaws.com -i /path/to/pemfile.pem
```

- One-liner to preview a .gz file:
```bash
s3cmd get s3://dataincubator-course/wikidata/wikistats/pagecounts/pagecounts-20081208-030000.gz - | gunzip -c | less
```

### Third party software

There are some third-party tools that can help navigate AWS S3. It can be time-consuming to go through the command line looking for logs when there's no autocomplete or easily viewable directory structure - in which case something like Bucket Explorer might save you some time.

## Google Cloud Platform

Cloud Dataproc is GCP's analog to EMR: a managed Hadoop cluster environment that uses Google Compute Engine instances under the hood. There's a comprehensive 60-day free trial with $300 of credit to use (which should be more than plenty for our purposes).  You can interact with the comput engine either through the [Cloud SDK website](https://cloud.google.com/sdk/#Quick_Start) or through the `gcloud` command line tool.  Here are some step-by-step instructions to get started, largely using the website.

1. Go to the [Cloud SDK website](https://cloud.google.com/sdk/#Quick_Start) and sign up for a trial account. It will request your credit card information, but you will not be charged.

1. In the top bar of the website, there should be a "Go to projects" drop down.  Open it and select "Create a project...".  You will be prompted for a name, and then allowed to create the project.  The project acts as an umbrella for various related resources.

1. From the menu in the upper left, select "Storage", and then create a bucket.  This is essentially a cloud filesystem onto which you can load files and directories.  Give it a unique name.

1. From the same menu in the upper left, select "API Manager".  Ensure that the "Compute Engine API", the "Cloud Dataproc API", and the  "Cloud Storage JSON API" are all enabled.

1. Now, switch to the command line on your DO box.  Run `gcloud init`.  It will ask you to authenticate with your Google account, as well as set the default project and compute zone.

  1. This should set things up to authenticate all gcloud commands.  But if you're having trouble, go to the "Credentials" tab in the "API Manager" section of the website.  Chose "Create Credentials" and choose "Service account key". Then, select New service account, enter a Name and select Key type JSON. Put this JSON file somewhere and point to it with the environment variable $GOOGLE_APPLICATION_CREDENTIALS (eg. `export GOOGLE_APPLICATION_CREDENTIALS=/home/vagrant/auth/gcp.json`). Make sure the service account has editor-level permissions.

1. Upload data to the bucket you created above, using the `gsutil cp` command.  It can take as arguments local paths, Google Storage paths (`gs://bucket-name/path`), and Amazon S3 paths (`s3://bucket-name/path`).  Useful flags include `-r` (recursive) and `-m` (parallelize tasks).  For example, `gsutil -m cp -r s3://dataincubator-course/mrdata/english/ gs://mybucket/data/wikipedia/`.

1. Run a job with MRJob.  You can either let MRJob create a cluster for you, or you can provision one by hand.  The latter is recommended, as it is a bit easier to configure.

  1. If no cluster is specified, MRJob will create a cluster for you.  For example,
  ```bash
      python script.py -r dataproc \
          gs://bucket/directory > out.txt
  ```
  To use third-party libraries you'll need to configure the bootstrapping option in `~/.mrjob.conf`. You can also specify cluster creation options here. For the free trial, remember that not all options are available.
  
  1. You can create a cluster using the Cloud SDK Website.  From the main menu, select "Dataproc", and then choose "Create cluster".  You can configure the cluster, including using SSDs, which often helps performance.  Then when launching the job, you specify the cluster ID:
  ```bash
      python script.py -r dataproc \
          --cluster-id cluster-1 \
          gs://bucket/directory > out.txt
  ```
  Before you do so, you need to be aware that the cluster will not be provisioned with many libraries.  Critically, mrjob itself isn't included.  The easiest solution is to create a provisioning script that looks something like this:
  ```bash
      #!/bin/bash
      curl https://bootstrap.pypa.io/get-pip.py | python
      pip install mrjob
      pip install simplejson
      pip install mwparserfromhell
      pip install lxml
  ```
  Upload this script to a Google Storage bucket (`gsutil cp`).  When you are creating a cluster, select the advanced options ("Preemptible workers, bucket, network, version, initialization, & access options") and enter the `gs://` path to the init script in the "Initialization actions" entry.  This will cause each node in the cluster to run this script, installing pip, mrjob, etc.
  
    **Remember:** The cluster will not be shut down after your job completes.  Destroy it, from the web console or the command line, to avoid unnecessary charges.

There is a [quickstart guide](https://pythonhosted.org/mrjob/guides/dataproc-quickstart.html) to using mrjob with Dataproc.  The full [mrjob documentation](https://media.readthedocs.org/pdf/mrjob/latest/mrjob.pdf) contains plenty more details.

### <a name="upgrading"></a>Upgrading from the free trial

The free trial limits you to 8 YARN cores, including the master node, which realistically means the biggest cluster you can use on dataproc has 3 worker instances with 2 nodes each.

It is a significant time saver to upgrade to a paid account. You will keep your free trial credit (it expires when your trial would have expired), the only difference is **you will have to manually cancel your account** to avoid being billed after the trial period expires. Doing this is painless and will increase your quota to 24 YARN cores.

### Viewing logs

mrjob tries to pull back the actual Python errors Tracebacks when a job fails, but as of this writing, this is not implemented for Dataproc.  You can find the logs by SSHing into the worker nodes and examining them directly, but it may be slightly easier to use the log viewer in the web console.  Go to the [Logging page](https://console.cloud.google.com/logs) of the console and select "Dataproc" from the drop-down menu.  You can then select a specific cluster to search, or you can search all clusters.  Searching for "Traceback" will yield all lines that open a Python traceback, but you can't see the whole error message from this view.  Instead, open that line and find the filename, in the structPayload structure.  Then search for that filename.  You'll get all the lines of that file, in order, so you can figure out what the error actually was.

Another option: You can open the YARN web interface by following the steps documented by GCP here:
https://cloud.google.com/dataproc/docs/concepts/cluster-web-interfaces
Some logging aggregation is disabled by default on GCP, but you can determine where tasks failed, SSH into worker nodes a little more carefully, and look at logs from there. It's also useful for keeping track of your jobs.

*Copyright &copy; 2015 The Data Incubator.  All rights reserved.*