### MACS 30123 Lab Session: Working with EMR Clusters/Spark
*Week 7*

**Edited by Ethan Kozlowski, Nalin Bhatt, Max Zhu (developed from Adam Wu, Wonje Yun)**


### Create S3 Bucket

First create S3 bucket to store our files.

In [1]:
import boto3

In [3]:
# Initialize boto3 handler
s3 = boto3.resource('s3')

# Create a new bucket to store your files
BUCKETNAME = 'ethankoz-bucket'
s3.create_bucket(Bucket=BUCKETNAME)

# This is what we will use to interface with the specific bucket
bucket = s3.Bucket( BUCKETNAME )

In [5]:
# Upload your .py file to S3

FILENAME = 'mystuff/myfile.py'
with open('mystuff/myfile.py', 'rb') as myfile:
    bucket.put_object(Key=FILENAME, Body=myfile)

### Launching EMR Cluster

Next launch EMR Cluster in Terminal/bash.

In [None]:
%%bash 

aws emr create-cluster \
    --name "Spark Cluster" \
    --release-label "emr-6.2.0" \
    --applications Name=Hadoop Name=Hive Name=JupyterEnterpriseGateway Name=JupyterHub Name=Livy Name=Pig Name=Spark Name=Tez \
    --instance-type m5.xlarge \
    --instance-count 3 \
    --use-default-roles \
    --region us-east-1 \
    --ec2-attributes '{"KeyName": "vockey"}' \
    --configurations '[{"Classification": "jupyter-s3-conf", "Properties": {"s3.persistence.enabled": "true", "s3.persistence.bucket": "ethankoz-bucket"}}]'

When creating a new cluster, make sure to adjust the security settings to allow for `ssh` access. See `emr_cheatsheet.md` in Week 7 course materials.

#### Method 1: `ssh` Directly

The first way to work with EMR is to directly `ssh` into it, then work with it just like we did for `EC2` (see previous lab on EC2).

Connecting to it:
```
$ ssh -i <FILE PATH TO vockey.pem> hadoop@<EMR-PUBLIC-ADDRESS>
```

Uploading a folder called `mystuff` locally -> EMR:
```
$ scp -i <FILE PATH TO vockey.pem> -r <FILE PATH TO mystuff folder> hadoop@<EMR-PUBLIC-ADDRESS>:/home/hadoop
```

Downloading a folder called `mystuff` from EMR -> locally:
```
$ scp -i <FILE PATH TO vockey.pem> -r hadoop@<EMR-PUBLIC-ADDRESS>:/home/hadoop/mystuff .
```
---

After uploading your files in there, you can then run Spark jobs with
``` 
[EMR] spark-submit mystuff/myfile.py
```
Alternatively if your files are saved on `S3`, then
```
[EMR] spark-submit s3://ethan-example-bucket/mystuff/myfile.py ethan-example-bucket
```

#### Method 2: Interactive Sessions

You can also launch a Jupyter server directly on EMR and work with it interactively.
```
$ ssh -i "vockey.pem" -NL 9443:localhost:9443 hadoop@EMR-PUBLIC-ADDRESS
```
This forwards the remote connection to your `https://localhost:9443`, and you can log in with username `jovyan`, password `jupyter`. 