# Starting a Spark cluster on AWS

## 0. Setup (review from [here](https://github.com/gSchool/DSI_Lectures/blob/master/high-performance-python/moses_marsh/aws_lecture.ipynb))

### 0.1 AWS Credentials
- sign in to your AWS console: https://console.aws.amazon.com

- click on your name in the upper right, then click `My Security Credentials`

- A dialog will pop up, asking you to either `Continue to Security Credentials` or `Get started with IAM users`. Click **`Continue to Security Credentials`**. IAM is a way of managing multiple user roles, which we won't worry about.

- click `Access Keys`, then `Create New Access Key`

- `Download Key File` will save the keys in a text file, and `show access key` will display the keys for quick copy/pasting. **This is your only chance to save these keys**. If you lose them immediately (such as closing your browser and deleting the downloaded file), you cannot recover them. You will have to generate a new pair. 

- in your terminal, install the AWS command line tools with `pip install awscli` (yes this is weird that we're installing a command line utility with `pip`, the python package installer. thanks amazon.)
- then type `aws configure`
    - paste your AWS Access Key ID and AWS Secret Access Key when prompted
    - for Default Region Name, enter `us-east-1`
    - for Default Output Format, enter `json` (or leave it as `None`, this doesn't matter to us for now)
- this created a folder, `~/.aws`, containing two files: `config` and `credentials`. You can use these to manage multiple profiles. For now, we're cool.    


### 0.2 Managing buckets with the AWS command line interface

[AWS CLI S3 documentation](https://docs.aws.amazon.com/cli/latest/userguide/using-s3-commands.html)

- `aws s3 ls` to list your buckets
- `aws s3 ls s3://BUCKETNAME/` to list contents of a bucket
- `aws s3 ls s3://BUCKETNAME/FOLDERNAME/` to list contents of a directory in the bucket (the trailing `/` is necessary)
- `aws s3 mb s3://BUCKETNAME` to create a new bucket
- `aws s3 rb s3://BUCKETNAME` to delete a bucket (the bucket must be empty)
    - `aws s3 rb s3://BUCKETNAME --force` to delete a non-empty bucket
- `aws s3 cp LOCALFILE s3://BUCKETNAME/` to upload a local file to a bucket
- `aws s3 cp s3://BUCKETNAME/FILENAME .` to download a `FILENAME` to the current directory
    - just like the UNIX `cp`, use the `--recursive` flag for copying directories
    - you can also use `rm` and `mv` the same way
- see the `aws s3 sync` command in the docs above for more examples of how to keep a local directory & a remote bucket directory synchronized


## 0.3 Starting a single EC2 instance using the AWS Console GUI
- Login to the AWS console, click `Services`, then, under "Compute", click `EC2`
- In the upper right corner of the page, make sure your region is `N. Virginia`
- Click the blue `Launch Instance` button
- scroll down to the first entry that says `ubuntu` and click `select`
- by default, the `t2.micro` instance (free tier) is selected. leave it, then click `Next: Configure instance details`
- leave `IAM role` as `None`, then clicke `Next: Add Storage`
- here we can add EBS (think of it as more hard drive space). The default disk size is 8 GB. Change it to 20 (which still qualifies for the free tier).
- click `Next` until you are at the `Configure Security Group` screen. Make sure there is an entry with `Type: SSH`, `Protocol: TCP`, `Port Range: 22`, and `Source: Anywhere`. (If not, create one with `Add Rule`)
- Now click `Review and Launch`, then `Launch`
- This brings up a window asking you to select a secure key pair to access this instance. Select `create a new key pair`, give it a name (for example, `my_first_key`), then click `Download Key Pair`
    - **this is your only chance to download it**. if you lose this file, you'll have to generate a new one, and you'll lose access to any EC2 instances that need the old key pair. 
- Click `launch instance`

##### is there an AWS CLI way to start EC2 instances?
yes, good luck: https://docs.aws.amazon.com/cli/latest/userguide/cli-using-ec2.html

## 0.4 Accessing your EC2 instance using SSH
We are going to set up convenient SSH access to your cloud computer
- move the `pem` file from wherever you downloaded it to your `~/.ssh` folder (if this folder doesn't exist, create it)
    - example: `mv ~/Downloads/my_first_key.pem ~/.ssh`
- SSH requires that your key file be accessible only to you, so change the permissions with:
    - `chmod 400 ~/.ssh/my_first_key.pem`
- back on the EC2 dashboard of the AWS web console, click on your instance (you should see it running [here](https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Instances:sort=instanceId)
- copy its public DNS (which looks something like `ec2-52-90-35-125.compute-1.amazonaws.com`)
- in the `~/.ssh` folder, create a file named `config` (you can create a blank file by typing `touch config`) and enter the following text:

```
Host my_first_ec2
    Hostname PUBLICDNS
    User ubuntu
    IdentityFile ~/.ssh/my_first_key.pem

```

- now we can connect your local terminal to a terminal running on the remote computer with
    - `ssh my_first_ec2`  
- you now have a terminal open to enter shell commands on a remote computer! wow! the following commands are to be run on the remote computer, NOT on a local terminal
- let's install the AWS CLI on this system. To do so, in this case:
    - `sudo apt update`
    - `sudo apt upgrade`
    - `sudo apt install awscli`
- Try `aws s3 ls`. If this gives you an authentication error, you'll have to copy your AWS keys over with `aws configure`.

## 0.5 Copying a file to a remote computer (EC2 instance)
- To copy a file `myfile.txt` to EC2, use a command like this.
    - `scp myfile.txt my_first_ec2:`
- To copy a file from the EC2 to the current directory on the local machine, try
    - `scp my_first_ec2:path/to/remote_file .`
- To copy a directory `mydir` to EC2, use a command like this. 
    - `scp -r mydir my_first_ec2:`

## 1. Starting a cluster using the AWS Console GUI

### 1.1 starting the cluster
- Login to the AWS console, click Services, then, under "Analytics", click EMR
- In the upper right corner of the page, make sure your region is N. Virginia
- Click the blue "Create Cluster" button
- Use the following settings. Items marked with an asterisk (`*`) indicate changes to the default values that you need to make before launching the cluster.

| *Setting* | *Value* |
| ------- | ----- |
| Cluster name (`*`) | First Spark Cluster |
| Launch Mode | Cluster |
| Release | emr-5.22.0 |
| Applications (`*`) | Spark: Spark 2.4.0 on Hadoop 2.8.5 YARN with Ganglia 3.7.2 and Zeppelin 0.8.1 |
| Instance type | m3.xlarge |
| Number of instances | 3 |
| EC2 key pair (`*`) | spark **`<-- choose a pem key that you have locally in your .ssh folder`** |
| Permissions | Default |
| EMR Role | EMR_DefaultRole |
| EC2 instance profile | EMR_EC2_DefaultRole |

- Click Create Cluster at the bottom of the page. It will take a few minutes for your cluster to launch, so now is a good time to take a 5-minute break.

  - NOTE: Occasionally your cluster will fail to launch with the error message Terminated with errors: Failed to provision ec2 instances because 'The requested instance profile EMR_AutoScaling_DefaultRole is invalid'. If this happens, just try and launch the cluster again and it should work.

### 1.2 Connecting to the master node of the cluster using SSH
- When the cluster status changes to Waiting, you are ready to begin working with the cluster:
- Find the hostname of your cluster's master node in the EMR console. It should be listed under **Master public DNS**, and look something like `ec2-54-80-24-220.compute-1.amazonaws.com`
- Open your SSH configuration file (e.g. code ~/.ssh/config, or use your favorite text editor) and add the following, but substitute your cluster master's host name.

```
# First Spark Cluster
Host sparkparty
  HostName ec2-XX-XXX-X-XX.compute-1.amazonaws.com
  User hadoop
  IdentityFile ~/.ssh/mykey.pem
```
- Note that it is essential to use the username `hadoop` (not `ec2-user` or `ubuntu`) in order to have access to the Spark environment.
- In the [AWS EMR Console](http://console.aws.amazon.com/elasticmapreduce/home), click on the link by **Security groups for Master**, click the row for  **ElasticMapReduce-master**, click the **Inbound** tab, and click the **Edit** button. Ensure that there is a rule allowing SSH connections on TCP Port 22, from anywhere. If this rule does not exist, create it and click Save.
- Connect to the cluster using SSH and open a PySpark shell on the cluster:
```
(local terminal): ssh sparkparty
(remote terminal): pyspark
(remote pyspark shell): yelp_business_url = 's3a://learn-assets.galvanize.com/gSchool/ds-curriculum/course-data/spark/yelp_academic_dataset_business.json.gz'
(remote pyspark shell): yelp_business_df = spark.read.json(yelp_business_url)
(remote pyspark shell): yelp_business_df.printSchema()
```
- If the above ran without errors, congratulations, you have now run your first lines of Spark code on a remote computer.
- Now let's shut down this cluster.
 - Return to the [AWS EMR console](http://console.aws.amazon.com/elasticmapreduce/home), select the checkbox to the left of "First Spark Cluster", and click the Terminate button above the list. Goodnight cluster.

## 2. Starting a cluster using the AWS CLI with the provided script

Take a look at the `launch_cluster.sh` script included in the scripts folder (make sure to `cd` into the scripts folder to run this script). The documentation at the top of the script details how to use it. 

```
# Takes three arguments:
#   bucket name - one that has already been created
#   name of key file - without .pem extension
#   number of worker instances
#      ex. bash launch_cluster.sh mybucket mykey 2

# This script assumes that the file bootstrap-emr.sh is 
#   in your current directory.
```


Note that you you will need to have the `awscli` set up including specifying a default region. You should not include the `.pem` extension when passing the name of your identity file. This cluster will be launched with a bootstrap action that installs Anaconda on each cluster node, enabling you to use data science libraries on the cluster.

You will be using this cluster in today's exercise, where it is advised you have a cluster with at least 6 workers. Consider this when you go to use the `launch_cluster.sh` script.
- run the command `bash launch_cluster.sh bucketname mykey 4` from the `scripts` folder
- go to the [AWS EMR console](http://console.aws.amazon.com/elasticmapreduce/home) and click on your cluster
- copy the **Master Public DNS** and paste it into your `~/.ssh/config` file as follows:
```
Host cluck
   HostName ec2-XX-XXX-X-XX.compute-1.amazonaws.com 
   User hadoop
   IdentityFile ~/.ssh/mykey.pem
```

## 3. Running Jupyter Lab on the cluster

- use `scp` to copy the script `jupyspark-emr.sh` to your master node
 - `scp jupyspark-emr.sh cluck:`
- open a terminal on the remote machine
 - `ssh cluck`
- in that terminal, start a jupyter lab server with
 - `bash jupyspark-emr.sh`
- **On your local machine**, run the following command to ssh tunnel port 48888, the port at which your notebook server is now running on the master node, to your localhost's port 48888.
 - `ssh -NfL 48888:localhost:48888 cluck`
- Now you can access the remote jupyter server **from your browser** by going to the URL `localhost:48888` 

### 3.1: tmux
Many processes are tied to your terminal. If you `ssh` into your remote machine and run a process in that terminal, that process will stop if you suddenly lose your connection.

`tmux` is a tool for "terminal multiplexing". It's great for managing many processes in many terminals. Here's the process for starting a process in a terminal that is detached from your `ssh`ed terminal using `tmux`:

- `ssh` into your remote machine
- start a tmux session with `tmux new -s some_name`
- start your process (notebook or script or whatever)
- type `<ctrl>-b d` to detach (now you are back in your `ssh`ed terminal)
- exit or shut down or go to sleep or whatever
- `ssh` back in to your remote machine
- type `tmux a -t some_name` to check on that process
- [handy tmux reference](https://gist.github.com/MohamedAlaa/2961058)