# Table of contents

**1. AWS EMR: Introduction**

**2. Apache Spark on EMR**

**3. AWS S3: Data Storage**

**4. Tutorial**

# AWS EMR: Introduction

## What is AWS EMR?

- Amazon Elastic Map Reduce.

- Amazon EMR provides a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances. You can also run other popular distributed frameworks such as Apache Spark, HBase, Presto, and Flink in Amazon EMR, and interact with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB.

- Amazon EMR securely and reliably handles a broad set of big data use cases, including log analysis, web indexing, data transformations (ETL), machine learning, financial analysis, scientific simulation, and bioinformatics.

- Run a Spark Job with AWS EMR

- AWS EMR features: Auto Scaling, Auto termination, Logs to s3, EMR events, Handles hardware failure

- Boto3: Run Job Flow call

## Apache Spark on AWS EMR

<img src = "images/Spark-Diagram.jpg">

## Data Storage Layer: Amazon s3

### AWS Simple Storage Service

Where is data stored?

● Scales - just keep putting files, and it will never fill up.

● Upload and download your data with SSL encrypted end points

● Provides multiple options for encrypting data at rest.

● Low Cost - $0.023 per GB

● Reading data in a Spark application is as simple as calling -

`sc.textFile(“s3n://<bucketname>” )`


## How AWS EMR access data from the S3 Bucket?

<img src = "images/spark2.png">

## Walkthrough: Running a spark Job on EMR

We'll solve the PageRank problem referring to the spark script prepared in the **Data Proccessing Layer** concept.

## Data storage


**Spark Codebase:** 

AWS EMR gives you an option to submit your spark applications through spark scripts stored on AWS S3. ONe can store the whole codebase on S3 and give the cluster access to it.

Command to launch a Spark application:

`spark-submit <s3_path>`

I have created a s3 bucket with name **grey-atom** and the spark scripts are stored in this bucket. Here the code resides in the script `job.py`.

**job.py**

```python
import pyspark
from pyspark import SparkConf
from operator import add

if __name__ == '__main__':

    conf = SparkConf().setAppName("Pagerank")
    conf.set('spark.executor.instances', 2)
    conf.set('spark.executor.cores', '1')

    conf.set('spark.dynamicAllocation.enabled', 'true')
    # conf.set('spark.executor.memory', '6')

    # conf.set('spark.yarn.executor.memoryOverhead', '4096')
    sc = pyspark.SparkContext(conf=conf)
    sc.addPyFile('s3://grey-atom/support.py')
    from support import parseNeighbors, computeContribs

    rdd = sc.textFile("s3n://grey-atom/Input/")
    rdd = rdd.map(lambda line: line.split('\t'))
    url_links_rdd = rdd.map(lambda urls: parseNeighbors(urls)).distinct().groupByKey().cache()
    ranks = url_links_rdd.map(lambda url_neighbors: (url_neighbors[0], 1.0))
    no_of_iteration = 2
    for iteration in range(no_of_iteration):
        # Calculates URL contributions to the rank of other URLs.
        contribs = url_links_rdd.join(ranks).flatMap(lambda url_urls_rank: computeContribs(url_urls_rank[1][0], url_urls_rank[1][1]))
        ranks = contribs.reduceByKey(add).mapValues(lambda rank: rank * 0.85 + 0.15)
    ranks = ranks.coalesce(99)
    ranks.saveAsTextFile('s3n://grey-atom/output')

```

**Input**

The data to process again stored in `Input` in the same s3 bucket.

Same is shown below:

<img src = "images/s3.png">

## AWS EMR

**1. Create a cluster**

Go to AWS EMR Colsole and click on `Create Cluster`. Next follow the steps are shown below to launch your cluster:

**Step 1:**

<img src = "images/one.png">
<img src = "images/two.png">

## 2. Submit a step

Once the Cluster is in ready state, we can go futher to launch a spark application. Next, click on Add step where we just have to give `spark-submit` command. For example:

<img src = "images/step-submit.png">

As the Spark Application is running, one can monitor the number of Stages, Jobs, executors that are launched by cluster to execute the Spark aplication. 

**Click on Application history:**

You'll see Jobs, number of Stages, Tasks and various other metrics assciated with the Spark application.

<img src = "images/app1.png">

**Next click on this Job to get more details about Stages**

<img src = "images/app2.png">

**Output**

In the spark-script, I saved my output using the `sc.saveAsTextFile(s3_path)` to the s3 bucket `grey-atom`. Once the step is completed, the output is pushed into the s3 bucket with `output` folder. 

<img src = "images/out.png">

**Output pushed from Spark Application**

<img src = "images/output.png">

As you can see the filenames are like `part-00000`, `part-000001` ... `part-00000n`, these are named by spark only. Essentially we pushed the output from each partition to s3 so the naming follows from that. 

**So this was all about launching EMR clusters to run Spark Applications using AWS S3 as a data storage option.**