# EMR Serverless Workshop

## Environment Setup

In [None]:
!aws cloudformation create-stack --stack-name EMRLabv1 \
            --template-url https://wysde-assets.s3.us-east-1.amazonaws.com/cfn/emr-serverless-cfn-v2.json \
            --capabilities CAPABILITY_NAMED_IAM CAPABILITY_AUTO_EXPAND

The CloudFormation stack will roughly take 5 minutes to complete. 

## Submit Spark jobs to EMR Serverless

### Option 1 - Submit Spark jobs from EMR Studio

Before we submit a Job to EMR Serverless we need to create an Application. You can create one or more applications that use open-source analytics frameworks. To create an application, you specify the open-source framework that you want to use (for example, Apache Spark or Apache Hive), the Amazon EMR release for the open-source framework version (for example, Amazon EMR release 6.4, which corresponds to Apache Spark 3.1.2), and a name for your application

1. In AWS Console, under services search for EMR. In the EMR console select EMR Serverless, or alternatively, go to the EMR Serverless Console
2. Click Create and launch EMR Studio.
3. To create an EMR Serverless application, choose Create application
4. Give a name to your application. In the rest of this lab we are going to use **my-serverless-application** as the name of our application. Choose Spark as the Type and emr-6.7.0 as the release version. Choose default settings. Click on Create application
5. You should see the Status Created.
6. Now you are ready to submit a job. To do this, choose Submit job.
7. In the Submit job screen enter the details as below:
    | Name                   | word\_count                                                                                                                                           |
    | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
    | Runtime role           | EMRServerlessS3RuntimeRole                                                                                                                            |
    | Script location S3 URI | s3://wysde-assets/src/py/emr/wordcount.py (This is not a public path, so you might get access denied error in emr)                                                                |
    | Script arguments       | \[“s3://YOUR\_BUCKET/wordcount\_output/"\] Replace YOUR\_BUCKET with the S3 bucket name that you noted from the Cloudformation Stack Outputs Section. |
8. You can see that the Application status shows as Starting and the Job run status shows as Pending. First time job submissions take a little longer since the application needs to be prepared and started. Furthermore, capacity needs to be provisioned before the application becomes ready to accept jobs. If the application is already in Started status, jobs will start executing as and when submitted.
9. Once the job has been submitted the Run status shows Success.
10. You can now verify that the job has written its output in the s3 path that we provided as an argument when submitting the job. You can go to the s3 path and see csv files successfully created by the EMR Serverless application.

### Option 2 - Submit Spark jobs from CLI

1. In AWS Console, under services search for Cloud9 and choose Cloud9 or alternatively go to Cloud9 Console 
2. The workshop cloudformation template should have already created a Cloud9 environment Cloud9 for EMRServerless Workshop
3. Choose Open IDE to launch the Cloud9 IDE.
4. Open a new terminal tab in the main work area
5. Go to Cloudformation  and click on the stack emrserverless-workshop or look for the description EMR Serverless workshop
6. Go to outputs tab
7. We will now export the output values in Cloud9 terminal:
   1. export JOB_ROLE_ARN=<<EMRServerlessS3RuntimeRoleARN>>
   2. export S3_BUCKET=s3://<<YOUR_BUCKET>>
8. We will now export the application id of our EMR Serverless Application to cloud9 terminal.
9. Go to EMR Serverless console and click on application and copy the application id and use in the terminal as below
   1.  export APPLICATION_ID=<<application_id>>
10. Submit the job using the following command:
    ```
    aws emr-serverless start-job-run --application-id ${APPLICATION_ID} --execution-role-arn ${JOB_ROLE_ARN} --name "Spark-WordCount-CLI" --job-driver '{
            "sparkSubmit": {
                "entryPoint": "s3://wysde-assets/src/py/emr/wordcount.py",
                "entryPointArguments": [
            "'"$S3_BUCKET"'/wordcount_output_cli/"
            ],
                "sparkSubmitParameters": "--conf spark.executor.cores=1 --conf spark.executor.memory=4g --conf spark.driver.cores=1 --conf spark.driver.memory=4g --conf spark.executor.instances=1 --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
            }
        }'
    ```
11. If you get aws cli "command not found error", make sure your aws cli is of latest version.
11. Check the status on the EMR console.
12. Navigate to your S3 bucket and check the output under wordcount_output folder.



### Verify Spark job output data using Athena

Now let us verify that the word_count output written by our job has the correct data. In order to do this we need to crawl the job output data and store the schema in Glue Data Catalog.

1. In AWS Console, under services search for Glue and select AWS Glue or alternatively, go to Glue Console.
2. In the Glue console click Crawlers
3. Click Add crawler
4. Give a name for the Glue crawler. In this lab we use word-count-output-crawler as the Crawler name. Click Next.
5. In the Add a data store screen provide s3://$S3_BUCKET/wordcount_output as the path. Make sure you replace $S3_BUCKET with your S3 bucket name. This is the data written as output by EMR Serverless. We are going to crawl this data. Click Next.
6. In order to be able to crawl, the Glue crawler job needs the permissions to read the S3 bucket and create a Glue database and table. For this, we need an IAM service role for Glue with the right permissions. Choose Create an IAM role and provide emrserverless as the suffix for the role name. Click Next.
7. In the the crawler’s output screen, choose Add database. Give emrserverlessdb as the Database name.
8. Choose Run on demand as the Frequency for the crawler. Click Create.
9. The new Glue database shows up. Click Next. Review the choices you have made and click Next again.
10. We have created the Crawler, but we are not not running yet. Click Run it now to start running it.
11. After less than a minute, you should get a message on the console that 1 table got created.
12. In the left Navigation menu, select Tables to see that a new table called wordcount_output got created.
13. Execute the following query "SELECT * FROM wordcount_output limit 10;".
14. Now we can see the wordcount data generated by the EMR Serverless job.



## Submit Hive jobs from EMR Studio

1. On EMR Serverless Console click on Create application Enter name as hive-serverless, Select Type as Hive and latest release version.
2. Under Application setup options, Choose default settings and click Create application. You will notice the application is created and shows Starting.
3. Once the application is started, we can submit Hive jobs to it. Click on Submit job
4. Enter the details as below:
  ```
  Name	Hive-Serverless-Console
  Runtime role	EMRServerlessS3RuntimeRole
  Initialization Script Location S3 URI	s3://wysde-assets/src/sql/create_taxi_trip.sql
  Script location S3 URI	s3://wysde-assets/src/sql/count.sql
  ```
5. As part of the initialization script we are creating a hive table for NewYork Taxi Trip details:
   ```sql
  CREATE EXTERNAL TABLE if not exists `nytaxitrip`(
    `vendorid` bigint, 
    `tpep_pickup_datetime` string, 
    `tpep_dropoff_datetime` string, 
    `passenger_count` bigint, 
    `trip_distance` double, 
    `ratecodeid` bigint, 
    `store_and_fwd_flag` string, 
    `pulocationid` bigint, 
    `dolocationid` bigint, 
    `payment_type` bigint, 
    `fare_amount` double, 
    `extra` double, 
    `mta_tax` double, 
    `tip_amount` double, 
    `tolls_amount` double, 
    `improvement_surcharge` double, 
    `total_amount` double, 
    `congestion_surcharge` string)
  ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY ',' 
  STORED AS INPUTFORMAT 
    'org.apache.hadoop.mapred.TextInputFormat' 
  OUTPUTFORMAT 
    'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
  LOCATION
  's3://wysde-datasets/tripdata/';
   ```
6. Once the table is created by Hive, we want to run a count query to see the number of records in this table:
  ```sql
  SET hive.cli.print.header=true;
  SET hive.query.name=TaxiTrips;

  Select count(*) as count from nytaxitrip;
  ```
7. Under Job configuration update the DOC-EXAMPLE_BUCKET with your S3 bucket name.
8. Under Additional settings select Upload logs to your Amazon S3 bucket and add your s3 bucket path - <<S3_Bucket/logs>>.
9. Click Submit job
10. Once the job is completed successfully, go over to Athena console to verify if the table got created and query it to check the data.
11. Click on the three dots and click Preview table

## Orchestration using MWAA

Amazon EMR Serverless supports following airflow operators

**EmrServerlessCreateApplicationOperator** - Operator to create Serverless EMR Application with following parameter

- release_label: The EMR release version associated with the application.
- job_type: The type of application you want to start, such as Spark or Hive.
- wait_for_completion: If true, wait for the Application to start before returning. Default to True
- client_request_token: The client idempotency token of the application to create. Its value must be unique for each request.
- aws_conn_id: AWS connection to use

**EmrServerlessStartJobOperator** - Operator to start EMR Serverless job with following parameter

- application_id: ID of the EMR Serverless application to start.
- execution_role_arn: ARN of role to perform action.
- job_driver: Driver that the job runs on.
- configuration_overrides: Configuration specifications to override existing configurations.
- client_request_token: The client idempotency token of the application to create.
- wait_for_completion: If true, waits for the job to start before returning. Defaults to True.
- aws_conn_id: AWS connection to use
- config: extra configuration for the job.

**EmrServerlessDeleteApplicationOperator** - Operator to delete EMR Serverless application
- application_id: ID of the EMR Serverless application to delete.
- wait_for_completion: If true, wait for the Application to start before returning. Default to True
- aws_conn_id: AWS connection to use

### MWAA Setup

Before scheduling EMR Serverless job using Amazon MWAA, you need to create the required AWS resources. To do this, we provide AWS CloudFormation template to create a stack that contains the resources. When you create the stack, AWS creates following resources in your account:

- Setup the VPC network for the Amazon MWAA environment, deploying the following resources:
  - A VPC with a pair of public and private subnets spread across two Availability Zones.
  - An internet gateway, with a default route on the public subnets.
  - A pair of NAT gateways (one in each Availability Zone), and default routes for them in the private subnets.
  - Amazon S3 gateway VPC endpoints and EMR interface VPC endpoints in the private subnets in two Availability Zones.
  - A security group to be used by the Amazon MWAA environment that only allows local inbound traffic and all outbound traffic.
- Setup an Amazon MWAA execution IAM role with permissions to
  - Launch an EMR cluster,
  - Access EMR Studio,
  - Access to the required s3-buckets,
  - Access to data catalog and monitoring tools.
- An Amazon Simple Storage Service (Amazon S3) bucket that meets the following Amazon MWAA requirements:
  - The bucket in the same AWS Region where you create the MWAA environment.
  - Globally unique bucket name,
  - Bucket versioning is enabled
- A folder named dags created in the same bucket to store DAGs.
- EMR Serverless application with the following attributes
  - Amazon EMR release version 6.6.6
  - Apache Spark runtime to use within the application
- Amazon MWAA environment we setup Airflow version 2.2.2, with mw1.small environment class and maximum worker count as 1.
  - For monitoring, we choose to publish environment performance to CloudWatch Metrics.
  - For Airflow logging configuration, we choose to send only the task logs and use log level INFO.

In [None]:
!aws cloudformation create-stack --stack-name EMR-Serverless-Orchestration \
            --template-url https://wysde-assets.s3.us-east-1.amazonaws.com/cfn/mwaa_emr.yml \
            --capabilities CAPABILITY_NAMED_IAM

The CloudFormation stack will roughly take 30-40 minutes to complete.

### Submitting Spark job

In this section we will setup the Managed Apache Airflow environment to handle EMR Serverless jobs. Further we will upload an airflow dag for Spark job and run it in the Amazon MWAA environment

1. Go to the Stack Output tab and copy the Value for the S3Bucket Key
2. Upload the requirements.txt file to S3 bucket
4. Navigate to the Managed Apache Airflow Console 
5. You should be able to see your MWAA environment
6. Click on the mwaa-emr-serverless environment and then in the window that opens click on Edit button
7. In the Edit window go to the section DAG code in Amazon S3 and look for Requirements file - optional and then click on Browse S3 button next to it.
8. Select the requirements.txt file and click on Choose.
9. Keep rest as default and click on Next button
10. In the next window Configure advanced settings leave everything as default and click on Next button at the bottom of the screen
11. In the Review and Save window click on Save
12. Go back to Airflow environments and wait for the status to change from Updating to Available
13. The update will roughly take 8-10 minutes to complete.
14. Go to the Stack Output tab and copy the Values for the ApplicationID, JobRoleArn, S3Bucket Keys.
15. Open the file using text editor and validate the contents of the dag. Replace the values for ApplicationID, JobRoleArn and S3Bucket with the values you copied from CloudFormation stack above into the code in appropriate section.
16. Upload the python script to the dags folder in the S3 bucket
17. Navigate to the S3 console 
18. Search for s3 bucket with the Value you copied from the Output section of the CloudFormation stack for S3Bucket Key
19. Move to dags folder inside the s3 bucket and then Upload the example_emr_serverless.py script
20. Trigger the jobs in Apache Airflow UI
21. Navigate to the Managed Apache Airflow Console 
22. Click on Open Airflow UI link for the Managed Apache Airflow environment mwaa-emr-serverless
23. You should see the DAG example_emr_serverless_job in Apache Airflow UI. It may take 2-3 minutes to show up the dag in the Airflow UI.
24. Click on play button for dag example_emr_serverless_job in the Actions column and then Trigger Dag to schedule the spark job on EMR Serverless
25. Once the status changes from light green to dark green the job is completed.
26. Click on the dag example_emr_serverless_job and then in the next window click on the dark green square icon next to start_job
27. Click on the Logs button in the popup window 
28. You will able to see the logs for the spark job. Notice that final state for the job is success.

### Orchestrate from locally running Airflow

Instead of MWAA, it is also possible to orchestrate the EMR Serverless Job fom the locally running Airflow.

## Transactional Data Lake with Apache Hudi

1. Upload all the files of src/hudi-*.py to Amazon S3 bucket created by cloud formation template.
2. Writing to Hudi Table - Apache Hudi provides 2 types of storage formats for creating a hudi table - Copy on Write and Merge on Read. Refer [this](https://hudi.apache.org/docs/next/faq#what-is-the-difference-between-copy-on-write-cow-vs-merge-on-read-mor-storage-types) to understand which table type is suited for your use case.

### Hudi Copy On Write

1. Click on Submit Job under your spark EMR serverless application
2. Provide sample name like hudi_copy_on_write for job name and select EMRServerlessS3RuntimeRole under Runtime role. Browse to S3 folder where you have saved all scripts from prerequisite section and chose hudi-cow.py. The script needs S3 bucket name as argument. This bucket will be used by hudi to write your data. You can provide the S3 bucket name created along with cloudformation template: ["emrserverless-workshop-<account_id>"]
3. Under spark properties select “Edit in text” and paste below spark configurations.
   ```
   --conf spark.jars=/usr/lib/hudi/hudi-spark-bundle.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
   ```
4. Check the Run status to identify status of job completion. 
5. Now that our hudi copy on write table is created. Lets verify the count of records from Amazon Athena. Run below query on Athena query editor. Note - EMR serverless currently does not support interactive analysis. So you will have to use other platforms like EMR on EC2 or Athena to query hudi tables.
    ```sql
    select count(*) from default.hudi_trips_cow
    ```
6. Now that we have verified our hudi table. Lets perform an upsert operation. But first, lets verify the records we want to modify by running below query on athena:
    ```sql
    select trip_id, route_id, tstamp, destination from default.hudi_trips_cow where trip_id between 999996 and 1000013
    ```
7. Lets submit another job that will change the destination and timestamp for trip IDs 1000000 to 1000010. Browse the S3 folder where you have uploaded all files from pre requisite section and chose hudi-upsert-cow.py. Similar to in above section, provide S3 bucket name where hudi table is created as script argument: ["emrserverless-workshop-<account_id>"]. Make sure Spark properties are provided. 
8. Once the job is completed. Verify the updated rows by running below query again on athena. You will notice that destination column for trip_id 1000000 to 1000010 is updated to Boston.
    ```sql
    select trip_id, route_id, tstamp, destination from default.hudi_trips_cow where trip_id between 999996 and 1000013
    ```

### Hudi Merge on Read

1. Provide sample name like hudi_merge_on_read for job name and select EMRServerlessS3RuntimeRole under Runtime role. Browse to S3 folder where you have saved all scripts from prerequisite section and chose hudi-mor.py. The script needs S3 bucket name as argument. This bucket will be used by hudi to write your data. You can provide the S3 bucket name created along with cloudformation template: ["emrserverless-workshop-<account_id>"]. Make sure spark properties are correctly provided.
2. Once the job is complete. It will create two table query types referenced as _ro(Read Optimized) and _rt(Real Time). Run below queries on Athena to validate data in these tables.
    ```sql
    select count(*) from default.hudi_trips_mor_ro
    select count(*) from default.hudi_trips_mor_rt
    ```
3. Now lets update some records and see how these Read Optimized and Real Time query types behave. Similar to above section on Hudi Copy on Write, we will update the records with trip IDs 1000000 to 1000010. So lets verify records in both query types. Both will have identical data.
    ```sql
    select trip_id, route_id, tstamp, destination from default.hudi_trips_mor_ro where trip_id between 999996 and 1000013
    select trip_id, route_id, tstamp, destination from default.hudi_trips_mor_rt where trip_id between 999996 and 1000013
    ```
4. Lets submit a job that will update destination for trip IDs 1000000 to 1000010. Browse S3 folder where you have saved all scripts from prerequisite section and chose hudi-upsert-mor.py. Provide hudi table location as script argument and make sure spark properties are correctly provided. 
5. Once the job is complete. Run below query on Athena. You will notice that hudi_trips_mor_rt has an updated Boston destination for trip_id 1000000 to 1000010.
    ```sql
    select trip_id, route_id, tstamp, destination from default.hudi_trips_mor_rt where trip_id between 999996 and 1000013
    ```
6. Lets run the same query against hudi_trips_mor_ro and we will notice that it still has older records. This is because read optimized only reads data upto last compaction. Since in our case data is not yet compacted, latest updated records are not yet reflected. This is a tradeoff between performance and data freshness. Running queries against read optimized will be faster as against real time, which always provides fresh snapshot of data. Notice the time difference of 1.17s against 6.127s as well as difference in data scanned of 2.25MB against 21.07MB
    ```sql
    select trip_id, route_id, tstamp, destination from default.hudi_trips_mor_ro where trip_id between 999996 and 1000013
    ```

## Clean Up

Follow the steps below for clean up:

1. Delete all the Spark/Hive applications that you created for this lab from EMR Studio/Serverless Console
2. Delete the S3 bucket emrserverless-workshop-<<account-id>> created as part of the workshop. Click here to know more about how to delete the Amazon S3 bucket 
3. Open up the CloudFormation console  and select the EMRServerless stack and click on Delete button to terminate the stack.