# EMR Notebook using Spark Sql <a name="top"></a>



## Exercise: 
[(Back to the top)](#top)

In this notebook, we will do the following activities:
    
- Use a data set from S3 in CSV format and load it into Spark, create a spark data frame, create temporary table, query using spark sql
- Use a pre-cataloged dataset in S3 using AWS Glue crawler and transform tables using spark sql
- Write the transformed output to S3 in parquet format, crawl it to crate a catalog entryu, query using spark sql.

Let's start by connecting to our our AWS Glue Dev Endpoint - a persistent AWS Glue Spark  Development environment.

In [None]:
spark.version

In [None]:
spark.sql("show databases").show()

In [None]:
spark.sql("use salesdb")

In [None]:
spark.sql("show tables").show()

Note that regular Spark SQL commands work great as we have enabled the feature 'Use Glue Data Catalog as the Hive metastore' for our AWS Glue Dev Endpoint by default. You can choose to run any spark-sql commands against these tables as an optional exercise 

You can click on the link to read more on [AWS Glue Data Catalog Support for Spark SQL Jobs](
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-data-catalog-hive.html)

#### Above tables are pre-created for you using the AWS Glue crawler and you can see any new EMR cluster can seamlessly access the tables. Now, what about files that are in S3 (say CSV) which you need to use spark and create a table using a data frame and query it in sql? Use the section below

## Load CSV files from S3 into spark program and create tables to query in sql

In [None]:
# Boilerplate code
from pyspark.sql import SparkSession
from pyspark.sql import DataFrame
from pyspark.sql import Column
from pyspark.sql import Row # A row of data in a DataFrame
from pyspark.sql import GroupedData # Aggregation methods, returned by DataFrame.groupBy().
from pyspark.sql import DataFrameNaFunctions # Methods for handling missing data (null values).
from pyspark.sql import DataFrameStatFunctions # Methods for statistics functionality.
from pyspark.sql import functions # List of built-in functions available for DataFrame.
from pyspark.sql import types # List of data types available.
from pyspark.sql import Window # For working with window functions.
#End Boilerplate code

nytrip_df = spark.read.csv("s3://glue-labs-001-180486424913/data/nyc_trips_csv", header='true')

In [None]:
#Print schema for the csv files read in previous step
nytrip_df.printSchema()

In [None]:
#To run sql commands on the above spark dataframe, we will create a temp table. This table will persist through the life of this spark session
nytrip_df.createOrReplaceTempView("temp_nytripdata")

In [None]:
#Validate your glue metastore and see if this table shows as a temp table as per third column.
spark.sql("use salesdb")
spark.sql("show tables").show()

In [None]:
# Now fire at will using your standard ANSI sql queries against the tables in the catalog
spark.sql("select * from temp_nytripdata").show(5)

## Transform your data by denormalizing the tables and writing them in parquet format
[(Back to the top)](#top)

In this activity, we will denormalize two tables and create a Parquet format output.


### Transform the dataset

Let's now denormalize the source tables in Glue catalog and write out the transformed output in Parquet format to the destination location. Note to change the S3 output_path in the cells below to appropriate bucket in your account

In [None]:
#Verify sample data and schema
adf=spark.sql("select * from product_category limit 5")
adf.printSchema()
bdf=spark.sql("select * from product limit 5")
bdf.printSchema()
#Verify count on source
adf.count()
bdf.count()

In [None]:
# Create product denorm table joining between product_category and product tables
product_df=spark.sql("SELECT a.category_id,a.category_name, b.product_id,b.name,b.supplier_id \
FROM product_category a, product b \
WHERE a.category_id=b.category_id")

#Lets verify the schema and record count
product_df.printSchema()
print(product_df.count())

In [None]:
#Write the entire product denorm table above into 1 file in parquet file format. Make Sure to update the S3 path below and replace it with your 

product_df.coalesce(1).write.mode("OVERWRITE").parquet("s3://glue-labs-001-180486424913/data/sales_analytics/product_dim/")

### Now that your data is written to S3, switch to AWS console and validate you have 1 parquet file created at the above S3 location.

## Final step, Crawl the Transformed Data and create a table in your catalog for querying

- Navigate to the Glue console at Services -> Glue
- From the left-hand panel menu, navigate to Data Catalog -> Crawlers.
- Click on the button ‘Add Crawler’ to create a new Glue Crawler.
- Fields to fill in:
    - Page: Add information about your crawler
        - Crawler name: **sales_analytics_crawler**
    - Page: Add a data store
        - Choose a data store: S3
        - Include path: **s3://###s3_bucket###/data/sales_analytics/**
    - Page: Choose an IAM role
        - IAM Role: Choose an existing IAM role **glue-labs-GlueServiceRole**
    - Page: Configure the crawler's output
        - Database:  Click on ‘Add database’ and enter database name as **sales_analytics**.
- Click on the button ‘Finish’ to create the crawler.
- Select the new Crawler and click on Run crawler to run the Crawler.


Now, lets query your transformed table

In [None]:
# validate your table exisits and start querying
spark.sql("use sales_analytics").show()
spark.sql("show tables").show()
spark.sql("select * from sales_analytics.replacethiswith your TableName")

## Congratulations!!! You have now successfully completed this exercise and learned how to use spark in your day-to-day
[(Back to the top)](#top)
