# MSIN0166 Data Engineering workshop practice

In [1]:
import pyspark
from pyspark.sql import SparkSession

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
5,,pyspark,idle,,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In this workshop, we will explore how to extract and process data for analytical insights via PySpark and SparkSQL.
You will now use S3 as a data source to extract data instead of connecting to a Postgres database. 
<br/>
<br/>
Data from the Postgres database has been copied into S3 in the following S3 bucket: <b>s3://msin0166-spark-workshop</b>
<br/>
It can be found within the  <b>data/ecommerce_data workshop</b>. 
<br/>
<br/>
Hence, all table data can be found at: <b>s3://msin0166-spark-workshop/data/ecommerce_data/</b>

Note: All tables have been stored in Parquet format. 
In order to check the list of tables stored in S3 (Parquet files), please navigate to the msin0166-spark-workshop S3 bucket via the AWS Management console.


### Step 1: Set the accessKeyID and the secretAccessKey variables in the code below

In [2]:
spark = SparkSession.builder.appName("S3CSVRead").getOrCreate()
accessKeyId=" "
secretAccessKey=" "
spark._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
spark._jsc.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", accessKeyId)
spark._jsc.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", secretAccessKey)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### Step 2: Follow the sample code below and load the Order table data in a Spark DataFrame

<code>station_df = spark.read.parquet("s3://msin0166-spark-workshop/data/ecommerce_data/user.parquet")</code>

In [5]:
order_df = spark.read.parquet(".....")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### Step 3:  In order to display the Spark DataFrame, we will call the show() function on the Dataframe
Example: <code> station_df.show() </code>

In [7]:
# Please display the DataFrame in this cell

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Spark provides the SparkSQL library, allowing us to write SQL against DataFrames. In order to do that, we need to store our DataFrame as a view, as follows:
<br/>
<br/>
<code> station_df.createOrReplaceTempView("station_df")</code>
<br/>
<br/>
We can then start writing SQL queries to extract data from the DataFrame

<code>query_result=spark.sql("SELECT * FROM station_df")</code>
<br/>
<br/>
<code>query_result.show()</code>

### Step 4: Select the average of the order total in the Order table by using SparkSQL
Note: Don't forget to create a view, so that the order table can be found by your query

In [None]:
# Create the view

In [None]:
# Run the query

In [None]:
# Display the query result

### Step 5: 
a) Extract the user data from S3. <br/>
b) Register it as a view <br/>
c) Count how many users are less than 60 years old <br/>


In [None]:
# a) Extract the user data from S3.
user_data=spark.read.parquet(".....") 

In [None]:
# b) Register it as a view
user_data....("user_data")

In [None]:
# c) Count how many users are less than 60 years old
user_count=spark.sql("....")

In [None]:
user_count.show()

### Step 6: Retrieve user data together with all personal identifiable information. (Hint: join PIIand user tables)
a) Extract PII data from S3 (Hint: It's the pii.parquet file) <br/>
b) Register the PII data as a view <br/>
c) Write a SQL query to JOIN the PII table and the User table  <br/>

In [None]:
# a) Extract PII data from S3 (Hint: It's the pii.parquet file

In [None]:
# b) Register the PII data as a view

In [None]:
# c) Write a SQL query to JOIN the PII table and the User table

### Step 7: Find all products ordered by the user whose first name is ‘FN1’

You can create new Spark DataFrame columns by using the .withColumn() function <br/>
<br/>
<br/>
The withColumn function will take two parameters:
- First parameter representing hte name of the new column.
- Second parameter: The expression used to populate that column
<br/>   
<br/>
<code>station_df.withColumn("bikeParts",concat(col("bike_part_one"), col(("bike_part_two")))).show()<code>

### Step 8: Create a new column named fullName on your PII table. It should contain a concatenation of the first_name and last_name columns


In [21]:
from pyspark.sql.functions import col, concat
pii=pii.withColumn("fullName",......)


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [None]:
pii.createOrReplaceTempView("pii")
pii.show()

### Step 9: Create a new column called doubleProductPrice in the product table. 
It should contain values equal to double the price fomr the product_price column <br/>
<br/>
a) Extract Product data from S3 (Hint: It's the product.parquet file) <br/>
b) Register the Product data as a view <br/>
c) Follow the example in step 7 and create a new column called doubleProductPrice by doubling the value of the product_price column <br/>

In [None]:
# a) Extract Product data from S3 (Hint: It's the product.parquet file)

In [None]:
# b) Register the Product data as a view

In [None]:
# c) Follow the example in step 7 and create a new column called doubleProductPrice by doubling the value of the product_price column

As Spark uses the Hadoop MapReduce framework as its foundation, it would be good to apply the MapReduce algorithm as part of our exercises
<br/>
<br/>
<b>Below, you can find code used to count the number of orders by their status in the Order table.<b/>
<br/>

In [26]:
order=spark.read.parquet("s3://msin0166-spark-workshop/data/ecommerce_data/order.parquet")
order.createOrReplaceTempView("order")
order.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------+-----------+------------+-------------------+-------------------+
|order_id|order_total|order_status|         created_at|         updated_at|
+--------+-----------+------------+-------------------+-------------------+
|       1|       60.0|    finished|2021-02-04 00:00:00|2021-02-04 00:00:00|
|       2|      120.0|    finished|2021-02-04 00:00:00|2021-02-04 00:00:00|
|       3|      190.0|    finished|2021-02-04 00:00:00|2021-02-04 00:00:00|
|       4|       30.0|    finished|2021-02-04 00:00:00|2021-02-04 00:00:00|
|       5|       30.0|    finished|2021-02-04 00:00:00|2021-02-04 00:00:00|
|       6|      160.0|    finished|2021-02-04 00:00:00|2021-02-04 00:00:00|
|       7|      100.0|    finished|2021-02-04 00:00:00|2021-02-04 00:00:00|
|       8|       30.0|    finished|2021-02-04 00:00:00|2021-02-04 00:00:00|
|       9|       20.0|    finished|2021-02-04 00:00:00|2021-02-04 00:00:00|
|      10|      180.0|    finished|2021-02-04 00:00:00|2021-02-04 00:00:00|
|      11|  

In [34]:
order_rdd=order.rdd


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [35]:
order_rdd.take(3)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[Row(order_id=1, order_total=60.0, order_status='finished', created_at=datetime.datetime(2021, 2, 4, 0, 0), updated_at=datetime.datetime(2021, 2, 4, 0, 0)), Row(order_id=2, order_total=120.0, order_status='finished', created_at=datetime.datetime(2021, 2, 4, 0, 0), updated_at=datetime.datetime(2021, 2, 4, 0, 0)), Row(order_id=3, order_total=190.0, order_status='finished', created_at=datetime.datetime(2021, 2, 4, 0, 0), updated_at=datetime.datetime(2021, 2, 4, 0, 0))]

In [36]:
order_map=order_rdd.map(lambda x: (x[2],1))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [37]:
order_map.take(3)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[('finished', 1), ('finished', 1), ('finished', 1)]

In [38]:
order_status_count=order_map.reduceByKey(lambda a,b:a+b)


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [39]:
order_status_count.collect()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[('finished', 10), ('in_progress', 10)]

### Step 10: Use the MapReduce algorithm to count the number of products per category in the Product table 

Hint: Use the product_category column, which is the second column)
<br/>
<br/>
a) Convert the Spark DataFrame to an RDD <br/>
b) Print the RDD to check it is not empty <br/>
c) Apply the map function to the Order RDD <br/>
d) Apply the reduceByKey function to the RDD created in step c) <br/>

In [None]:
#a) Convert the Spark DataFrame to an RDD

In [None]:
#b) Print the RDD to check it is not empty

In [None]:
#c) Apply the map function to the Order RDD 

In [None]:
#d) Apply the reduceByKey function to the RDD created in step c)