# 01-Load a Dimension table
This notebook extracts product data from parquet files and loads it into Delta lake and Synapse table.
This notebook executes some popular transformations you will encounter in real-life scenarios
					
## Contents
1. Extract
1. Transform
1. Load

In [8]:
# Set Parameters

# Set path to source files
basePath = "abfss://data@REPLACE_DATALAKE_NAME.dfs.core.windows.net/sample/AdventureWorksDW2019/dbo/"


### 1. Extract

In [17]:
# Create a spark dataframe with product data
productDF = spark.read.parquet(basePath + "DimProduct")
productDF.createOrReplaceTempView("product_tmp")
display(productDF.limit(10))

### 2. Transform

In [18]:
# Always import these two sets of libraries at a minimum for spark transformations
from pyspark.sql.functions import *
from pyspark.sql.types import *

In [23]:
# Join two dataframes to build a Product Category Table
productSubCategoryDF = spark.read.parquet(basePath + "DimProductSubcategory")
productCategoryDF = spark.read.parquet(basePath + "DimProductCategory")

prodcatDF = productSubCategoryDF.join(productCategoryDF,productSubCategoryDF.ProductCategoryKey == productCategoryDF.ProductCategoryKey,"inner")\
                                .select(col("ProductSubcategoryKey"),col("EnglishProductSubcategoryName").alias("ProductSubCategory"),col("EnglishProductCategoryName").alias("ProductCategory"))

prodcatDF.createOrReplaceTempView("prodCategory_tmp")
display(prodcatDF)

### 3. Load

Create temporary table to faciliate data transfer between scala and python

In [26]:
%%sql
CREATE OR REPLACE TEMPORARY VIEW prod_tmp AS
SELECT 
pt.ProductKey as product_sk,
pt.ProductAlternateKey as product_id,
pt.EnglishProductName as ProductName,
pct.ProductSubCategory,
pct.ProductCategory
FROM product_tmp pt, prodCategory_tmp pct
WHERE
pt.ProductSubcategoryKey = pct.ProductSubcategoryKey

Create a delta lake table if doesn't already exist for product data

In [28]:
%%sql
-- Creating spark tables using delta format allow ACID transactions on data lake tables
-- This is a one-time task
CREATE TABLE IF NOT EXISTS sparklakehouse.stg_product USING DELTA
AS
SELECT * FROM prod_tmp

Merge new and changed data

In [29]:
%%sql

MERGE INTO sparklakehouse.stg_product t
USING prod_tmp s 
ON t.product_sk = s.product_sk 
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

Clean up old versions of the data periodically

In [30]:
%%sql
VACUUM sparklakehouse.stg_product RETAIN 168 HOURS;

Finally upload the data into Synapse Analytics

In [27]:
%%spark
// Create a scala data frame from the Temporary table
val scala_df = spark.sqlContext.sql ("select * from prod_tmp")
					
// Create a staging table after which we can run a stored procedure to create the final table
scala_df.write.synapsesql("SQLTestPool.dbo.StgProduct", Constants.INTERNAL)