In [10]:
BUCKET_PATH = "s3://wysde-assets/labs/glue-cdc-upsert"

We will use the following folders:
- $BUCKET_PATH/fullload – This folder is used for a one-time full load from the upstream data source
- $BUCKET_PATH/cdcload – This folder is used for copying the upstream data changes
- $BUCKET_PATH/delta – This folder holds the Delta Lake data files

## Upload the `full-load.csv` data in $BUCKET_PATH/fullload folder

In [7]:
!aws s3 cp data/full-load.csv $BUCKET_PATH/fullload/

upload: data/full-load.csv to s3://wysde-assets/labs/glue-cdc-upsert/fullload/full-load.csv


## Set up an IAM policy and role

### Create Policy

In [2]:
%%writefile policy-document.json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowListingOfFolders",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::wysde-assets"
            ]
        },
        {
            "Sid": "ObjectAccessInBucket",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": "arn:aws:s3:::wysde-assets/*"
        }
    ]
}

Overwriting policy-document.json


In [3]:
policy_name = "glue-delta-lake-cdc-policy"

!aws iam create-policy --policy-name {policy_name} --policy-document file://policy-document.json

----------------------------------------------------------------------------------------------------
|                                           CreatePolicy                                           |
+--------------------------------------------------------------------------------------------------+
[2m[33m|[0m|                                             Policy                                             |[2m[33m|[0m
[2m[33m|[0m+--------------------------------+---------------------------------------------------------------+[2m[33m|[0m
[2m[33m|[0m|  [1m[34mArn[0m                           |  [1m[34marn:aws:iam::684199068947:policy/glue-delta-lake-cdc-policy[0m  |[2m[33m|[0m
[2m[33m|[0m|  [1m[34mAttachmentCount[0m               |  [1m[34m0[0m                                                            |[2m[33m|[0m
[2m[33m|[0m|  [1m[34mCreateDate[0m                    |  [1m[34m2023-02-05T06:50:39+00:00[0m                                    

### Create Role

In [4]:
%%writefile role-trust.json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "glue.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

Writing role-trust.json


In [5]:
role_name = "glue-delta-lake-cdc-role"

!aws iam create-role --role-name {role_name} --assume-role-policy-document file://role-trust.json

-----------------------------------------------------------------------------
|                                CreateRole                                 |
+---------------------------------------------------------------------------+
[2m[33m|[0m|                                  Role                                   |[2m[33m|[0m
[2m[33m|[0m+------------+------------------------------------------------------------+[2m[33m|[0m
[2m[33m|[0m|  [1m[34mArn[0m       |  [1m[34marn:aws:iam::684199068947:role/glue-delta-lake-cdc-role[0m   |[2m[33m|[0m
[2m[33m|[0m|  [1m[34mCreateDate[0m|  [1m[34m2023-02-05T06:53:17+00:00[0m                                 |[2m[33m|[0m
[2m[33m|[0m|  [1m[34mPath[0m      |  [1m[34m/[0m                                                         |[2m[33m|[0m
[2m[33m|[0m|  [1m[34mRoleId[0m    |  [1m[34mAROAZ6TLRIUJW7AQ3WWME[0m                                     |[2m[33m|[0m
[2m[33m|[0m|  [1m[34mRoleName[0m

### Attach Policies to the Role

In [7]:
!aws iam attach-role-policy --policy-arn arn:aws:iam::684199068947:policy/glue-delta-lake-cdc-policy --role-name {role_name}
!aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole --role-name {role_name}
!aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/CloudWatchFullAccess --role-name {role_name}

[0m[0m

## Setup Glue Job

In this section, we set up two AWS Glue jobs: one for full load and one for the CDC load. Let’s start with the full load job.

### Full load Job

1. On the AWS Glue console, under Data Integration and ETL in the navigation pane, choose Jobs. AWS Glue Studio opens in a new tab.
1. Select Spark script editor and choose Create.
1. In the script editor, replace the code with the following code snippet
1. Navigate to the Job details tab.
1. Provide a name for the job (for example, Full-Load-Job).
1. For IAM Role¸ choose the role that you created earlier.
1. For Worker type¸ choose G2X.
1. For Job bookmark, choose Disable.
1. Set Number of retries to 0.
1. Under Advanced properties¸ keep the default values, but provide the delta core JAR file path for Python library path and Dependent JARs path.
1. Under Job parameters:
    - Add the key --s3_bucket with the bucket name you created earlier as the value.
    - Add the key --datalake-formats  and give the value delta
1. Keep the remaining default values and choose Save.

In [None]:
import sys
from awsglue.utils import getResolvedOptions
from pyspark.sql.session import SparkSession
from pyspark.sql.types import *

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME','s3_bucket'])

# Initialize Spark Session with Delta Lake
spark = SparkSession \
.builder \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()

#Define the table schema
schema = StructType() \
      .add("policy_id",IntegerType(),True) \
      .add("expiry_date",DateType(),True) \
      .add("location_name",StringType(),True) \
      .add("state_code",StringType(),True) \
      .add("region_name",StringType(),True) \
      .add("insured_value",IntegerType(),True) \
      .add("business_type",StringType(),True) \
      .add("earthquake_coverage",StringType(),True) \
      .add("flood_coverage",StringType(),True) 

# Read the full load
sdf = spark.read.format("csv").option("header",True).schema(schema).load("s3://"+ args['s3_bucket']+"/fullload/")
sdf.printSchema()

# Write data as DELTA TABLE
sdf.write.format("delta").mode("overwrite").save("s3://"+ args['s3_bucket']+"/delta/insurance/")


### CDC Job

1. Create a second job called CDC-Load-Job.
1. Follow the steps on the Job details tab as with the previous job.
1. Alternatively, you may choose “Clone job” option from the Full-Load-Job, this will carry all the job details from the full load job.
1. In the script editor, enter the following code snippet for the CDC logic.

In [None]:
import sys
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.functions import expr

## For Delta lake
from delta.tables import DeltaTable


## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME','s3_bucket'])

# Initialize Spark Session with Delta Lake
spark = SparkSession \
.builder \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()

# Read the CDC load
cdc_df = spark.read.csv("s3://"+ args['s3_bucket']+"/cdcload")
cdc_df.show(5,True)

# now read the full load (latest data) as delta table
delta_df = DeltaTable.forPath(spark, "s3://"+ args['s3_bucket']+"/delta/insurance/")
delta_df.toDF().show(5,True)

# UPSERT process if matches on the condition the update else insert
# if there is no keyword then create a data set with Insert, Update and Delete flag and do it separately.
# for delete it has to run in loop with delete condition, this script do not handle deletes.
    
final_df = delta_df.alias("prev_df").merge( \
source = cdc_df.alias("append_df"), \
#matching on primarykey
condition = expr("prev_df.policy_id = append_df._c1"))\
.whenMatchedUpdate(set= {
    "prev_df.expiry_date"           : col("append_df._c2"), 
    "prev_df.location_name"         : col("append_df._c3"),
    "prev_df.state_code"            : col("append_df._c4"),
    "prev_df.region_name"           : col("append_df._c5"), 
    "prev_df.insured_value"         : col("append_df._c6"),
    "prev_df.business_type"         : col("append_df._c7"),
    "prev_df.earthquake_coverage"   : col("append_df._c8"), 
    "prev_df.flood_coverage"        : col("append_df._c9")} )\
.whenNotMatchedInsert(values =
#inserting a new row to Delta table
{   "prev_df.policy_id"             : col("append_df._c1"),
    "prev_df.expiry_date"           : col("append_df._c2"), 
    "prev_df.location_name"         : col("append_df._c3"),
    "prev_df.state_code"            : col("append_df._c4"),
    "prev_df.region_name"           : col("append_df._c5"), 
    "prev_df.insured_value"         : col("append_df._c6"),
    "prev_df.business_type"         : col("append_df._c7"),
    "prev_df.earthquake_coverage"   : col("append_df._c8"), 
    "prev_df.flood_coverage"        : col("append_df._c9")
})\
.execute()

## Run the Full load Job

On the AWS Glue console, open `full-load-job` and choose Run. The job takes about 2 minutes to complete, and the job run status changes to Succeeded. Go to `$bucket_name` and open the delta folder, which contains the insurance folder. You can note the Delta Lake files in it.

In [13]:
!aws s3 ls {BUCKET_PATH}/delta/insurance/

                           PRE _delta_log/
2023-02-05 12:46:08          0 _delta_log_$folder$
2023-02-05 12:46:42       2802 part-00000-6562d148-4197-40f8-b17a-54107cda54c6-c000.snappy.parquet


## Create and run the AWS Glue crawler

In this step, we create an AWS Glue crawler with Delta Lake as the data source type. After successfully running the crawler, we inspect the data using Athena.

1.  On the AWS Glue console, choose Crawlers in the navigation pane.
2.  Choose Create crawler.
3.  Provide a name (for example, `delta-lake-crawler`) and choose Next.
4.  Choose Add a data source and choose Delta Lake as your data source.
5.  Copy your delta folder URI (for example, `s3://$BUCKET/delta/insurance`) and enter the Delta Lake table path location.
6.  Keep the default selection Create Native tables, and choose Add a Delta Lake data source.
7.  Choose Next.
8.  Choose the IAM role you created earlier, then choose Next.
9.  Select the `default` target database, and provide `delta_` for the table name prefix. If no `default` database exist, you may create one.
10. Choose Next.
11. Choose Create crawler.
12. Run the newly created crawler. After the crawler is complete, the `delta_insurance` table is available under `Databases/Tables`.
13. Open the table to check the table overview.

### Create the Crawler

In [22]:
%%writefile glue-targets.json
{
  "DeltaTargets": [
    {
      "DeltaTables": ["s3://wysde-assets/labs/glue-cdc-upsert/delta/insurance"],
      "CreateNativeDeltaTable": true
    }
  ]
}

Overwriting glue-targets.json


In [21]:
!aws glue create-crawler \
    --name delta-lake-crawler \
    --database-name deltalake \
    --targets file://glue-targets.json \
    --table-prefix delta_ \
    --role AWSGlueServiceRole-FullS3Access

[0m

### Run the Crawler

Go to the crawer in GUI and run it

## Verify the Table and its schema

Note: Make sure to upgrade athena query engine to 3.x

In [32]:
# !aws athena start-query-execution \
#     --query-string "SELECT * FROM delta_insurance" \
#     --query-execution-context Database=datalake,Catalog=AwsDataCatalog \
#     --result-configuration OutputLocation="s3://athena-workshop-684199068947/"

# !aws athena get-query-results \
#     --query-execution-id 6a2386c3-2404-4832-88f0-f20a7982bd6d \
#     --output text

In [None]:
!pip install awswrangler

In [33]:
import awswrangler as wr  
df = wr.athena.read_sql_query(sql="SELECT * FROM delta_insurance", database="datalake")
df

Unnamed: 0,policy_id,expiry_date,location_name,state_code,region_name,insured_value,business_type,earthquake_coverage,flood_coverage
0,100595,2023-03-27,Rural,NY,East,2446600,Farming,N,N
1,100617,2023-03-27,Urban,VT,Northeast,8861500,Office Bldg,N,N
2,100580,2023-03-30,Urban,NH,Northeast,97920,Office Bldg,Y,Y
3,100581,2023-03-30,Urban,NY,East,5150000,Apartment,Y,Y
4,100475,2023-03-31,Rural,WI,Midwest,1451662,Farming,N,N
5,100503,2023-03-31,Urban,NJ,East,1761960,Office Bldg,N,N
6,100504,2023-03-31,Rural,NY,East,1649105,Farming,N,N
7,100616,2023-03-31,Urban,NY,East,2329500,Apartment,N,N
8,100611,2023-04-25,Urban,NJ,East,1595500,Office Bldg,Y,Y
9,100621,2023-04-25,Urban,MI,Central,394220,Retail,N,N


In [38]:
query = """
SELECT * FROM delta_insurance
WHERE policy_id IN (100462, 100463,100475,110001,110002)
order by policy_id;
"""

df = wr.athena.read_sql_query(sql=query, database="datalake")
df

Unnamed: 0,policy_id,expiry_date,location_name,state_code,region_name,insured_value,business_type,earthquake_coverage,flood_coverage
0,100462,2023-03-25,Urban,NY,East,3400000,Construction,Y,Y
1,100463,2023-03-27,Urban,NY,East,15480000,Office Bldg,Y,Y
2,100475,2023-03-31,Rural,WI,Midwest,1451662,Farming,N,N


## Upload the CDC data feed and run the CDC job

In [35]:
!head ./data/cdc-load.csv

U,100462,2024-12-31,Urban,NY,East,3400000,Construction,Y,Y
U,100463,2023-03-27,Urban,NY,East,1000000,Office Bldg,Y,Y
U,100475,2023-03-31,Rural,WI,Midwest,1451662,Farming,N,Y
I,110001,2024-03-31,Urban,CA,WEST,210000,Office Bldg,N,N
I,110002,2024-03-31,Rural,FL,East,975000,Retail,N,Y

The first column in the CDC feed describes the UPSERT operations. U is for updating an existing record, and I is for inserting a new record.

In [34]:
!aws s3 cp data/cdc-load.csv $BUCKET_PATH/cdcload/

upload: data/cdc-load.csv to s3://wysde-assets/labs/glue-cdc-upsert/cdcload/cdc-load.csv


The change details are as follows:

-   100462 -- Expiry date changes to 12/31/2024
-   100463 -- Insured value changes to 1 million
-   100475 -- This policy is now under a new flood zone
-   110001 and 110002 -- New policies added to the table

Run CDC-Load-Job. This job takes care of updating the Delta Lake accordingly.

In [37]:
!aws glue start-job-run --job-name CDC-Load-Job

-------------------------------------------------------------------------------------
|                                    StartJobRun                                    |
+----------+------------------------------------------------------------------------+
|  [1m[34mJobRunId[0m|  [1m[34mjr_0f3e63260a9103ff26714526a8431c222581abc4e40816bbe7b15b39f0623e4c[0m   |
+----------+------------------------------------------------------------------------+
[0m

Run the query again:

In [39]:
query = """
SELECT * FROM delta_insurance
WHERE policy_id IN (100462, 100463,100475,110001,110002)
order by policy_id;
"""

df = wr.athena.read_sql_query(sql=query, database="datalake")
df

Unnamed: 0,policy_id,expiry_date,location_name,state_code,region_name,insured_value,business_type,earthquake_coverage,flood_coverage
0,100462,2024-12-31,Urban,NY,East,3400000,Construction,Y,Y
1,100463,2023-03-27,Urban,NY,East,1000000,Office Bldg,Y,Y
2,100475,2023-03-31,Rural,WI,Midwest,1451662,Farming,N,Y
3,110001,2024-03-31,Urban,CA,WEST,210000,Office Bldg,N,N
4,110002,2024-03-31,Rural,FL,East,975000,Retail,N,Y


As shown in the above output, the changes in the CDC data feed are reflected in the Athena query results.

Organizations are continuously looking at high performance, cost-effective, and scalable analytical solutions to extract the value of their operational data sources in near-real time. The analytical platform should be ready to receive changes in the operational data as soon as they occur. Typical data lake solutions face challenges to handle the changes in source data; the Delta Lake framework can close this gap. This lab demonstrated how to build data lakes for UPSERT operations using AWS Glue and native Delta Lake tables, and how to query AWS Glue tables from Athena. You can implement your large scale UPSERT data operations using AWS Glue, Delta Lake and perform analytics using Amazon Athena.