## Step 1: Deploy solution resources with AWS CloudFormation

Deploy the stack by running the below cli command. It will create:

- Two S3 buckets: one for scripts and query results, and one for the data lake storage
- An Athena workgroup
- An EMR Serverless application
- An AWS Glue database and tables on external public S3 buckets of TPC-DS data
- An AWS Glue database for the data lake
- An [AWS Identity and Access Management](http://aws.amazon.com/iam) (IAM) role and polices

In [3]:
!aws cloudformation create-stack \
--stack-name DataLakehouse \
--template-body file://cfn.yml \
--capabilities CAPABILITY_NAMED_IAM

{
    "StackId": "arn:aws:cloudformation:us-east-1:684199068947:stack/DataLakehouse/a6dd6700-c014-11ed-b4c4-0ae42d5d0cc1"
}


When the stack creation is complete, check the **Outputs** tab of the stack to verify the resources created.

![cfn](https://user-images.githubusercontent.com/62965911/224493013-44176a92-7986-498f-9b59-3186022493f8.png)

## Step 2: Upload Spark scripts to Amazon S3

In [10]:
AWS_ACCOUNT_ID = !aws sts get-caller-identity --query "Account" --output text
AWS_ACCOUNT_ID = AWS_ACCOUNT_ID[0]
AWS_ACCOUNT_ID

'684199068947'

In [11]:
%%writefile ingest-iceberg.py
from pyspark.sql import SparkSession, DataFrame, Row
from pyspark.sql import functions as F

from pyspark.sql.types import DoubleType, FloatType, LongType, StructType, StructField, StringType

from pyspark.sql.functions import col, lit
#from datetime import datetime


spark = SparkSession \
        .builder \
        .config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
        .enableHiveSupport() \
        .getOrCreate()    


#Variables 
DB_NAME = "datalake"
TABLE_NAME_CUSTOMER = "customer_iceberg"
TABLE_NAME_WEBSALES = "web_sales_iceberg"
TABLE_NAME_CADRESS = "customer_address_iceberg"
TABLE_NAME_DATEDIM = "date_dim_iceberg"
TABLE_NAME_HOUSEHOLD = "household_demographics_iceberg"
TABLE_NAME_INCOMEBAND = "income_band_iceberg"
TABLE_NAME_ITEM = "item_iceberg"
TABLE_NAME_PROMOTION = "promotion_iceberg"
TABLE_NAME_TIMEDIM = "time_dim_iceberg"
TABLE_NAME_WEBPAGE = "web_page_iceberg"
TABLE_NAME_WEBRETURNS =  "web_returns_iceberg"
TPC_DS_DATABASE = "tpc-source"


#Create the customer table in Iceberg 
spark.sql(f"""
    CREATE OR REPLACE TABLE  dev.`{DB_NAME}`.`{TABLE_NAME_CUSTOMER}`(
        c_customer_sk             int,
        c_customer_id             string,
        c_current_cdemo_sk        int,
        c_current_hdemo_sk        int,
        c_current_addr_sk         int,
        c_first_shipto_date_sk    int,
        c_first_sales_date_sk     int,
        c_salutation              string,
        c_first_name              string,
        c_last_name               string,
        c_preferred_cust_flag     string,
        c_birth_day               int,
        c_birth_month             int,
        c_birth_year              int,
        c_birth_country           string,
        c_login                   string,
        c_email_address           string,
        c_last_review_date        string
    )
    USING iceberg
    PARTITIONED BY (c_birth_country)
    OPTIONS ('format-version'='2')
    """)

#Insert data into customer table
spark.sql(f"""
    INSERT INTO dev.`{DB_NAME}`.`{TABLE_NAME_CUSTOMER}` 
    SELECT *
    FROM `{TPC_DS_DATABASE}`.customer
    """)


#Create the websales table in Iceberg  
spark.sql(f"""
    CREATE OR REPLACE TABLE  dev.`{DB_NAME}`.`{TABLE_NAME_WEBSALES}`(
        ws_sold_date_sk           int,
        ws_sold_time_sk           int,
        ws_ship_date_sk           int,
        ws_item_sk                int,
        ws_bill_customer_sk       int,
        ws_bill_cdemo_sk          int,
        ws_bill_hdemo_sk          int,
        ws_bill_addr_sk           int,
        ws_ship_customer_sk       int,
        ws_ship_cdemo_sk          int,
        ws_ship_hdemo_sk          int,
        ws_ship_addr_sk           int,
        ws_web_page_sk            int,
        ws_web_site_sk            int,
        ws_ship_mode_sk           int,
        ws_warehouse_sk           int,
        ws_promo_sk               int,
        ws_order_number           int,
        ws_quantity               int,
        ws_wholesale_cost         double,
        ws_list_price             double,
        ws_sales_price            double,
        ws_ext_discount_amt       double,
        ws_ext_sales_price        double,
        ws_ext_wholesale_cost     double,
        ws_ext_list_price         double,
        ws_ext_tax                double,
        ws_coupon_amt             double,
        ws_ext_ship_cost          double,
        ws_net_paid               double,
        ws_net_paid_inc_tax       double,
        ws_net_paid_inc_ship      double,
        ws_net_paid_inc_ship_tax  double,
        ws_net_profit             double
    )
    USING iceberg
    PARTITIONED BY (ws_warehouse_sk)
    OPTIONS ('format-version'='2')
    """)

#Insert data into websales table
spark.sql(f"""
    INSERT INTO dev.`{DB_NAME}`.`{TABLE_NAME_WEBSALES}`
    SELECT *
    FROM `{TPC_DS_DATABASE}`.web_sales
    """)


#Create the customer adresse table customer_address
spark.sql(f"""
    CREATE OR REPLACE TABLE  dev.`{DB_NAME}`.`{TABLE_NAME_CADRESS}`(
        ca_address_sk          int,
        ca_address_id          string,
        ca_street_number       string,
        ca_street_name         string,
        ca_street_type         string,
        ca_suite_number        string,
        ca_city                string,
        ca_county              string,
        ca_state               string,
        ca_zip                 string,
        ca_country             string,
        ca_gmt_offset          float,
        ca_location_type       string
    )
    USING iceberg
    PARTITIONED BY (ca_country, ca_city)
    OPTIONS ('format-version'='2')
    """)

#Insert data into customer address table
spark.sql(f"""
    INSERT INTO dev.`{DB_NAME}`.`{TABLE_NAME_CADRESS}`
    SELECT *
    FROM `{TPC_DS_DATABASE}`.customer_address
    """)


#Create the customer adresse table date_dim
spark.sql(f"""
    CREATE OR REPLACE TABLE  dev.`{DB_NAME}`.`{TABLE_NAME_DATEDIM}`(
        d_date_sk              int,
        d_date_id              string,
        d_date                 date,
        d_month_seq            int,
        d_week_seq             int,
        d_quarter_seq          int,
        d_year                 int,
        d_dow                  int,
        d_moy                  int,
        d_dom                  int,
        d_qoy                  int,
        d_fy_year              int,
        d_fy_quarter_seq       int,
        d_fy_week_seq          int,
        d_day_name             string,
        d_quarter_name         string,
        d_holiday              string,
        d_weekend              string,
        d_following_holiday    string,
        d_first_dom            int,
        d_last_dom             int,
        d_same_day_ly          int,
        d_same_day_lq          int,
        d_current_day          string,
        d_current_week         string,
        d_current_month        string,
        d_current_quarter      string,
        d_current_year         string
    )
    USING iceberg
    PARTITIONED BY (d_year)
    OPTIONS ('format-version'='2')
    """)

#insert date_dim data
spark.sql(f"""
    INSERT INTO dev.`{DB_NAME}`.`{TABLE_NAME_DATEDIM}`
    SELECT *
    FROM `{TPC_DS_DATABASE}`.date_dim
    """)

#Create the table household_demographics
spark.sql(f"""
    CREATE OR REPLACE TABLE  dev.`{DB_NAME}`.`{TABLE_NAME_HOUSEHOLD}`(
        hd_demo_sk             int,
        hd_income_band_sk      int,
        hd_buy_potential       string,
        hd_dep_count           int,
        hd_vehicle_count       int
    )
    USING iceberg
    PARTITIONED BY (hd_buy_potential)
    OPTIONS ('format-version'='2')
    """)

#insert household_demographic data
spark.sql(f"""
    INSERT INTO dev.`{DB_NAME}`.`{TABLE_NAME_HOUSEHOLD}`
    SELECT *
    FROM `{TPC_DS_DATABASE}`.household_demographics
    """)    


#Create the table income_band
spark.sql(f"""
    CREATE OR REPLACE TABLE  dev.`{DB_NAME}`.`{TABLE_NAME_INCOMEBAND}`(   
        ib_income_band_sk      int,
        ib_lower_bound         int,
        ib_upper_bound         int
    )
    USING iceberg
    OPTIONS ('format-version'='2')
    """)

#insert household_demographic data
spark.sql(f"""
    INSERT INTO dev.`{DB_NAME}`.`{TABLE_NAME_INCOMEBAND}`
    SELECT *
    FROM `{TPC_DS_DATABASE}`.income_band
    """)    



#Create table item
spark.sql(f"""
    CREATE OR REPLACE TABLE  dev.`{DB_NAME}`.`{TABLE_NAME_ITEM}`(   
        i_item_sk              int,
        i_item_id              string,
        i_rec_start_date       string,
        i_rec_end_date         string,
        i_item_desc            string,
        i_current_price        double,
        i_wholesale_cost       double,
        i_brand_id             int,
        i_brand                string,
        i_class_id             int,
        i_class                string,
        i_category_id          int,
        i_category             string,
        i_manufact_id          int,
        i_manufact             string,
        i_size                 string,
        i_formulation          string,
        i_color                string,
        i_units                string,
        i_container            string,
        i_manager_id           string,
        i_product_name         string
    )
    USING iceberg
    PARTITIONED BY (i_category)
    OPTIONS ('format-version'='2')
    """)

#insert item data
spark.sql(f"""
    INSERT INTO dev.`{DB_NAME}`.`{TABLE_NAME_ITEM}`
    SELECT *
    FROM `{TPC_DS_DATABASE}`.item
    """)   

#Create the promotion table promotion
spark.sql(f"""
    CREATE OR REPLACE TABLE  dev.`{DB_NAME}`.`{TABLE_NAME_PROMOTION}`(   
        p_promo_sk             int,
        p_promo_id             string,
        p_start_date_sk        int,
        p_end_date_sk          int,
        p_item_sk              int,
        p_cost                 double,
        p_response_target      int,
        p_promo_name           string,
        p_channel_dmail        string,
        p_channel_email        string,
        p_channel_catalog      string,
        p_channel_tv           string,
        p_channel_radio        string,
        p_channel_press        string,
        p_channel_event        string,
        p_channel_demo         string,
        p_channel_details      string,
        p_purpose              string,
        p_discount_active      string
    )
    USING iceberg
    PARTITIONED BY (p_purpose)
    OPTIONS ('format-version'='2')
    """)

#insert promotion data
spark.sql(f"""
    INSERT INTO dev.`{DB_NAME}`.`{TABLE_NAME_PROMOTION}`
    SELECT *
    FROM `{TPC_DS_DATABASE}`.promotion
    """)   


#Create the promotion table time_dim
spark.sql(f"""
    CREATE OR REPLACE TABLE  dev.`{DB_NAME}`.`{TABLE_NAME_TIMEDIM}`(   
        t_time_sk              int,
        t_time_id              string,
        t_time                 int,
        t_hour                 int,
        t_minute               int,
        t_second               int,
        t_am_pm                string,
        t_shift                string,
        t_sub_shift            string,
        t_meal_time            string
    )
    USING iceberg
    PARTITIONED BY (t_hour)
    OPTIONS ('format-version'='2')
    """)


#insert time_dim data
spark.sql(f"""
    INSERT INTO dev.`{DB_NAME}`.`{TABLE_NAME_TIMEDIM}`
    SELECT *
    FROM `{TPC_DS_DATABASE}`.time_dim
    """)   


#Create the promotion table web_page
spark.sql(f"""
    CREATE OR REPLACE TABLE  dev.`{DB_NAME}`.`{TABLE_NAME_WEBPAGE}`(   
        wp_web_page_sk         int,
        wp_web_page_id         string,
        wp_rec_start_date      string,
        wp_rec_end_date        string,
        wp_creation_date_sk    int,
        wp_access_date_sk      int,
        wp_autogen_flag        string,
        wp_customer_sk         int,
        wp_url                 string,
        wp_type                string,
        wp_char_count          int,
        wp_link_count          int,
        wp_image_count         int,
        wp_max_ad_count        int
    )
    USING iceberg
    PARTITIONED BY (wp_rec_start_date )
    OPTIONS ('format-version'='2')
    """)

#insert web_page data
spark.sql(f"""
    INSERT INTO dev.`{DB_NAME}`.`{TABLE_NAME_WEBPAGE}`
    SELECT *
    FROM `{TPC_DS_DATABASE}`.web_page
    """)   


#Create the promotion table web_page
spark.sql(f"""
    CREATE OR REPLACE TABLE  dev.`{DB_NAME}`.`{TABLE_NAME_WEBRETURNS}`( 
        ws_sold_date_sk             int,
        ws_sold_time_sk             int,
        wr_returned_date_sk         int,   
        wr_returned_time_sk         int, 
        wr_item_sk                  int, 
        wr_refunded_customer_sk     int,
        wr_refunded_cdemo_sk        int,   
        wr_refunded_hdemo_sk        int,   
        wr_refunded_addr_sk         int,    
        wr_returning_customer_sk    int,
        wr_returning_cdemo_sk       int,   
        wr_returning_hdemo_sk       int,  
        wr_returning_addr_sk        int,   
        wr_web_page_sk              int,         
        wr_reason_sk                int,           
        wr_order_number             int,
        wr_return_quantity          int,     
        wr_return_amt               double,  
        wr_return_tax               double,     
        wr_return_amt_inc_tax       double,
        wr_fee                      double,         
        wr_return_ship_cost         double,
        wr_refunded_cash            double,   
        wr_reversed_charge          double, 
        wr_account_credit           double,  
        wr_net_loss                 double 
    )
    USING iceberg
    PARTITIONED BY (wr_reason_sk)
    OPTIONS ('format-version'='2')
    """)
       
#insert web_returns data
spark.sql(f"""
    INSERT INTO dev.`{DB_NAME}`.`{TABLE_NAME_WEBRETURNS}`
    SELECT *
    FROM `{TPC_DS_DATABASE}`.web_returns
    """) 

Overwriting ingest-iceberg.py


In [12]:
%%writefile update-item.py
from pyspark.sql import SparkSession

from pyspark.sql.functions import col, lit, when, concat



spark = SparkSession \
        .builder \
        .config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
        .enableHiveSupport() \
        .getOrCreate()    


#Variables 
DB_NAME = "datalake"
TPC_DS_DATABASE = "tpc-source"
TABLE_NAME_ITEM = "item_iceberg"

input_data =  spark.sql(f"""
    SELECT *
    FROM `{TPC_DS_DATABASE}`.item
    WHERE i_current_price > 90.0;
    ;"""
)

#Create a batch of update and insert records
temp_data = input_data.withColumn("i_brand",when(col("i_brand").like("corpnameless%"), "Unknown") \
                                            .otherwise(col("i_brand")))

update_data = temp_data.withColumn("i_item_id", when(temp_data.i_brand != "Unknown",(concat(col("i_item_id"), lit("N"))))\
                                                .otherwise(col("i_item_id")))


# update table with the batch of new inserts and updated records
update_data.createOrReplaceTempView("item_records")

spark.sql(f"""
    MERGE INTO dev.`{DB_NAME}`.`{TABLE_NAME_ITEM}` item
    USING item_records changed
        ON item.i_item_id = changed.i_item_id
    WHEN MATCHED THEN UPDATE SET item.i_brand = changed.i_brand
    WHEN NOT MATCHED THEN INSERT *
""")

Overwriting update-item.py


In [13]:
!aws s3 cp ingest-iceberg.py s3://datalake-resources-{AWS_ACCOUNT_ID}-us-east-1/scripts/
!aws s3 cp update-item.py s3://datalake-resources-{AWS_ACCOUNT_ID}-us-east-1/scripts/

upload: ./ingest-iceberg.py to s3://datalake-resources-684199068947-us-east-1/scripts/ingest-iceberg.py
upload: ./update-item.py to s3://datalake-resources-684199068947-us-east-1/scripts/update-item.py


## Step 3: Create Iceberg tables and ingest TPC-DS data

To create your Iceberg tables and ingest the data, complete the following steps:

1. On the Amazon EMR console, choose **EMR Serverless** in the navigation pane.
2. Choose **Manage applications**.
3. Choose the application `datalake-app`_._
4. Choose **Start application**.

Once started, it will provision the pre-initialized capacity as configured at creation (one Spark driver and two Spark executors). The pre-initialized capacity are resources that will be provisioned when you start your application. They can be used instantly when you submit jobs. However, they incur charges even if they’re not used when the application is in a started state. By default, the application is set to stop when idle for 15 minutes.

Now that the EMR application has started, we can submit the Spark ingest job `ingest-iceberg.py`. The job creates the Iceberg tables and then loads data from the previously created AWS Glue Data Catalog tables on TPC-DS data in an external bucket.

5. Navigate to the `datalake-app`.
6. On the **Job runs** tab, choose **Submit job**.
7. For **Name**, enter `ingest-data`.
8. For **Runtime role**, choose the IAM role created by the CloudFormation stack.
9. For **Script location**, enter the S3 path for your resource bucket (`datalake-resource-`_<####>_`-us-east-1>scripts>ingest-iceberg.py`).
10. Under **Spark properties**, choose **Edit in text**.
11. Enter the following properties, replacing <BUCKET\_NAME> with your data lake bucket name `datalake-`<####>`-us-east-1` (not datalake-resources)
    ```python
    --conf spark.executor.cores=2 --conf spark.executor.memory=4g --conf spark.driver.cores=2 --conf spark.driver.memory=8g --conf spark.executor.instances=2 --conf spark.jars=/usr/share/aws/iceberg/lib/iceberg-spark3-runtime.jar --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.dev.warehouse=s3://<BUCKET_NAME>/warehouse --conf spark.sql.catalog.dev=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.dev.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.lock-impl=org.apache.iceberg.aws.glue.DynamoLockManager --conf spark.sql.catalog.glue_catalog.lock.table=myIcebergLockTab --conf spark.dynamicAllocation.maxExecutors=8 --conf spark.driver.maxResultSize=1G --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
    ```
12. Submit the job.

![ingest-emr](https://user-images.githubusercontent.com/62965911/224493016-4889eb63-e5bb-4b85-881b-99f2a352ea7c.png)

## Step 4: Query Iceberg tables

In this section, we provide examples of data warehouse queries from TPC-DS on the Iceberg tables.

1. On the Athena console, open the query editor.
2. For **Workgroup**, switch to `DatalakeWorkgroup`.
3. Choose **Acknowledge**. The queries in `DatalakeWorkgroup` will run on Athena engine version 3.
4. On the **Saved queries** tab, choose a query to run on your Iceberg tables.

The following queries are listed:

- **Query3 –** Report the total extended sales price per item brand of a specific manufacturer for all sales in a specific month of the year.
- **Query45 –** Report the total web sales for customers in specific zip codes, cities, counties, or states, or specific items for a given year and quarter.
- **Query52 –** Report the total of extended sales price for all items of a specific brand in a specific year and month.
- **Query6 –** List all the states with at least 10 customers who during a given month bought items with the price tag at least 20% higher than the average price of items in the same category.
- **Query75 –** For 2 consecutive years, track the sales of items by brand, class, and category.
- **Query86a –** Roll up the web sales for a given year by category and class, and rank the sales among peers within the parent. For each group, compute the sum of sales and location with the hierarchy and rank within the group.

These queries are examples of queries used in decision-making and reporting in an organization. You can run them in the order you want. For this lab, we start with `Query3`.

5. Before you run the query, confirm that **Database** is set to `datalake`.
6. Now you can run the query.
7. Repeat these steps to run the other queries.

![athena-ingest](https://user-images.githubusercontent.com/62965911/224493002-a7805257-ea4b-4b4c-9d86-5c2ed8405d8b.png)

## Step 5: Update the item table

After running the queries, we prepare a batch of updates and inserts of records into the `item` table.

1. First, run the following query to count the number of records in the `item` Iceberg table:

```sql
SELECT count(*) FROM "datalake"."item_iceberg";
```

This should return 102,000 records.

2. Select item records with a price higher than $90:

```sql
SELECT count(*) FROM "datalake"."item_iceberg" WHERE i_current_price > 90.0;
```

This will return 1,112 records.

The `update-item.py` job takes these 1,112 records, modifies 11 records to change the name of the brand to `Unknown`, and changes the remaining 1,101 records’ `i_item_id` key to flag them as new records. As a result, a batch of 11 updates and 1,101 inserts are merged into the `item_iceberg` table.

The 11 records to be updated are those with price higher than $90, and the brand name starts with `corpnameless`.

3. Run the following query:

```sql
SELECT count(*) FROM "datalake"."item_iceberg" WHERE i_current_price > 90.0 AND i_brand LIKE 'corpnameless%';
```

The result is 11 records. The `item_update.py` job replaces the brand name with `Unknown` and merges the batch into the Iceberg table.

Now you can return to the EMR Serverless console and run the job on the EMR Serverless application.

4. On the application details page, choose **Submit job**.
5. For **Name**, enter `update-item-job`.
6. For **Runtime role**¸ use the same role that you used previously.
7. For **S3 URI**, enter the `update-item.py` script location.
8. Under **Spark properties**, choose **Edit in text**.
9. Enter the following properties, replacing the `<BUCKET-NAME>` with your own `datalake-`_<####>_`-us-east-1`:

```python
--conf spark.executor.cores=2 --conf spark.executor.memory=8g --conf spark.driver.cores=4 --conf spark.driver.memory=8g --conf spark.executor.instances=2 --conf spark.jars=/usr/share/aws/iceberg/lib/iceberg-spark3-runtime.jar --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.dev=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.dev.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.lock-impl=org.apache.iceberg.aws.glue.DynamoLockManager --conf spark.sql.catalog.glue_catalog.lock.table=myIcebergLockTab --conf spark.dynamicAllocation.maxExecutors=4 --conf spark.driver.maxResultSize=1G --conf spark.sql.catalog.dev.warehouse=s3://<BUCKET-NAME>/warehouse --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
```

10. Then submit the job.
11. After the job finishes successfully, return to the Athena console and run the following query:

```sql
SELECT count(*) FROM "datalake"."item_iceberg";
```

The returned result is 103,101 = 102,000 + (1,112 – 11). The batch was merged successfully.

![athena-merge](https://user-images.githubusercontent.com/62965911/224493005-b03af543-06a4-4570-afe4-138be9d68edc.png)

## Step 6: Time travel

To run a time travel query, complete the following steps:

1. Get the timestamp of the job run via the application details page on the EMR Serverless console, or the Spark UI on the History Server, as shown in the following screenshot.

![emr-1](https://user-images.githubusercontent.com/62965911/224493015-3f475945-a8a8-46d3-a0c3-7e14be6160da.png)

This time could be just minutes before you ran the update Spark job.

2. Convert the timestamp from the format `YYYY/MM/DD hh:mm:ss to YYYY-MM-DDThh:mm:ss.sTZD` with time zone. For example, from `2023/02/20 14:40:41` to `2023-02-20 14:40:41.000 UTC`.
3. On the Athena console, run the following query to count the `item` table records at a time before the update job, replacing <TRAVEL\_TIME> with your time:

```sql
SELECT count(*) FROM "datalake"."item_iceberg" FOR TIMESTAMP AS OF TIMESTAMP <TRAVEL_TIME>;
```

![athena-1](https://user-images.githubusercontent.com/62965911/224492992-1261591b-a779-4536-aceb-c3389bff9fe8.png)

The query will give 102,000 as a result, the expected table size before running the update job.

4. Now you can run a query with a timestamp after the successful run of the update job (for example, `2023-02-20 15:06:00.000 UTC`):

```sql
SELECT count(*) FROM "datalake"."item_iceberg" FOR TIMESTAMP AS OF TIMESTAMP <TRAVEL_TIME>;
```

![athena-2](https://user-images.githubusercontent.com/62965911/224493007-e8b855ab-2bb5-4b02-b3c6-fc5fcae698fd.png)

The query will now give 103,101 as the size of the table at that time, after the update job successfully finished.

Additionally, you can query in Athena based on the version ID of a snapshot in Iceberg. However, for more advanced use cases, such as to roll back to a given version or to find version IDs, you can use Iceberg’s SDK or Spark on Amazon EMR.

## Step 7: Clean up

Complete the following steps to clean up your resources:

In [None]:
#on the Amazon S3 console, empty your buckets
!aws s3 rm s3://datalake-684199068947-us-east-1 --recursive --quiet
!aws s3 rm s3://datalake-resources-684199068947-us-east-1 --recursive --quiet

In [23]:
#on the Athena console, delete the workgroup `DatalakeWorkgroup`
!aws athena delete-work-group --work-group DatalakeWorkgroup --recursive-delete-option

In [21]:
#on the EMR Studio console, stop the application `datalake-app`
!aws emr-serverless stop-application --application-id 00f8gkpvcl7vtf09

In [24]:
#on the AWS CloudFormation console, delete the CloudFormation stack
!aws cloudformation delete-stack --stack-name DataLakehouse