### Load Full E-commerce Data


In [0]:
# Load October CSV data
oct_events = spark.read.csv(
    "/Volumes/workspace/eccomerce/ecommerce_data/2019-Oct.csv",
    header=True,
    inferSchema=True
)

# Load November CSV data
nov_events = spark.read.csv(
    "/Volumes/workspace/eccomerce/ecommerce_data/2019-Nov.csv",
    header=True,
    inferSchema=True
)


### Rename Datasets

In [0]:
# Rename October DataFrame
df_oct = oct_events

# Rename November DataFrame
df_nov = nov_events


### 1.Create Incremental Data
### Step 1: Create Incremental Dataset (New & Updated Records)

This step simulates incoming incremental data.
The dataset contains:
- Records that already exist (to test updates)
- Completely new records (to test inserts)

This temporary view will be used as the source for the MERGE operation.


In [0]:
%sql
-- Create incremental data (new + updated records)
CREATE OR REPLACE TEMP VIEW sales_updates AS
SELECT 101 AS order_id, 'Laptop' AS product, 3 AS qty, 1250 AS price
UNION ALL
SELECT 105 AS order_id, 'Keyboard' AS product, 2 AS qty, 80 AS price;


### View Incremental Data
### Step 2: Validate Incremental Data

Querying the temporary view to verify the incremental records
before merging them into the Delta table.

In [0]:
%sql
SELECT * FROM sales_updates;


order_id,product,qty,price
101,Laptop,3,1250
105,Keyboard,2,80


### MERGE Incremental Data


In [0]:
%sql
-- Apply incremental updates and inserts into Delta table
MERGE INTO sales_delta t
USING sales_updates s
ON t.order_id = s.order_id
WHEN MATCHED THEN
  UPDATE SET *
WHEN NOT MATCHED THEN
  INSERT *;


num_affected_rows,num_updated_rows,num_deleted_rows,num_inserted_rows
2,2,0,0


### Verify MERGE Result

In [0]:
%sql
-- Validate final table after merge
SELECT * FROM sales_delta;


order_id,product,qty,price
105,Keyboard,2,80
101,Laptop,3,1250


### 2.Query historical versions

### i.Check Table History Time Travel Setup

In [0]:
%sql
-- View Delta table version history
DESCRIBE HISTORY sales_delta;


version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata,engineInfo
2,2026-01-13T07:00:14.000Z,73029697663450,ruchiwange07@gmail.com,MERGE,"Map(predicate -> [""(order_id#13667 = order_id#13659)""], clusterBy -> [], matchedPredicates -> [{""actionType"":""update""}], statsOnLoad -> false, notMatchedBySourcePredicates -> [], notMatchedPredicates -> [{""actionType"":""insert""}])",,List(533720317512728),0113-064726-rmq3pplr-v2n,1.0,WriteSerializable,False,"Map(numTargetRowsCopied -> 0, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 2, numTargetBytesAdded -> 3853, numTargetBytesRemoved -> 2405, numTargetDeletionVectorsAdded -> 0, numTargetRowsMatchedUpdated -> 2, executionTimeMs -> 3991, materializeSourceTimeMs -> 137, numTargetRowsInserted -> 0, numTargetRowsMatchedDeleted -> 0, numTargetDeletionVectorsUpdated -> 0, scanTimeMs -> 1797, numTargetRowsUpdated -> 2, numOutputRows -> 2, numTargetDeletionVectorsRemoved -> 0, numTargetRowsNotMatchedBySourceUpdated -> 0, numTargetChangeFilesAdded -> 0, numSourceRows -> 2, numTargetFilesRemoved -> 2, numTargetRowsNotMatchedBySourceDeleted -> 0, rewriteTimeMs -> 1953)",,Databricks-Runtime/17.3.x-aarch64-photon-scala2.13
1,2026-01-13T06:58:12.000Z,73029697663450,ruchiwange07@gmail.com,MERGE,"Map(predicate -> [""(order_id#13350 = order_id#13342)""], clusterBy -> [], matchedPredicates -> [{""actionType"":""update""}], statsOnLoad -> true, notMatchedBySourcePredicates -> [], notMatchedPredicates -> [{""actionType"":""insert""}])",,List(533720317512728),0113-064726-rmq3pplr-v2n,0.0,WriteSerializable,False,"Map(numTargetRowsCopied -> 0, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 2, numTargetBytesAdded -> 2405, numTargetBytesRemoved -> 0, numTargetDeletionVectorsAdded -> 0, numTargetRowsMatchedUpdated -> 0, executionTimeMs -> 3455, materializeSourceTimeMs -> 252, numTargetRowsInserted -> 2, numTargetRowsMatchedDeleted -> 0, numTargetDeletionVectorsUpdated -> 0, scanTimeMs -> 1285, numTargetRowsUpdated -> 0, numOutputRows -> 2, numTargetDeletionVectorsRemoved -> 0, numTargetRowsNotMatchedBySourceUpdated -> 0, numTargetChangeFilesAdded -> 0, numSourceRows -> 2, numTargetFilesRemoved -> 0, numTargetRowsNotMatchedBySourceDeleted -> 0, rewriteTimeMs -> 1787)",,Databricks-Runtime/17.3.x-aarch64-photon-scala2.13
0,2026-01-13T06:58:01.000Z,73029697663450,ruchiwange07@gmail.com,CREATE TABLE,"Map(partitionBy -> [], clusterBy -> [], description -> null, isManaged -> true, properties -> {""delta.parquet.compression.codec"":""zstd"",""delta.enableDeletionVectors"":""true"",""delta.writePartitionColumnsToParquet"":""true"",""delta.enableRowTracking"":""true"",""delta.rowTracking.materializedRowCommitVersionColumnName"":""_row-commit-version-col-cd85bdf0-18b8-4e60-a08d-a7295a472e2c"",""delta.rowTracking.materializedRowIdColumnName"":""_row-id-col-380183c4-e824-4fb4-aa93-ded91bd4575b""}, statsOnLoad -> false)",,List(533720317512728),0113-064726-rmq3pplr-v2n,,WriteSerializable,True,Map(),,Databricks-Runtime/17.3.x-aarch64-photon-scala2.13


### 3.OPTIMIZE Delta Table

In [0]:
%sql
-- Optimize entire Delta table (no partition column)
OPTIMIZE sales_delta;

path,metrics
,"List(0, 0, List(null, null, 0.0, 0, 0), List(null, null, 0.0, 0, 0), 0, null, null, 0, 0, 1, 1, true, 0, 0, 1768289372822, 1768289373303, 8, 0, null, List(0, 0), null, 4, 4, 0, 0, null)"


Observation:
The OPTIMIZE command completed successfully.
No files were compacted as the table did not contain small or fragmented files at this stage.
This is expected for low-volume or already optimized Delta tables.


### 4.Clean Old Files


In [0]:
%sql
-- Clean obsolete files from Delta table (safe retention)
VACUUM sales_delta RETAIN 168 HOURS; 


path


Observation:
VACUUM executed successfully.
No obsolete files were found for deletion, which is expected for a newly created
or low-volume Delta table within the retention period.


### Analytics & AI Readiness Preview




In [0]:
%sql
SELECT product, COUNT(*) AS orders
FROM sales_delta
GROUP BY product;

product,orders
Laptop,1
Keyboard,1


### AI Readiness Note

The Delta table created in this notebook is optimized, versioned, and governed.
With incremental MERGE,OPTIMIZE, and VACUUM applied,
the dataset is now ready for downstream AI, ML, and analytics workloads
without additional preprocessing.


## Conclusion:
This notebook demonstrates end-to-end Delta Lake lifecycle management
aligned with AI-ready data engineering best practices.
