In [0]:
%pip install pytest==8.4.2

# ðŸ§¾ Module 1 â€” Batch Ingestion with COPY INTO (Lookup Tables Example)

In this exercise, you'll simulate the daily ingestion of a product catalog and users data into a Bronze Delta table using Databricks' `COPY INTO` command.

### Learning Objectives
By the end of this lab, you will be able to:
- Load batch data files into a Delta table using `COPY INTO`.
- Understand **idempotent ingestion** (avoiding duplicate file loads).
- Explore **Delta table history**.
- Discuss when to use `COPY INTO`, `MERGE`, or `OVERWRITE`.
- Remove duplicates to build a **clean Silver table** with only the latest records.  
- There should be 4 tables created, 2 in bronze and 2 in silver schema.

### Scenario
Your e-commerce platform receives daily CSV files with updated product and user information.  
Each file represents a full snapshot, with small updates day by day.  
Your task: safely ingest these files, preserving history in Bronze and cleaning them in Silver.

## Step 0 â€” Setup and Context

**TO DO:**

Define common variables like 
- Volume locations: data, checkpoints and schemas.
- Table full name (three level namespace).
- Import python libraries if needed: e.g. `helpers.utils` package.

> **Optional:**  
> Validate data in the volumen you just created: `"/Volumes/capstone_dev/{{you_bronze_schema}}/raw_files/"`  
> You can use dbutils command for that purpose.

In [0]:
from helpers import utils
from instructors.src.solutions.batch_ingestion import BatchIngestion

mod1 = BatchIngestion()
catalog_name = utils.get_param("catalog", "capstone_dev")

base_user = utils.get_base_user_schema()
schema_bronze = f"{base_user}_bronze"
schema_silver = f"{base_user}_silver"

# UC Volume path for governed data
volume_path = f"/Volumes/{catalog_name}/{schema_bronze}/raw_files/"

product_table_bronze = f"{catalog_name}.{schema_bronze}.products"
product_table_silver = f"{catalog_name}.{schema_silver}.products"

user_table_bronze = f"{catalog_name}.{schema_bronze}.users"
user_table_silver = f"{catalog_name}.{schema_silver}.users"

print(f"Source Volume Path: {volume_path}")
print(f"Product Bronze Table: {product_table_bronze}")
print(f"Product Silver Table: {product_table_silver}")
print(f"User Bronze Table: {user_table_bronze}")
print(f"User Silver Table: {user_table_silver}")


In [0]:
# Optional: Validate that 7 product files exist in the raw folder
product_file_path = volume_path + "products/"
files = dbutils.fs.ls(product_file_path)
print(f"Found {len(files)} files in {product_file_path}:")
for f in files:
    print("-", f.name)

In [0]:
# Optional: Validate that 7 user files exist in the raw folder
user_file_path = volume_path + "users/"
files = dbutils.fs.ls(user_file_path)
print(f"Found {len(files)} files in {volume_path}:")
for f in files:
    print("-", f.name)

## Step 1 â€” Create a Product Bronze Delta Table

**TO DO:**
- We'll create a **managed Delta table** to store raw product data.  
- Partitioning by `last_modified` improves query performance for time-based lookups.

| COLUMN | DATA TYPE |
| :------- | :------: |
| product_id | INT |
| name | STRING |
| category | STRING |
| price | DOUBLE |
| last_modified | DATE |

> **Reflection:**  
> Why is `USING DELTA` important compared to plain Parquet files?

In [0]:
display(mod1.create_bronze_products(product_table_bronze))
print("OK - Table created successfully.")

## Step 2 â€” Create a User Bronze Delta Table

**TO DO:**
- We'll create a **managed Delta table** to store raw user data.  
- Partitioning by `last_modified` improves query performance for time-based lookups.

| COLUMN | DATA TYPE |
| :------- | :------: |
| user_id | INT |
| name | STRING |
| email | STRING |
| phone | STRING |
| is_active | BOOLEAN |
| last_modified | DATE |



In [0]:
display(mod1.create_bronze_users(user_table_bronze))
print("OK - Table created successfully.")

## Step 3 â€” Load Daily Product Files (COPY INTO)

**TO DO:**  
We now load all product CSV files from the UC Volume into the Delta table using `COPY INTO`.

>**Key Note:**
>- **Idempotent ingestion:**  Databricks tracks loaded files; re-running `COPY INTO` skips them.
>- **Append-only:**  `COPY INTO` never updates or deletes data â€” it only appends. Ideal for **Bronze ingestion** (raw, immutable history).  
>- For changing lookup tables (like Products), youâ€™d use **MERGE** or **OVERWRITE** later in Silver/Gold stages.


In [0]:
print("Running COPY INTO...\n")
display(mod1.copy_into(product_table_bronze, product_file_path))

## Step 4 â€” Load Daily User Files (COPY INTO)

**TO DO:**  
We now load all user CSV files from the UC Volume into the Delta table using `COPY INTO`.

>**Key Note:**
>- **Idempotent ingestion:**  Databricks tracks loaded files; re-running `COPY INTO` skips them.
>- **Append-only:**  `COPY INTO` never updates or deletes data â€” it only appends. Ideal for **Bronze ingestion** (raw, immutable history).  
>- For changing lookup tables (like User), youâ€™d use **MERGE** or **OVERWRITE** later in Silver/Gold stages.

In [0]:
print("Running COPY INTO...\n")
display(mod1.copy_into(user_table_bronze, user_file_path))

## Step 5 â€” Inspect the Table and History

Letâ€™s explore the data and check the Delta tableâ€™s history.


In [0]:
#Product Table
query = f"SELECT * FROM {product_table_bronze}"
display(spark.sql(query))

In [0]:
#Product Table History
query = f"DESCRIBE HISTORY {product_table_bronze}"
display(spark.sql(query))

In [0]:
#User Table
query = f"SELECT * FROM {user_table_bronze}"
display(spark.sql(query))

In [0]:
#User Table History
query = f"DESCRIBE HISTORY {user_table_bronze}"
display(spark.sql(query))

## Step 6 â€” Create a Deduplicated Product Silver Table

**SCENARIO:**  
- The Bronze table may include duplicates since each daily snapshot repeats the same products.  
- Some records were re-uploaded within the same files.

**TO DO:**
- Create silver delta table `product`.
- Remove duplicates and keep the latest record per `product_id`.

**TIPS**  
- Use **window functions** and `last_modified` date.

> **Key Note:**  
> Bronze = Immutable raw history.  
> Silver = Clean, current state.  
> Gold = Aggregated or business-level views.


In [0]:
display(mod1.create_silver_products(product_table_bronze, product_table_silver))
print(f"OK - Silver table created: {product_table_silver}")

## Step 7 â€” Create a Deduplicated User Silver Table

**SCENARIO:**  
- The Bronze table may include duplicates since each daily snapshot could repeats the same users.  
- Some records were re-uploaded within the same files.

**TO DO:**
- Create silver delta table `user`.
- Remove duplicates and keep the latest record per `user_id`.

**TIPS**  
- Use **window functions** and `last_modified` date.

> **Key Note:**  
> Bronze = Immutable raw history.  
> Silver = Clean, current state.  
> Gold = Aggregated or business-level views.

In [0]:
display(mod1.create_silver_users(user_table_bronze, user_table_silver))
print(f"OK - Silver table created: {user_table_silver}")

## Step 8 â€” Validate Deduplication Results

**TO DO:**  
- Check that the Silver tables has no duplicated records(latest).  
- Compare counts to confirm duplicates were removed.


In [0]:
# Check total distinct products in Silver
query = f"SELECT COUNT(*) AS total_records FROM {product_table_silver}"
display(spark.sql(query))

In [0]:
# Compare to raw product bronze
query = f"SELECT COUNT(DISTINCT product_id) AS distinct_products_in_bronze FROM {product_table_bronze}"
display(spark.sql(query))

In [0]:
# Inspect latest clean product records
query = f"SELECT * FROM {product_table_silver} ORDER BY product_id"
display(spark.sql(query))

In [0]:
# Check total distinct users in Silver
query = f"SELECT COUNT(*) AS total_records FROM {user_table_silver}"
display(spark.sql(query))

In [0]:
# Compare to raw user bronze
query = f"SELECT COUNT(DISTINCT user_id) AS distinct_users_in_bronze FROM {user_table_bronze}"
display(spark.sql(query))

In [0]:
# Inspect latest clean user records
query = f"SELECT * FROM {user_table_silver} ORDER BY user_id"
display(spark.sql(query))

## Step 9 â€” Tests
**TO DO:**  
- Check if there are any failed tests and investigate their root cause

In [0]:
from helpers import test_runner
import os

notebook_path = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()
os.environ["NOTEBOOK_NAME"] = notebook_path.split("/")[-1]

test_runner.run()



## Reflection & Discussion

1. **What happens if you run `COPY INTO` again?**  
   Why does the row count remain unchanged?

2. **How would you handle price updates if you wanted to keep only the latest state?**  
   Would you use MERGE or OVERWRITE?

3. **If a source file is replaced, will existing data in the Delta table change?**  
   How can you detect this using `DESCRIBE HISTORY`?

4. **Why is COPY INTO ideal for Bronze ingestion but not for volatile lookup tables?**

5. **Advanced:**  
   How could this pipeline evolve to use **Auto Loader** for continuous ingestion from S3 or ADLS?



## Summary

In this lab, you:
- Ingested products and users CSVs using `COPY INTO`.
- Verified **idempotency** and **Delta history**.
- Promoted raw Bronze data into a clean Silver version.
- Discussed `COPY INTO` vs. `MERGE` vs. `OVERWRITE`.

> **Key Note:**  
> - `COPY INTO` â†’ append-only raw loads (Bronze)  
> - `MERGE` â†’ upsert / update existing rows (Silver)  
> - `OVERWRITE` â†’ replace entire dataset (batch refresh)  
>
> Together, these operations form the backbone of **modern medallion pipelines** in Databricks.
