# Week 4 Lab: Bronze Layer

In this lab you will build the Bronze layer of our bookstore's medallion architecture.

You'll ingest four CSV files into Delta tables, adding audit columns and using MERGE INTO so the pipeline is idempotent — safe to run multiple times without creating duplicates.

**A note on our source data:**
- **Books and Stores** are static reference data — each is a single CSV file that won't change over time.
- **Online Orders and In-Store Orders** are transactional data. New order files can arrive in their directories at any time, so `read_files` reads the entire directory to pick up any new files on each run.

## Prerequisites

Before running this notebook, make sure the target Bronze tables exist. Run the DDL notebook at `ddl/bronze` to create them.

## Step 1: Load CSV Files into Temporary Views

We read each source CSV into a temporary view. 
* For books and stores, we point at a single static file. 
* For both online and in-store orders, we point at a directory — `read_files` will read all CSVs in the directory, picking up any new files that have landed since the last run. 

We add two audit columns in each view:
* `source_filename` — comes from the file metadata
* `ingestion_timestamp` — set to `current_timestamp()` so we know when the data was loaded

### Load Stores data into a Temporary View
CREATE a temporary view named `stores_raw` which:
* selects each column from the stores CSV file by name
* adds `source_filename` from the file metadata
* adds `ingestion_timestamp` set to the current time.

"Stores" is a static data set for the duration of the labs - our bookstore has 8 physical locations, and we will not be opening or closing any locations as part of the coursework.

In [None]:
CREATE OR REPLACE TEMPORARY VIEW stores_raw AS
SELECT
  store_nbr,
  store_name,
  store_address,
  store_city,
  store_state,
  store_zip,
  current_timestamp() AS ingestion_timestamp,
  _metadata.file_path AS source_filename
FROM read_files(
  '/FileStore/hwe-data/stores/stores.csv',
  format => 'csv',
  header => true
)

### Load Books data into a Temporary View
CREATE a temporary view named `books_raw` which:
* selects each column from the books CSV file by name
* adds `source_filename` from the file metadata
* adds `ingestion_timestamp` set to the current time.

"Books" is a static data set for the duration of the labs - our bookstore has a catalog of 110 books, and we will not be adding or removing books from this catalog as part of the coursework.

In [None]:
CREATE OR REPLACE TEMPORARY VIEW books_raw AS
SELECT
  isbn,
  title,
  author,
  genre,
  current_timestamp() AS ingestion_timestamp,
  _metadata.file_path AS source_filename
FROM read_files(
  '/FileStore/hwe-data/books/books.csv',
  format => 'csv',
  header => true
)

### Load Online Orders data into a Temporary View
CREATE a temporary view named `online_orders_raw` which:
* selects each column from the online orders CSV directory by name
* adds `source_filename` from the file metadata
* adds `ingestion_timestamp` set to the current time.

"Online Orders" is transactional data - unlike books and stores, new order files can arrive in this directory at any time. `read_files` reads the entire directory so we always pick up every order we've ever received.

In [None]:
CREATE OR REPLACE TEMPORARY VIEW online_orders_raw AS
SELECT
  order_id,
  order_timestamp,
  customer_email,
  customer_name,
  customer_address,
  customer_city,
  customer_state,
  customer_zip,
  items,
  payment_method,
  total_amount,
  current_timestamp() AS ingestion_timestamp,
  _metadata.file_path AS source_filename
FROM read_files(
  '/FileStore/hwe-data/online_orders/',
  format => 'csv',
  header => true
)

### Load In-Store Orders data into a Temporary View
CREATE a temporary view named `instore_orders_raw` which:
* selects each column from the in-store orders CSV directory by name
* adds `source_filename` from the file metadata
* adds `ingestion_timestamp` set to the current time.

"In-Store Orders" is transactional data - same as online orders, new order files can arrive at any time. Also like "Online Orders", `read_files` reads the entire directory so we always pick up every order we've ever received.

In [None]:
CREATE OR REPLACE TEMPORARY VIEW instore_orders_raw AS
SELECT
  order_id,
  transaction_timestamp,
  store_nbr,
  customer_email,
  items,
  payment_method,
  total_amount,
  cashier_name,
  current_timestamp() AS ingestion_timestamp,
  _metadata.file_path AS source_filename
FROM read_files(
  '/FileStore/hwe-data/instore_orders/',
  format => 'csv',
  header => true
)

---
## Step 2: Load Data into Bronze Tables

Now we load data from the temporary views into the bronze tables.

For **static reference data** (stores and books), we use `INSERT OVERWRITE` — this replaces the entire table contents each time. Since the source is a single, complete file, a full reload is the simplest approach and is naturally idempotent.

For **transactional data** (online orders and in-store orders), we use `MERGE INTO`, which matches rows on a natural key:
- If a match is found → update the existing row
- If no match → insert a new row

This makes our pipeline **idempotent** — running it again with the same data won't create duplicates.

Overwrite `bronze.stores` with the full contents of `stores_raw`.

In [None]:
INSERT OVERWRITE bronze.stores (store_nbr, store_name, store_address, store_city, store_state, store_zip, ingestion_timestamp, source_filename)
SELECT
  store_nbr,
  store_name,
  store_address,
  store_city,
  store_state,
  store_zip,
  ingestion_timestamp,
  source_filename
FROM stores_raw

Overwrite `bronze.books` with the full contents of `books_raw`.

In [None]:
INSERT OVERWRITE bronze.books (isbn, title, author, genre, ingestion_timestamp, source_filename)
SELECT
  isbn,
  title,
  author,
  genre,
  ingestion_timestamp,
  source_filename
FROM books_raw

Merge `online_orders_raw` into `bronze.online_orders`, matching on `order_id`.

In [None]:
MERGE INTO bronze.online_orders AS target
USING online_orders_raw AS source
ON target.order_id <=> source.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

Merge `instore_orders_raw` into `bronze.instore_orders`, matching on `order_id`.

In [None]:
MERGE INTO bronze.instore_orders AS target
USING instore_orders_raw AS source
ON target.order_id <=> source.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

---
## Step 3: Verify Idempotency

Now go back and **run all the cells above a second time**. The MERGE should match every row and update it — no new rows should be inserted.

Let's verify.

Check the row counts across all four bronze tables. These should be the same before and after the second run.

In [None]:
SELECT 'bronze.stores' AS table_name, COUNT(*) AS row_count FROM bronze.stores
UNION ALL
SELECT 'bronze.books', COUNT(*) FROM bronze.books
UNION ALL
SELECT 'bronze.online_orders', COUNT(*) FROM bronze.online_orders
UNION ALL
SELECT 'bronze.instore_orders', COUNT(*) FROM bronze.instore_orders

Check for duplicate keys. This query should return **no rows**. If it returns anything, your MERGE key isn't working correctly.

In [None]:
SELECT 'bronze.stores' AS table_name, store_nbr AS key, COUNT(*) AS cnt
FROM bronze.stores GROUP BY store_nbr HAVING COUNT(*) > 1
UNION ALL
SELECT 'bronze.books', isbn, COUNT(*)
FROM bronze.books GROUP BY isbn HAVING COUNT(*) > 1
UNION ALL
SELECT 'bronze.online_orders', order_id, COUNT(*)
FROM bronze.online_orders GROUP BY order_id HAVING COUNT(*) > 1
UNION ALL
SELECT 'bronze.instore_orders', order_id, COUNT(*)
FROM bronze.instore_orders GROUP BY order_id HAVING COUNT(*) > 1

Finally, spot-check a few rows to confirm the audit columns look correct. The `ingestion_timestamp` should reflect when you last ran the pipeline.

In [None]:
SELECT isbn, title, ingestion_timestamp, source_filename
FROM bronze.books
LIMIT 5