# Week 5 Lab: Silver Layer

In this lab you will build the Silver layer of our bookstore's medallion architecture.

You'll clean, normalize, and conform the raw Bronze data into Third Normal Form (3NF) tables. This means:
- Eliminating redundancy (customer data gets its own table)
- Flattening nested data (JSON arrays become rows)
- Unifying multiple sources (online + instore orders become one table)
- Applying data quality checks (rejecting bad data, trimming whitespace)

## Silver Data Model — 3NF

```
┌──────────────────┐     ┌─────────────────────────┐     ┌──────────────────┐
│ silver.customers │◄────│      silver.orders      │────►│  silver.stores   │
├──────────────────┤     ├─────────────────────────┤     ├──────────────────┤
│ PK email         │     │ PK order_id             │     │ PK store_nbr     │
│    name          │     │ PK order_channel        │     │    name          │
│    address       │     │    order_datetime       │     │    address       │
│    city          │     │ FK customer_email       │     │    city          │
│    state         │     │ FK store_nbr            │     │    state         │
│    zip           │     │    payment_method       │     │    zip           │
└──────────────────┘     │    total_amount         │     └──────────────────┘
                         │    cashier_name         │
                         └────────────┬────────────┘
                                      │
                                      │ 1:M
                                      ▼
                         ┌─────────────────────────┐     ┌──────────────────┐
                         │   silver.order_items    │────►│   silver.books   │
                         ├─────────────────────────┤     ├──────────────────┤
                         │ PK,FK order_id          │     │ PK isbn          │
                         │ PK,FK order_channel     │     │    title         │
                         │ PK,FK isbn              │     │    author        │
                         │     quantity            │     │    genre         │
                         │     unit_price          │     └──────────────────┘
                         └─────────────────────────┘
```

---
## Prerequisites

Before running this notebook:
1. Run the DDL notebook at `ddl/silver` to create the target Silver tables.
2. Make sure the Bronze layer is populated (Week 4 lab).

---
## Step 1: Build silver.stores

Source: `bronze.stores`

A straightforward load — the column names already match between bronze and silver, so no renaming is needed. We just drop the audit columns (`ingestion_timestamp`, `source_filename`) since they belong to the Bronze layer.

Merge stores from `bronze.stores` into `silver.stores`, matching on `store_nbr`.

In [None]:
MERGE INTO silver.stores AS target
USING (
  SELECT store_nbr, name, address, city, state, zip
  FROM bronze.stores
) AS source
ON target.store_nbr = source.store_nbr
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

---
## Step 2: Build silver.books

Source: `bronze.books`

This is a straightforward load with data quality checks applied. We:
- **Trim** whitespace from all string fields
- **Reject** rows where `isbn` or `title` is null or empty
- **Validate** that `isbn` matches the ISBN-13 format (`978-X-XX-XXXXXX-X`)
- **Drop** the audit columns (`ingestion_timestamp`, `source_filename`) since they belong to the Bronze layer

Merge cleaned books data into `silver.books`, matching on `isbn`. The USING subquery applies all quality checks inline — only valid rows make it through.

In [None]:
MERGE INTO silver.books AS target
USING (
  SELECT
    TRIM(isbn) AS isbn,
    TRIM(title) AS title,
    TRIM(author) AS author,
    TRIM(genre) AS genre
  FROM bronze.books
  WHERE isbn IS NOT NULL
    AND TRIM(isbn) != ''
    AND title IS NOT NULL
    AND TRIM(title) != ''
    AND isbn RLIKE '^978-[0-9]-[0-9]{2}-[0-9]{6}-[0-9]$'
) AS source
ON target.isbn = source.isbn
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

---
## Step 3: Build silver.customers

Source: `bronze.online_orders`

There is no customers CSV — customer data is embedded on each online order. A single customer may have placed multiple orders, and their name or address may have changed between orders.

To build a clean customers table we:
1. Group orders by `customer_email` (the natural key)
2. For each email, take the customer fields from the **most recent** order
3. Rename columns by dropping the `customer_` prefix

We use `ROW_NUMBER()` partitioned by email, ordered by `order_timestamp DESC`, to pick the latest order per customer.

Merge derived customer data into `silver.customers`, matching on `email`. The subquery uses a window function to pick the most recent order per customer.

In [None]:
MERGE INTO silver.customers AS target
USING (
  SELECT
    customer_email AS email,
    customer_name AS name,
    customer_address AS address,
    customer_city AS city,
    customer_state AS state,
    customer_zip AS zip
  FROM (
    SELECT
      *,
      ROW_NUMBER() OVER (
        PARTITION BY customer_email
        ORDER BY order_timestamp DESC
      ) AS rn
    FROM bronze.online_orders
  )
  WHERE rn = 1
) AS source
ON target.email = source.email
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

---
## Step 4: Build silver.orders

Sources: `bronze.online_orders` + `bronze.instore_orders`

The Bronze layer has two separate order tables from two source systems. In Silver we unify them into a single `silver.orders` table with:
- An `order_channel` column (`'online'` or `'in-store'`) to distinguish the source
- A composite primary key of `(order_id, order_channel)` — since order IDs could overlap between systems
- Sentinel values via COALESCE:
  - Online orders have no store, so `store_nbr` = `'online'`
  - In-store orders may have no customer email, so `customer_email` = `'in-store'`
- Renamed timestamp columns to a common `order_datetime`
- Customer detail fields dropped (they now live in `silver.customers`)

Create a temporary view that unions online and instore orders into a common schema.

In [None]:
CREATE OR REPLACE TEMPORARY VIEW orders_unified AS

SELECT
  order_id,
  'online' AS order_channel,
  order_timestamp AS order_datetime,
  customer_email,
  'online' AS store_nbr,
  payment_method,
  total_amount,
  CAST(NULL AS STRING) AS cashier_name
FROM bronze.online_orders

UNION ALL

SELECT
  order_id,
  'in-store' AS order_channel,
  transaction_timestamp AS order_datetime,
  COALESCE(customer_email, 'in-store') AS customer_email,
  store_nbr,
  payment_method,
  total_amount,
  cashier_name
FROM bronze.instore_orders

Merge the unified orders into `silver.orders`, matching on the composite key `(order_id, order_channel)`.

In [None]:
MERGE INTO silver.orders AS target
USING orders_unified AS source
ON target.order_id = source.order_id
  AND target.order_channel = source.order_channel
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

---
## Step 5: Build silver.order_items

Sources: `bronze.online_orders` + `bronze.instore_orders` (the `items` JSON column)

In Bronze, each order has an `items` column containing a JSON array of line items:
```json
[{"isbn": "978-...", "title": "...", "quantity": 1, "unit_price": 49.99}, ...]
```

In Silver we **explode** this array so each line item becomes its own row. This is normalization — instead of a nested structure, we get a flat table that can be joined and aggregated with standard SQL.

We use:
- `FROM_JSON` to parse the JSON string into a typed array of structs
- `EXPLODE` (via `LATERAL VIEW`) to turn each array element into a row
- We drop `title` from the struct since it's redundant with `silver.books`

Create a temporary view that explodes the items JSON from both order sources.

In [None]:
CREATE OR REPLACE TEMPORARY VIEW order_items_exploded AS

SELECT
  order_id,
  'online' AS order_channel,
  item.isbn,
  item.quantity,
  CAST(item.unit_price AS DECIMAL(10, 2)) AS unit_price
FROM bronze.online_orders
  LATERAL VIEW EXPLODE(
    FROM_JSON(items, 'ARRAY<STRUCT<isbn: STRING, title: STRING, quantity: INT, unit_price: DOUBLE>>')
  ) t AS item

UNION ALL

SELECT
  order_id,
  'in-store' AS order_channel,
  item.isbn,
  item.quantity,
  CAST(item.unit_price AS DECIMAL(10, 2)) AS unit_price
FROM bronze.instore_orders
  LATERAL VIEW EXPLODE(
    FROM_JSON(items, 'ARRAY<STRUCT<isbn: STRING, title: STRING, quantity: INT, unit_price: DOUBLE>>')
  ) t AS item

Merge the exploded order items into `silver.order_items`, matching on the composite key `(order_id, order_channel, isbn)`.

In [None]:
MERGE INTO silver.order_items AS target
USING order_items_exploded AS source
ON target.order_id = source.order_id
  AND target.order_channel = source.order_channel
  AND target.isbn = source.isbn
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

---
## Step 6: Verify

Now let's run some checks to make sure the Silver layer is correct.

**Row counts** — Compare Silver table counts to their Bronze sources. `silver.books` may be slightly less than `bronze.books` if any rows were rejected by the quality checks.

In [None]:
SELECT 'bronze.stores' AS table_name, COUNT(*) AS row_count FROM bronze.stores
UNION ALL
SELECT 'silver.stores', COUNT(*) FROM silver.stores
UNION ALL
SELECT 'bronze.books', COUNT(*) FROM bronze.books
UNION ALL
SELECT 'silver.books', COUNT(*) FROM silver.books
UNION ALL
SELECT 'silver.customers', COUNT(*) FROM silver.customers
UNION ALL
SELECT 'silver.orders', COUNT(*) FROM silver.orders
UNION ALL
SELECT 'silver.order_items', COUNT(*) FROM silver.order_items

**Customer count** — The number of customers should equal the number of distinct `customer_email` values in `bronze.online_orders`.

In [None]:
SELECT
  (SELECT COUNT(*) FROM silver.customers) AS silver_customer_count,
  (SELECT COUNT(DISTINCT customer_email) FROM bronze.online_orders) AS bronze_distinct_emails

**Order count** — `silver.orders` should equal the combined count of both Bronze order tables.

In [None]:
SELECT
  (SELECT COUNT(*) FROM silver.orders) AS silver_order_count,
  (SELECT COUNT(*) FROM bronze.online_orders) + (SELECT COUNT(*) FROM bronze.instore_orders) AS bronze_combined_count

**Referential integrity** — Every `customer_email` in `silver.orders` should exist in `silver.customers` (or be the sentinel value `'in-store'`). This query should return **no rows**.

In [None]:
SELECT DISTINCT o.customer_email
FROM silver.orders o
LEFT JOIN silver.customers c ON o.customer_email = c.email
WHERE c.email IS NULL
  AND o.customer_email != 'in-store'

**Referential integrity** — Every `isbn` in `silver.order_items` should exist in `silver.books`. This query should return **no rows**.

In [None]:
SELECT DISTINCT oi.isbn
FROM silver.order_items oi
LEFT JOIN silver.books b ON oi.isbn = b.isbn
WHERE b.isbn IS NULL

**Total amount cross-check** — For each order, verify that the sum of `quantity * unit_price` across its line items matches the `total_amount` stored on the order. Any mismatches indicate a data quality issue.

In [None]:
SELECT
  o.order_id,
  o.order_channel,
  o.total_amount AS order_total,
  SUM(oi.quantity * oi.unit_price) AS computed_total,
  o.total_amount - SUM(oi.quantity * oi.unit_price) AS difference
FROM silver.orders o
JOIN silver.order_items oi
  ON o.order_id = oi.order_id
  AND o.order_channel = oi.order_channel
GROUP BY o.order_id, o.order_channel, o.total_amount
HAVING o.total_amount != SUM(oi.quantity * oi.unit_price)