## Building `instacart.silver.dim_user` (User Dimension)

This step creates a **compressed, analytics-ready user dimension** that summarizes each customer’s order history and behavioral patterns.  
It converts millions of transactional rows into compact user profiles suitable for dashboards, reports, and machine learning pipelines.

---

### What This Code Does

**Input Source:** `bronze.orders` (loaded from `orders.csv`)

The process performs the following steps:

1. **Read raw order data** from the bronze layer.  
2. **Aggregate by `user_id`** to compute key metrics:
   - **Activity metrics:** first and last order number, total orders, mean days between orders.  
   - **Behavioral signals:**
     - `dow_pref_mask` → a 7-bit integer representing preferred days of the week.  
     - `hour_pref_hist` → a 24-element array representing order frequency by hour.  
3. **Persist results** as a Delta table in the silver layer for structured and performant analytical access.

---

### Why We Use a Surrogate Key (`user_sk`)

- Provides a **stable, numeric identifier** for lightweight joins and consistent references across refreshes.  
- Enables **Z-ORDER clustering and partition pruning**, improving Delta table query performance.  
- **Decouples** analytical tables from changing or inconsistent source identifiers.

---

### Why This Modeling Helps

- **Compression:** Converts millions of order records into one summarized row per user, reducing storage and improving query performance.  
- **Analytical readiness:** Pre-aggregated metrics simplify downstream analysis and feature engineering (e.g., z-scoring `mean_days_between_orders`, normalizing `hour_pref_hist`).  
- **Behavioral insight:** Enables instant analysis of churn risk, reorder cadence, and user segmentation patterns.  
- **Scalability:** Serves as a stable foundation for joining with fact tables and powering predictive analytics without recomputation.

---

**Result:**  
A structured, high-performance `dim_user` table that transforms raw transactional data into user-level behavioral intelligence.


In [0]:
%sql
USE CATALOG instacart;


In [0]:
%sql
-- ensure schema exists (safe to re-run)
CREATE SCHEMA IF NOT EXISTS instacart.silver
LOCATION 'abfss://processed-data@datastorage00578.dfs.core.windows.net/Instacart/silver/';

-- temp views
CREATE OR REPLACE TEMP VIEW v_orders AS
SELECT order_id, user_id, order_number, order_dow, order_hour_of_day, days_since_prior_order
FROM instacart.bronze.orders;

CREATE OR REPLACE TEMP VIEW v_user_hour AS
SELECT user_id, order_hour_of_day AS hr, COUNT(*) AS cnt
FROM v_orders
GROUP BY user_id, order_hour_of_day;

CREATE OR REPLACE TEMP VIEW v_user_hour_pivot AS
SELECT *
FROM v_user_hour
PIVOT (SUM(cnt) FOR hr IN (
  0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
));

-- table shell
CREATE TABLE IF NOT EXISTS instacart.silver.dim_user (
  user_sk BIGINT GENERATED ALWAYS AS IDENTITY,
  user_id INT,
  first_order_number SMALLINT,
  last_order_number SMALLINT,
  total_orders SMALLINT,
  mean_days_between_orders FLOAT,
  dow_pref_mask INT,
  hour_pref_hist ARRAY<SMALLINT>
)
USING DELTA
LOCATION 'abfss://processed-data@datastorage00578.dfs.core.windows.net/Instacart/silver/dim_user';

-- INSERT with explicit column list (omit identity)
INSERT OVERWRITE instacart.silver.dim_user (
  user_id,
  first_order_number,
  last_order_number,
  total_orders,
  mean_days_between_orders,
  dow_pref_mask,
  hour_pref_hist
)
SELECT
  u.user_id,
  MIN(u.order_number)                                                AS first_order_number,
  MAX(u.order_number)                                                AS last_order_number,
  CAST(MAX(u.order_number) AS SMALLINT)                              AS total_orders,
  AVG(u.days_since_prior_order)                                      AS mean_days_between_orders,
  REDUCE(COLLECT_SET(u.order_dow), 0,
         (acc, x) -> acc | SHIFTLEFT(1, x),
         acc -> acc)                                                 AS dow_pref_mask,
  ARRAY(
    COALESCE(p.`0`,0), COALESCE(p.`1`,0), COALESCE(p.`2`,0), COALESCE(p.`3`,0),
    COALESCE(p.`4`,0), COALESCE(p.`5`,0), COALESCE(p.`6`,0), COALESCE(p.`7`,0),
    COALESCE(p.`8`,0), COALESCE(p.`9`,0), COALESCE(p.`10`,0), COALESCE(p.`11`,0),
    COALESCE(p.`12`,0), COALESCE(p.`13`,0), COALESCE(p.`14`,0), COALESCE(p.`15`,0),
    COALESCE(p.`16`,0), COALESCE(p.`17`,0), COALESCE(p.`18`,0), COALESCE(p.`19`,0),
    COALESCE(p.`20`,0), COALESCE(p.`21`,0), COALESCE(p.`22`,0), COALESCE(p.`23`,0)
  )                                                                  AS hour_pref_hist
FROM v_orders u
LEFT JOIN v_user_hour_pivot p
  ON u.user_id = p.user_id
GROUP BY u.user_id,
  COALESCE(p.`0`,0), COALESCE(p.`1`,0), COALESCE(p.`2`,0), COALESCE(p.`3`,0),
  COALESCE(p.`4`,0), COALESCE(p.`5`,0), COALESCE(p.`6`,0), COALESCE(p.`7`,0),
  COALESCE(p.`8`,0), COALESCE(p.`9`,0), COALESCE(p.`10`,0), COALESCE(p.`11`,0),
  COALESCE(p.`12`,0), COALESCE(p.`13`,0), COALESCE(p.`14`,0), COALESCE(p.`15`,0),
  COALESCE(p.`16`,0), COALESCE(p.`17`,0), COALESCE(p.`18`,0), COALESCE(p.`19`,0),
  COALESCE(p.`20`,0), COALESCE(p.`21`,0), COALESCE(p.`22`,0), COALESCE(p.`23`,0);


In [0]:
%sql
select * from instacart.silver.dim_user limit 20;

In [0]:
%sql
select count(*) from instacart.silver.dim_user;