# Retail Schema: Grain, Dimensions, Facts, and Keys

## Data Grain

**Grain** is the *lowest level of detail* represented by a fact table row.

Declaring the grain is a design commitment: it states exactly what one row means (for example, “one row per transaction line item”).

You should declare the grain before choosing facts and dimensions because every selected column must be compatible with that grain.

In retail schemas, two business events occur frequently:

1. **An order is placed**  
2. **A product is purchased as part of an order** (line item)

So you commonly see two grains:

- **Order-level grain:** one row per `order_id` (table: `orders`)  
- **Line-item grain:** one row per order line item (table: `order_items`)

### Why grain matters

- Order totals and order status naturally belong to the **order grain**
- Quantity and line revenue belong to the **line-item grain**

## Dimension Tables

Dimensions provide the descriptive context around a business event: **who, what, where, when, why, and how**.

They contain attributes that users use to filter, group, and label results (for example, product category, region, customer segment).

Dimensions often require careful governance because their definitions become the shared vocabulary for business analysis.

### Retail schema examples (dimensions)

- `customers` → **Customer dimension** (who)
- `products` → **Product dimension** (what)
- `categories`, `departments` → **Product hierarchy dimensions** (what, how grouped)
- (often added) `dim_date` → **Date dimension** (when)

### Example questions dimensions help answer

- Revenue by **category_name**
- Orders by **customer city**
- Revenue trend by **order_date**

## Fact Tables

Facts are the measurable outcomes of business events and are typically numeric (for example, quantity sold, sales amount).

A fact table row should have a one-to-one relationship with the event described by the declared grain.

Fact tables usually contain two kinds of columns:

- **Foreign keys** to dimensions
- **Measures** used in calculations

### In our retail schema, `order_items` behaves like a fact table

**Grain:** one row per order line item

**Dimension keys:**
- `order_item_order_id` (ties to `orders`)
- `order_item_product_id` (ties to `products`)

**Measures:**
- `order_item_quantity`
- `order_item_subtotal` (line revenue)
- `order_item_product_price` (unit price at purchase time)

## Rule: Do Not Mix Grains

A single fact table should represent a single grain.

Mixing different grains in the same table makes metrics ambiguous and leads to incorrect aggregations and confusing report behavior.

If you truly need multiple grains, model them as separate fact tables designed for their specific grains.

### Retail example of grain-mixing (what not to do)

If you store **order_total** on every line-item row, then summing order_total duplicates values (order_total repeats once per line item).

### Correct patterns

- Keep order-level measures in an **order-grain fact** (one row per order).
- Keep line-item measures in a **line-item fact** (one row per order item).

## What is a Key

A **key** is a column (or a set of columns) used to uniquely identify a row in a table.

Keys are essential for maintaining uniqueness and connecting tables reliably.

Without correct keys, joins can duplicate rows, drop records, or create inconsistent results in reporting.

### Retail schema examples

- `customers.customer_id` uniquely identifies a customer
- `orders.order_id` uniquely identifies an order
- `products.product_id` uniquely identifies a product
- `order_items.order_item_id` uniquely identifies an order line item

These keys enable reliable joins:

- `orders.order_customer_id = customers.customer_id`
- `order_items.order_item_order_id = orders.order_id`
- `order_items.order_item_product_id = products.product_id`

## Natural Key

A **natural key** is a real-world identifier that already exists in the data (for example, Aadhaar number or SSN).

It can uniquely identify a business entity without adding a new generated column.

Natural keys are meaningful, but they may change, contain sensitive information, or vary across systems.

### Retail examples of natural keys (possible)

- Customer email (often unique, but can change)
- Product SKU (meaningful, but can be redefined)
- External order reference / payment transaction ID

### Example risk

If you join using customer email and the customer updates their email, historical joins may break unless handled carefully.

## Surrogate Key

A **surrogate key** is a generated identifier (often an auto-increment integer or generated unique ID).

It is created when data is loaded into the warehouse and is not derived from business meaning.

Surrogate keys are widely used when you expect multiple versions of the same entity (history tracking) or when natural keys are unstable.

### Retail example: customer history (SCD-style)

A warehouse dimension might look like:

- `customer_sk` (surrogate key)
- `customer_id` (natural/business key from source)
- attributes: email, city, segment
- `effective_start_date`, `effective_end_date`, `is_current`

This allows multiple historical versions for the same `customer_id`, each with a different `customer_sk`.

## Surrogate Key vs Natural Key

Natural keys come from the source domain; surrogate keys are created by the warehouse.

Surrogate keys reduce dependency on operational identifiers and make relationships stable even when source identifiers change.

A common pattern is to keep both:

- use the **surrogate key** for joins
- retain the **natural key** for traceability back to the source

### Retail example pattern

**Dimension: `dim_customer`**
- `customer_sk` (surrogate) ✅ for joins
- `customer_id` (natural) ✅ for source traceability
- descriptive attributes

**Fact: `fact_sales_line_item`**
- `customer_sk`, `product_sk`, `date_key` (dimension foreign keys)
- measures: quantity, subtotal, unit_price