# ETL Testing in a Data Warehouse (DWH) 


## What is ETL Testing?

**ETL Testing** is the practice of validating that data is:
- correctly **extracted** from sources,
- correctly **transformed** according to business rules,
- correctly **loaded** into target tables (warehouse layers),
- and remains **trustworthy** for reporting and decision-making.


## Why ETL Testing is Important

ETL Testing matters because:

- **Decisions depend on data**  
  Incorrect revenue, wrong customer mapping, or missing orders can lead to bad business decisions.

- **Errors are expensive when discovered late**  
  If a bug reaches the Gold layer and dashboards are built on it, fixing it later can require reprocessing, backfills, and stakeholder rework.

- **Pipelines change constantly**  
  New sources, new columns, rule changes, and schema updates can silently break logic.

- **Data issues are common in real-world feeds**  
  NULLs, duplicates, inconsistent city names, invalid emails, late arriving data, and wrong categories are common and not exceptional.

- **Trust is hard to earn and easy to lose**  
  Once business users lose trust in reports, adoption drops and the warehouse becomes underused.


## Where ETL Testing Fits in a DWH Pipeline

A common DWH pipeline has layers like:

- **Bronze**: consolidated raw landing (minimal changes, add lineage like `source_system`)
- **Silver**: cleansed and standardized (apply business rules, remove bad data, deduplicate)
- **Gold**: business-ready model (star schema, facts & dimensions, calculated measures)

ETL tests should be present at **each layer**:
- Bronze tests confirm **ingestion correctness** (e.g., duplicates, row counts, schema drift).
- Silver tests confirm **transformation correctness** (e.g., standardization rules, filters, dedup rules).
- Gold tests confirm **analytics correctness** (e.g., measures, referential integrity, completeness from Silver → Gold).


## Core Dimensions of Data Quality (What We Test)

Most ETL tests fall into these categories:

1. **Completeness**  
   Are all expected records present? Did we lose anything during transformation?

2. **Accuracy**  
   Are values correct based on the rules? Are calculations correct?

3. **Validity**  
   Do values conform to expected formats/ranges? (e.g., email format, dates)

4. **Uniqueness**  
   Are keys unique where they must be? (e.g., `order_id` in Silver/Gold)

5. **Consistency**  
   Are values consistent across datasets and over time? (e.g., city mapping)

6. **Referential Integrity**  
   Do fact rows correctly reference dimension keys?

7. **Timeliness / Freshness**  
   Is the data updated within required windows? Is `load_timestamp` progressing?

8. **Reconciliation**  
   Do totals match between layers and/or sources? (counts, sums, revenue)


## Designing a Simple ETL Test Framework 

A practical pattern is to:
1. Generate a unique **test_run_id** for a run.
2. Run each test and capture:
   - `test_name`
   - `status` (PASS/FAIL)
   - `actual_value` (usually a count of violations)
   - `expected_desc`
   - `details`
3. Store results in a QA table (optional, but recommended for audit/history).
4. Print a run report.

Even without storing results, you can still return a PASS/FAIL row for each test query.


## Best Practices for ETL Testing

- **Test early and at every layer**: Don’t wait until Gold to detect issues.
- **Use counts and sums for reconciliation**: Row counts, revenue sums, distinct keys.
- **Make tests deterministic**: Same input should always yield same result.
- **Keep tests small and readable**: One rule per test is easier to debug.
- **Log test results**: Storing in a QA table enables audit and trend tracking.
- **Automate**: Run tests as part of the pipeline (after each load).
- **Treat failing tests as blockers**: If critical rules fail, stop the pipeline and investigate.
