# Data Quality Frameworks & Governance

## What is a Data Quality Framework?

A **Data Quality (DQ) Framework** is a structured approach for defining, running, and monitoring data quality checks across a data pipeline. 
It standardizes *what* to check (rules), *when* to check (ingestion vs transformation vs consumption), and *what to do* when data fails. 
The goal is to make data quality repeatable and measurable, instead of relying on ad-hoc manual checks. 
In modern platforms, a DQ framework is often embedded into pipelines so quality is validated continuously as data moves from bronze to silver to gold.

## Rules-based DQ Checks

**Rules-based checks** validate data against explicit business or technical rules, such as “customer_id cannot be null” or “order_amount must be >= 0”. 
These rules typically map to quality dimensions like completeness, validity, accuracy, and consistency. 
A good rule is specific, testable, and tied to a dataset and column(s), with a clear pass/fail outcome. 
Rules-based checks are the foundation for automation because they can run at scale on every batch or streaming micro-batch.

## Thresholds and Scorecards

A **threshold** defines how much bad data is acceptable before the pipeline is considered unhealthy (for example, “null rate must be < 0.1%”). 
Thresholds are practical because real-world data can have occasional issues, and you may want to alert instead of blocking everything. 
A **scorecard** aggregates multiple check results into a simple view (by dataset, domain, or pipeline run) so teams can track quality over time. 
Scorecards help compare quality across releases and highlight whether data quality is improving, stable, or degrading.

## Exception Handling

**Exception handling** defines what action the pipeline takes when a check fails. 
Common patterns include: **fail the pipeline** (hard stop), **quarantine** bad records to a separate table, **drop** invalid rows, or **continue with alerts**. 
The right choice depends on business risk: critical compliance datasets often require hard stops, while low-risk datasets may allow quarantine. 
Effective exception handling also records *why* the failure happened and *which rows* were affected, enabling faster root-cause analysis.

## Audit, Balance & Controls (ABC)

**Audit, Balance & Controls (ABC)** refers to operational checks that prove pipeline correctness beyond row-level validation. 
**Audit** typically tracks run metadata: when data arrived, which source file/version was used, counts processed, and success/failure status. 
**Balance** checks ensure data reconciliation across stages (for example, source count vs bronze count vs silver count, or totals matching across transformations). 
**Controls** define governance and enforcement mechanisms: approvals, access controls, lineage, retention rules, and documented sign-offs for critical datasets.

## Quality Checks (Null / Duplicate / Type Checks)

Quality checks validate that data is usable and trustworthy before it is consumed by downstream tables, dashboards, or models. 
**Null checks** ensure mandatory fields are populated (for example, `customer_id` or `order_id` cannot be missing). 
**Duplicate checks** detect repeated records where uniqueness is expected (for example, duplicate primary keys or duplicate transactions). 
**Type checks** confirm columns match expected data types and formats (for example, dates are real dates, numeric fields do not contain text).

## Null Checks

A null check evaluates whether required columns contain missing values and whether the null rate is within an acceptable threshold. 
Nulls in key columns can break joins, cause record loss, and create inconsistent aggregates across layers. 
A common practice is to define mandatory columns per dataset and validate them during ingestion (bronze) and after transformations (silver). 
When nulls are found, pipelines may fail, quarantine bad rows, or apply controlled defaulting based on business policy.

## Duplicate Checks

Duplicate checks ensure the same logical entity is not stored multiple times when it should be unique. 
Duplicates often appear due to reprocessing, late-arriving files, upstream retries, or missing deduplication logic in ingestion. 
A practical approach is to define the **uniqueness key** (single or composite) and check whether it repeats within a load window. 
If duplicates are allowed by design (for example, event logs), the rule must be clarified so it does not create false alarms.

## Type Checks

Type checks verify that each field conforms to the schema expectations (for example, integers remain integers and timestamps remain timestamps). 
Type drift is common when raw sources change formats or when ingestion lands all fields as strings. 
Incorrect types can cause silent calculation errors, incorrect comparisons, and failed downstream writes. 
Type checks typically include schema validation, parse success rates, and explicit casting with error handling rules.

## Lineage Basics

**Data lineage** describes where data came from, how it moved, and what transformations were applied along the way. 
At a minimum, lineage should answer: which source system/file produced this dataset, and which tables or jobs contributed to it. 
Lineage improves trust because analysts can trace unexpected results back to upstream inputs and transformations. 
In governed environments, lineage is also required for audits, impact analysis, and change management.

## Business Rules

**Business rules** are domain-specific validations that reflect real operational meaning, not just technical correctness. 
Examples include “order_status must be one of the approved statuses” or “refund_amount cannot exceed order_amount.” 
Business rules must be documented with clear ownership because they represent business policy and can change over time. 
Embedding business rules into silver/gold pipelines ensures curated datasets represent consistent and approved definitions.

## Intro to PII and Masking

**PII (Personally Identifiable Information)** is data that can identify an individual, directly or indirectly (such as name, phone, email, address, government ID). 
PII must be protected because misuse can lead to privacy violations, regulatory penalties, and loss of customer trust. 
**Masking** reduces exposure by hiding sensitive values while keeping the dataset usable for analysis and testing. 
Common masking approaches include redaction (showing partial values), tokenization, hashing, and role-based views that reveal full data only to authorized users.