
# Data Quality Dimensions

Data quality is measured using **dimensions** — standard ways to describe *what kind of quality issue* exists in a dataset.



## What is a “Data Quality Dimension”?

Think of a “dimension” as a **quality lens**.

When you buy a packaged food item, the factory does not “check quality” in one step. Instead, quality is
broken into **separate checks**, for example:
- Is the packet **sealed**? (completeness / integrity)
- Is it **within expiry date**? (validity / timeliness)
- Is the label **consistent** across batches? (consistency)
- Is the weight **correct**? (accuracy)
- Are there **duplicate serial numbers**? (uniqueness)

Similarly, in data, “quality” becomes manageable only when we split it into measurable dimensions.
That’s why many organizations use a core set of dimensions like **Completeness, Accuracy, Consistency,
Validity, Timeliness, and Uniqueness**.


![Dimensions](../images/coffee.png)


A **data quality dimension** is a category used to evaluate a specific aspect of data quality.

Instead of saying “the data is bad,” dimensions help you say **why** it is bad:
- **Completeness** → required values are missing
- **Uniqueness** → duplicates exist where they should not
- **Consistency** → the same concept is represented differently across systems
- **Validity** → values break defined rules/constraints
- **Accuracy** → values do not match the real-world or authoritative source
- **Timeliness** → data is outdated or arrives too late for its purpose

These dimensions make data quality measurable, reportable, and improvable.



Imagine a customer table in a company. You want to email customers a delivery update.

- If emails are missing → **Completeness** problem  
- If the email exists but is wrong → **Accuracy** problem  
- If one system stores “Bengaluru” and another stores “Bangalore” → **Consistency** problem  
- If email format is invalid like `abc.com` → **Validity** problem  
- If data arrives 5 days late, updates are useless → **Timeliness** problem  
- If the same customer is present twice → **Uniqueness** problem  

What would be the real world consequences?
- Completeness → customer not reachable
- Accuracy → wrong person contacted
- Consistency → wrong grouping, wrong counts
- Validity → ETL failures, rejected loads
- Timeliness → stale dashboards
- Uniqueness → double counting, wrong KPIs


![Data Quality Matrix](../images/diagnostic.png)


## The 6 Core Dimensions

### 1) Completeness
**Meaning:** Required data is present (not NULL / not blank) for the use case.

**Typical failures:**
- Missing email or phone for customer communication
- Missing address for shipping
- Missing event date for trend analysis

**Example checks (SQL pattern):**
```sql
SELECT COUNT(*) AS missing_email_rows
FROM customers
WHERE email IS NULL OR TRIM(email) = '';
```



### Uniqueness
**Meaning:** Values that should be unique are not duplicated.

**Typical failures:**
- Duplicate customer IDs
- Duplicate order IDs causing double counting
- Duplicate product SKUs

**Example checks (SQL pattern):**
```sql
SELECT customer_id, COUNT(*) AS cnt
FROM customers
GROUP BY customer_id
HAVING COUNT(*) > 1;
```



### Consistency
**Meaning:** The same data concept is represented the same way across records/tables/systems.

**Typical failures:**
- Currency stored as `GBP` in one table and `Great British Pound` in another
- City stored as `Bengaluru` vs `Bangalore`
- Status stored as `COMPLETE`, `Complete`, `completed`

**Example checks (SQL pattern):**
```sql
SELECT currency_code, COUNT(*) AS cnt
FROM orders
GROUP BY currency_code
ORDER BY cnt DESC;
```



### Validity
**Meaning:** Data follows defined rules and constraints (allowed ranges, formats, and relationships).

**Validity is rule-conformance**, not real-world truth.

**Typical failures:**
- Discount outside 0..1
- Quantity <= 0
- Ship date earlier than order date
- Email not matching expected format

**Example checks (SQL pattern):**
```sql
SELECT COUNT(*) AS invalid_discount_rows
FROM orders
WHERE discount < 0 OR discount > 1;
```

```sql
SELECT COUNT(*) AS invalid_ship_date_rows
FROM orders
WHERE ship_date < order_date;
```



### Accuracy
**Meaning:** Data matches reality or an authoritative source system.

This dimension usually requires comparison to a **trusted reference** (a master system, ledger, or verified source).

**Typical failures:**
- Incorrect customer address compared to KYC system
- Incorrect product price compared to pricing master
- Incorrect totals compared to finance ledger

**Example (reconciliation pattern):**
```sql
SELECT
  w.txn_date,
  w.total_revenue AS warehouse_revenue,
  f.total_revenue AS finance_revenue,
  (w.total_revenue - f.total_revenue) AS diff
FROM warehouse_daily_revenue w
JOIN finance_daily_revenue f
  ON w.txn_date = f.txn_date;
```



### Timeliness
**Meaning:** Data is available and updated within the time window required for the business purpose.

**Typical failures:**
- Data arrives days late, dashboards become stale
- Real-time use cases using batch-refresh data

**Example checks (SQL pattern):**
```sql
SELECT
  MAX(ingested_at) AS last_ingestion_time,
  TIMESTAMPDIFF(MINUTE, MAX(ingested_at), NOW()) AS minutes_since_last_update
FROM orders;
```


![Dimensions](../images/dimensions.png)


#### Validity vs Accuracy

These are often confused:

- **Validity:** “Does it follow the rule?”  
  Example: `abc@xyz.com` is a valid email format.

- **Accuracy:** “Is it the correct real-world value?”  
  Example: `abc@xyz.com` may be valid format but belongs to the wrong customer.

A value can be **valid but inaccurate**.
