# Data Quality Tables - Setup Notebook

**Project:** Maven Fuzzy Factory E-Commerce Analytics  
**Created:** November 20, 2025  
**Purpose:** Create enhanced data quality log and summary tables with standardized schema

---

## Table Definitions

### data_quality_log
**Purpose:** Detailed validation results for each check performed  
**Retention:** 90 days (configurable)  
**Usage:** Root cause analysis, audit trail, failure investigation

| Column | Type | Nullable | Description |
|--------|------|----------|-------------|
| run_id | STRING | No | Unique identifier for validation run (UUID) |
| run_timestamp | TIMESTAMP | No | When validation was executed |
| table_name | STRING | No | Source table being validated |
| check_name | STRING | No | Name of validation check |
| check_type | STRING | No | Category: completeness, uniqueness, validity |
| column_name | STRING | Yes | Column(s) being validated |
| passed | STRING | No | "True" or "False" |
| invalid_count | INT | No | Number of invalid records found |
| threshold | STRING | Yes | Expected threshold (e.g., "0", ">0") |
| message | STRING | Yes | Descriptive message about result |

---

### data_quality_summary
**Purpose:** High-level quality metrics per validation run  
**Retention:** 180 days (configurable)  
**Usage:** Trending, dashboards, pipeline decision-making

| Column | Type | Nullable | Description |
|--------|------|----------|-------------|
| run_id | STRING | No | Unique identifier (matches log) |
| run_timestamp | TIMESTAMP | No | When validation was executed |
| table_name | STRING | No | Source table being validated |
| row_count | INT | No | Total rows in source table |
| pk_duplicate_count | INT | No | Number of duplicate primary keys |
| null_violations | INT | No | Total nulls in critical columns |
| validation_checks_total | INT | No | Total validation checks performed |
| validation_checks_passed | INT | No | Number of checks that passed |
| quality_score | STRING | No | Percentage score (e.g., "100.0") |
| overall_status | STRING | No | "PASSED" or "FAILED" |

---

## Import Required Libraries

In [None]:
from pyspark.sql.types import StructType, StructField, StringType, TimestampType, IntegerType

print("✓ Libraries imported successfully")

## Drop Existing Tables (If Present)

In [None]:
# Drop existing tables to ensure clean slate
spark.sql("DROP TABLE IF EXISTS data_quality_log")
spark.sql("DROP TABLE IF EXISTS data_quality_summary")

print("✓ Existing tables dropped (if they existed)")

## Create data_quality_log Table

In [None]:
# Define log table schema
log_schema = StructType([
    StructField("run_id", StringType(), False),
    StructField("run_timestamp", TimestampType(), False),
    StructField("table_name", StringType(), False),
    StructField("check_name", StringType(), False),
    StructField("check_type", StringType(), False),
    StructField("column_name", StringType(), True),
    StructField("passed", StringType(), False),
    StructField("invalid_count", IntegerType(), False),
    StructField("threshold", StringType(), True),
    StructField("message", StringType(), True)
])

# Create empty DataFrame
empty_log_df = spark.createDataFrame([], log_schema)

# Save as Delta table
empty_log_df.write.mode("overwrite").saveAsTable("data_quality_log")

print("✓ data_quality_log table created")
print(f"  Columns: {len(log_schema.fields)}")
print("  Schema:")
spark.table("data_quality_log").printSchema()

## Create data_quality_summary Table

In [None]:
# Define summary table schema
summary_schema = StructType([
    StructField("run_id", StringType(), False),
    StructField("run_timestamp", TimestampType(), False),
    StructField("table_name", StringType(), False),
    StructField("row_count", IntegerType(), False),
    StructField("pk_duplicate_count", IntegerType(), False),
    StructField("null_violations", IntegerType(), False),
    StructField("validation_checks_total", IntegerType(), False),
    StructField("validation_checks_passed", IntegerType(), False),
    StructField("quality_score", StringType(), False),
    StructField("overall_status", StringType(), False)
])

# Create empty DataFrame
empty_summary_df = spark.createDataFrame([], summary_schema)

# Save as Delta table
empty_summary_df.write.mode("overwrite").saveAsTable("data_quality_summary")

print("✓ data_quality_summary table created")
print(f"  Columns: {len(summary_schema.fields)}")
print("  Schema:")
spark.table("data_quality_summary").printSchema()

## Verify Table Creation

In [None]:
# List quality tables
print("Quality Tables in Lakehouse:")
spark.sql("SHOW TABLES LIKE 'data_quality*'").show(truncate=False)

# Verify row counts (should be 0)
log_count = spark.table("data_quality_log").count()
summary_count = spark.table("data_quality_summary").count()

print(f"\nInitial Row Counts:")
print(f"  data_quality_log: {log_count}")
print(f"  data_quality_summary: {summary_count}")

if log_count == 0 and summary_count == 0:
    print("\n✓✓✓ SUCCESS: Quality tables created successfully!")
    print("\nReady for validation notebooks to write data.")
else:
    print("\n⚠ WARNING: Tables contain data. Expected 0 rows.")

## Sample Query Templates

Use these queries after running validation notebooks:

In [None]:
# Sample queries for future use (commented out)

sample_queries = """
-- Query 1: View all validation summary results
SELECT 
    table_name,
    run_timestamp,
    quality_score,
    overall_status,
    validation_checks_passed,
    validation_checks_total,
    row_count
FROM data_quality_summary
ORDER BY run_timestamp DESC;

-- Query 2: Find failed checks
SELECT 
    table_name,
    check_name,
    column_name,
    invalid_count,
    threshold,
    message
FROM data_quality_log
WHERE passed = 'False'
ORDER BY table_name, check_name;

-- Query 3: Quality score by table (most recent run)
SELECT 
    table_name,
    quality_score,
    overall_status
FROM (
    SELECT 
        table_name,
        quality_score,
        overall_status,
        ROW_NUMBER() OVER (PARTITION BY table_name ORDER BY run_timestamp DESC) as rn
    FROM data_quality_summary
) ranked
WHERE rn = 1
ORDER BY table_name;
"""

print("Sample Query Templates:")
print(sample_queries)

---

## Setup Complete ✓

**Next Steps:**
1. Run validation notebooks for each staging table
2. Review quality scores in `data_quality_summary`
3. Investigate any failures in `data_quality_log`
4. Proceed to transformation phase once all tables pass

---