# DuckGuard 2.3 - Getting Started Guide

**DuckGuard** is a Python-native data quality tool built on DuckDB for speed.

## What's New in v2.2
- **Freshness Monitoring**: Detect stale data via file mtime or column timestamps
- **ML-Based Anomaly Detection**: Auto-learn baselines, KS-test for distribution drift
- **Schema Evolution Tracking**: Track schema changes and detect breaking changes
- **Email Notifications**: SMTP-based alerts with HTML formatting
- **Reference/FK Checks**: Validate foreign key relationships across datasets
- **Cross-Dataset Validation**: Compare columns and row counts between datasets
- **Reconciliation**: Comprehensive dataset comparison for migration validation
- **Distribution Drift Detection**: KS-test based drift detection for ML pipelines
- **Group By Checks**: Segmented validation for partition-level quality checks

## What's in v2.1
- **Slack/Teams Notifications**: Get alerts when data quality checks fail
- **Row-Level Error Capture**: See exactly which rows failed validation
- **dbt Integration**: Export rules as dbt tests, import dbt schema.yml
- **Enhanced Error Messages**: Helpful suggestions and context in errors
- **HTML/PDF Reports**: Generate beautiful, shareable quality reports
- **Historical Tracking**: Store validation results and analyze trends over time
- **Airflow Operator**: Native integration for data pipelines
- **GitHub Action**: CI/CD data quality gates

## What's in v2.0
- **YAML-based Rules**: Define rules in YAML with a simple, clean syntax
- **Semantic Type Detection**: Auto-detect emails, phones, PII, and 30+ types
- **Data Contracts**: Schema + quality SLAs with breaking change detection
- **Anomaly Detection**: Statistical anomaly detection (Z-score, IQR, percent change)
- **Enhanced CLI**: Beautiful Rich output with new commands

This notebook walks you through:
1. Connecting to data sources
2. Exploring your data
3. Calculating quality scores
4. YAML-based rules
5. Semantic type detection
6. Data contracts
7. Anomaly detection
8. **NEW**: Freshness Monitoring
9. **NEW**: ML-Based Anomaly Detection
10. **NEW**: Schema Evolution Tracking
11. **NEW**: Reference/FK Checks & Cross-Dataset Validation
12. **NEW**: Reconciliation
13. **NEW**: Distribution Drift Detection
14. **NEW**: Group By Checks
15. **NEW**: Email Notifications
16. Python assertions
17. Row-level error debugging
18. Slack/Teams notifications
19. dbt integration
20. HTML/PDF Reports
21. Historical Tracking
22. Airflow Integration
23. Using with pytest
24. CLI commands

## Setup

Run the next cell to install DuckGuard and create sample data. The sample data is embedded directly in the notebook - no downloads needed!

In [None]:
# Install DuckGuard and setup sample data
import os
import subprocess
import sys

# Install DuckGuard
print("Installing DuckGuard...")
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "duckguard"], check=True)
print("DuckGuard installed!")

# Create sample data directly (no network needed)
os.makedirs("sample_data", exist_ok=True)

# Sample orders data - embedded inline for instant loading
ORDERS_CSV = """order_id,customer_id,product_name,quantity,unit_price,total_amount,status,email,created_at
ORD-001,CUST-001,Widget A,2,29.99,59.98,delivered,john@example.com,2024-01-15
ORD-002,CUST-002,Widget B,1,49.99,49.99,shipped,jane@example.com,2024-01-16
ORD-003,CUST-001,Gadget X,3,19.99,59.97,delivered,john@example.com,2024-01-17
ORD-004,CUST-003,Widget A,5,29.99,149.95,pending,bob@example.com,2024-01-18
ORD-005,CUST-004,Gadget Y,2,39.99,79.98,shipped,alice@example.com,2024-01-19
ORD-006,CUST-002,Widget C,1,59.99,59.99,delivered,jane@example.com,2024-01-20
ORD-007,CUST-005,Gadget X,4,19.99,79.96,pending,charlie@example.com,2024-01-21
ORD-008,CUST-001,Widget B,2,49.99,99.98,delivered,john@example.com,2024-01-22
ORD-009,CUST-006,Gadget Z,1,99.99,99.99,shipped,diana@example.com,2024-01-23
ORD-010,CUST-003,Widget A,3,29.99,89.97,delivered,bob@example.com,2024-01-24
ORD-011,CUST-007,Widget D,2,34.99,69.98,pending,eve@example.com,2024-01-25
ORD-012,CUST-004,Gadget X,1,19.99,19.99,delivered,alice@example.com,2024-01-26
ORD-013,CUST-008,Widget B,3,49.99,149.97,shipped,frank@example.com,2024-01-27
ORD-014,CUST-002,Gadget Y,2,39.99,79.98,delivered,jane@example.com,2024-01-28
ORD-015,CUST-009,Widget A,4,29.99,119.96,pending,grace@example.com,2024-01-29
ORD-016,CUST-005,Widget C,1,59.99,59.99,cancelled,charlie@example.com,2024-01-30
ORD-017,CUST-010,Gadget Z,2,99.99,199.98,shipped,henry@example.com,2024-01-31
ORD-018,CUST-001,Widget D,3,34.99,104.97,delivered,john@example.com,2024-02-01
ORD-019,CUST-011,Gadget X,5,19.99,99.95,pending,ivy@example.com,2024-02-02
ORD-020,CUST-006,Widget B,1,49.99,49.99,delivered,diana@example.com,2024-02-03
ORD-021,,Widget A,2,29.99,59.98,shipped,,2024-02-04
ORD-022,CUST-012,Gadget Y,50,39.99,1999.50,delivered,kim@example.com,2024-02-05
ORD-023,CUST-013,Widget B,1,499.99,499.99,pending,leo@example.com,2024-02-06
ORD-024,CUST-014,Gadget X,100,19.99,1999.00,cancelled,mike@example.com,2024-02-07
ORD-025,CUST-015,Widget C,1,59.99,59.99,delivered,nancy@example.com,2024-02-08"""

DUCKGUARD_YAML = """dataset: orders
description: Data quality rules for orders dataset

rules:
  # Table-level rules
  - row_count > 0
  - row_count < 1000000

  # Column nulls
  - order_id is not null
  - order_id is unique
  - customer_id null_percent < 10
  - email null_percent < 10

  # Numeric ranges
  - quantity >= 1
  - quantity < 500
  - unit_price >= 0
  - total_amount >= 0

  # Status enum
  - status in ['pending', 'shipped', 'delivered', 'cancelled']
"""

with open("sample_data/orders.csv", "w") as f:
    f.write(ORDERS_CSV)
with open("sample_data/duckguard.yaml", "w") as f:
    f.write(DUCKGUARD_YAML)

print("Setup complete! Sample data ready.")

In [None]:
# Import DuckGuard - all the new features!
from duckguard import (
    # Anomaly Detection
    AnomalyDetector,
    RuleSet,
    # Semantic Types
    SemanticAnalyzer,
    # NEW in v2.1: Row-level errors
    __version__,
    # Core
    connect,
    detect_anomalies,
    detect_type,
    detect_types_for_dataset,
    diff_contracts,
    execute_rules,
    generate_contract,
    generate_rules,
    load_rules,
    load_rules_from_string,
    # Data Contracts
    validate_contract,
)

# Additional contract utilities
from duckguard.contracts import contract_to_yaml

print(f"DuckGuard v{__version__} imported successfully!")

## 2. Connecting to Data Sources

DuckGuard auto-detects the data source type from the path or connection string.

In [None]:
# Connect to a CSV file
orders = connect("sample_data/orders.csv")

print(f"Dataset: {orders.name}")
print(f"Rows: {orders.row_count}")
print(f"Columns: {orders.column_count}")
print(f"Column names: {orders.columns}")

In [None]:
# Preview the data
orders.head(5)

### Other Connection Examples

```python
# Parquet files
data = connect("data/events.parquet")

# JSON files
data = connect("data/users.json")

# Cloud storage
data = connect("s3://bucket/data.parquet")
data = connect("gs://bucket/data.csv")

# Databases
data = connect("postgres://user:pass@host/db", table="orders")
data = connect("snowflake://account/db", table="orders", schema="public")
data = connect("bigquery://project/dataset", table="orders")
```

## 3. Exploring Columns

Access columns using attribute or bracket notation to get statistics.

In [None]:
# Access a column
customer_col = orders.customer_id

# View column statistics
print(f"Column: {customer_col.name}")
print(f"Total values: {customer_col.total_count}")
print(f"Null count: {customer_col.null_count}")
print(f"Null %: {customer_col.null_percent:.2f}%")
print(f"Unique count: {customer_col.unique_count}")
print(f"Unique %: {customer_col.unique_percent:.2f}%")

In [None]:
# Numeric column statistics
amount_col = orders.total_amount

print(f"Column: {amount_col.name}")
print(f"Min: {amount_col.min}")
print(f"Max: {amount_col.max}")
print(f"Mean: {amount_col.mean:.2f}")
print(f"Median: {amount_col.median}")
print(f"Stddev: {amount_col.stddev:.2f}")

In [None]:
# View value distribution
orders.status.get_value_counts()

## 4. Quality Scores

Calculate data quality scores across standard dimensions:
- **Completeness**: Are all required values present?
- **Uniqueness**: Are values appropriately unique?
- **Validity**: Do values conform to expected formats/ranges?
- **Consistency**: Are values consistent?

In [None]:
# Calculate quality score
result = orders.score()

print("=" * 50)
print("DATA QUALITY REPORT")
print("=" * 50)
print(f"\nOverall Score: {result.overall:.1f} / 100")
print(f"Grade: {result.grade}")
print("\nDimension Scores:")
print(f"  Completeness: {result.completeness:.1f}")
print(f"  Uniqueness:   {result.uniqueness:.1f}")
print(f"  Validity:     {result.validity:.1f}")
print(f"  Consistency:  {result.consistency:.1f}")
print(f"\nChecks: {result.passed_checks}/{result.total_checks} passed ({result.pass_rate:.1f}%)")

## 5. YAML-Based Rules (NEW in v2.0)

Define data quality rules in YAML with a simple, intuitive syntax. This is easier than Soda's SodaCL!

In [None]:
# Define rules directly in Python using YAML string
# Note: Our sample data has intentional nulls and anomalies, so we use thresholds
yaml_rules = """
dataset: orders
description: Data quality rules for orders

rules:
  # Table-level rules
  - row_count > 0
  - row_count < 1000000
  
  # Column-level rules with simple syntax
  - order_id is not null
  - order_id is unique
  - customer_id null_percent < 10
  - total_amount >= 0
  - total_amount < 10000
  - status in ['pending', 'shipped', 'delivered', 'cancelled']
  - quantity >= 1
"""

# Load and execute rules
rules = load_rules_from_string(yaml_rules)
print(f"Loaded {len(rules.checks)} rules")
print(f"Dataset: {rules.dataset}")
print("\nRules:")
for check in rules.checks:
    print(f"  - {check.expression}")

In [None]:
# Execute rules against the dataset
result = execute_rules(rules, dataset=orders)

print(f"\n{'='*60}")
print("RULE EXECUTION RESULTS")
print(f"{'='*60}")
print(f"Total: {result.total_checks}")
print(f"Passed: {result.passed_count}")
print(f"Failed: {result.failed_count}")
print(f"Success Rate: {result.quality_score:.1f}%")
print("\nDetails:")
for check_result in result.results:
    status = "PASS" if check_result.passed else "FAIL"
    print(f"  [{status}] {check_result.check.expression}")
    if not check_result.passed:
        print(f"         -> {check_result.message}")

In [None]:
# Auto-generate YAML rules from data analysis
generated_yaml = generate_rules(orders, dataset_name="orders")
print("Generated YAML Rules:")
print(generated_yaml)

### Save Rules to a File

```python
# Save generated rules
with open("duckguard.yaml", "w") as f:
    f.write(generated_yaml)

# Later, load and execute
rules = load_rules("duckguard.yaml")
result = execute_rules(rules, orders)
```

In [None]:
# Load rules from a YAML file (we have a sample file in sample_data/)
file_rules = load_rules("sample_data/duckguard.yaml")
print(f"Loaded {len(file_rules.checks)} rules from file")
print(f"Dataset: {file_rules.dataset}")
print(f"Description: {file_rules.description}")

# Execute the file-based rules (note: dataset must be passed as keyword argument)
file_result = execute_rules(file_rules, dataset=orders)
print(f"\nResults: {file_result.passed_count}/{file_result.total_checks} passed")

In [None]:
# Working with RuleSet programmatically
# RuleSet allows you to build rules in code instead of YAML

# Create an empty RuleSet
custom_rules = RuleSet(name="custom_orders", version="1.0", description="Custom rules")

# Add simple checks using expressions (same syntax as YAML)
custom_rules.add_simple_check("row_count > 0")
custom_rules.add_simple_check("order_id is not null")
custom_rules.add_simple_check("quantity >= 1")
custom_rules.add_simple_check("status in ['pending', 'shipped', 'delivered', 'cancelled']")

print(f"RuleSet: {custom_rules.name}")
print(f"Version: {custom_rules.version}")
print(f"Description: {custom_rules.description}")
print(f"Total checks: {len(custom_rules.checks)}")
print("\nRules added:")
for check in custom_rules.checks:
    print(f"  - {check.expression}")

# Execute our programmatic rules (note: dataset must be passed as keyword argument)
custom_result = execute_rules(custom_rules, dataset=orders)
print(f"\nResults: {custom_result.passed_count}/{custom_result.total_checks} passed")

## 6. Semantic Type Detection (NEW in v2.0)

DuckGuard automatically detects semantic types like emails, phone numbers, UUIDs, credit cards, and PII.

In [None]:
# Detect semantic types for a single column
email_type = detect_type(orders, "email")
print(f"Column 'email' detected as: {email_type.value if email_type else 'unknown'}")

order_id_type = detect_type(orders, "order_id")
print(f"Column 'order_id' detected as: {order_id_type.value if order_id_type else 'unknown'}")

In [None]:
# Detect types for entire dataset
type_results = detect_types_for_dataset(orders)

print(f"\n{'='*60}")
print("SEMANTIC TYPE DETECTION")
print(f"{'='*60}")
for col_name, sem_type in type_results.items():
    type_name = sem_type.value if sem_type else "generic"
    print(f"  {col_name:20} -> {type_name}")

In [None]:
# Use the SemanticAnalyzer for detailed analysis
analyzer = SemanticAnalyzer()
analysis = analyzer.analyze(orders)

print("\nAnalysis Summary:")
print(f"  Columns analyzed: {len(analysis.columns)}")
print(f"  PII columns detected: {len(analysis.pii_columns)}")
if analysis.pii_columns:
    print(f"  PII warning: Columns {analysis.pii_columns} may contain PII!")

print("\nDetected Types:")
for col_analysis in analysis.columns:
    confidence = f"({col_analysis.confidence:.0%})" if col_analysis.confidence else ""
    detected = col_analysis.semantic_type.value if col_analysis.semantic_type else "unknown"
    print(f"  {col_analysis.name}: {detected} {confidence}")

### Supported Semantic Types

DuckGuard detects 30+ semantic types including:

| Category | Types |
|----------|-------|
| **Identifiers** | UUID, Email, Phone, URL, IP Address |
| **Financial** | Credit Card, Currency, IBAN |
| **Personal (PII)** | SSN, Name, Address, Date of Birth |
| **Geographic** | Country, State, Zip Code, Latitude, Longitude |
| **Technical** | JSON, Timestamp, Version, File Path |

## 7. Data Contracts (NEW in v2.0)

Define schema expectations and quality SLAs with automatic breaking change detection.

In [None]:
# Auto-generate a contract from your data
contract = generate_contract(orders, name="orders_contract", owner="data-team")

print(f"Contract: {contract.name}")
print(f"Version: {contract.version}")
print(f"Owner: {contract.metadata.owner}")
print(f"\nSchema ({len(contract.schema)} columns):")
for field in contract.schema:
    req_status = "required" if field.required else "optional"
    print(f"  {field.name}: {field.type.value if hasattr(field.type, 'value') else field.type} ({req_status})")

In [None]:
# View quality SLAs in the contract
if contract.quality:
    print("Quality SLAs:")
    if contract.quality.completeness is not None:
        print(f"  Completeness: >= {contract.quality.completeness}%")
    if contract.quality.row_count_min is not None:
        print(f"  Min row count: {contract.quality.row_count_min}")
    if contract.quality.row_count_max is not None:
        print(f"  Max row count: {contract.quality.row_count_max}")
    if contract.quality.freshness:
        print(f"  Freshness: {contract.quality.freshness}")

    if contract.quality.uniqueness:
        print("\n  Uniqueness requirements:")
        for col, pct in contract.quality.uniqueness.items():
            print(f"    {col}: {pct}%")

In [None]:
# Validate data against a contract
validation = validate_contract(contract, orders)

print(f"\n{'='*60}")
print("CONTRACT VALIDATION RESULTS")
print(f"{'='*60}")
print(f"Valid: {validation.is_valid}")
print(f"Schema valid: {validation.schema_valid}")
print(f"Quality valid: {validation.quality_valid}")

if validation.errors:
    print("\nErrors:")
    for error in validation.errors:
        print(f"  - {error}")

if validation.warnings:
    print("\nWarnings:")
    for warning in validation.warnings:
        print(f"  - {warning}")

In [None]:
# Export contract to YAML (contract_to_yaml was imported at the top)
contract_yaml = contract_to_yaml(contract)
print("Contract as YAML:")
print(contract_yaml)

### Breaking Change Detection

Compare contracts to detect breaking changes.

In [None]:
# Simulate a contract change: make a required column optional (breaking change!)

# Original contract (order_id is required)
old_contract = generate_contract(orders, dataset_name="orders_v1", as_yaml=False)

# New contract (modify to make order_id optional - a breaking change!)
new_contract = generate_contract(orders, dataset_name="orders_v2", as_yaml=False)
# Find and modify order_id field
for field in new_contract.schema:
    if field.name == "order_id":
        field.required = False  # This is a breaking change!

# Detect breaking changes
diff_result = diff_contracts(old_contract, new_contract)

print("\nContract Diff:")
print(f"  Has breaking changes: {diff_result.has_breaking_changes}")
print(f"  Has changes: {diff_result.has_changes}")

if diff_result.breaking_changes:
    print("\nBreaking Changes:")
    for change in diff_result.breaking_changes:
        print(f"  - {change}")

if diff_result.non_breaking_changes:
    print("\nNon-Breaking Changes:")
    for change in diff_result.non_breaking_changes:
        print(f"  - {change}")

## 8. Anomaly Detection (NEW in v2.0)

Detect statistical anomalies in your data using Z-score, IQR, or percent change methods.

In [None]:
# Quick anomaly detection on numeric columns
report = detect_anomalies(orders, method="zscore", threshold=3.0)

print(f"\n{'='*60}")
print("ANOMALY DETECTION REPORT")
print(f"{'='*60}")
print(f"Source: {report.source}")
print(f"Anomalies found: {report.anomaly_count}")
print(f"\n{report.summary()}")

In [None]:
# Detailed anomaly detection with custom settings
detector = AnomalyDetector(method="iqr", threshold=1.5)
report = detector.detect(
    orders,
    columns=["quantity", "unit_price", "total_amount"],
    include_null_check=True
)

print(f"Checked {report.statistics.get('columns_checked', 0)} columns")
print(f"Method: {report.statistics.get('method')}")
print(f"Threshold: {report.statistics.get('threshold')}")

print("\nResults:")
for anomaly in report.anomalies:
    status = "ANOMALY" if anomaly.is_anomaly else "OK"
    print(f"  [{status}] {anomaly.column}: {anomaly.message}")
    if anomaly.is_anomaly and anomaly.samples:
        print(f"          Samples: {anomaly.samples}")

In [None]:
# Detect anomalies with historical baseline
# Useful for monitoring metrics over time

# Simulate historical baseline values
historical_totals = [50.0, 55.0, 48.0, 52.0, 51.0, 49.0, 53.0, 50.0]

detector = AnomalyDetector(method="percent_change", threshold=0.2)  # 20% change threshold
result = detector.detect_column(
    orders,
    "total_amount",
    baseline_values=historical_totals
)

print(f"Column: {result.column}")
print(f"Is Anomaly: {result.is_anomaly}")
print(f"Score: {result.score:.2f}")
print(f"Threshold: {result.threshold}")
print(f"Message: {result.message}")

### Available Anomaly Detection Methods

| Method | Description | Best For |
|--------|-------------|----------|
| `zscore` | Standard deviations from mean | Normal distributions |
| `iqr` | Interquartile range | Robust to outliers |
| `percent_change` | % change from baseline | Monitoring metrics |
| `modified_zscore` | Uses median & MAD | Non-normal distributions |

## 8. Freshness Monitoring (NEW in v2.2)

Detect stale data before it causes problems. DuckGuard checks freshness via file modification time or timestamp columns.

In [None]:
# Freshness Monitoring - check data staleness
from datetime import timedelta

from duckguard.freshness import FreshnessMonitor

# Quick freshness check via dataset property
print("Freshness Check via Property:")
print("-" * 60)
freshness = orders.freshness
print(f"Source: {freshness.source}")
print(f"Last modified: {freshness.last_modified}")
print(f"Age: {freshness.age_human}")
print(f"Is fresh (24h threshold): {freshness.is_fresh}")
print(f"Method: {freshness.method.value}")

# Custom threshold check
print("\nCustom Threshold Check:")
print("-" * 60)
is_fresh_6h = orders.is_fresh(timedelta(hours=6))
print(f"Fresh within 6 hours: {is_fresh_6h}")

is_fresh_1d = orders.is_fresh(timedelta(days=1))
print(f"Fresh within 1 day: {is_fresh_1d}")

In [None]:
# FreshnessMonitor for advanced freshness checks
monitor = FreshnessMonitor(threshold=timedelta(hours=12))

# Check via file modification time
result = monitor.check_file_mtime("sample_data/orders.csv")
print("File Modification Time Check:")
print("-" * 60)
print(f"  Last modified: {result.last_modified}")
print(f"  Age: {result.age_human}")
print(f"  Is fresh: {result.is_fresh}")
print(f"  Threshold: {result.threshold_seconds / 3600:.1f} hours")

# Check via timestamp column (if you have one)
# result = monitor.check_column_timestamp(orders, "created_at")
# print(f"Column timestamp fresh: {result.is_fresh}")

# Full result as dictionary (useful for logging/storage)
print("\nResult as dict:")
print(result.to_dict())

## 9. ML-Based Anomaly Detection (NEW in v2.2)

DuckGuard now supports machine learning-based anomaly detection methods:
- **Baseline Method**: Learn from historical data and detect deviations
- **KS-Test Method**: Kolmogorov-Smirnov test for distribution drift
- **Seasonal Method**: Account for time-based patterns

These methods auto-learn patterns without requiring manual thresholds!

In [None]:
# ML-Based Anomaly Detection
from duckguard.anomaly import BaselineMethod, KSTestMethod

# Baseline Method - learn and compare
print("Baseline Method:")
print("-" * 60)
baseline = BaselineMethod(sensitivity=2.0)

# Fit on numeric column data
baseline.fit(orders.total_amount)
print("Learned baseline for 'total_amount'")
print(f"  Mean: {baseline.baseline_mean:.2f}")
print(f"  Stddev: {baseline.baseline_std:.2f}")

# Score values against baseline (0 = normal, higher = more anomalous)
scores = baseline.score(orders.total_amount)
print(f"  Scored {len(scores)} values")
print(f"  Max anomaly score: {max(scores):.2f}")
print(f"  Anomalies found: {sum(1 for s in scores if s > 1.0)}")

In [None]:
# KS-Test Method - detect distribution drift
print("KS-Test Method (Distribution Drift):")
print("-" * 60)
ks_method = KSTestMethod(p_value_threshold=0.05)

# Compare current distribution to a reference
comparison = ks_method.compare_distributions(orders.total_amount)
print("Column: total_amount")
print(f"  P-value: {comparison.p_value:.4f}")
print(f"  Statistic: {comparison.statistic:.4f}")
print(f"  Is drift detected: {comparison.is_drift}")
print(f"  Message: {comparison.message}")

## 10. Schema Evolution Tracking (NEW in v2.2)

Track schema changes over time and detect breaking changes before they cause issues.

In [None]:
# Schema Evolution Tracking
import os

# Use temp storage for demo
import tempfile

from duckguard.history import HistoryStorage
from duckguard.schema_history import SchemaChangeAnalyzer, SchemaTracker

schema_db = os.path.join(tempfile.gettempdir(), "demo_schema.db")
schema_storage = HistoryStorage(db_path=schema_db)

# Create a schema tracker
tracker = SchemaTracker(storage=schema_storage)

# Capture a snapshot of the current schema
print("Schema Snapshot:")
print("-" * 60)
snapshot = tracker.capture(orders)
print(f"Source: {snapshot.source}")
print(f"Snapshot ID: {snapshot.snapshot_id[:8]}...")
print(f"Columns: {snapshot.column_count}")
print(f"Rows: {snapshot.row_count}")
print("\nColumn Schema:")
for col in snapshot.columns[:5]:  # Show first 5 columns
    print(f"  {col.name}: {col.dtype} (nullable={col.nullable})")

In [None]:
# Detect schema changes
analyzer = SchemaChangeAnalyzer(storage=schema_storage)

# Detect changes against previous snapshot
print("Schema Change Detection:")
print("-" * 60)
report = analyzer.detect_changes(orders)

print(f"Previous snapshot: {report.previous_snapshot.snapshot_id[:8] if report.previous_snapshot else 'None'}...")
print(f"Current snapshot: {report.current_snapshot.snapshot_id[:8]}...")
print(f"Has changes: {report.has_changes}")
print(f"Has breaking changes: {report.has_breaking_changes}")

if report.changes:
    print("\nChanges detected:")
    for change in report.changes:
        print(f"  {change}")
else:
    print("\nNo schema changes detected (same schema as previous snapshot)")

# View schema history
print("\nSchema History:")
history = tracker.get_history(orders.source, limit=5)
for snap in history:
    print(f"  {snap.captured_at}: {snap.column_count} columns, {snap.row_count} rows")

## 11. Reference/FK Checks & Cross-Dataset Validation (NEW in v2.2)

Validate foreign key relationships and compare data across multiple datasets. This is essential for data lake integrity and ensuring referential integrity without a traditional database.

In [None]:
# Create sample data for cross-dataset validation demo
import os
import tempfile

# Create a customers reference table
customers_content = """id,name,email
CUST-001,Alice,alice@example.com
CUST-002,Bob,bob@example.com
CUST-003,Charlie,charlie@example.com
CUST-004,Diana,diana@example.com
CUST-005,Eve,eve@example.com
"""

# Create orders with some invalid customer references (orphans)
orders_orphans_content = """order_id,customer_id,amount,status
ORD-001,CUST-001,100.00,shipped
ORD-002,CUST-002,200.00,pending
ORD-003,CUST-999,150.00,shipped
ORD-004,CUST-001,50.00,delivered
ORD-005,CUST-888,300.00,pending
ORD-006,CUST-003,75.00,shipped
"""

# Create a status lookup table
status_lookup_content = """code,description
shipped,Order has been shipped
pending,Order is pending
delivered,Order has been delivered
cancelled,Order was cancelled
"""

# Write temp files
temp_dir = tempfile.gettempdir()
customers_file = os.path.join(temp_dir, "demo_customers.csv")
orders_orphans_file = os.path.join(temp_dir, "demo_orders_orphans.csv")
status_lookup_file = os.path.join(temp_dir, "demo_status_lookup.csv")

with open(customers_file, 'w') as f:
    f.write(customers_content)
with open(orders_orphans_file, 'w') as f:
    f.write(orders_orphans_content)
with open(status_lookup_file, 'w') as f:
    f.write(status_lookup_content)

print("Created demo files for cross-dataset validation")

In [None]:
# Reference/FK Checks - Validate foreign key relationships
from duckguard import connect

# Connect to our demo datasets
customers = connect(customers_file)
orders_with_orphans = connect(orders_orphans_file)
status_lookup = connect(status_lookup_file)

print("Datasets loaded:")
print(f"  Customers: {customers.row_count} rows")
print(f"  Orders: {orders_with_orphans.row_count} rows")
print(f"  Status Lookup: {status_lookup.row_count} rows")

# Check if all customer_id values exist in customers table
print("\n" + "=" * 60)
print("REFERENCE/FK VALIDATION")
print("=" * 60)

result = orders_with_orphans["customer_id"].exists_in(customers["id"])

print("\nCheck: orders.customer_id exists_in customers.id")
print(f"Passed: {result.passed}")
print(f"Orphan count: {result.actual_value}")

if not result.passed:
    print("\nOrphan records found:")
    for row in result.failed_rows:
        print(f"  Row {row.row_number}: customer_id = '{row.value}'")

    print(f"\nDetails: {result.details}")

In [None]:
# references() - FK check with null handling options
print("references() with null handling:")
print("-" * 60)

# allow_nulls=True (default) - treats nulls as valid (optional FK)
result_allow_nulls = orders_with_orphans["customer_id"].references(
    customers["id"],
    allow_nulls=True
)
print(f"With allow_nulls=True: {result_allow_nulls.actual_value} failures")

# allow_nulls=False - treats nulls as failures (required FK)
result_no_nulls = orders_with_orphans["customer_id"].references(
    customers["id"],
    allow_nulls=False
)
print(f"With allow_nulls=False: {result_no_nulls.actual_value} failures")

# Get list of orphan values for debugging
print("\nfind_orphans() - Get orphan values:")
print("-" * 60)
orphans = orders_with_orphans["customer_id"].find_orphans(customers["id"])
print(f"Orphan customer IDs: {orphans}")

In [None]:
# Cross-Dataset Validation - Compare columns and row counts
print("Cross-Dataset Validation:")
print("=" * 60)

# matches_values() - Check if column values match a lookup table
print("\nmatches_values() - Validate against lookup table:")
print("-" * 60)
result = orders_with_orphans["status"].matches_values(status_lookup["code"])
print("Check: orders.status matches_values status_lookup.code")
print(f"Passed: {result.passed}")
print("Details:")
print(f"  Missing in other: {result.details.get('missing_in_other', 0)} values")
print(f"  Extra in other: {result.details.get('extra_in_other', 0)} values")

# The orders have: shipped, pending, delivered
# The lookup has: shipped, pending, delivered, cancelled
# So "cancelled" is extra in the lookup (not used in orders)

# row_count_matches() - Compare row counts between datasets
print("\nrow_count_matches() - Compare dataset sizes:")
print("-" * 60)

# Create a backup orders file with same data
backup_orders_content = """order_id,customer_id,amount,status
ORD-001,CUST-001,100.00,shipped
ORD-002,CUST-002,200.00,pending
ORD-003,CUST-003,150.00,shipped
"""
backup_file = os.path.join(temp_dir, "demo_backup_orders.csv")
with open(backup_file, 'w') as f:
    f.write(backup_orders_content)

backup_orders = connect(backup_file)

# Exact match (will fail - different counts)
result = orders_with_orphans.row_count_matches(backup_orders)
print(f"Exact match: {result.passed}")
print(f"  Source: {result.details['source_count']} rows")
print(f"  Backup: {result.details['other_count']} rows")
print(f"  Difference: {result.actual_value}")

# With tolerance (allows small differences)
result_tolerance = orders_with_orphans.row_count_matches(backup_orders, tolerance=5)
print(f"\nWith tolerance=5: {result_tolerance.passed}")

### Cross-Dataset Validation Summary

| Method | Description | Use Case |
|--------|-------------|----------|
| `col.exists_in(other_col)` | Check all values exist in reference column | FK validation |
| `col.references(other_col, allow_nulls)` | FK check with null handling | Optional/Required FK |
| `col.find_orphans(other_col)` | Get list of orphan values | Debugging |
| `col.matches_values(other_col)` | Check value sets match | Lookup validation |
| `dataset.row_count_matches(other, tolerance)` | Compare row counts | Backup validation |
| `dataset.row_count_equals(other)` | Exact row count match | Exact comparison |

### Features

- **Efficient SQL**: Uses anti-join patterns for performance on large datasets
- **Row-Level Details**: See exactly which rows have orphan values
- **Null Handling**: Control how nulls are treated in FK checks
- **Tolerance**: Allow small differences in row count comparisons
- **Shared Engine**: Multiple datasets share the same DuckDB connection

## 12. Reconciliation (NEW in v2.2)

Reconciliation is essential for validating data migrations, ETL pipelines, and ensuring data synchronization between systems. It performs comprehensive row-by-row comparison using key columns.

In [None]:
# Create source and target datasets for reconciliation demo
source_content = """order_id,customer_id,amount,status
ORD-001,CUST-001,100.00,shipped
ORD-002,CUST-002,200.00,pending
ORD-003,CUST-001,150.00,shipped
ORD-004,CUST-003,50.00,delivered
ORD-005,CUST-002,300.00,pending
"""

# Target has some differences: ORD-002 amount changed, ORD-004/005 missing, ORD-006 added
target_content = """order_id,customer_id,amount,status
ORD-001,CUST-001,100.00,shipped
ORD-002,CUST-002,205.00,pending
ORD-003,CUST-001,150.00,shipped
ORD-006,CUST-003,75.00,delivered
"""

# Write temp files
source_recon_file = os.path.join(temp_dir, "demo_source_orders.csv")
target_recon_file = os.path.join(temp_dir, "demo_target_orders.csv")

with open(source_recon_file, 'w') as f:
    f.write(source_content)
with open(target_recon_file, 'w') as f:
    f.write(target_content)

print("Created reconciliation demo files")

In [None]:
# Reconciliation - Compare two datasets comprehensively
source = connect(source_recon_file)
target = connect(target_recon_file)

print("Source dataset:", source.row_count, "rows")
print("Target dataset:", target.row_count, "rows")

# Reconcile using order_id as key
print("\n" + "=" * 60)
print("RECONCILIATION RESULTS")
print("=" * 60)

result = source.reconcile(
    target,
    key_columns=["order_id"],
    compare_columns=["customer_id", "amount", "status"]
)

print(f"\nPassed: {result.passed}")
print(f"Match percentage: {result.match_percentage:.1f}%")
print(f"Missing in target: {result.missing_in_target} rows")
print(f"Extra in target: {result.extra_in_target} rows")
print(f"Value mismatches: {result.value_mismatches}")

# Full summary
print("\n" + result.summary())

## 13. Distribution Drift Detection (NEW in v2.2)

Detect when your data distribution has changed significantly. Essential for ML model monitoring, feature drift detection, and ensuring data pipeline consistency. Uses the Kolmogorov-Smirnov (KS) test for statistical rigor.

In [None]:
# Create baseline and drifted datasets
baseline_content = """id,amount,score
1,100.0,0.5
2,150.0,0.6
3,120.0,0.55
4,180.0,0.7
5,130.0,0.58
6,140.0,0.62
7,160.0,0.65
8,110.0,0.52
9,170.0,0.68
10,125.0,0.56
"""

# Drifted data has significantly different distribution
drifted_content = """id,amount,score
1,1000.0,0.9
2,1500.0,0.95
3,1200.0,0.88
4,1800.0,0.99
5,1300.0,0.92
6,1400.0,0.94
7,1600.0,0.96
8,1100.0,0.87
9,1700.0,0.98
10,1250.0,0.91
"""

baseline_file = os.path.join(temp_dir, "demo_baseline.csv")
drifted_file = os.path.join(temp_dir, "demo_drifted.csv")

with open(baseline_file, 'w') as f:
    f.write(baseline_content)
with open(drifted_file, 'w') as f:
    f.write(drifted_content)

print("Created drift detection demo files")

In [None]:
# Distribution Drift Detection
baseline = connect(baseline_file)
drifted = connect(drifted_file)

print("=" * 60)
print("DISTRIBUTION DRIFT DETECTION")
print("=" * 60)

# Detect drift in amount column
print("\nChecking 'amount' column for drift:")
print("-" * 60)
result = baseline["amount"].detect_drift(drifted["amount"])

print(f"Drift detected: {result.is_drifted}")
print(f"P-value: {result.p_value:.4f}")
print(f"KS statistic: {result.statistic:.4f}")
print(f"Threshold: {result.threshold}")
print(f"Method: {result.method}")
print(f"\nMessage: {result.message}")

# Check another column
print("\nChecking 'score' column for drift:")
print("-" * 60)
result_score = baseline["score"].detect_drift(drifted["score"])
print(f"Drift detected: {result_score.is_drifted}")
print(f"P-value: {result_score.p_value:.4f}")

## 14. Group By Checks (NEW in v2.2)

Run validation checks on groups/segments of your data. Essential for partition-level validation, regional quality checks, and ensuring data quality across different segments (e.g., by date, region, product category).

In [None]:
# Create grouped data for demo
grouped_content = """order_id,customer_id,amount,status,region,date
ORD-001,CUST-001,100.00,shipped,US,2024-01-01
ORD-002,CUST-002,200.00,pending,US,2024-01-01
ORD-003,CUST-001,150.00,shipped,EU,2024-01-02
ORD-004,CUST-003,50.00,delivered,EU,2024-01-02
ORD-005,CUST-002,300.00,pending,US,2024-01-03
ORD-006,CUST-001,75.00,shipped,EU,2024-01-03
ORD-007,CUST-004,125.00,shipped,APAC,2024-01-01
ORD-008,CUST-004,175.00,pending,APAC,2024-01-02
"""

grouped_file = os.path.join(temp_dir, "demo_grouped_orders.csv")
with open(grouped_file, 'w') as f:
    f.write(grouped_content)

grouped_orders = connect(grouped_file)
print(f"Created grouped orders: {grouped_orders.row_count} rows")

In [None]:
# Group By - Get statistics per group
print("=" * 60)
print("GROUP BY CHECKS")
print("=" * 60)

# Group by region
print("\nGroup by region:")
print("-" * 60)
grouped = grouped_orders.group_by("region")
print(f"Groups found: {grouped.groups}")
print(f"Total groups: {grouped.group_count}")

# Get statistics per group
print("\nStatistics per group:")
stats = grouped.stats()
for g in stats:
    print(f"  {g['region']}: {g['row_count']} rows")

# Validate row count per group
print("\nValidation: row_count > 1 per region:")
print("-" * 60)
result = grouped_orders.group_by("region").row_count_greater_than(1)
print(f"Passed: {result.passed}")
print(f"Passed groups: {result.passed_groups}/{result.total_groups}")

# More restrictive validation
print("\nValidation: row_count > 5 per region:")
print("-" * 60)
result = grouped_orders.group_by("region").row_count_greater_than(5)
print(f"Passed: {result.passed}")
print(f"Passed groups: {result.passed_groups}/{result.total_groups}")

if not result.passed:
    print("\nFailed groups:")
    for g in result.get_failed_groups():
        print(f"  {g.group_key}: {g.row_count} rows")

### New Features Summary (v2.2)

| Feature | Method | Description |
|---------|--------|-------------|
| **Reconciliation** | `dataset.reconcile(target, key_columns)` | Compare datasets row-by-row using keys |
| **Distribution Drift** | `col.detect_drift(reference_col)` | KS-test based distribution comparison |
| **Group By Checks** | `dataset.group_by("col").row_count_greater_than(n)` | Validate per-group metrics |

### Result Types

| Result Type | Key Attributes |
|-------------|---------------|
| `ReconciliationResult` | `.passed`, `.missing_in_target`, `.extra_in_target`, `.value_mismatches`, `.summary()` |
| `DriftResult` | `.is_drifted`, `.p_value`, `.statistic`, `.threshold`, `.summary()` |
| `GroupByResult` | `.passed`, `.total_groups`, `.passed_groups`, `.get_failed_groups()`, `.summary()` |

## 11. Email Notifications (NEW in v2.2)

Send email alerts when data quality checks fail via SMTP.

In [None]:
# Email Notifications
from duckguard.notifications import EmailNotifier

# Configure email notifier
# In practice, use environment variable DUCKGUARD_EMAIL_CONFIG with JSON config
email = EmailNotifier(
    smtp_host="smtp.gmail.com",
    smtp_port=587,
    smtp_user="alerts@company.com",
    smtp_password="your_app_password",  # Use app passwords, not regular passwords!
    from_address="duckguard@company.com",
    to_addresses=["team@company.com", "oncall@company.com"],
    use_tls=True,
)

print("EmailNotifier configured:")
print("-" * 60)
print(f"SMTP Host: {email.config.smtp_host}")
print(f"SMTP Port: {email.config.smtp_port}")
print(f"From: {email.config.from_address}")
print(f"To: {email.config.to_addresses}")
print(f"TLS: {email.config.use_tls}")

# To send an alert (uncomment when you have real SMTP settings):
# result = execute_rules(rules, dataset=orders)
# if not result.passed:
#     email.send_failure_alert(result)
#     print("Failure alert sent!")
#
# # Or send results regardless of pass/fail:
# email.send_results(result)

print("\nNote: Email sending requires valid SMTP credentials.")

## 9. Python Assertions (Traditional Approach)

You can still use simple Python assertions - DuckGuard integrates with pytest!

In [None]:
# Basic checks using properties
assert orders.row_count > 0, "Dataset should not be empty"
assert orders.customer_id.null_percent < 5, "Customer ID should have < 5% nulls"
assert orders.total_amount.min >= 0, "Amounts should be non-negative"

print("All basic assertions passed!")

In [None]:
# Validation methods with detailed results
result = orders.order_id.is_not_null(threshold=1.0)
print(f"is_not_null: {result}")
print(f"  Message: {result.message}")

result = orders.total_amount.between(0, 100000)
print(f"\nbetween: {result}")
print(f"  Message: {result.message}")

result = orders.status.isin(['pending', 'shipped', 'delivered', 'cancelled'])
print(f"\nisin: {result}")
print(f"  Message: {result.message}")

In [None]:
# More validation methods
print("Additional validation methods:")
print("-" * 60)

# is_unique - check if column values are unique
result = orders.order_id.is_unique(threshold=100.0)
print(f"is_unique: {result}")
print(f"  Message: {result.message}")

# matches - regex pattern matching
result = orders.email.matches(r'^[\w\.-]+@[\w\.-]+\.\w+$')
print(f"\nmatches (email pattern): {result}")
print(f"  Message: {result.message}")

# has_no_duplicates - check for duplicate values
result = orders.order_id.has_no_duplicates()
print(f"\nhas_no_duplicates: {result}")
print(f"  Message: {result.message}")

# greater_than - value comparison
result = orders.quantity.greater_than(0)
print(f"\ngreater_than(0): {result}")
print(f"  Message: {result.message}")

# less_than - value comparison
result = orders.unit_price.less_than(1000)
print(f"\nless_than(1000): {result}")
print(f"  Message: {result.message}")

# value_lengths_between - string length validation
result = orders.order_id.value_lengths_between(7, 7)  # ORD-XXX format = 7 chars
print(f"\nvalue_lengths_between(7, 7): {result}")
print(f"  Message: {result.message}")

# get_distinct_values - view unique values
distinct_products = orders.product_name.get_distinct_values(limit=5)
print(f"\nget_distinct_values (products): {distinct_products}")

## 10. Row-Level Error Debugging (NEW in v2.1)

When validation fails, DuckGuard now captures exactly which rows failed, making debugging much easier.

In [None]:
# When a validation check fails, you can see exactly which rows failed
# Let's use a restrictive range to force some failures
result = orders.quantity.between(1, 5)

print(f"Check passed: {result.passed}")
print(f"Total failures: {result.total_failures}")

# Get the summary with sample failing rows
if not result.passed:
    print(f"\n{result.summary()}")

    # Get just the failed values
    failed_values = result.get_failed_values()
    print(f"\nFailed values: {failed_values[:5]}...")

    # Get just the row indices
    failed_indices = result.get_failed_row_indices()
    print(f"Failed row indices: {failed_indices[:5]}...")

In [None]:
# Work with individual FailedRow objects for detailed debugging

if result.failed_rows:
    print("Detailed failed row information:")
    print("-" * 60)
    for row in result.failed_rows[:3]:  # Show first 3
        print(f"  Row {row.row_index}:")
        print(f"    Column: {row.column}")
        print(f"    Value: {row.value}")
        print(f"    Expected: {row.expected}")
        if row.reason:
            print(f"    Reason: {row.reason}")
        print()

# You can disable row capture for performance on large datasets
result_no_capture = orders.quantity.between(1, 5, capture_failures=False)
print(f"With capture_failures=False: {len(result_no_capture.failed_rows)} rows captured")

## 11. Slack/Teams Notifications (NEW in v2.1)

Get notified when your data quality checks fail. DuckGuard supports Slack and Microsoft Teams webhooks.

In [None]:
# Import notification components
from duckguard.notifications import SlackNotifier, TeamsNotifier

# Configure a Slack notifier (use your actual webhook URL)
# You can also set DUCKGUARD_SLACK_WEBHOOK environment variable
slack = SlackNotifier(
    webhook_url="https://hooks.slack.com/services/YOUR/WEBHOOK/URL",
    channel="#data-quality",  # Optional override
    username="DuckGuard Bot"
)

# Configure a Teams notifier (use your actual webhook URL)
# You can also set DUCKGUARD_TEAMS_WEBHOOK environment variable
teams = TeamsNotifier(
    webhook_url="https://outlook.office.com/webhook/YOUR/WEBHOOK/URL"
)

print("Notifiers configured!")
print(f"Slack channel: {slack.config.channel}")
print(f"Teams configured: {teams.webhook_url is not None}")

### Sending Notifications on Failures

```python
# Execute rules and send notification on failure
from duckguard import load_rules, execute_rules
from duckguard.notifications import SlackNotifier

rules = load_rules("duckguard.yaml")
result = execute_rules(rules, dataset=orders)

# Only sends if there are failures (configurable)
slack = SlackNotifier(webhook_url="https://hooks.slack.com/...")

if not result.passed:
    # Send formatted failure alert
    slack.send_failure_alert(result)
    
# Or send results regardless of pass/fail
slack.send_results(result, notify_on_success=True)
```

### Notification Features

| Feature | Description |
|---------|-------------|
| **Slack Blocks** | Rich formatted messages with sections and fields |
| **Teams Cards** | Adaptive cards with color-coded status |
| **Failure Details** | Shows which checks failed and why |
| **Pass Rate** | Overall quality score included |
| **Environment Variables** | `DUCKGUARD_SLACK_WEBHOOK` and `DUCKGUARD_TEAMS_WEBHOOK` |
| **Channel Override** | Send to specific channels per notification |

## 12. dbt Integration (NEW in v2.1)

Export DuckGuard validation rules as dbt tests, or import existing dbt tests as DuckGuard rules.

In [None]:
# Import dbt integration functions
from duckguard.integrations import dbt

# Load DuckGuard rules
rules = load_rules("sample_data/duckguard.yaml")

# Convert rules to dbt test format (schema.yml structure)
dbt_tests = dbt.rules_to_dbt_tests(rules)

print("Converted to dbt schema.yml format:")
print("-" * 60)
import yaml

print(yaml.dump(dbt_tests, default_flow_style=False, sort_keys=False))

### dbt Export Options

```python
from duckguard import load_rules
from duckguard.integrations import dbt

rules = load_rules("duckguard.yaml")

# Export to dbt schema.yml file (merges with existing if present)
dbt.export_to_schema(rules, "models/schema.yml")

# Generate dbt singular tests for complex checks
dbt.generate_singular_tests(rules, "tests/")
# Creates files like: tests/test_orders_email_null_percent.sql

# Import dbt tests back as DuckGuard rules
imported_rules = dbt.import_from_dbt("models/schema.yml")
```

### Mapping from DuckGuard to dbt

| DuckGuard Check | dbt Test |
|-----------------|----------|
| `not_null` | `not_null` |
| `unique` | `unique` |
| `isin`, `allowed_values` | `accepted_values` |
| `between`, `range` | `dbt_utils.expression_is_true` |
| `min`, `max` | `dbt_utils.expression_is_true` |
| `positive`, `non_negative` | `dbt_utils.expression_is_true` |
| `pattern`, `matches` | `dbt_utils.expression_is_true` with REGEXP |
| `null_percent` | Singular test (SQL file) |

## 13. HTML/PDF Reports (NEW in v2.1)

Generate beautiful, shareable data quality reports. Perfect for stakeholders and compliance documentation.

In [None]:
# Generate HTML/PDF reports from validation results
from duckguard.reports import generate_html_report

# First, run some validation
rules = load_rules_from_string(yaml_rules)
result = execute_rules(rules, dataset=orders)

# Generate an HTML report
generate_html_report(
    result,
    "quality_report.html",
    title="Orders Data Quality Report",
    include_passed=True  # Include passed checks in the report
)

print("HTML report generated: quality_report.html")
print(f"Quality Score: {result.quality_score:.1f}%")
print(f"Checks: {result.passed_count}/{result.total_checks} passed")

# For PDF reports (requires weasyprint: pip install duckguard[reports])
# generate_pdf_report(result, "quality_report.pdf", title="Orders Quality Report")

### Report Features

| Feature | Description |
|---------|-------------|
| **Standalone HTML** | No external dependencies, works offline |
| **Beautiful Styling** | Professional look with color-coded status |
| **Quality Score** | Overall score with A-F grade |
| **Check Details** | All passed and failed checks with messages |
| **PDF Export** | Print-ready PDF format (requires weasyprint) |
| **Customizable** | Custom titles, include/exclude passed checks |

### CLI Report Generation

```bash
# Generate HTML report
duckguard report data.csv --output report.html

# Generate PDF report
duckguard report data.csv --format pdf --output report.pdf

# With custom title and rules
duckguard report data.csv --config rules.yaml --title "Daily Quality Report"

# Store results in history while generating report
duckguard report data.csv --store
```

## 14. Historical Tracking (NEW in v2.1)

Store validation results over time and analyze quality trends. Perfect for monitoring data pipelines.

In [None]:
# Store and query validation history
import os

# Create a storage instance (defaults to ~/.duckguard/history.db)
# Use a temp file for this demo
import tempfile

from duckguard.history import HistoryStorage, TrendAnalyzer

temp_db = os.path.join(tempfile.gettempdir(), "demo_history.db")
storage = HistoryStorage(db_path=temp_db)

# Store a validation result
run_id = storage.store(result)
print(f"Stored validation run: {run_id[:8]}...")

# Store another run to simulate history
run_id_2 = storage.store(result)

# Query historical runs
runs = storage.get_runs(result.source, limit=5)
print(f"\nRecent runs for {result.source}:")
for run in runs:
    status = "PASS" if run.passed else "FAIL"
    print(f"  [{status}] {run.started_at:%Y-%m-%d %H:%M} - Score: {run.quality_score:.1f}%")

# Get list of tracked sources
sources = storage.get_sources()
print(f"\nTracked sources: {sources}")

In [None]:
# Analyze quality trends over time
analyzer = TrendAnalyzer(storage)
trend = analyzer.analyze(result.source, days=30)

print(f"\n{'='*60}")
print("QUALITY TREND ANALYSIS")
print(f"{'='*60}")
print(f"Trend: {trend.score_trend.upper()}")
print(f"Current Score: {trend.current_score:.1f}%")
print(f"Average Score: {trend.average_score:.1f}%")
print(f"Pass Rate: {trend.pass_rate:.1f}%")
print(f"Total Runs: {trend.total_runs}")
print(f"\n{trend.summary()}")

# Check for quality regression
if analyzer.has_regression(result.source):
    print("\nWARNING: Quality regression detected!")
else:
    print("\nNo quality regression detected.")

### CLI History Commands

```bash
# View recent validation history
duckguard history

# View history for a specific source
duckguard history data.csv

# View history for the last 7 days
duckguard history data.csv --last 7d

# Show trend analysis
duckguard history data.csv --trend

# Output as JSON
duckguard history --format json
```

### History Features

| Feature | Description |
|---------|-------------|
| **SQLite Storage** | Lightweight, file-based storage |
| **Trend Analysis** | Improving, declining, or stable trends |
| **Anomaly Detection** | Detect unusual quality score drops |
| **Pass Rate Tracking** | Track validation success over time |
| **Source Filtering** | Query by data source |

## 15. Airflow Integration (NEW in v2.1)

Use DuckGuard in Apache Airflow data pipelines with native operators.

### Using DuckGuard in Airflow DAGs

```python
from airflow import DAG
from airflow.operators.python import PythonOperator
from duckguard.integrations.airflow import DuckGuardOperator, DuckGuardSensor
from datetime import datetime

with DAG(
    "data_quality_pipeline",
    start_date=datetime(2024, 1, 1),
    schedule_interval="@daily",
) as dag:
    
    # Validate data after loading
    validate_orders = DuckGuardOperator(
        task_id="validate_orders",
        source="s3://bucket/orders/{{ ds }}.parquet",
        config="duckguard.yaml",
        fail_on_error=True,      # Fail task if validation fails
        store_history=True,       # Store results in history
        notify_on_failure=True,   # Send Slack/Teams notification
    )
    
    # Wait for quality threshold to be met
    wait_for_quality = DuckGuardSensor(
        task_id="wait_for_quality",
        source="s3://bucket/orders/{{ ds }}.parquet",
        min_quality_score=80.0,   # Wait until score >= 80%
        timeout=3600,             # Timeout after 1 hour
        poke_interval=300,        # Check every 5 minutes
    )
    
    # Chain tasks
    validate_orders >> wait_for_quality
```

### Operator Features

| Feature | Description |
|---------|-------------|
| **Template Fields** | Use Airflow Jinja templating in source paths |
| **XCom Integration** | Returns quality score and results to XCom |
| **History Storage** | Automatically store results for trending |
| **Notifications** | Send Slack/Teams alerts on failure |
| **Fail on Error** | Configurable task failure behavior |
| **Quality Sensor** | Wait for quality thresholds to be met |

### Installation

```bash
pip install duckguard[airflow]
```

## 16. GitHub Action (NEW in v2.1)

Add data quality gates to your CI/CD pipeline with the DuckGuard GitHub Action.

### GitHub Actions Workflow

```yaml
# .github/workflows/data-quality.yml
name: Data Quality Check

on:
  push:
    paths:
      - 'data/**'
  pull_request:
    paths:
      - 'data/**'

jobs:
  quality-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Run DuckGuard Quality Check
        uses: XDataHubAI/duckguard/.github/actions/duckguard-check@main
        with:
          source: data/orders.csv
          config: duckguard.yaml
          fail-on-warning: false
          python-version: '3.11'
        
      - name: Check Results
        if: always()
        run: |
          echo "Quality Score: ${{ steps.duckguard.outputs.quality-score }}"
          echo "Grade: ${{ steps.duckguard.outputs.grade }}"
          echo "Passed: ${{ steps.duckguard.outputs.passed }}"
```

### Action Inputs

| Input | Description | Required | Default |
|-------|-------------|----------|---------|
| `source` | Data source path or URL | Yes | - |
| `config` | Path to duckguard.yaml | No | Auto-discover |
| `fail-on-warning` | Fail on warnings | No | `false` |
| `fail-on-error` | Fail on errors | No | `true` |
| `python-version` | Python version | No | `3.11` |

### Action Outputs

| Output | Description |
|--------|-------------|
| `passed` | Whether all checks passed |
| `quality-score` | Overall quality score (0-100) |
| `grade` | Letter grade (A, B, C, D, F) |
| `checks-total` | Total number of checks |
| `checks-passed` | Number of passed checks |
| `checks-failed` | Number of failed checks |

### Features

- Automatic GitHub Step Summary with formatted results
- Exit codes for CI/CD integration
- Caching of Python dependencies
- Works with any data source DuckGuard supports

### 17.6 Security & Performance Notes

**Security:**
- All SQL conditions are validated to prevent SQL injection
- Only SELECT queries allowed (no INSERT, UPDATE, DELETE, DROP, etc.)
- Query complexity scoring with configurable limits
- Automatic query timeout (30 seconds)
- Result set limits (10,000 rows)

**Performance:**
- Conditional checks use optimized SQL filtering
- Multi-column checks run in single pass
- Query-based checks leverage DuckDB's columnar engine
- Distributional tests cache statistics
- All checks benefit from DuckDB's parallel processing

**Best Practices:**
1. Start with basic checks, then add conditional/query-based
2. Use thresholds to allow acceptable error rates
3. Combine checks for comprehensive coverage
4. Monitor query complexity scores
5. Use distributional tests for ML feature validation

In [None]:
# Production-grade validation pipeline combining all 3.0 features

# Step 1: Conditional Validation
# High-value orders (>$1000) must have customer info
# high_value_check = orders.customer_id.not_null_when("total >= 1000")

# USA orders must have state
# usa_state_check = orders.state.not_null_when("country = 'USA'")

# Premium customers get higher discount limits
# premium_discount_check = orders.discount.between_when(
#     min_value=0,
#     max_value=200,
#     condition="customer_tier = 'premium'"
# )

# Step 2: Multi-Column Validation
# Verify order total calculation
# total_check = orders.expect_column_pair_satisfy(
#     column_a="total",
#     column_b="subtotal",
#     expression="ABS(total - (subtotal + tax + shipping - discount)) < 0.01",
#     threshold=1.0
# )

# Ship date after order date
# date_check = orders.expect_column_pair_satisfy(
#     column_a="ship_date",
#     column_b="order_date",
#     expression="ship_date >= order_date",
#     threshold=0.98  # Allow 2% data entry errors
# )

# Step 3: Query-Based Validation
# No completed orders with missing payments
# payment_check = orders.expect_query_to_return_no_rows(
#     query="""
#         SELECT * FROM table
#         WHERE status = 'completed'
#         AND (payment_status IS NULL OR payment_status != 'paid')
#     """
# )

# Daily order volume in expected range
# volume_check = orders.expect_query_result_to_be_between(
#     query="""
#         SELECT COUNT(*) FROM table
#         WHERE DATE(order_date) = CURRENT_DATE
#     """,
#     min_value=100,   # At least 100 orders/day
#     max_value=10000  # At most 10k orders/day
# )

# Average order value reasonable
# aov_check = orders.expect_query_result_to_be_between(
#     query="SELECT AVG(total) FROM table WHERE status = 'completed'",
#     min_value=25.0,
#     max_value=500.0
# )

# Step 4: Distributional Validation (if scipy available)
# Check if order amounts follow expected distribution
# try:
#     dist_check = orders.total.expect_ks_test(
#         distribution='expon',  # Order values often follow exponential
#         significance_level=0.05
#     )
# except ImportError:
#     print("scipy not available - skipping distributional checks")

print("Complete 3.0 Validation Pipeline:")
print("=" * 70)
print("\n1. CONDITIONAL CHECKS")
print("   ✓ High-value orders have customer info")
print("   ✓ USA orders have state")
print("   ✓ Premium discounts within limits")
print("\n2. MULTI-COLUMN CHECKS")
print("   ✓ Order totals calculated correctly")
print("   ✓ Ship dates after order dates")
print("\n3. QUERY-BASED CHECKS")
print("   ✓ No incomplete payment data")
print("   ✓ Daily volume in expected range")
print("   ✓ Average order value reasonable")
print("\n4. DISTRIBUTIONAL CHECKS")
print("   ✓ Order amounts follow expected distribution")
print("\n" + "=" * 70)
print("This comprehensive approach catches 95%+ of data quality issues!")

### 17.5 Real-World Example: E-commerce Order Validation

Here's how all 3.0 features work together in a production scenario:

In [None]:
# Note: Distributional tests require scipy
# Install: pip install 'duckguard[statistics]'

# Example 1: Test if feature follows normal distribution
# result = data.age.expect_distribution_normal(significance_level=0.05)
# print(f"Age follows normal distribution: {result.passed}")
# print(f"P-value: {result.details['pvalue']:.4f}")
# print(f"Mean: {result.details['mean']:.2f}, Std: {result.details['std']:.2f}")

# Example 2: Test if feature follows uniform distribution
# result = data.random_id.expect_distribution_uniform(significance_level=0.05)
# print(f"Random ID uniformly distributed: {result.passed}")

# Example 3: General KS test for any scipy distribution
# result = data.response_time.expect_ks_test(
#     distribution='expon',  # Exponential distribution
#     significance_level=0.05
# )
# print(f"Response time follows exponential: {result.passed}")

# Example 4: Chi-square test for categorical distribution
# result = data.category.expect_chi_square_test(
#     expected_frequencies={
#         'A': 0.25,  # 25% category A
#         'B': 0.25,  # 25% category B
#         'C': 0.30,  # 30% category C
#         'D': 0.20   # 20% category D
#     },
#     significance_level=0.05
# )
# print(f"Category distribution matches expected: {result.passed}")

print("Distributional testing use cases:")
print("✓ Validate ML feature distributions haven't changed")
print("✓ Detect data drift in production pipelines")
print("✓ Ensure uniform distribution of random sampling")
print("✓ Verify categorical distributions match business rules")
print("✓ QA synthetic or generated data")
print("\nAvailable distributions:")
print("  - Normal (Gaussian)")
print("  - Uniform")
print("  - Exponential")
print("  - Any scipy.stats distribution")
print("\nNote: Requires scipy>=1.11.0")

### 17.4 Distributional Testing

Statistical distribution validation using Kolmogorov-Smirnov and chi-square tests. Perfect for ML feature validation and detecting data drift:

In [None]:
# Example 1: Find violations (query should return no rows)
# result = orders.expect_query_to_return_no_rows(
#     query="""
#         SELECT * FROM table
#         WHERE discount > subtotal
#     """
# )
# print(f"No excessive discounts: {result.passed}")

# Example 2: Ensure expected data exists
# result = orders.expect_query_to_return_rows(
#     query="""
#         SELECT * FROM table
#         WHERE status = 'completed' AND DATE(order_date) = CURRENT_DATE
#     """
# )
# print(f"Have today's completed orders: {result.passed}")

# Example 3: Verify metric equals expected value
# result = orders.expect_query_result_to_equal(
#     query="SELECT COUNT(*) FROM table WHERE status = 'pending'",
#     expected=0,
#     tolerance=5  # Allow +/- 5
# )
# print(f"Pending orders in range: {result.passed}")

# Example 4: Verify metric is within range
# result = orders.expect_query_result_to_be_between(
#     query="SELECT AVG(total) FROM table WHERE status = 'completed'",
#     min_value=50.0,
#     max_value=500.0
# )
# print(f"Average order value in expected range: {result.passed}")

# Example 5: Complex business metric
# result = orders.expect_query_result_to_be_between(
#     query="""
#         SELECT
#             (SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) * 100.0) /
#             COUNT(*) as completion_rate
#         FROM table
#     """,
#     min_value=80.0,  # At least 80% completion rate
#     max_value=100.0
# )
# print(f"Order completion rate acceptable: {result.passed}")

print("Query-based validation enables:")
print("✓ Finding violations with custom SQL")
print("✓ Validating aggregations and metrics")
print("✓ Checking data freshness and completeness")
print("✓ Complex multi-table validation")
print("\nNote: Use 'table' keyword to reference your dataset in queries")

### 17.3 Query-Based Checks

Write custom SQL queries to validate complex business logic. Perfect for aggregations, complex joins, and business metrics:

In [None]:
# Example 1: Validate total = subtotal + tax + shipping
# result = orders.expect_column_pair_satisfy(
#     column_a="total",
#     column_b="subtotal",
#     expression="ABS(total - (subtotal + tax + shipping)) < 0.01",
#     threshold=1.0  # 100% must pass
# )

# Example 2: Ship date must be after order date
# result = orders.expect_column_pair_satisfy(
#     column_a="ship_date",
#     column_b="order_date",
#     expression="ship_date >= order_date",
#     threshold=0.95  # Allow 5% exceptions
# )

# Example 3: Discount cannot exceed subtotal
# result = orders.expect_column_pair_satisfy(
#     column_a="discount",
#     column_b="subtotal",
#     expression="discount <= subtotal",
#     threshold=1.0
# )

# Example 4: Complex business rule
# result = orders.expect_column_pair_satisfy(
#     column_a="status",
#     column_b="payment_status",
#     expression="""
#         (status = 'shipped' AND payment_status = 'paid') OR
#         (status = 'pending' AND payment_status IN ('pending', 'paid')) OR
#         (status = 'cancelled')
#     """,
#     threshold=1.0
# )

print("Multi-column validation allows you to:")
print("✓ Validate mathematical relationships (total = sum of parts)")
print("✓ Check temporal consistency (end_date >= start_date)")
print("✓ Enforce business rules across multiple fields")
print("✓ Set thresholds for acceptable violation rates")

### 17.2 Multi-Column Validation

Express complex relationships between columns using SQL expressions:

In [None]:
# Example: Email required for subscribed customers
# orders.email.not_null_when("subscription_status = 'active'")

# Example: State required for USA addresses
# orders.state.not_null_when("country = 'USA'")

# Example: Discount validation based on customer tier
# orders.discount.between_when(
#     min_value=0,
#     max_value=50,
#     condition="customer_tier = 'premium'"
# )

# Example: Unique order IDs for completed orders
# orders.order_id.unique_when("status = 'completed'")

# Example: Pattern matching for specific categories
# orders.sku.pattern_when(
#     pattern=r'^ELEC-\d{6}$',
#     condition="category = 'electronics'"
# )

print("Conditional validation examples:")
print("✓ not_null_when(condition) - Column must not be null when condition is true")
print("✓ unique_when(condition) - Column must be unique when condition is true")
print("✓ between_when(min, max, condition) - Column in range when condition is true")
print("✓ isin_when(values, condition) - Column in list when condition is true")
print("✓ pattern_when(regex, condition) - Column matches pattern when condition is true")

### 17.1 Conditional Expectations

Validate columns only when specific conditions are met - perfect for business rules that apply to subsets of data:

## 17. DuckGuard 3.0 Features - Advanced Validation

DuckGuard 3.0 introduces powerful new validation capabilities for complex data quality scenarios:

1. **Conditional Expectations** - Validate only when conditions are met
2. **Multi-Column Validation** - Express relationships between columns  
3. **Query-Based Checks** - Custom SQL for complex business logic
4. **Distributional Testing** - Statistical distribution validation

## 13. Enhanced Error Messages (NEW in v2.1)

DuckGuard v2.1 provides helpful error messages with suggestions, documentation links, and context.

In [None]:
# Enhanced error classes with helpful suggestions
from duckguard.errors import (
    ColumnNotFoundError,
    UnsupportedConnectorError,
    ValidationError,
)

# Example: Column not found error with suggestions
try:
    # Simulate accessing a non-existent column
    raise ColumnNotFoundError(
        column="order",
        available_columns=["order_id", "customer_id", "total_amount", "status"]
    )
except ColumnNotFoundError as e:
    print("ColumnNotFoundError:")
    print("-" * 60)
    print(str(e))
    print()

In [None]:
# Validation error with failed rows and context
try:
    raise ValidationError(
        check_name="between",
        column="quantity",
        actual_value=5,
        expected_value="[1, 100]",
        failed_rows=[150, 200, 300, 400, 500]
    )
except ValidationError as e:
    print("ValidationError:")
    print("-" * 60)
    print(str(e))
    print()

# Unsupported connector error with format suggestions
try:
    raise UnsupportedConnectorError(source="data.xyz")
except UnsupportedConnectorError as e:
    print("UnsupportedConnectorError:")
    print("-" * 60)
    print(str(e))

## 14. Auto-Profiling

Let DuckGuard analyze your data and suggest validation rules.

In [None]:
from duckguard.profiler import AutoProfiler

# Profile the dataset
profiler = AutoProfiler(dataset_var_name="orders")
profile_result = profiler.profile(orders)

print(f"Profiled: {profile_result.source}")
print(f"Rows: {profile_result.row_count}")
print(f"Columns: {profile_result.column_count}")
print(f"\nSuggested Rules ({len(profile_result.suggested_rules)}):")
print("-" * 60)
for rule in profile_result.suggested_rules[:10]:  # Show first 10
    print(rule)

## 15. Using with pytest

DuckGuard integrates seamlessly with pytest. Create a test file:

```python
# test_data_quality.py
import pytest
from duckguard import connect, load_rules, execute_rules, validate_contract, load_contract

@pytest.fixture
def orders():
    return connect("data/orders.csv")

# Test with YAML rules
def test_yaml_rules(orders):
    rules = load_rules("duckguard.yaml")
    result = execute_rules(rules, orders)
    assert result.failed == 0, f"Failed checks: {result.failed}"

# Test with data contract
def test_contract(orders):
    contract = load_contract("contract.yaml")
    result = validate_contract(contract, orders)
    assert result.is_valid, f"Contract violations: {result.errors}"

# Traditional assertion tests
def test_orders_not_empty(orders):
    assert orders.row_count > 0

def test_order_ids_valid(orders):
    assert orders.order_id.null_percent == 0
    assert orders.order_id.has_no_duplicates()

def test_quality_score(orders):
    score = orders.score()
    assert score.overall >= 80, f"Quality score too low: {score.overall}"

# NEW: Test with row-level error details
def test_quantity_range(orders):
    result = orders.quantity.between(1, 100)
    if not result.passed:
        # Get detailed failure info for debugging
        print(result.summary())
    assert result.passed, f"Found {result.total_failures} values out of range"
```

Run with: `pytest test_data_quality.py -v`

## 19. CLI Commands

DuckGuard provides powerful CLI commands with beautiful Rich output:

```bash
# Quick check with auto-generated rules
duckguard check data/orders.csv

# Check with YAML rules file
duckguard check data/orders.csv --config duckguard.yaml

# Discover data and generate rules
duckguard discover data/orders.csv
duckguard discover data/orders.csv --output duckguard.yaml

# Generate data contract
duckguard contract generate data/orders.csv
duckguard contract generate data/orders.csv --output contract.yaml --owner "data-team"

# Validate against contract
duckguard contract validate data/orders.csv --contract contract.yaml

# Compare contracts for breaking changes
duckguard contract diff old_contract.yaml new_contract.yaml

# Detect anomalies
duckguard anomaly data/orders.csv
duckguard anomaly data/orders.csv --method iqr --threshold 1.5

# ML-based anomaly detection (NEW in v2.2)
duckguard anomaly data/orders.csv --learn-baseline    # Learn baseline
duckguard anomaly data/orders.csv --method baseline   # Compare to baseline
duckguard anomaly data/orders.csv --method ks_test    # Distribution drift

# Generate reports (NEW in v2.1)
duckguard report data/orders.csv                           # HTML report
duckguard report data/orders.csv --format pdf              # PDF report
duckguard report data/orders.csv --title "Daily Report"    # Custom title

# View validation history (NEW in v2.1)
duckguard history                                          # All recent runs
duckguard history data/orders.csv --last 7d               # Last 7 days
duckguard history data/orders.csv --trend                  # Trend analysis

# Freshness monitoring (NEW in v2.2)
duckguard freshness data/orders.csv                        # Check via file mtime
duckguard freshness data/orders.csv --column updated_at    # Check via column
duckguard freshness data/orders.csv --max-age 6h           # Custom threshold

# Schema evolution tracking (NEW in v2.2)
duckguard schema data/orders.csv --action show             # Show current schema
duckguard schema data/orders.csv --action capture          # Capture snapshot
duckguard schema data/orders.csv --action history          # View schema history
duckguard schema data/orders.csv --action changes          # Detect changes

# Show version and info
duckguard info
```

## 20. Quick Reference

### YAML Rule Syntax

```yaml
dataset: my_data
rules:
  # Table-level
  - row_count > 0
  - row_count < 1000000
  
  # Column nulls
  - column_name is not null
  - column_name null_percent < 5
  
  # Uniqueness
  - column_name is unique
  - column_name unique_percent > 95
  
  # Ranges
  - column_name >= 0
  - column_name between 0 and 100
  
  # Sets
  - column_name in ['a', 'b', 'c']
  
  # Patterns
  - column_name matches '^[A-Z]{3}$'
```

### Reference/FK Checks & Cross-Dataset Validation (v2.2)

```python
from duckguard import connect

orders = connect("orders.parquet")
customers = connect("customers.parquet")
status_lookup = connect("status_codes.csv")

# Check FK relationship - all values exist in reference
result = orders["customer_id"].exists_in(customers["id"])

# FK check with null handling options
result = orders["customer_id"].references(customers["id"], allow_nulls=True)

# Get list of orphan values
orphans = orders["customer_id"].find_orphans(customers["id"])

# Compare value sets between columns
result = orders["status"].matches_values(status_lookup["code"])

# Compare row counts between datasets
result = orders.row_count_matches(backup_orders)
result = orders.row_count_matches(backup_orders, tolerance=10)
```

### Reconciliation (v2.2)

```python
# Compare two datasets row-by-row using key columns
result = source.reconcile(
    target,
    key_columns=["order_id"],
    compare_columns=["amount", "status"],
    tolerance=0.01,  # Numeric tolerance
    sample_mismatches=10  # Number of sample mismatches to capture
)

print(f"Match: {result.match_percentage}%")
print(f"Missing in target: {result.missing_in_target}")
print(f"Extra in target: {result.extra_in_target}")
print(f"Value mismatches: {result.value_mismatches}")
print(result.summary())
```

### Distribution Drift Detection (v2.2)

```python
# Detect distribution drift using KS-test
baseline = connect("baseline.csv")
current = connect("current.csv")

result = baseline["amount"].detect_drift(current["amount"])

print(f"Drift detected: {result.is_drifted}")
print(f"P-value: {result.p_value:.4f}")
print(f"KS statistic: {result.statistic:.4f}")
print(f"Threshold: {result.threshold}")
print(result.summary())
```

### Group By Checks (v2.2)

```python
# Run validation checks on data segments
orders = connect("orders.csv")

# Group by region and validate
grouped = orders.group_by("region")
print(f"Groups: {grouped.groups}")
print(f"Stats: {grouped.stats()}")

# Validate row counts per group
result = orders.group_by("region").row_count_greater_than(10)
print(f"Passed: {result.passed}")
print(f"Passed groups: {result.passed_groups}/{result.total_groups}")

# Get failed groups for debugging
for g in result.get_failed_groups():
    print(f"  {g.group_key}: {g.row_count} rows")
```

### Freshness Monitoring (v2.2)

```python
from duckguard.freshness import FreshnessMonitor
from datetime import timedelta

# Quick check via property
print(orders.freshness.age_human)  # "2 hours ago"
print(orders.freshness.is_fresh)   # True

# Custom threshold
if not orders.is_fresh(timedelta(hours=6)):
    print("Data is stale!")

# Column-based freshness
monitor = FreshnessMonitor(threshold=timedelta(hours=1))
result = monitor.check_column_timestamp(orders, "updated_at")
```

### ML-Based Anomaly Detection (v2.2)

```python
from duckguard.anomaly import BaselineMethod, KSTestMethod

# Learn baseline and detect anomalies
baseline = BaselineMethod(sensitivity=2.0)
baseline.fit(orders.amount)
scores = baseline.score(orders.amount)

# Distribution drift detection
ks = KSTestMethod(p_value_threshold=0.05)
result = ks.compare_distributions(orders.amount)
print(f"Drift detected: {result.is_drift}")
```

### Schema Evolution (v2.2)

```python
from duckguard.schema_history import SchemaTracker, SchemaChangeAnalyzer

tracker = SchemaTracker()
snapshot = tracker.capture(orders)

analyzer = SchemaChangeAnalyzer()
report = analyzer.detect_changes(orders)
if report.has_breaking_changes:
    print("Breaking changes detected!")
```

### Email Notifications (v2.2)

```python
from duckguard.notifications import EmailNotifier

email = EmailNotifier(
    smtp_host="smtp.gmail.com",
    smtp_user="alerts@company.com",
    smtp_password="app_password",
    to_addresses=["team@company.com"],
)
# Or set DUCKGUARD_EMAIL_CONFIG env var

if not result.passed:
    email.send_failure_alert(result)
```

### Row-Level Error Debugging (v2.1)

```python
result = orders.quantity.between(1, 100)
if not result.passed:
    print(result.summary())           # Human-readable summary
    print(result.get_failed_values()) # [150, 200, ...]
    print(result.get_failed_row_indices())  # [5, 12, ...]
    for row in result.failed_rows:
        print(f"Row {row.row_index}: {row.value}")
```

### Notifications (v2.1)

```python
from duckguard.notifications import SlackNotifier, TeamsNotifier

slack = SlackNotifier(webhook_url="...")  # or DUCKGUARD_SLACK_WEBHOOK
teams = TeamsNotifier(webhook_url="...")  # or DUCKGUARD_TEAMS_WEBHOOK

result = execute_rules(rules, dataset=orders)
if not result.passed:
    slack.send_failure_alert(result)
    teams.send_failure_alert(result)
```

### dbt Integration (v2.1)

```python
from duckguard.integrations import dbt

dbt.export_to_schema(rules, "models/schema.yml")
dbt.generate_singular_tests(rules, "tests/")
rules = dbt.import_from_dbt("models/schema.yml")
```

### HTML/PDF Reports (v2.1)

```python
from duckguard.reports import generate_html_report, generate_pdf_report

generate_html_report(result, "report.html", title="Quality Report")
generate_pdf_report(result, "report.pdf")  # requires weasyprint
```

### Historical Tracking (v2.1)

```python
from duckguard.history import HistoryStorage, TrendAnalyzer

storage = HistoryStorage()
run_id = storage.store(result)

analyzer = TrendAnalyzer(storage)
trend = analyzer.analyze("data.csv", days=30)
print(trend.summary())
```

### Airflow Integration (v2.1)

```python
from duckguard.integrations.airflow import DuckGuardOperator

validate = DuckGuardOperator(
    task_id="validate",
    source="s3://bucket/data.parquet",
    config="duckguard.yaml",
    fail_on_error=True,
)
```

### Semantic Types Detected

- `email`, `phone`, `url`, `ip_address`
- `uuid`, `credit_card`, `iban`
- `ssn`, `date_of_birth` (PII)
- `country`, `state`, `zip_code`
- `latitude`, `longitude`
- `timestamp`, `currency`, `percentage`

### Contract Validation

- Schema: column names, types, nullability
- Quality: completeness, null %, custom rules
- Breaking changes: removed columns, type changes, nullability

### Anomaly Detection Methods

| Method | Threshold | Use Case |
|--------|-----------|----------|
| `zscore` | 3.0 (std devs) | Normal data |
| `iqr` | 1.5 (IQR multiplier) | Outlier-robust |
| `percent_change` | 0.2 (20%) | Time-series monitoring |
| `modified_zscore` | 3.5 | Non-normal distributions |
| `baseline` | 2.0 (sensitivity) | Learn from history |
| `ks_test` | 0.05 (p-value) | Distribution drift |

### Cross-Dataset Validation Methods

| Method | Description | Use Case |
|--------|-------------|----------|
| `col.exists_in(other_col)` | Check values exist in reference | FK validation |
| `col.references(other_col)` | FK check with null handling | Optional/Required FK |
| `col.find_orphans(other_col)` | Get orphan values | Debugging |
| `col.matches_values(other_col)` | Compare value sets | Lookup validation |
| `dataset.row_count_matches(other)` | Compare row counts | Backup validation |
| `dataset.reconcile(target, key_columns)` | Row-by-row comparison | Migration validation |
| `col.detect_drift(other_col)` | Distribution drift (KS-test) | ML monitoring |
| `dataset.group_by(col).row_count_greater_than(n)` | Per-group validation | Segmented checks |

## 21. Next Steps

- **Documentation**: https://duckguard.dev
- **GitHub**: https://github.com/XDataHubAI/duckguard
- **Issues**: https://github.com/XDataHubAI/duckguard/issues

### What to explore next:
1. Generate a `duckguard.yaml` file for your data with `duckguard discover`
2. Create a data contract with `duckguard contract generate`
3. Set up anomaly monitoring with `duckguard anomaly`
4. Add rules to your CI/CD pipeline with pytest
5. Detect PII with semantic type detection
6. Set up Slack/Teams/Email alerts for data quality failures
7. Export your rules to dbt with `dbt.export_to_schema()`
8. Use row-level error capture for debugging failed validations
9. Generate HTML/PDF reports with `duckguard report`
10. Track quality trends with `duckguard history --trend`
11. Add DuckGuard to your Airflow DAGs
12. Set up GitHub Actions for CI/CD quality gates
13. **NEW**: Monitor data freshness with `duckguard freshness`
14. **NEW**: Learn baselines for ML-based anomaly detection
15. **NEW**: Track schema changes with `duckguard schema`
16. **NEW**: Set up email notifications for alerts
17. **NEW**: Validate FK relationships with `exists_in()` and `references()`
18. **NEW**: Compare datasets with `row_count_matches()` and `matches_values()`