# Notebook 05: Statistical Validation
**Project:** Synthetic Sleep Environment Dataset Generator  
**Authors:** Rushav Dash & Lisa Li  
**Course:** TECHIN 513 — Signal Processing & Machine Learning  
**University:** University of Washington  
**Date:** 2026-02-19

## Table of Contents
1. [Setup & Load Data](#section-1)
2. [Tier 1: Statistical Tests (KS-tests)](#section-2)
3. [Tier 2: ML Cross-Dataset Validation](#section-3)
4. [Tier 3: Sleep Science Sanity Checks](#section-4)
5. [Overall Validation Summary](#section-5)
6. [Distribution Visualizations](#section-6)

---
## 1. Setup & Load Data <a id='section-1'></a>

Load the generated synthetic dataset and the real reference datasets needed for comparison.

In [None]:
import sys, os
PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), '..'))
if PROJECT_ROOT not in sys.path:
    sys.path.insert(0, PROJECT_ROOT)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from src.data_loader import DataLoader
from src.validator import Validator

%matplotlib inline
plt.rcParams.update({'figure.dpi': 120, 'font.size': 11})
sns.set_theme(style='whitegrid')
print('Setup complete.')

In [None]:
# Load synthetic dataset
csv_path = os.path.join(PROJECT_ROOT, 'data', 'output', 'synthetic_sleep_dataset_5000.csv')
df_syn = pd.read_csv(csv_path)
print(f'Synthetic dataset: {df_syn.shape[0]:,} rows x {df_syn.shape[1]} columns')
df_syn.head(3)

In [None]:
# Load real datasets for comparison
loader = DataLoader(verbose=True)
try:
    loader.download_all()
    df_sleep = loader.load_sleep_efficiency()
    df_occ   = loader.load_room_occupancy()
    print(f'Real sleep data: {df_sleep.shape}')
    print(f'Real IoT data:   {df_occ.shape}')
except Exception as e:
    print(f'Warning: could not load real data ({e}). Tier 1 and Tier 2 will be skipped.')
    df_sleep = None
    df_occ = None

---
## 2. Tier 1: Statistical Tests (KS-tests) <a id='section-2'></a>

The two-sample Kolmogorov-Smirnov test compares each synthetic feature distribution
against the corresponding real sensor distribution.

**Null hypothesis H₀:** the two samples come from the same underlying distribution.

**Target:** p-value > 0.05 → fail to reject H₀ → distributions are statistically indistinguishable.

In [None]:
validator = Validator(df_syn, real_occupancy_df=df_occ, real_sleep_df=df_sleep)
tier1_results = validator.tier1_statistical()

print('=== TIER 1: KS-test Results ===')
for name, res in tier1_results.items():
    if '_warning' in name:
        print(f'  WARNING: {res.get("message", "")}')
    elif isinstance(res.get('pass'), bool):
        status = 'PASS' if res['pass'] else 'FAIL'
        print(f'  [{status}] {name}')
        print(f'          KS stat={res["statistic"]:.4f}  p={res["p_value"]:.4f}')
        if 'syn_mean' in res:
            print(f'          syn_mean={res["syn_mean"]:.3f}  real_mean={res["real_mean"]:.3f}')

### 2.1 KS-test Interpretation Table

In [None]:
rows = []
for name, res in tier1_results.items():
    if isinstance(res.get('pass'), bool):
        rows.append({
            'Test': name,
            'KS Statistic': round(res.get('statistic', float('nan')), 4),
            'p-value': round(res.get('p_value', float('nan')), 4),
            'Result': 'PASS' if res['pass'] else 'FAIL',
            'Syn Mean': round(res.get('syn_mean', float('nan')), 3),
            'Real Mean': round(res.get('real_mean', float('nan')), 3),
        })

if rows:
    df_tier1 = pd.DataFrame(rows).set_index('Test')
    display(df_tier1)
else:
    print('No KS-test results available (real data may not be loaded).')

---
## 3. Tier 2: ML Cross-Dataset Validation <a id='section-3'></a>

Train a Linear Regression on the synthetic dataset and evaluate transfer to real data.
**Target:** synthetic model RMSE ≤ 120% of real-data baseline RMSE.

In [None]:
tier2_results = validator.tier2_ml_validation()

print('=== TIER 2: ML Cross-Dataset Validation ===')
for k, v in tier2_results.items():
    print(f'  {k}: {v}')

---
## 4. Tier 3: Sleep Science Sanity Checks <a id='section-4'></a>

Domain-knowledge assertions derived from peer-reviewed sleep medicine literature.

In [None]:
tier3_results = validator.tier3_sanity_checks()

print('=== TIER 3: Sleep Science Sanity Checks ===')
rows_t3 = []
for check in tier3_results:
    status = 'PASS' if check['pass'] else 'FAIL'
    print(f'  [{status}] {check["name"]}')
    print(f'          actual={check["actual_value"]:.4f}  threshold={check["threshold"]}')
    rows_t3.append({
        'Check': check['name'],
        'Actual Value': round(check['actual_value'], 4),
        'Threshold': check['threshold'],
        'Result': status,
        'Description': check.get('description', ''),
    })

print()
df_tier3 = pd.DataFrame(rows_t3).set_index('Check')
display(df_tier3)

---
## 5. Overall Validation Summary <a id='section-5'></a>

In [None]:
# Re-run all tiers to get the full report and quality score
validator2 = Validator(df_syn, real_occupancy_df=df_occ, real_sleep_df=df_sleep)
report = validator2.run_all()
print(f'\nOverall quality score: {report.overall_score:.1f}%  ({report.passed_checks}/{report.total_checks} checks passed)')

---
## 6. Distribution Visualisations <a id='section-6'></a>

Side-by-side KDE plots compare synthetic vs. real feature distributions.

In [None]:
# Plot synthetic sleep efficiency vs real
fig, axes = plt.subplots(1, 2, figsize=(13, 4))

# Subplot 1: Temperature
if 'temp_mean' in df_syn.columns:
    df_syn['temp_mean'].plot.kde(ax=axes[0], label='Synthetic', color='tomato', linewidth=2)
    if df_occ is not None and 'Temperature' in df_occ.columns:
        t_real = df_occ['Temperature'].dropna()
        t_real = t_real[(t_real > 10) & (t_real < 35)]
        t_real.plot.kde(ax=axes[0], label='Real (IoT)', color='steelblue', linewidth=2, linestyle='--')
    axes[0].set_title('Temperature Distribution\nSynthetic vs. Real IoT')
    axes[0].set_xlabel('Temperature (°C)')
    axes[0].legend()

# Subplot 2: Sleep efficiency
if 'sleep_efficiency' in df_syn.columns:
    df_syn['sleep_efficiency'].plot.kde(ax=axes[1], label='Synthetic', color='mediumseagreen', linewidth=2)
    if df_sleep is not None and 'Sleep efficiency' in df_sleep.columns:
        df_sleep['Sleep efficiency'].dropna().plot.kde(
            ax=axes[1], label='Real', color='coral', linewidth=2, linestyle='--')
    axes[1].set_title('Sleep Efficiency Distribution\nSynthetic vs. Real Sleep Data')
    axes[1].set_xlabel('Sleep Efficiency')
    axes[1].legend()

plt.suptitle('Tier 1 Visual Comparison: Synthetic vs. Real Distributions', fontsize=13)
plt.tight_layout()
plt.show()

In [None]:
# Box plot of sleep efficiency per season
fig, ax = plt.subplots(figsize=(9, 4))
season_order = ['winter', 'spring', 'summer', 'fall']
palette = {'winter': '#5B9BD5', 'spring': '#70AD47', 'summer': '#FF7043', 'fall': '#FFC000'}
sns.boxplot(
    data=df_syn, x='season', y='sleep_efficiency',
    order=season_order, palette=palette, ax=ax, width=0.5
)
ax.set_title('Sleep Efficiency Distribution by Season')
ax.set_xlabel('Season')
ax.set_ylabel('Sleep Efficiency')
plt.tight_layout()
plt.show()