# NB0: Data Readiness & Pipeline Stage Detection (v4.3.1)

**Purpose**: Quickly verify if the necessary data for the N-of-1 pipeline are present and infer which ETL stages have been executed for P000001.

**Pipeline**: practicum2-nof1-adhd-bd **v4.3.1** (Dec 10, 2025)  
**Participant**: P000001  
**Snapshot**: 2025-12-09

This notebook is a sanity check / onboarding tool that:
1. Detects presence of key directories and files
2. Maps these to pipeline stages (0-9)
3. Provides actionable hints for missing stages
4. **v4.3.1**: Updated to detect MICE-imputed datasets in Stage 5

In [1]:
import pandas as pd
from pathlib import Path
import json
from typing import Dict, List, Tuple

# Configuration
PARTICIPANT = "P000001"
SNAPSHOT = "2025-12-09"

# REPO_ROOT is one level up from notebooks/
REPO_ROOT = Path.cwd().parent if Path.cwd().name == "notebooks" else Path.cwd()

print(f"Repository root: {REPO_ROOT}")
print(f"Checking data for: {PARTICIPANT} / {SNAPSHOT}")
print(f"Data directory: {REPO_ROOT / 'data'}")

Repository root: c:\dev\practicum2-nof1-adhd-bd
Checking data for: P000001 / 2025-12-09
Data directory: c:\dev\practicum2-nof1-adhd-bd\data


## Stage Detection Logic

Pipeline stages (from `scripts/run_full_pipeline.py`):

- **Stage 0**: Ingest (extract from data/raw)
- **Stage 1**: Aggregate (XML + CSVs to daily_*.csv)
- **Stage 2**: Unify (merge daily CSVs)
- **Stage 3**: Label (apply PBSI v4.3.1 labels - intuitive sign convention)
- **Stage 4**: Segment (create segment_autolog.csv)
- **Stage 5**: Prep ML6 (v4.3.1: temporal filter >= 2021-05-11 + MICE imputation)
- **Stage 6**: ML6 (Static Classifier) (baseline models with 6-fold CV)
- **Stage 7**: ML7 (LSTM Sequence) (LSTM + SHAP + drift detection)
- **Stage 8**: TFLite (mobile deployment)
- **Stage 9**: Report (RUN_REPORT.md)

### v4.3.1 Changes (Dec 10, 2025)
- **PBSI**: Higher score = better regulation (intuitive!)
- **Stage 5**: Now outputs `ai/ml6/features_daily_ml6.csv` (MICE-imputed, 2021-2025)
- **Stage 6**: Uses MICE data (F1=0.69¬±0.16)
- **Stage 7**: SHAP + drift + LSTM on MICE data

In [2]:
def check_file_exists(path: Path) -> bool:
    """Check if file exists and is non-empty."""
    return path.exists() and (path.is_dir() or path.stat().st_size > 0)

def check_directory_has_files(path: Path, pattern: str = "*") -> bool:
    """Check if directory exists and contains files matching pattern."""
    if not path.exists() or not path.is_dir():
        return False
    return len(list(path.glob(pattern))) > 0

# Define paths
paths = {
    "raw_base": REPO_ROOT / "data" / "raw",
    "etl_base": REPO_ROOT / "data" / "etl" / PARTICIPANT / SNAPSHOT,
    "ai_base": REPO_ROOT / "data" / "ai" / PARTICIPANT / SNAPSHOT,
}

# Stage-specific checks
stage_checks = [
    {
        "stage": "Stage 0: Ingest",
        "key_files": [
            paths["etl_base"] / "extracted" / "zepp",
            paths["etl_base"] / "extracted" / "apple" / "apple_health_export",
        ],
        "check": lambda: (
            check_directory_has_files(paths["etl_base"] / "extracted" / "zepp", "*.csv") or
            check_file_exists(paths["etl_base"] / "extracted" / "apple" / "apple_health_export" / "export.xml")
        ),
        "hint": f"make ingest PID={PARTICIPANT} SNAPSHOT={SNAPSHOT}"
    },
    {
        "stage": "Stage 1: Aggregate",
        "key_files": [
            paths["etl_base"] / "extracted" / "apple" / "daily_sleep.csv",
            paths["etl_base"] / "extracted" / "apple" / "daily_cardio.csv",
            paths["etl_base"] / "extracted" / "apple" / "daily_activity.csv",
            paths["etl_base"] / "extracted" / "zepp" / "daily_sleep.csv",
            paths["etl_base"] / "extracted" / "zepp" / "daily_cardio.csv",
            paths["etl_base"] / "extracted" / "zepp" / "daily_activity.csv",
        ],
        "check": lambda: (
            check_file_exists(paths["etl_base"] / "extracted" / "apple" / "daily_sleep.csv") or
            check_file_exists(paths["etl_base"] / "extracted" / "apple" / "daily_cardio.csv") or
            check_file_exists(paths["etl_base"] / "extracted" / "apple" / "daily_activity.csv") or
            check_file_exists(paths["etl_base"] / "extracted" / "zepp" / "daily_sleep.csv") or
            check_file_exists(paths["etl_base"] / "extracted" / "zepp" / "daily_cardio.csv") or
            check_file_exists(paths["etl_base"] / "extracted" / "zepp" / "daily_activity.csv")
        ),
        "hint": f"make aggregate PID={PARTICIPANT} SNAPSHOT={SNAPSHOT}"
    },
    {
        "stage": "Stage 2: Unify",
        "key_files": [
            paths["etl_base"] / "joined" / "features_daily_unified.csv",
        ],
        "check": lambda: check_file_exists(paths["etl_base"] / "joined" / "features_daily_unified.csv"),
        "hint": f"make unify PID={PARTICIPANT} SNAPSHOT={SNAPSHOT}"
    },
    {
        "stage": "Stage 3: Label (PBSI v4.3.1)",
        "key_files": [
            paths["etl_base"] / "joined" / "features_daily_labeled.csv",
        ],
        "check": lambda: check_file_exists(paths["etl_base"] / "joined" / "features_daily_labeled.csv"),
        "hint": f"make label PID={PARTICIPANT} SNAPSHOT={SNAPSHOT}"
    },
    {
        "stage": "Stage 4: Segment",
        "key_files": [
            paths["etl_base"] / "segment_autolog.csv",
        ],
        "check": lambda: check_file_exists(paths["etl_base"] / "segment_autolog.csv"),
        "hint": f"make segment PID={PARTICIPANT} SNAPSHOT={SNAPSHOT}"
    },
    {
        "stage": "Stage 5: Prep ML6 (v4.3.1: Temporal filter + MICE)",
        "key_files": [
            paths["ai_base"] / "ml6" / "features_daily_ml6.csv",
        ],
        "check": lambda: check_file_exists(paths["ai_base"] / "ml6" / "features_daily_ml6.csv"),
        "hint": f"python scripts/run_full_pipeline.py --participant {PARTICIPANT} --snapshot {SNAPSHOT} --start-stage 5 --end-stage 5"
    },
    {
        "stage": "Stage 6: ML6 (static classifier)",
        "key_files": [
            paths["ai_base"] / "ml6" / "cv_summary.json",
        ],
        "check": lambda: check_file_exists(paths["ai_base"] / "ml6" / "cv_summary.json"),
        "hint": f"python scripts/run_full_pipeline.py --participant {PARTICIPANT} --snapshot {SNAPSHOT} --start-stage 6 --end-stage 6"
    },
    {
        "stage": "Stage 7: ML7 (SHAP + Drift + LSTM)",
        "key_files": [
            paths["ai_base"] / "ml7" / "shap_summary.md",
            paths["ai_base"] / "ml7" / "drift_report.md",
            paths["ai_base"] / "ml7" / "lstm_report.md",
        ],
        "check": lambda: (
            check_file_exists(paths["ai_base"] / "ml7" / "shap_summary.md") and
            check_file_exists(paths["ai_base"] / "ml7" / "drift_report.md") and
            check_file_exists(paths["ai_base"] / "ml7" / "lstm_report.md")
        ),
        "hint": f"python scripts/run_full_pipeline.py --participant {PARTICIPANT} --snapshot {SNAPSHOT} --start-stage 7 --end-stage 7"
    },
    {
        "stage": "QC: Quality Control",
        "key_files": [
            paths["etl_base"] / "qc",
        ],
        "check": lambda: check_directory_has_files(paths["etl_base"] / "qc", "*.json"),
        "hint": f"make qc-all PID={PARTICIPANT} SNAPSHOT={SNAPSHOT}"
    },
]

print("‚úì Stage detection logic configured (v4.3.1)")

‚úì Stage detection logic configured (v4.3.1)


## Stage Status Check

In [3]:
# Run all checks
results = []

for check in stage_checks:
    exists = check["check"]()
    status = "‚úÖ OK" if exists else "‚ùå Missing"
    
    results.append({
        "Stage": check["stage"],
        "Status": status,
        "Key Files": ", ".join([str(p.relative_to(REPO_ROOT)) for p in check["key_files"]]),
        "Exists": exists,
        "Action": "" if exists else check["hint"]
    })

# Create summary DataFrame
df_status = pd.DataFrame(results)

print("\n" + "="*100)
print("PIPELINE STAGE STATUS")
print("="*100)
print(df_status.to_string(index=False))
print("="*100)


PIPELINE STAGE STATUS
                                             Stage    Status                                                                                                                                                                                                                                                                                                                                                                         Key Files  Exists                                      Action
                                   Stage 0: Ingest      ‚úÖ OK                                                                                                                                                                                                                                                                       data\etl\P000001\2025-12-09\extracted\zepp, data\etl\P000001\2025-12-09\extracted\apple\apple_health_export    True                                            
         

## Actionable Recommendations

In [4]:
missing_stages = df_status[~df_status["Exists"]]

if len(missing_stages) == 0:
    print("\nüéâ All pipeline stages complete! Ready for NB1-NB3 analysis.")
else:
    print(f"\n‚ö†Ô∏è  {len(missing_stages)} stage(s) incomplete:\n")
    for idx, row in missing_stages.iterrows():
        print(f"  {row['Status']} {row['Stage']}")
        print(f"     ‚Üí {row['Action']}\n")
    
    # First missing stage
    first_missing = missing_stages.iloc[0]
    print("\nüí° Recommendation: Run the first missing stage to proceed:")
    print(f"   {first_missing['Action']}")


‚ö†Ô∏è  1 stage(s) incomplete:

  ‚ùå Missing QC: Quality Control
     ‚Üí make qc-all PID=P000001 SNAPSHOT=2025-12-09


üí° Recommendation: Run the first missing stage to proceed:
   make qc-all PID=P000001 SNAPSHOT=2025-12-09


## File Inventory (if stages complete)

In [5]:
# Only run if at least Stage 2 is complete
if check_file_exists(paths["etl_base"] / "joined" / "features_daily_unified.csv"):
    print("\nüìä Quick Data Inventory:\n")
    
    # Load unified dataset
    try:
        df_unified = pd.read_csv(paths["etl_base"] / "joined" / "features_daily_unified.csv")
        print(f"  features_daily_unified.csv:")
        print(f"    - Shape: {df_unified.shape}")
        print(f"    - Date range: {df_unified['date'].min()} to {df_unified['date'].max()}")
        print(f"    - Columns: {', '.join(df_unified.columns[:10])}...")
    except Exception as e:
        print(f"  ‚ö†Ô∏è  Could not load features_daily_unified.csv: {e}")
    
    # Load labeled dataset if available
    if check_file_exists(paths["etl_base"] / "joined" / "features_daily_labeled.csv"):
        try:
            df_labeled = pd.read_csv(paths["etl_base"] / "joined" / "features_daily_labeled.csv")
            print(f"\n  features_daily_labeled.csv:")
            print(f"    - Shape: {df_labeled.shape}")
            if 'label_3cls' in df_labeled.columns:
                print(f"    - Label distribution: {df_labeled['label_3cls'].value_counts().to_dict()}")
        except Exception as e:
            print(f"  ‚ö†Ô∏è  Could not load features_daily_labeled.csv: {e}")
else:
    print("\n‚è© Skipping inventory (Stage 2 not complete)")


üìä Quick Data Inventory:

  features_daily_unified.csv:
    - Shape: (2868, 30)
    - Date range: 2017-12-04 to 2025-12-07
    - Columns: date, sleep_hours, sleep_quality_score, hr_mean, hr_min, hr_max, hr_std, hr_samples, hrv_sdnn_mean, hrv_sdnn_median...

  features_daily_labeled.csv:
    - Shape: (2868, 53)
    - Label distribution: {0.0: 1434, 1.0: 717, -1.0: 717}


## Summary

This notebook provides a quick sanity check before running EDA or modelling notebooks.

**Next steps**:
- If all stages complete: proceed to **NB1_EDA.ipynb**
- If missing stages: run the recommended commands above
- For full pipeline run: `python -m scripts.run_full_pipeline --participant P000001 --snapshot 2025-12-09`