# Notebook 01: Data Exploration
**Project:** Synthetic Sleep Environment Dataset Generator  
**Authors:** Rushav Dash & Lisa Li  
**Course:** TECHIN 513 — Signal Processing & Machine Learning  
**University:** University of Washington  
**Date:** 2026-02-19

## Table of Contents
1. [Setup & Imports](#section-1)
2. [Load Datasets via DataLoader](#section-2)
3. [Sleep Efficiency EDA](#section-3)
4. [Room Occupancy EDA](#section-4)
5. [Smart Home EDA](#section-5)
6. [Reference Statistics Summary](#section-6)

---
## 1. Setup & Imports <a id='section-1'></a>
Add the project root to `sys.path` so the `src` package is importable, then bring in all third-party libraries used throughout this notebook.

In [None]:
import sys
import os

# Make the src package importable from this notebook
PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), '..'))
if PROJECT_ROOT not in sys.path:
    sys.path.insert(0, PROJECT_ROOT)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import kagglehub
from scipy import signal as sp_signal

from src.data_loader import DataLoader
from src.feature_extractor import FeatureExtractor

%matplotlib inline
plt.rcParams.update({'figure.dpi': 120, 'font.size': 11})
sns.set_theme(style='whitegrid')
print('Setup complete.')

---
## 2. Load Datasets via DataLoader <a id='section-2'></a>
`DataLoader.download_all()` uses **kagglehub** to pull every dataset from Kaggle and caches them locally under `data/raw/`. Subsequent calls use the local cache.

In [None]:
loader = DataLoader(data_dir=os.path.join(PROJECT_ROOT, 'data'))

print('Downloading / locating datasets...')
loader.download_all()
print('Download complete.')

In [None]:
df_sleep = loader.load_sleep_efficiency()
df_occ   = loader.load_room_occupancy()
df_home  = loader.load_smart_home()

print(f'Sleep Efficiency : {df_sleep.shape}')
print(f'Room Occupancy   : {df_occ.shape}')
print(f'Smart Home       : {df_home.shape if df_home is not None else "not available"}')

---
## 3. Sleep Efficiency EDA <a id='section-3'></a>
We examine the **Sleep Efficiency** dataset in detail: schema, missing values, univariate distributions, and inter-feature correlations.

In [None]:
print('=== Shape ===')
print(df_sleep.shape)
print('\n=== dtypes ===')
print(df_sleep.dtypes)
print('\n=== First 5 rows ===')
df_sleep.head()

In [None]:
print('=== Missing value counts ===')
null_counts = df_sleep.isnull().sum()
null_pct    = (null_counts / len(df_sleep) * 100).round(2)
pd.DataFrame({'null_count': null_counts, 'null_pct': null_pct}).query('null_count > 0')

### 3.1 Correlation Heatmap
Pearson correlations between all numeric columns help us see which environmental and behavioural predictors are most linearly related to sleep efficiency.

In [None]:
numeric_sleep = df_sleep.select_dtypes(include='number')
corr_matrix   = numeric_sleep.corr()

fig, ax = plt.subplots(figsize=(12, 9))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(
    corr_matrix, mask=mask, annot=True, fmt='.2f',
    cmap='coolwarm', center=0, linewidths=0.5, ax=ax
)
ax.set_title('Sleep Efficiency Dataset — Pearson Correlation Heatmap', fontsize=13)
plt.tight_layout()
plt.show()

### 3.2 Univariate Distributions
Histograms with KDE overlays for the key numeric columns.

In [None]:
key_cols = [c for c in numeric_sleep.columns if numeric_sleep[c].nunique() > 5]
n_cols   = 3
n_rows   = int(np.ceil(len(key_cols) / n_cols))

fig, axes = plt.subplots(n_rows, n_cols, figsize=(14, 3.5 * n_rows))
axes = axes.flatten()

for i, col in enumerate(key_cols):
    sns.histplot(df_sleep[col].dropna(), kde=True, ax=axes[i], color='steelblue')
    axes[i].set_title(col)
    axes[i].set_xlabel('')

for j in range(i + 1, len(axes)):
    axes[j].set_visible(False)

fig.suptitle('Sleep Efficiency — Numeric Column Distributions', fontsize=13, y=1.01)
plt.tight_layout()
plt.show()

### 3.3 Boxplots by Gender
Comparing sleep efficiency and sleep duration across genders.

In [None]:
box_targets = ['Sleep efficiency', 'Sleep duration']
box_targets = [c for c in box_targets if c in df_sleep.columns]

if 'Gender' in df_sleep.columns and box_targets:
    fig, axes = plt.subplots(1, len(box_targets), figsize=(7 * len(box_targets), 5))
    if len(box_targets) == 1:
        axes = [axes]
    for ax, col in zip(axes, box_targets):
        sns.boxplot(data=df_sleep, x='Gender', y=col, palette='Set2', ax=ax)
        ax.set_title(f'{col} by Gender')
    plt.suptitle('Sleep Efficiency Dataset — Boxplots by Gender', fontsize=13)
    plt.tight_layout()
    plt.show()
else:
    print('Gender column not found or no target columns available.')

---
## 4. Room Occupancy EDA <a id='section-4'></a>
The **Room Occupancy** dataset provides dense multi-sensor time-series (temperature, light, sound, CO2, humidity) labelled by occupancy status. We explore trends and frequency content.

In [None]:
print(df_occ.dtypes)
df_occ.head()

### 4.1 Multi-sensor Time-series Plot
Plotting the first 2000 samples of each sensor channel.

In [None]:
# Identify a datetime column if present
time_col  = next((c for c in df_occ.columns if 'date' in c.lower() or 'time' in c.lower()), None)
sensor_cols = [c for c in df_occ.columns
               if c not in ([time_col] if time_col else []) + ['Occupancy', 'occupancy']]

plot_n = min(2000, len(df_occ))
x_vals = df_occ[time_col].iloc[:plot_n] if time_col else np.arange(plot_n)

fig, axes = plt.subplots(len(sensor_cols), 1, figsize=(14, 2.8 * len(sensor_cols)), sharex=True)
if len(sensor_cols) == 1:
    axes = [axes]

for ax, col in zip(axes, sensor_cols):
    ax.plot(x_vals, df_occ[col].iloc[:plot_n], linewidth=0.8)
    ax.set_ylabel(col, fontsize=9)
    ax.yaxis.set_major_locator(ticker.MaxNLocator(4))

axes[-1].set_xlabel('Sample index' if not time_col else 'Timestamp')
fig.suptitle('Room Occupancy — Multi-sensor Time Series (first 2000 samples)', fontsize=13)
plt.tight_layout()
plt.show()

### 4.2 FFT of Temperature Channel
A Fast Fourier Transform reveals dominant periodic components in the room temperature signal.

In [None]:
temp_col = next((c for c in df_occ.columns if 'temp' in c.lower()), None)

if temp_col:
    temp_vals = df_occ[temp_col].dropna().values
    N   = len(temp_vals)
    fs  = 1.0           # 1 sample per minute assumed
    fft_vals = np.abs(np.fft.rfft(temp_vals - temp_vals.mean()))
    freqs    = np.fft.rfftfreq(N, d=1/fs)

    fig, ax = plt.subplots(figsize=(12, 4))
    ax.semilogy(freqs[1:], fft_vals[1:], linewidth=0.9, color='teal')
    ax.set_xlabel('Frequency (cycles per sample)')
    ax.set_ylabel('Magnitude (log scale)')
    ax.set_title(f'FFT of Room Temperature ({temp_col})')
    plt.tight_layout()
    plt.show()
else:
    print('No temperature column found in Room Occupancy dataset.')

### 4.3 Autocorrelation of Temperature
Autocorrelation quantifies how much future values of temperature depend on past values — useful for designing realistic synthetic signals.

In [None]:
if temp_col:
    from pandas.plotting import autocorrelation_plot
    fig, ax = plt.subplots(figsize=(12, 4))
    autocorrelation_plot(df_occ[temp_col].dropna().iloc[:3000], ax=ax)
    ax.set_title(f'Autocorrelation — {temp_col}')
    ax.set_xlim(0, 200)
    plt.tight_layout()
    plt.show()

---
## 5. Smart Home EDA <a id='section-5'></a>
Load the smart home dataset if available and inspect its basic statistics.

In [None]:
if df_home is not None:
    print(f'Shape : {df_home.shape}')
    print(f'dtypes:\n{df_home.dtypes}')
    print('\nDescriptive statistics:')
    display(df_home.describe())
    print('\nNull counts:')
    print(df_home.isnull().sum())
else:
    print('Smart Home dataset not available — skipping section.')

In [None]:
if df_home is not None:
    home_numeric = df_home.select_dtypes(include='number')
    if not home_numeric.empty:
        fig, axes = plt.subplots(1, min(3, len(home_numeric.columns)),
                                 figsize=(5 * min(3, len(home_numeric.columns)), 4))
        if not hasattr(axes, '__len__'):
            axes = [axes]
        for ax, col in zip(axes, home_numeric.columns[:3]):
            sns.histplot(home_numeric[col].dropna(), kde=True, ax=ax, color='mediumpurple')
            ax.set_title(col)
        plt.suptitle('Smart Home — Sample Distributions', fontsize=13)
        plt.tight_layout()
        plt.show()

---
## 6. Reference Statistics Summary <a id='section-6'></a>
`FeatureExtractor.extract_reference_stats()` aggregates key distributional statistics from all three datasets into a single dictionary used by the signal generator to produce realistic synthetic data.

In [None]:
extractor = FeatureExtractor(
    sleep_df=df_sleep,
    occupancy_df=df_occ,
    smart_home_df=df_home
)

ref_stats = extractor.extract_reference_stats()
print(f'Number of statistic groups extracted: {len(ref_stats)}')
for group, values in ref_stats.items():
    print(f'\n--- {group} ---')
    if isinstance(values, dict):
        for k, v in values.items():
            print(f'  {k}: {v}')
    else:
        print(f'  {values}')

In [None]:
print('Notebook 01 complete.')