## **HRV + Sleep diary (the diary + the watch)**
What you did
Fixed timestamps: changed big numbers like 1617262425031 into real dates (e.g., 2021-04-01 07:33:45).


**Cleaned:** removed duplicates and ensured key heart numbers (HR, rmssd, sdnn, lf/hf) are real numbers.


**Aligned:** matched each HRV reading to the sleep diary day.


**Labeled:** made a sleep_flag that is 1 when the wrist-watch reading is during sleep, 0 otherwise.


**Resampled / windowed:** either kept HRV at its per-night resolution (if daily) or resampled to regular steps if there were many per-second values, then cut into windows (clips) if needed.


**Why**
The diary talks about nights, the watch talks about seconds — we needed to make them speak the same language so we could know “which heart readings happened while sleeping”.


**What it looks like after**
A cleaned table with columns like HR, rmssd, sdnn, lf/hf, plus date, sleep_duration, sleep_efficiency, and sleep_flag.


Saved file: data_processed/figshare_participant_01.parquet


If the HRV file was daily-only, you have a neat one-row-per-night table ready for tree models. If per-second, you have windows or resampled rows ready for deep models.


In [None]:
!umount /content/drive


In [None]:
from google.colab import drive
import os

# Remove existing files in the mount point
if os.path.exists('/content/drive'):
  !rm -rf /content/drive/*

drive.mount('/content/drive')

Mounted at /content/drive


First, we unmount (!umount) any old Google Drive connection to start fresh.

Then, we check if /content/drive already exists. If it does, we clean it out with rm -rf.

Finally, we mount Google Drive again with drive.mount('/content/drive').

**Why:** This makes sure we’re working with a clean, fresh connection to Drive.

In [None]:
import os

folder = '/content/drive/MyDrive/stress-project/data_raw/hrv_sleep'
print("Exists:", os.path.exists(folder))
print("Files:", os.listdir(folder))


Exists: True
Files: ['sensor_hrv_filtered.csv', 'README.docx', 'survey.csv', 'sensor_hrv.csv', 'sleep_diary.csv']


We check if the folder data_raw/hrv_sleep exists inside Drive.
Then we print the list of files inside it.

**Why:** To confirm that our raw data files are really there before loading them

In [None]:
import pandas as pd

base = "/content/drive/MyDrive/stress-project/data_raw/hrv_sleep"

hrv = pd.read_csv(f"{base}/sensor_hrv_filtered.csv")
sleep = pd.read_csv(f"{base}/sleep_diary.csv")

print("HRV head:\n", hrv.head())
print("Sleep diary head:\n", sleep.head())


HRV head:
   deviceId       ts_start         ts_end  missingness_score         HR  \
0     ab60  1617262425031  1617262724833           0.295448  84.592816   
1     ab60  1616736817151  1616737116986           0.239085  78.589565   
2     ab60  1616736517083  1616736816952           0.100773  75.620524   
3     ab60  1616736217077  1616736516883           0.268178  85.813165   
4     ab60  1616734416800  1616734716672           0.043466  76.944500   

          ibi  acc_x_avg  acc_y_avg  acc_z_avg  grv_x_avg  ...  calories  \
0  728.534374   0.284765  -0.593973   9.195984  -0.094203  ...  0.000000   
1  781.896913   3.050179  -1.239353   5.790543  -0.211973  ...  0.085083   
2  812.183910   2.153267  -3.546833   8.499866  -0.628970  ...       NaN   
3  769.754943   2.898409  -3.401356   4.606113  -0.249247  ...  1.375000   
4  775.190053  -0.050221  -6.576164   5.377019   0.715893  ...  0.000000   

    light_avg        sdnn       sdsd       rmssd     pnn20     pnn50  \
0  841.324415  

Load HRV sensor data from sensor_hrv_filtered.csv.

Load Sleep diary from sleep_diary.csv.

Show the first few rows (head()) from both.

**Why**:  To bring the raw data into Python so we can work with it.

In [None]:
import pandas as pd
import numpy as np

base = "/content/drive/MyDrive/stress-project/data_raw/hrv_sleep"

# Load CSVs
hrv = pd.read_csv(f"{base}/sensor_hrv_filtered.csv")
sleep = pd.read_csv(f"{base}/sleep_diary.csv")

print("HRV columns:", hrv.shape)
print("Sleep columns:", sleep.shape)


HRV columns: (38913, 28)
Sleep columns: (1372, 11)


Reload both CSVs just to be safe.

hrv.shape tells us the number of rows and columns in the HRV dataset.

sleep.shape tells us the size of the sleep diary dataset.

**Why**: Before cleaning, it’s good to check the size of our datasets.

In [None]:
# Convert HRV timestamps
hrv['ts_start'] = pd.to_datetime(hrv['ts_start'], unit='ms')  # from ms epoch
hrv.set_index('ts_start', inplace=True)
hrv = hrv.sort_index()

# Select only numeric columns to reduce memory
numeric_cols = hrv.select_dtypes(include=[np.number]).columns[:1000]  # limit to 1000 cols
hrv_numeric = hrv[numeric_cols]

# Resample to 5-min intervals and interpolate
hrv_resampled = hrv_numeric.resample("5min").mean().interpolate()


ts_start is turned into real datetime (from milliseconds since 1970).

Set that timestamp as the index so our data is time-based.

Keep only numeric sensor columns (like HR, RMSSD, etc.), ignore text.

Resample the HRV data to 5-minute chunks by averaging. Missing values are filled by interpolation.

**Why**: Raw HRV data comes at uneven times. Resampling makes it neat, evenly spaced in time.

In [None]:
window_size = 6  # for 5-min intervals, 6x5=30 min

hrv_windowed = hrv_resampled.rolling(window=window_size, min_periods=1).agg({
    'HR': ['mean','std'],
    'rmssd': 'mean',
    'sdnn': 'mean',
    'lf/hf': 'mean'
})

# Flatten MultiIndex columns
hrv_windowed.columns = ['_'.join(col).strip() for col in hrv_windowed.columns.values]


Take a 30-minute sliding window (since 6 × 5min = 30min).

Inside each window, calculate:

Mean + Std of HR

Mean RMSSD

Mean SDNN

Mean LF/HF

Flatten the weird multi-level column names into simple names like HR_mean, HR_std.

**Why**: These summary features make the data easier to use for stress/sleep analysis.

In [None]:
# Ensure index is datetime
hrv_windowed.index = pd.to_datetime(hrv_windowed.index)

# Create a date column for merging
hrv_windowed['date'] = hrv_windowed.index.date
sleep['date'] = pd.to_datetime(sleep['date']).dt.date

# Merge sleep info
sleep_metrics = sleep[['date', 'sleep_duration', 'sleep_efficiency']]
hrv_with_sleep = hrv_windowed.merge(sleep_metrics, on='date', how='left')

# ts_start is already the index; no need to set it again


Make sure HRV index is proper datetime.

Add a date column to HRV (so we can match to daily sleep diary).

Sleep diary already has a date column, convert to date-only.

Merge them together → each HRV row gets the sleep metrics (duration + efficiency) for that day.

**Why**: This links continuous HRV signals with nightly sleep quality.

In [None]:
sleep['date'] = pd.to_datetime(sleep['date'], errors='coerce')


In [None]:
# If they are strings, keep them; if already datetime, convert to string for combining
sleep['asleep_time'] = sleep['asleep'].astype(str)
sleep['wakeup_time'] = sleep['wakeup'].astype(str)

# Combine date + time and convert to datetime
sleep['asleep'] = pd.to_datetime(sleep['date'].dt.strftime('%Y-%m-%d') + ' ' + sleep['asleep_time'], errors='coerce')
sleep['wakeup'] = pd.to_datetime(sleep['date'].dt.strftime('%Y-%m-%d') + ' ' + sleep['wakeup_time'], errors='coerce')


  sleep['asleep'] = pd.to_datetime(sleep['date'].dt.strftime('%Y-%m-%d') + ' ' + sleep['asleep_time'], errors='coerce')
  sleep['wakeup'] = pd.to_datetime(sleep['date'].dt.strftime('%Y-%m-%d') + ' ' + sleep['wakeup_time'], errors='coerce')


In [None]:
sleep.loc[sleep['wakeup'] < sleep['asleep'], 'wakeup'] += pd.Timedelta(days=1)


First, make sure date in sleep diary is a real date.

Convert asleep and wakeup columns into strings so we can combine them with the date.

Build full datetime values: date + time → e.g. 2023-05-01 23:00.

Sometimes wakeup might look earlier (like 06:00 vs 23:00), so we add 1 day.

**Why**: Sleep starts one day and often ends the next morning. We need proper datetime ranges.

In [None]:
hrv_with_sleep.index = pd.to_datetime(hrv_with_sleep.index, errors='coerce')


In [None]:
print(hrv_with_sleep.index[:5])


DatetimeIndex([          '1970-01-01 00:00:00',
               '1970-01-01 00:00:00.000000001',
               '1970-01-01 00:00:00.000000002',
               '1970-01-01 00:00:00.000000003',
               '1970-01-01 00:00:00.000000004'],
              dtype='datetime64[ns]', freq=None)


In [None]:
hrv_with_sleep.index = pd.to_datetime(hrv_with_sleep.index, errors='coerce').tz_localize(None)


In [None]:
sleep['asleep'] = pd.to_datetime(sleep['asleep'], errors='coerce').dt.tz_localize(None)
sleep['wakeup'] = pd.to_datetime(sleep['wakeup'], errors='coerce').dt.tz_localize(None)


Make both HRV data and sleep diary timezone-free (naive).

This prevents mismatches (sometimes data has hidden timezone info).

**Why**: We want both datasets to “speak the same clock.”

In [None]:
import numpy as np

# Make timestamps naive (no timezone)
hrv_index = pd.to_datetime(hrv_with_sleep.index).tz_localize(None)
sleep_asleep = pd.to_datetime(sleep['asleep']).dt.tz_localize(None)
sleep_wakeup = pd.to_datetime(sleep['wakeup']).dt.tz_localize(None)

# Create a sleep_flag array
sleep_flag = np.zeros(len(hrv_index), dtype=int)

for start, end in zip(sleep_asleep, sleep_wakeup):
    mask = (hrv_index >= start) & (hrv_index <= end)
    sleep_flag[mask] = 1

# Assign to dataframe
hrv_with_sleep['sleep_flag'] = sleep_flag


Create an empty array of zeros (means “awake”).

For each night’s sleep interval (asleep → wakeup), mark HRV timestamps that fall in that range as 1 (means “asleep”).

Add this sleep_flag column to HRV data.

**Why**: This gives us a per-minute (or per-5min) label telling if the person is asleep.

In [None]:
# Keep only relevant sleep metrics
sleep_metrics = sleep[['asleep', 'wakeup', 'sleep_duration', 'sleep_efficiency']]

# Initialize columns
hrv_with_sleep['sleep_duration'] = np.nan
hrv_with_sleep['sleep_efficiency'] = np.nan

# Assign nightly metrics to HRV rows
for i, row in sleep_metrics.iterrows():
    mask = (hrv_with_sleep.index >= row['asleep']) & (hrv_with_sleep.index <= row['wakeup'])
    hrv_with_sleep.loc[mask, 'sleep_duration'] = row['sleep_duration']
    hrv_with_sleep.loc[mask, 'sleep_efficiency'] = row['sleep_efficiency']


First, we prepare two new empty shelves (sleep_duration, sleep_efficiency) in our HRV toy box.

Then, for each night’s sleep diary row, we:

Check which HRV timestamps happened while the person was asleep.

Fill those HRV rows with that night’s total sleep duration and sleep efficiency.

**Why**

Because the HRV data (tiny blocks) is at minute-by-minute or 5-minute chunks, but the sleep diary has only one row per night.
So we “spread” that one row’s values into all the matching HRV rows.

In [None]:
numeric_cols = ['HR_mean', 'RMSSD_mean', 'SDNN_mean', 'LFHF_mean']


We pick only the important HRV numbers from the big table.



In [None]:
print(hrv_with_sleep.columns)


Index(['HR_mean', 'HR_std', 'rmssd_mean', 'sdnn_mean', 'lf/hf_mean', 'date',
       'sleep_duration', 'sleep_efficiency', 'sleep_flag'],
      dtype='object')


In [None]:
# Define numeric columns for rolling aggregation
numeric_cols = ['HR_mean', 'HR_std', 'rmssd_mean', 'sdnn_mean', 'lf/hf_mean']

# Set window size
window = '30min'  # can also use an integer for fixed-row windows

# Perform rolling aggregation
hrv_windowed = hrv_with_sleep[numeric_cols].rolling(window=window).agg({
    'HR_mean': ['mean', 'std'],
    'rmssd_mean': 'mean',
    'sdnn_mean': 'mean',
    'lf/hf_mean': 'mean'
})

# Flatten the MultiIndex columns after aggregation
hrv_windowed.columns = ['_'.join(col).strip() for col in hrv_windowed.columns.values]

# Add sleep info back
hrv_windowed['sleep_flag'] = hrv_with_sleep['sleep_flag']
hrv_windowed['sleep_duration'] = hrv_with_sleep['sleep_duration']
hrv_windowed['sleep_efficiency'] = hrv_with_sleep['sleep_efficiency']

print(hrv_windowed.head())


                               HR_mean_mean  HR_mean_std  rmssd_mean_mean  \
1970-01-01 00:00:00.000000000     69.263992          NaN        78.583185   
1970-01-01 00:00:00.000000001     69.263996     0.000006        78.583191   
1970-01-01 00:00:00.000000002     69.264000     0.000008        78.583197   
1970-01-01 00:00:00.000000003     69.264004     0.000011        78.583203   
1970-01-01 00:00:00.000000004     69.264008     0.000013        78.583209   

                               sdnn_mean_mean  lf/hf_mean_mean  sleep_flag  \
1970-01-01 00:00:00.000000000       90.267004         1.382869           0   
1970-01-01 00:00:00.000000001       90.266999         1.382868           0   
1970-01-01 00:00:00.000000002       90.266993         1.382868           0   
1970-01-01 00:00:00.000000003       90.266987         1.382867           0   
1970-01-01 00:00:00.000000004       90.266982         1.382867           0   

                               sleep_duration  sleep_efficiency  
19

Now we want to look at data in 30-minute chunks.

For every 30-min chunk:

Calculate average HR (mean) and how much it jumps around (std).

Calculate averages of other HRV numbers.

After the rolling stats, column names look weird like ('HR_mean', 'mean').

We rename them to HR_mean_mean, etc., so they are easy to read.

Put back sleep info for each 30-min window:

sleep_flag → 1 if asleep, 0 if awake

sleep_duration → how long they slept that night

sleep_efficiency → how good their sleep was

Look at the first few rows to make sure numbers make sense.

In [None]:
# HRV date range
print(hrv_windowed.index.min(), hrv_windowed.index.max())

# Sleep diary dates
sleep['date'] = pd.to_datetime(sleep['date']).dt.date
print(sleep['date'].min(), sleep['date'].max())


1970-01-01 00:00:00 1970-01-01 00:00:00.000730450
2021-03-09 2021-04-05


Convert sleep diary dates to simple date format.

This makes it easy to match HRV data with sleep

In [None]:
# Step 1: Check HRV data after rolling
print("✅ HRV windowed shape:", hrv_windowed.shape)
print(hrv_windowed.head(5))

# Step 2: Check 'date' column
print("\n✅ 'date' column in HRV windowed:")
print(hrv_windowed['date'].head(5))

# Step 3: Check sleep diary dates
print("\n✅ Sleep diary 'date' column type and head:")
print(sleep['date'].dtype)
print(sleep[['date','sleep_duration','sleep_efficiency']].head(5))

# Step 4: After merging HRV with sleep
print("\n✅ HRV with sleep shape:", hrv_with_sleep.shape)
print(hrv_with_sleep[['HR_mean','rmssd_mean','sdnn_mean','lf/hf_mean','sleep_flag',
                      'sleep_duration','sleep_efficiency']].head(10))

# Step 5: Check for NaNs
print("\n✅ Count of NaNs in each column:")
print(hrv_with_sleep.isna().sum())


✅ HRV windowed shape: (730451, 9)
                               HR_mean_mean  HR_mean_std  rmssd_mean_mean  \
1970-01-01 00:00:00.000000000     69.263992          NaN        78.583185   
1970-01-01 00:00:00.000000001     69.263996     0.000006        78.583191   
1970-01-01 00:00:00.000000002     69.264000     0.000008        78.583197   
1970-01-01 00:00:00.000000003     69.264004     0.000011        78.583203   
1970-01-01 00:00:00.000000004     69.264008     0.000013        78.583209   

                               sdnn_mean_mean  lf/hf_mean_mean  sleep_flag  \
1970-01-01 00:00:00.000000000       90.267004         1.382869           0   
1970-01-01 00:00:00.000000001       90.266999         1.382868           0   
1970-01-01 00:00:00.000000002       90.266993         1.382868           0   
1970-01-01 00:00:00.000000003       90.266987         1.382867           0   
1970-01-01 00:00:00.000000004       90.266982         1.382867           0   

                               sle

Put ts_start back as a column.

Convert from milliseconds since 1970 to normal datetime.

Set the timeline properly so all HRV rows know when they happened.

In [None]:
# Reset index so ts_start is a column again
hrv.reset_index(inplace=True)  # now ts_start is back as a column
print(hrv[['ts_start']].head())


                 ts_start
0 2018-01-01 20:17:11.436
1 2021-03-04 03:40:01.055
2 2021-03-04 03:40:27.070
3 2021-03-04 03:40:55.745
4 2021-03-04 03:55:55.963


In [None]:
hrv['ts_start'] = pd.to_datetime(hrv['ts_start'], unit='ms')
hrv['ts_end'] = pd.to_datetime(hrv['ts_end'], unit='ms')
hrv.set_index('ts_start', inplace=True)
print("✅ HRV timestamps fixed:")
print(hrv.head())


✅ HRV timestamps fixed:
                        deviceId                  ts_end  missingness_score  \
ts_start                                                                      
2018-01-01 20:17:11.436     sm34 2018-01-01 20:22:11.377           0.133749   
2021-03-04 03:40:01.055     ev76 2021-03-04 03:45:01.053           0.293910   
2021-03-04 03:40:27.070     pw85 2021-03-04 03:45:26.980           0.249228   
2021-03-04 03:40:55.745     nd56 2021-03-04 03:45:55.710           0.073167   
2021-03-04 03:55:55.963     nd56 2021-03-04 04:00:55.787           0.144252   

                                HR         ibi  acc_x_avg  acc_y_avg  \
ts_start                                                               
2018-01-01 20:17:11.436  69.263992  931.515005  -1.544450   6.299788   
2021-03-04 03:40:01.055  73.644957  836.224877   0.067789  -3.367955   
2021-03-04 03:40:27.070  77.520160  763.874719  -1.407254   0.875035   
2021-03-04 03:40:55.745  73.368159  820.535691   0.549644   1.

In [None]:
# 1️⃣ Convert 'date' to datetime
sleep['date'] = pd.to_datetime(sleep['date'], errors='coerce')

# 2️⃣ Ensure asleep/wakeup are strings
sleep['asleep'] = sleep['asleep'].astype(str)
sleep['wakeup'] = sleep['wakeup'].astype(str)

# 3️⃣ Combine date + time as string, then convert to datetime
sleep['asleep'] = pd.to_datetime(sleep['date'].dt.strftime('%Y-%m-%d') + ' ' + sleep['asleep'], errors='coerce')
sleep['wakeup'] = pd.to_datetime(sleep['date'].dt.strftime('%Y-%m-%d') + ' ' + sleep['wakeup'], errors='coerce')

print("✅ Sleep timestamps fixed:")
print(sleep[['date','asleep','wakeup']].head())


  sleep['asleep'] = pd.to_datetime(sleep['date'].dt.strftime('%Y-%m-%d') + ' ' + sleep['asleep'], errors='coerce')
  sleep['asleep'] = pd.to_datetime(sleep['date'].dt.strftime('%Y-%m-%d') + ' ' + sleep['asleep'], errors='coerce')
  sleep['wakeup'] = pd.to_datetime(sleep['date'].dt.strftime('%Y-%m-%d') + ' ' + sleep['wakeup'], errors='coerce')


✅ Sleep timestamps fixed:
        date                     asleep                     wakeup
0 2021-03-09  2021-03-09 02:00:00-09:00  2021-03-09 08:30:00-09:00
1 2021-03-10  2021-03-10 00:40:00-10:00  2021-03-10 07:50:00-10:00
2 2021-03-11  2021-03-11 01:00:00-11:00  2021-03-11 07:40:00-11:00
3 2021-03-12  2021-03-12 01:00:00-12:00  2021-03-12 07:50:00-12:00
4 2021-03-13  2021-03-13 01:30:00-13:00  2021-03-13 10:30:00-13:00


  sleep['wakeup'] = pd.to_datetime(sleep['date'].dt.strftime('%Y-%m-%d') + ' ' + sleep['wakeup'], errors='coerce')


Combine date + asleep/wakeup times into full timestamps.

In [None]:
# Optional: create date column for daily merge
hrv['date'] = hrv.index.date
sleep['date'] = pd.to_datetime(sleep['date']).dt.date

hrv_with_sleep = hrv.merge(sleep[['date','sleep_duration','sleep_efficiency']],
                           on='date', how='left')
print("✅ After merge:")
print(hrv_with_sleep.head())
print(hrv_with_sleep.isna().sum())


✅ After merge:
  deviceId                  ts_end  missingness_score         HR         ibi  \
0     sm34 2018-01-01 20:22:11.377           0.133749  69.263992  931.515005   
1     ev76 2021-03-04 03:45:01.053           0.293910  73.644957  836.224877   
2     pw85 2021-03-04 03:45:26.980           0.249228  77.520160  763.874719   
3     nd56 2021-03-04 03:45:55.710           0.073167  73.368159  820.535691   
4     nd56 2021-03-04 04:00:55.787           0.144252  70.581504  849.956465   

   acc_x_avg  acc_y_avg  acc_z_avg  grv_x_avg  grv_y_avg  ...       sdsd  \
0  -1.544450   6.299788   1.859824   0.140348   0.083587  ...  56.970503   
1   0.067789  -3.367955   7.529114  -0.283839   0.434003  ...  67.244560   
2  -1.407254   0.875035   9.599346   0.759207  -0.459268  ...  81.792090   
3   0.549644   1.745705   9.557753  -0.391295  -0.821447  ...  42.366288   
4   4.016267  -6.006324   6.395449  -0.854227  -0.112696  ...  64.387509   

        rmssd     pnn20     pnn50           lf 

Combine HRV and sleep diary using date.

Create a sleep_flag → 1 if sleep info exists, else 0.

In [None]:
# 1️⃣ Check current columns
print("Columns in HRV_with_sleep:")
print(hrv_with_sleep.columns)

# 2️⃣ Identify timestamp column
# Example: if ts_end exists
# Convert to datetime
hrv_with_sleep['ts_end'] = pd.to_datetime(hrv_with_sleep['ts_end'], unit='ms', errors='coerce')

# 3️⃣ Set it as index
hrv_with_sleep.set_index('ts_end', inplace=True)

# 4️⃣ Create 'date' column
hrv_with_sleep['date'] = hrv_with_sleep.index.floor('D')

# 5️⃣ Check
print("✅ HRV date column head:")
print(hrv_with_sleep[['date']].head())


Columns in HRV_with_sleep:
Index(['deviceId', 'ts_end', 'missingness_score', 'HR', 'ibi', 'acc_x_avg',
       'acc_y_avg', 'acc_z_avg', 'grv_x_avg', 'grv_y_avg', 'grv_z_avg',
       'grv_w_avg', 'gyr_x_avg', 'gyr_y_avg', 'gyr_z_avg', 'steps', 'distance',
       'calories', 'light_avg', 'sdnn', 'sdsd', 'rmssd', 'pnn20', 'pnn50',
       'lf', 'hf', 'lf/hf', 'date', 'sleep_duration', 'sleep_efficiency'],
      dtype='object')
✅ HRV date column head:
                              date
ts_end                            
2018-01-01 20:22:11.377 2018-01-01
2021-03-04 03:45:01.053 2021-03-04
2021-03-04 03:45:26.980 2021-03-04
2021-03-04 03:45:55.710 2021-03-04
2021-03-04 04:00:55.787 2021-03-04


In [None]:
import pandas as pd

# 1️⃣ Load the data again
base = "/content/drive/MyDrive/stress-project/data_raw/hrv_sleep"
hrv = pd.read_csv(f"{base}/sensor_hrv_filtered.csv")
sleep = pd.read_csv(f"{base}/sleep_diary.csv")

print("✅ HRV head:\n", hrv.head())
print("✅ Sleep diary head:\n", sleep.head())

# 2️⃣ Ensure sleep diary 'date' is datetime
sleep['date'] = pd.to_datetime(sleep['date'], errors='coerce')

# 3️⃣ Create 'date' column in HRV to match sleep diary
hrv['date'] = pd.to_datetime(hrv['ts_end'], unit='ms').dt.floor('D')

# 4️⃣ Merge HRV with sleep diary on 'date'
hrv_with_sleep = hrv.merge(
    sleep[['date', 'sleep_duration', 'sleep_efficiency']],
    on='date',
    how='left'
)

# 5️⃣ Create sleep_flag: 1 if sleep_duration exists, else 0
hrv_with_sleep['sleep_flag'] = hrv_with_sleep['sleep_duration'].notna().astype(int)

# 6️⃣ Print to check
print("✅ After merge, HRV_with_sleep head:")
print(hrv_with_sleep.head())

print("✅ sleep_flag counts:")
print(hrv_with_sleep['sleep_flag'].value_counts())

print("✅ Count of NaNs in HRV_with_sleep:")
print(hrv_with_sleep.isna().sum())


✅ HRV head:
   deviceId       ts_start         ts_end  missingness_score         HR  \
0     ab60  1617262425031  1617262724833           0.295448  84.592816   
1     ab60  1616736817151  1616737116986           0.239085  78.589565   
2     ab60  1616736517083  1616736816952           0.100773  75.620524   
3     ab60  1616736217077  1616736516883           0.268178  85.813165   
4     ab60  1616734416800  1616734716672           0.043466  76.944500   

          ibi  acc_x_avg  acc_y_avg  acc_z_avg  grv_x_avg  ...  calories  \
0  728.534374   0.284765  -0.593973   9.195984  -0.094203  ...  0.000000   
1  781.896913   3.050179  -1.239353   5.790543  -0.211973  ...  0.085083   
2  812.183910   2.153267  -3.546833   8.499866  -0.628970  ...       NaN   
3  769.754943   2.898409  -3.401356   4.606113  -0.249247  ...  1.375000   
4  775.190053  -0.050221  -6.576164   5.377019   0.715893  ...  0.000000   

    light_avg        sdnn       sdsd       rmssd     pnn20     pnn50  \
0  841.324415

In [None]:
# 1️⃣ Ensure datetime index
hrv_with_sleep.index = pd.to_datetime(hrv_with_sleep['ts_start'], unit='ms')

# 2️⃣ Select numeric columns only
numeric_cols = hrv_with_sleep.select_dtypes(include='number').columns.tolist()
print("Numeric columns:", numeric_cols)

# 3️⃣ Resample HRV to 5-minute intervals using only numeric columns
hrv_resampled = hrv_with_sleep[numeric_cols].resample('5min').mean()  # or '5min' instead of '5T'

# 4️⃣ Add sleep_flag back (use max to propagate sleep periods)
hrv_resampled['sleep_flag'] = hrv_with_sleep['sleep_flag'].resample('5min').max()

# 5️⃣ Compute rolling 30-min window stats
hrv_windowed = hrv_resampled.rolling('30min').agg({
    'HR': ['mean','std'],
    'rmssd': 'mean',
    'sdnn': 'mean',
    'lf/hf': 'mean'
})

# 6️⃣ Flatten MultiIndex columns
hrv_windowed.columns = ['_'.join(col).strip() for col in hrv_windowed.columns.values]

# ✅ Check result
print(hrv_windowed.head())
print("NaNs per column:\n", hrv_windowed.isna().sum())


Numeric columns: ['ts_start', 'ts_end', 'missingness_score', 'HR', 'ibi', 'acc_x_avg', 'acc_y_avg', 'acc_z_avg', 'grv_x_avg', 'grv_y_avg', 'grv_z_avg', 'grv_w_avg', 'gyr_x_avg', 'gyr_y_avg', 'gyr_z_avg', 'steps', 'distance', 'calories', 'light_avg', 'sdnn', 'sdsd', 'rmssd', 'pnn20', 'pnn50', 'lf', 'hf', 'lf/hf', 'sleep_duration', 'sleep_efficiency', 'sleep_flag']
                       HR_mean  HR_std  rmssd_mean  sdnn_mean  lf/hf_mean
ts_start                                                                 
2018-01-01 20:15:00  69.263992     NaN   78.583185  90.267004    1.382869
2018-01-01 20:20:00  69.263992     NaN   78.583185  90.267004    1.382869
2018-01-01 20:25:00  69.263992     NaN   78.583185  90.267004    1.382869
2018-01-01 20:30:00  69.263992     NaN   78.583185  90.267004    1.382869
2018-01-01 20:35:00  69.263992     NaN   78.583185  90.267004    1.382869
NaNs per column:
 HR_mean       334562
HR_std        334849
rmssd_mean    334562
sdnn_mean     334562
lf/hf_mean    

Group data in 5-minute blocks, then compute 30-min rolling averages.

Keep sleep info aligned.

Flatten column names for easy reading.

In [None]:
import os

# Create folder if it doesn't exist
os.makedirs("/content/drive/MyDrive/stress-project/processed_data", exist_ok=True)

# Define full path
save_path = "/content/drive/MyDrive/stress-project/processed_data/hrv_processed_week3.parquet"

# Save the processed HRV data
hrv_windowed.to_parquet(save_path, index=True)

print(f"✅ Processed HRV data saved at: {save_path}")


✅ Processed HRV data saved at: /content/drive/MyDrive/stress-project/processed_data/hrv_processed_week3.parquet


Make a folder for processed data if it doesn’t exist.

Save the cleaned HRV + sleep data into a file.

Now it’s ready for analysis!