# Preliminary Data Exploration

### Sleep Data

**1. Verify the Structure**

The data about sleep is divided into two different datasets. For each day, we have two datasets: the first one collects data about the heart rate values during sleep, the second one collects data about different stages during sleep. Let's see an example.

Sample day: 2024-12-27

In [23]:
import pandas as pd
import matplotlib.pyplot as plt
import os
import random

In [6]:
sample_day = '2024-12-27'

stages_sleep_directory = "../../data/raw/sleep/stages"
heart_rates_sleep_directory = "../../data/raw/sleep/heart_rates"

stages_sleep = pd.read_csv(os.path.join(stages_sleep_directory, f"{sample_day}.csv"))
heart_rates_sleep = pd.read_csv(os.path.join(heart_rates_sleep_directory, f"{sample_day}.csv"))

In [8]:
stages_sleep.head()

Unnamed: 0,start_time,end_time,sleep_stage
0,00:59,01:03,Light
1,01:03,01:24,Deep
2,01:24,01:29,Light
3,01:29,01:35,Deep
4,01:35,02:41,Light


In [7]:
heart_rates_sleep.head()

Unnamed: 0,timestamp,heart_rate
0,54,00:58
1,53,01:00
2,53,01:02
3,54,01:04
4,54,01:06


From this quick analysis, we can see that there was an error in data collection. 
`timestamp` column and `heart_rate` column are reversed.

The goal is to obtain a unique dataset for each day that joins the two datasets. Let's see an example.

In [10]:
data = {
    'timestamp': ['00:58', '01:00', '01:02', '01:04', '01:06'],
    'heart_rate': [54, 53, 53, 54, 54],
    'sleep_stage': ['Deep', 'Deep', 'Deep', 'Deep', 'Light']
}
goal_df = pd.DataFrame(data)

goal_df.head()

Unnamed: 0,timestamp,heart_rate,sleep_stage
0,00:58,54,Deep
1,01:00,53,Deep
2,01:02,53,Deep
3,01:04,54,Deep
4,01:06,54,Light


Let's remove all the files that are not relevant to our analysis (e.g., '2025-01-07.csv')

In [31]:
def remove_not_relevant_files(directory):
    for f in os.listdir(directory):
        if f.endswith(".csv"):
            date = f.split(".")[0]
            year, month, day = int(date.split("-")[0]), int(date.split("-")[1]), int(date.split("-")[2])
            if year >= 2025 and month >= 1 and day > 6:
                file_path = os.path.join(directory, f)
                os.remove(file_path)
                print(f"File removed: {file_path}")

remove_not_relevant_files(stages_sleep_directory)
remove_not_relevant_files(heart_rates_sleep_directory)

Fix the error in heart rate datasets. Reverse the columns (see above for more details about the error).

In [22]:
def reverse_columns(directory):
    for f in os.listdir(directory):
        if f.endswith(".csv"):
            file_path = os.path.join(directory, f)
            df = pd.read_csv(file_path)

            df = df.rename(columns={'timestamp': 'heart_rate', 'heart_rate': 'timestamp'})

            df.to_csv(file_path, index=False)
            print("File saved successfully")

reverse_columns(heart_rates_sleep_directory)

File saved successfully
File saved successfully
File saved successfully
File saved successfully
File saved successfully
File saved successfully
File saved successfully
File saved successfully
File saved successfully
File saved successfully
File saved successfully
File saved successfully
File saved successfully
File saved successfully
File saved successfully
File saved successfully
File saved successfully


Verify if the error has been correctly fixed by picking up a datasets from `sleep/heart_rates` directory randomly.

In [30]:
files = [os.path.join(heart_rates_sleep_directory, f) for f in os.listdir(heart_rates_sleep_directory)]

random_file = random.choice(files)

random_df = pd.read_csv(random_file)
random_df.head()

Unnamed: 0,heart_rate,timestamp
0,60,00:10
1,58,00:12
2,58,00:14
3,59,00:16
4,60,00:18


The error has been successfully fixed.

##### 2. Identify Missing Values

2.1) Check for NaN values for all datasets

In [32]:
def check_for_nan_values(directory):
    counter = {}
    for f in os.listdir(directory):
        if f.endswith(".csv"):
            file_path = os.path.join(directory, f)
            df = pd.read_csv(file_path)

            missing_values = df.isnull().sum().to_dict()

            missing_values['total'] = sum(missing_values.values())

            counter[f] = missing_values

    return counter

In [38]:
heart_rate_sleep_nan_counter = check_for_nan_values(heart_rates_sleep_directory)
for f, report in heart_rate_sleep_nan_counter.items():
    print(f)
    print("\n")
    for key, value in report.items():
        print(key, ' --> ', value)
    print("\n")
    print("-"*40+"\n")

2024-12-21.csv


heart_rate  -->  0
timestamp  -->  0
total  -->  0


----------------------------------------

2024-12-22.csv


heart_rate  -->  0
timestamp  -->  0
total  -->  0


----------------------------------------

2024-12-23.csv


heart_rate  -->  0
timestamp  -->  0
total  -->  0


----------------------------------------

2024-12-24.csv


heart_rate  -->  0
timestamp  -->  0
total  -->  0


----------------------------------------

2024-12-25.csv


heart_rate  -->  0
timestamp  -->  0
total  -->  0


----------------------------------------

2024-12-26.csv


heart_rate  -->  0
timestamp  -->  0
total  -->  0


----------------------------------------

2024-12-27.csv


heart_rate  -->  0
timestamp  -->  0
total  -->  0


----------------------------------------

2024-12-28.csv


heart_rate  -->  0
timestamp  -->  0
total  -->  0


----------------------------------------

2024-12-29.csv


heart_rate  -->  0
timestamp  -->  0
total  -->  0


----------------------------------

In [40]:
stages_sleep_nan_counter = check_for_nan_values(stages_sleep_directory)
for f, report in stages_sleep_nan_counter.items():
    print(f)
    print("\n")
    for key, value in report.items():
        print(key, ' --> ', value)
    print("\n")
    print("-"*40+"\n")

2024-12-21.csv


start_time  -->  0
end_time  -->  0
sleep_stage  -->  0
total  -->  0


----------------------------------------

2024-12-22.csv


start_time  -->  0
end_time  -->  0
sleep_stage  -->  0
total  -->  0


----------------------------------------

2024-12-23.csv


start_time  -->  0
end_time  -->  0
sleep_stage  -->  0
total  -->  0


----------------------------------------

2024-12-24.csv


start_time  -->  0
end_time  -->  0
sleep_stage  -->  0
total  -->  0


----------------------------------------

2024-12-25.csv


start_time  -->  0
end_time  -->  0
sleep_stage  -->  0
total  -->  0


----------------------------------------

2024-12-26.csv


start_time  -->  0
end_time  -->  0
sleep_stage  -->  0
total  -->  0


----------------------------------------

2024-12-27.csv


start_time  -->  0
end_time  -->  0
sleep_stage  -->  0
total  -->  0


----------------------------------------

2024-12-28.csv


start_time  -->  0
end_time  -->  0
sleep_stage  -->  0
total  -->

No missing value (NaN) found.

2.2) For each sleep stages dataset, identify if a time slot is missed. 

Example:
01:07,01:36,Deep
01:45,01:59,Light

In the case above, the time slot `01:36` to `01:45` is missed.

In [41]:
def check_missing_time_slots(directory):
    missing_slots_report = {}

    for file in os.listdir(directory):
        if file.endswith(".csv"):
            file_path = os.path.join(directory, file)
            df = pd.read_csv(file_path)

            df['start_time'] = pd.to_datetime(df['start_time'], format='%H:%M')
            df['end_time'] = pd.to_datetime(df['end_time'], format='%H:%M')

            missing_slots = []

            for i in range(len(df) - 1):
                current_end = df.loc[i, 'end_time']
                next_start = df.loc[i + 1, 'start_time']
                if current_end != next_start:
                    missing_slots.append(f"{current_end.strftime('%H:%M')} to {next_start.strftime('%H:%M')}")

            missing_slots_report[file] = missing_slots if missing_slots else "No missing slots"
    return missing_slots_report

In [43]:
missing_slots_report = check_missing_time_slots(stages_sleep_directory)

for file, missing_slots in missing_slots_report.items():
    print(f"File: {file}")
    if missing_slots == "No missing slots":
        print("No missing slots.")
    else:
        print("Missing slots:")
        for slot in missing_slots:
            print(f"  - {slot}")
    print("-" * 40)

File: 2024-12-21.csv
No missing slots.
----------------------------------------
File: 2024-12-22.csv
No missing slots.
----------------------------------------
File: 2024-12-23.csv
No missing slots.
----------------------------------------
File: 2024-12-24.csv
No missing slots.
----------------------------------------
File: 2024-12-25.csv
No missing slots.
----------------------------------------
File: 2024-12-26.csv
No missing slots.
----------------------------------------
File: 2024-12-27.csv
No missing slots.
----------------------------------------
File: 2024-12-28.csv
No missing slots.
----------------------------------------
File: 2024-12-29.csv
No missing slots.
----------------------------------------
File: 2024-12-30.csv
No missing slots.
----------------------------------------
File: 2024-12-31.csv
No missing slots.
----------------------------------------
File: 2025-01-01.csv
No missing slots.
----------------------------------------
File: 2025-01-02.csv
No missing slots.
-

No missing slot has been founded.

2.3) Calculate the completeness for each heart rates sleep dataset

This function evaluates the completeness of heart rate datasets during sleep by comparing the expected number of entries (based on sleep duration) with the actual entries. Completeness is expressed as a percentage.

In [107]:
def calculate_dataset_completeness(stages_directory, heart_rates_directory):
    completeness_report = {}
    
    for stages_file in os.listdir(stages_directory):
        if stages_file.endswith('.csv'):
            stages_path = os.path.join(stages_directory, stages_file)
            heart_rates_path = os.path.join(heart_rates_directory, stages_file)
            
            stages_df = pd.read_csv(stages_path)
            heart_rates_df = pd.read_csv(heart_rates_path)

            start_time = None
            end_time = None

            start_time_values = stages_df['start_time']
            for idx, value in start_time_values.items():
                if idx == 0:
                    start_time = value

            end_time_values = stages_df['end_time']
            for idx, value in end_time_values.items():
                if idx == len(stages_df)-1:
                    end_time = value
            
            start_time = pd.to_datetime(start_time, format='%H:%M')
            end_time = pd.to_datetime(end_time, format='%H:%M')

            day = stages_file.split('.')[0]
            
            if start_time is not None and end_time is not None:
                total_sleep_minutes = (end_time - start_time).seconds // 60  # Total sleep minutes
                
                # Number of expected rows
                expected_rows = total_sleep_minutes // 2
                
                # Completeness
                actual_rows = len(heart_rates_df)
                completeness_percentage = int((actual_rows / expected_rows) * 100)
                
                completeness_report[day] = completeness_percentage
            else:
                completeness_report[day] = "No data available"
    
    return completeness_report

# Example
completeness_report = calculate_dataset_completeness(stages_sleep_directory, heart_rates_sleep_directory)

for day, completeness in completeness_report.items():
    print(f"{day}: {completeness}")

2024-12-21: No data available
2024-12-22: 100
2024-12-23: 100
2024-12-24: 100
2024-12-25: 100
2024-12-26: 100
2024-12-27: 100
2024-12-28: 100
2024-12-29: 100
2024-12-30: 100
2024-12-31: 100
2025-01-01: No data available
2025-01-02: 100
2025-01-03: 100
2025-01-04: 100
2025-01-05: 100
2025-01-06: 100
