# Preliminary Data Exploration

### Sleep Data

**1. Verify the Structure**

The data about sleep is divided into two different datasets. For each day, we have two datasets: the first one collects data about the heart rate values during sleep, the second one collects data about different stages during sleep. Let's see an example.

Sample day: 2024-12-27

In [23]:
import pandas as pd
import matplotlib.pyplot as plt
import os
import random

In [6]:
sample_day = '2024-12-27'

stages_sleep_directory = "../../data/raw/sleep/stages"
heart_rates_sleep_directory = "../../data/raw/sleep/heart_rates"

stages_sleep = pd.read_csv(os.path.join(stages_sleep_directory, f"{sample_day}.csv"))
heart_rates_sleep = pd.read_csv(os.path.join(heart_rates_sleep_directory, f"{sample_day}.csv"))

In [8]:
stages_sleep.head()

Unnamed: 0,start_time,end_time,sleep_stage
0,00:59,01:03,Light
1,01:03,01:24,Deep
2,01:24,01:29,Light
3,01:29,01:35,Deep
4,01:35,02:41,Light


In [7]:
heart_rates_sleep.head()

Unnamed: 0,timestamp,heart_rate
0,54,00:58
1,53,01:00
2,53,01:02
3,54,01:04
4,54,01:06


From this quick analysis, we can see that there was an error in data collection. 
`timestamp` column and `heart_rate` column are reversed.

The goal is to obtain a unique dataset for each day that joins the two datasets. Let's see an example.

In [10]:
data = {
    'timestamp': ['00:58', '01:00', '01:02', '01:04', '01:06'],
    'heart_rate': [54, 53, 53, 54, 54],
    'sleep_stage': ['Deep', 'Deep', 'Deep', 'Deep', 'Light']
}
goal_df = pd.DataFrame(data)

goal_df.head()

Unnamed: 0,timestamp,heart_rate,sleep_stage
0,00:58,54,Deep
1,01:00,53,Deep
2,01:02,53,Deep
3,01:04,54,Deep
4,01:06,54,Light


Let's remove all the files that are not relevant to our analysis (e.g., '2025-01-07.csv')

In [19]:
def remove_not_relevant_files(directory):
    for f in os.listdir(directory):
        if f.endswith(".csv"):
            date = f.split(".")[0]
            year, month, day = int(date.split("-")[0]), int(date.split("-")[1]), int(date.split("-")[2])
            if year >= 2025 and month >= 1 and day > 6:
                file_path = os.path.join(directory, f)
                os.remove(file_path)
                print(f"File removed: {file_path}")

remove_not_relevant_files(stages_sleep_directory)
remove_not_relevant_files(heart_rates_sleep_directory)

Fix the error in heart rate datasets. Reverse the columns (see above for more details about the error).

In [22]:
def reverse_columns(directory):
    for f in os.listdir(directory):
        if f.endswith(".csv"):
            file_path = os.path.join(directory, f)
            df = pd.read_csv(file_path)

            df = df.rename(columns={'timestamp': 'heart_rate', 'heart_rate': 'timestamp'})

            df.to_csv(file_path, index=False)
            print("File saved successfully")

reverse_columns(heart_rates_sleep_directory)

File saved successfully
File saved successfully
File saved successfully
File saved successfully
File saved successfully
File saved successfully
File saved successfully
File saved successfully
File saved successfully
File saved successfully
File saved successfully
File saved successfully
File saved successfully
File saved successfully
File saved successfully
File saved successfully
File saved successfully


Verify if the error has been correctly fixed by picking up a datasets from `sleep/heart_rates` directory randomly.

In [30]:
files = [os.path.join(heart_rates_sleep_directory, f) for f in os.listdir(heart_rates_sleep_directory)]

random_file = random.choice(files)

random_df = pd.read_csv(random_file)
random_df.head()

Unnamed: 0,heart_rate,timestamp
0,60,00:10
1,58,00:12
2,58,00:14
3,59,00:16
4,60,00:18


The error has been successfully fixed.