It's a relatively simple dataset, but your goal is this: 

Can we predict Sleep Score (posted by the FitBit app) using the other metrics in the dataset? In other words, is there a formula here that the FitBit app uses to compute Sleep Score that we can reverse-engineer?
Two constraints for this assignment:

1. Your modeling efforts must involve bagging and stacking in some way. Otherwise, you may try whatever you like.

2. You are allowed, even encouraged, to compute and/or gather additional features to use as explanatory variables in your model. For example, you might create a variable for the time they went to sleep (as a measure of how "early" they went to bed, or not). There are multiple datasets and you should use all of them, which means you may use the corresponding month for the dataset as a variable as well (or anything related to it).

Your submission should be an HTML or .ipynb file of all of your work.

In [52]:
import pandas as pd

# load in files
april_df = pd.read_csv(r"C:\Users\achur\OneDrive\Desktop\School\CP Spring 2024\545\GSB545\Labs\April sleep data - Sheet1.csv")
december_df = pd.read_csv(r"C:\Users\achur\OneDrive\Desktop\School\CP Spring 2024\545\GSB545\Labs\December Sleep data - Sheet1.csv")
february_df = pd.read_csv(r"C:\Users\achur\OneDrive\Desktop\School\CP Spring 2024\545\GSB545\Labs\February sleep data - Sheet1 (1).csv")
january_df = pd.read_csv(r"C:\Users\achur\OneDrive\Desktop\School\CP Spring 2024\545\GSB545\Labs\January sleep data - Sheet1.csv")
march_df = pd.read_csv(r"C:\Users\achur\OneDrive\Desktop\School\CP Spring 2024\545\GSB545\Labs\March sleep data - Sheet1.csv")
november_df = pd.read_csv(r"C:\Users\achur\OneDrive\Desktop\School\CP Spring 2024\545\GSB545\Labs\November Sleep Data - Sheet1.csv")

### Clean Data

In [53]:
# clean data

# rename sleep score in feb df
february_df.rename(columns={"SLEEP SQORE": "SLEEP SCORE"}, inplace=True)

# rename the february column
february_df.rename(columns={"FEBEUARY": "FEBRUARY"}, inplace=True)

# rename column for days of the week
february_df.rename(columns={"FEBRUARY": "DAYS OF THE WEEK"}, inplace=True)
april_df.rename(columns={"APRIL": "DAYS OF THE WEEK"}, inplace=True)
december_df.rename(columns={"DECEMBER": "DAYS OF THE WEEK"}, inplace=True)
january_df.rename(columns={"JANUARY": "DAYS OF THE WEEK"}, inplace=True)
march_df.rename(columns={"MARCH": "DAYS OF THE WEEK"}, inplace=True)
november_df.rename(columns={"NOVEMBER": "DAYS OF THE WEEK"}, inplace=True)

# then create month column
april_df["MONTH"] = "APRIL"
february_df["MONTH"] = "FEBRUARY"
december_df["MONTH"] = "DECEMBER"
january_df["MONTH"] = "JANUARY"
march_df["MONTH"] = "MARCH"
november_df["MONTH"] = "NOVEMBER"


# heart rate in below resting and under resting
january_df.rename(columns={"HEART RATE UNDER RESTING": "HEART RATE BELOW RESTING"}, inplace=True)

# march for heart rate and heartrate
march_df.rename(columns={"HEARTRATE BELOW RESTING": "HEART RATE BELOW RESTING"}, inplace=True)


In [54]:
# drop na's
april_df.dropna(inplace=True)
february_df.dropna(inplace=True)
december_df.dropna(inplace=True)
january_df.dropna(inplace=True)
march_df.dropna(inplace=True)
november_df.dropna(inplace=True)

In [55]:
# concat the dataframes
df_combined = pd.concat([april_df, february_df, december_df, january_df, march_df, november_df], ignore_index=True)

In [56]:
# fix the times
import re

df_combined[["SLEEP START", "SLEEP END"]] = df_combined["SLEEP TIME"].str.split(" - ", expand=True)

def fix_time_string(s):
    if pd.isna(s):
        return s
    s = s.strip().lower()
    s = re.sub(r"(\d{1,2})-(\d{2})(am|pm)", r"\1:\2\3", s)  
    s = re.sub(r"[^\dxapm:]", "", s)  
    return s

df_combined["SLEEP START"] = df_combined["SLEEP START"].apply(fix_time_string)
df_combined["SLEEP END"] = df_combined["SLEEP END"].apply(fix_time_string)

df_combined["SLEEP START"] = pd.to_datetime(df_combined["SLEEP START"], errors="coerce").dt.strftime("%H:%M")
df_combined["SLEEP END"] = pd.to_datetime(df_combined["SLEEP END"], errors="coerce").dt.strftime("%H:%M")



In [57]:
# fix the times so that all of them have seconds
df_combined['HOURS OF SLEEP'] = df_combined['HOURS OF SLEEP'].apply(
    lambda x: x if len(x.split(':')) == 3 else f"{x}:00"
)
# convert to hours of sleep
df_combined['HOURS OF SLEEP'] = pd.to_timedelta(df_combined['HOURS OF SLEEP']).dt.total_seconds() / 3600

In [58]:
# make sure percentages are floats
for col in ['REM SLEEP', 'DEEP SLEEP', 'HEART RATE BELOW RESTING']:
    df_combined[col] = df_combined[col].str.replace('%', '').astype(float)

# convert sleep start and end to hours
df_combined['SLEEP START'] = pd.to_datetime(df_combined['SLEEP START'].astype(str), format='%H:%M', errors='coerce')
df_combined['SLEEP END'] = pd.to_datetime(df_combined['SLEEP END'].astype(str), format='%H:%M', errors='coerce')

# make sure hours and minutes are floats
df_combined['SLEEP START'] = df_combined['SLEEP START'].dt.hour + df_combined['SLEEP START'].dt.minute / 60
df_combined['SLEEP END'] = df_combined['SLEEP END'].dt.hour + df_combined['SLEEP END'].dt.minute / 60

### Bagging and Stacking Models

In [59]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, StackingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score
import numpy as np
from sklearn.metrics import r2_score

# dummify variables
df_combined = pd.get_dummies(df_combined, columns=['MONTH', 'DAYS OF THE WEEK'], drop_first=True)

# Drop unused or problematic columns
#df.drop(columns=['DATE', 'SLEEP TIME'], inplace=True)
#df.dropna(inplace=True)

# x and y
X = df_combined.drop(columns=['SLEEP SCORE', 'DATE', 'SLEEP TIME'], errors='ignore')
y = df_combined['SLEEP SCORE']

# standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# training and testing
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=1)

# bagging model
bagging_model = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=100,
    random_state=1
)

# check bagging model
bagging_model.fit(X_train, y_train)
y_pred_bag = bagging_model.predict(X_test)
mse_bag = mean_squared_error(y_test, y_pred_bag)
print(f"Bagging MSE: {mse_bag:.2f}")

# stacking model using bagging
stacking_model = StackingRegressor(
    estimators=[
        ('bagging', bagging_model),
        ('knn', KNeighborsRegressor())
    ],
    final_estimator=LinearRegression()
)

# check stacking model
stacking_model.fit(X_train, y_train)

# predict
y_pred_stack = stacking_model.predict(X_test)
mse_stack = mean_squared_error(y_test, y_pred_stack)

print(f"Stacking MSE: {mse_stack:.2f}")
#prediction

y_pred_bag = bagging_model.predict(X_test)
avg_bagging_score = y_pred_bag.mean()
print(f"Average Predicted Sleep Score (Bagging): {avg_bagging_score:.2f}")

y_pred_stack = stacking_model.predict(X_test)
avg_stacking_score = y_pred_stack.mean()
print(f"Average Predicted Sleep Score (Stacking): {avg_stacking_score:.2f}")

# cv bag
bagging_cv_scores = cross_val_score(
    bagging_model, X_scaled, y, cv=5, scoring='neg_mean_squared_error'
)
bagging_cv_mse = -np.mean(bagging_cv_scores)
bagging_cv_std = np.std(bagging_cv_scores)
print(f"\nBagging CV MSE: {bagging_cv_mse:.2f}")

# cv stack
stacking_cv_scores = cross_val_score(
    stacking_model, X_scaled, y, cv=5, scoring='neg_mean_squared_error'
)
stacking_cv_mse = -np.mean(stacking_cv_scores)
stacking_cv_std = np.std(stacking_cv_scores)
print(f"\nStacking CV MSE: {stacking_cv_mse:.2f}")
rmse_bag = np.sqrt(bagging_cv_mse)
rmse_stack = np.sqrt(stacking_cv_mse)
print(f"Bagging CV RMSE: {rmse_bag:.2f}")
print(f"Stacking CV RMSE: {rmse_stack:.2f}")
# r2 for Bagging
r2_bag = r2_score(y_test, y_pred_bag)
print(f"Bagging R² Score: {r2_bag:.4f}")

# r2 for Stacking
r2_stack = r2_score(y_test, y_pred_stack)
print(f"Stacking R² Score: {r2_stack:.4f}")

Bagging MSE: 5.61
Stacking MSE: 5.75
Average Predicted Sleep Score (Bagging): 84.48
Average Predicted Sleep Score (Stacking): 84.28

Bagging CV MSE: 31.26

Stacking CV MSE: 32.62
Bagging CV RMSE: 5.59
Stacking CV RMSE: 5.71
Bagging R² Score: 0.6808
Stacking R² Score: 0.6727


The bagging model had the highest R^2 value, so it is best fitting model.

## Appendix and References
https://docs.python.org/3/library/re.html

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingRegressor.html

Generative A.I. Statement: Chat-GPT was used to suggest changes in code to debug errors. Chat-GPT was only used to resolve errors in already hand written code.