## About the dataset

This dataset is taken from Kaggle: [Sleep Data](https://www.kaggle.com/datasets/danagerous/sleep-data?select=sleepdata.csv). It tracks the sleep cycles of a particular user from [2014-2018](../data/raw/sleepdata.csv) and [2018-2022](../data/raw/sleepdata_2.csv).

Both datasets have Start, End, and Sleep Quality (target) columns. Both datasets also have a few other faetures, but the data from 2018-2022 (sleepdata_2.csv) contains additional columns that the 2014-2018 (sleepdata.csv) data does not have. 

## Imports

In [30]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

## Initial cleaning to stack datasets

In [31]:
df1 = pd.read_csv("../data/raw/sleepdata.csv", sep=";")
print(df1.shape)
df1.head()

(887, 8)


Unnamed: 0,Start,End,Sleep quality,Time in bed,Wake up,Sleep Notes,Heart rate,Activity (steps)
0,2014-12-29 22:57:49,2014-12-30 07:30:13,100%,8:32,:),,59.0,0
1,2014-12-30 21:17:50,2014-12-30 21:33:54,3%,0:16,:|,Stressful day,72.0,0
2,2014-12-30 22:42:49,2014-12-31 07:13:31,98%,8:30,:|,,57.0,0
3,2014-12-31 22:31:01,2015-01-01 06:03:01,65%,7:32,,,,0
4,2015-01-01 22:12:10,2015-01-02 04:56:35,72%,6:44,:),Drank coffee:Drank tea,68.0,0


In [32]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 887 entries, 0 to 886
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Start             887 non-null    object 
 1   End               887 non-null    object 
 2   Sleep quality     887 non-null    object 
 3   Time in bed       887 non-null    object 
 4   Wake up           246 non-null    object 
 5   Sleep Notes       652 non-null    object 
 6   Heart rate        162 non-null    float64
 7   Activity (steps)  887 non-null    int64  
dtypes: float64(1), int64(1), object(6)
memory usage: 55.6+ KB


In [33]:
df2 = pd.read_csv("../data/raw/sleepdata_2.csv", sep=";")
print(df2.shape)
df2.head()

(921, 21)


Unnamed: 0,Start,End,Sleep Quality,Regularity,Mood,Heart rate (bpm),Steps,Alarm mode,Air Pressure (Pa),City,...,Time in bed (seconds),Time asleep (seconds),Time before sleep (seconds),Window start,Window stop,Did snore,Snore time,Weather temperature (°C),Weather type,Notes
0,2019-05-12 23:26:13,2019-05-13 06:11:03,60%,0%,,0,8350,Normal,,,...,24289.2,22993.8,161.9,2019-05-13 06:00:00,2019-05-13 06:00:00,True,92.0,0.0,No weather,
1,2019-05-13 22:10:31,2019-05-14 06:10:42,73%,0%,,0,4746,Normal,,,...,28810.2,25160.9,192.1,2019-05-14 05:50:00,2019-05-14 05:50:00,True,0.0,0.0,No weather,
2,2019-05-14 21:43:00,2019-05-15 06:10:41,86%,96%,,0,4007,Normal,,,...,30461.5,28430.8,203.1,2019-05-15 05:50:00,2019-05-15 05:50:00,True,74.0,0.0,No weather,
3,2019-05-15 23:11:51,2019-05-16 06:13:59,77%,92%,,0,6578,Normal,,,...,25327.6,23132.5,168.9,2019-05-16 05:50:00,2019-05-16 05:50:00,True,0.0,0.0,No weather,
4,2019-05-16 23:12:13,2019-05-17 06:20:32,78%,94%,,0,4913,Normal,,,...,25698.4,22614.6,171.3,2019-05-17 05:50:00,2019-05-17 05:50:00,True,188.0,0.0,No weather,


In [34]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 921 entries, 0 to 920
Data columns (total 21 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Start                        921 non-null    object 
 1   End                          921 non-null    object 
 2   Sleep Quality                921 non-null    object 
 3   Regularity                   921 non-null    object 
 4   Mood                         0 non-null      float64
 5   Heart rate (bpm)             921 non-null    int64  
 6   Steps                        921 non-null    int64  
 7   Alarm mode                   921 non-null    object 
 8   Air Pressure (Pa)            492 non-null    float64
 9   City                         487 non-null    object 
 10  Movements per hour           921 non-null    float64
 11  Time in bed (seconds)        921 non-null    float64
 12  Time asleep (seconds)        921 non-null    float64
 13  Time before sleep (s

In [35]:
pd.to_timedelta(df1["Time in bed"] + ":00").dt.total_seconds() / 3600

0      8.533333
1      0.266667
2      8.500000
3      7.533333
4      6.733333
         ...   
882    9.133333
883    7.183333
884    8.933333
885    9.216667
886    8.916667
Name: Time in bed, Length: 887, dtype: float64

In [36]:
df1["Time in bed (hr)"] = pd.to_timedelta(df1["Time in bed"] + ":00").dt.total_seconds() / 3600

In [37]:
df1 = df1.drop(columns=["Wake up", "Sleep Notes", "Time in bed"])
df2 = df2.drop(columns=["Mood", "Notes"])

In [38]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 887 entries, 0 to 886
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Start             887 non-null    object 
 1   End               887 non-null    object 
 2   Sleep quality     887 non-null    object 
 3   Heart rate        162 non-null    float64
 4   Activity (steps)  887 non-null    int64  
 5   Time in bed (hr)  887 non-null    float64
dtypes: float64(2), int64(1), object(3)
memory usage: 41.7+ KB


In [39]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 921 entries, 0 to 920
Data columns (total 19 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Start                        921 non-null    object 
 1   End                          921 non-null    object 
 2   Sleep Quality                921 non-null    object 
 3   Regularity                   921 non-null    object 
 4   Heart rate (bpm)             921 non-null    int64  
 5   Steps                        921 non-null    int64  
 6   Alarm mode                   921 non-null    object 
 7   Air Pressure (Pa)            492 non-null    float64
 8   City                         487 non-null    object 
 9   Movements per hour           921 non-null    float64
 10  Time in bed (seconds)        921 non-null    float64
 11  Time asleep (seconds)        921 non-null    float64
 12  Time before sleep (seconds)  921 non-null    float64
 13  Window start        

In [40]:
df2["Time in bed (hr)"] = df2["Time in bed (seconds)"] / 3600
rows_to_move = ['Start', 'End', 'Sleep Quality', 'Heart rate (bpm)', 'Steps', 'Time in bed (hr)']
df2 = df2.drop(columns=["Time in bed (seconds)"])
row_order = rows_to_move + [c for c in df2.columns if c not in rows_to_move]
df2 = df2[row_order]


In [41]:
df1 = df1.rename(columns={"Heart rate": "Heart rate (bpm)", "Activity (steps)": "Steps"})
df2 = df2.rename(columns={"Sleep Quality": "Sleep quality"})

In [42]:
df1.head()

Unnamed: 0,Start,End,Sleep quality,Heart rate (bpm),Steps,Time in bed (hr)
0,2014-12-29 22:57:49,2014-12-30 07:30:13,100%,59.0,0,8.533333
1,2014-12-30 21:17:50,2014-12-30 21:33:54,3%,72.0,0,0.266667
2,2014-12-30 22:42:49,2014-12-31 07:13:31,98%,57.0,0,8.5
3,2014-12-31 22:31:01,2015-01-01 06:03:01,65%,,0,7.533333
4,2015-01-01 22:12:10,2015-01-02 04:56:35,72%,68.0,0,6.733333


In [43]:
df2.head()

Unnamed: 0,Start,End,Sleep quality,Heart rate (bpm),Steps,Time in bed (hr),Regularity,Alarm mode,Air Pressure (Pa),City,Movements per hour,Time asleep (seconds),Time before sleep (seconds),Window start,Window stop,Did snore,Snore time,Weather temperature (°C),Weather type
0,2019-05-12 23:26:13,2019-05-13 06:11:03,60%,0,8350,6.747,0%,Normal,,,35.0,22993.8,161.9,2019-05-13 06:00:00,2019-05-13 06:00:00,True,92.0,0.0,No weather
1,2019-05-13 22:10:31,2019-05-14 06:10:42,73%,0,4746,8.002833,0%,Normal,,,78.6,25160.9,192.1,2019-05-14 05:50:00,2019-05-14 05:50:00,True,0.0,0.0,No weather
2,2019-05-14 21:43:00,2019-05-15 06:10:41,86%,0,4007,8.461528,96%,Normal,,,60.5,28430.8,203.1,2019-05-15 05:50:00,2019-05-15 05:50:00,True,74.0,0.0,No weather
3,2019-05-15 23:11:51,2019-05-16 06:13:59,77%,0,6578,7.035444,92%,Normal,,,45.2,23132.5,168.9,2019-05-16 05:50:00,2019-05-16 05:50:00,True,0.0,0.0,No weather
4,2019-05-16 23:12:13,2019-05-17 06:20:32,78%,0,4913,7.138444,94%,Normal,,,44.6,22614.6,171.3,2019-05-17 05:50:00,2019-05-17 05:50:00,True,188.0,0.0,No weather


In [44]:
df = pd.concat([df1, df2], ignore_index=True, sort=False)
df = df.reset_index(drop=True)
print(df.shape)
df.iloc[886:890]

(1808, 19)


Unnamed: 0,Start,End,Sleep quality,Heart rate (bpm),Steps,Time in bed (hr),Regularity,Alarm mode,Air Pressure (Pa),City,Movements per hour,Time asleep (seconds),Time before sleep (seconds),Window start,Window stop,Did snore,Snore time,Weather temperature (°C),Weather type
886,2018-02-16 22:52:29,2018-02-17 07:48:04,91%,,2291,8.916667,,,,,,,,,,,,,
887,2019-05-12 23:26:13,2019-05-13 06:11:03,60%,0.0,8350,6.747,0%,Normal,,,35.0,22993.8,161.9,2019-05-13 06:00:00,2019-05-13 06:00:00,True,92.0,0.0,No weather
888,2019-05-13 22:10:31,2019-05-14 06:10:42,73%,0.0,4746,8.002833,0%,Normal,,,78.6,25160.9,192.1,2019-05-14 05:50:00,2019-05-14 05:50:00,True,0.0,0.0,No weather
889,2019-05-14 21:43:00,2019-05-15 06:10:41,86%,0.0,4007,8.461528,96%,Normal,,,60.5,28430.8,203.1,2019-05-15 05:50:00,2019-05-15 05:50:00,True,74.0,0.0,No weather


## Imputing missing values using ml

### Method 1: manual

In [45]:
df.isna().sum().sum()

np.int64(13279)

In [46]:
cols_with_nans = [x for x in df if df[x].isna().sum() > 0]
cols_with_nans

['Heart rate (bpm)',
 'Regularity',
 'Alarm mode',
 'Air Pressure (Pa)',
 'City',
 'Movements per hour',
 'Time asleep (seconds)',
 'Time before sleep (seconds)',
 'Window start',
 'Window stop',
 'Did snore',
 'Snore time',
 'Weather temperature (°C)',
 'Weather type']

In [47]:
df[cols_with_nans].dtypes

Heart rate (bpm)               float64
Regularity                      object
Alarm mode                      object
Air Pressure (Pa)              float64
City                            object
Movements per hour             float64
Time asleep (seconds)          float64
Time before sleep (seconds)    float64
Window start                    object
Window stop                     object
Did snore                       object
Snore time                     float64
Weather temperature (°C)       float64
Weather type                    object
dtype: object

In [48]:
df[cols_with_nans].isna().sum()

Heart rate (bpm)                725
Regularity                      887
Alarm mode                      887
Air Pressure (Pa)              1316
City                           1321
Movements per hour              887
Time asleep (seconds)           887
Time before sleep (seconds)     887
Window start                    967
Window stop                     967
Did snore                       887
Snore time                      887
Weather temperature (°C)        887
Weather type                    887
dtype: int64

In [49]:
df.shape

(1808, 19)

In [50]:
numeric_cols = [col for col in df[cols_with_nans].select_dtypes("number")]
cat_cols = [col for col in df[cols_with_nans].select_dtypes("object")]
print(numeric_cols)
print(cat_cols)

['Heart rate (bpm)', 'Air Pressure (Pa)', 'Movements per hour', 'Time asleep (seconds)', 'Time before sleep (seconds)', 'Snore time', 'Weather temperature (°C)']
['Regularity', 'Alarm mode', 'City', 'Window start', 'Window stop', 'Did snore', 'Weather type']


In [51]:
for col in numeric_cols:
    na_percent = df[col].isna().sum() / len(df[col])
    print(f"{col}:{na_percent}")

Heart rate (bpm):0.40099557522123896
Air Pressure (Pa):0.7278761061946902
Movements per hour:0.4905973451327434
Time asleep (seconds):0.4905973451327434
Time before sleep (seconds):0.4905973451327434
Snore time:0.4905973451327434
Weather temperature (°C):0.4905973451327434
