# Sleep iOS App Data - Exploratory Analysis 

## Inspiration for this project
A few years ago I had read a wonderful (and somewhat worrisome) book titled 'Why We Sleep' by Matthew Walker, in which Walker whom is a neuroscientist and sleep researcher, carefully delves into the modern research regarding the importance of sleep. Walker outlines extensive research on how sleep affects our moods, dietary preferences, our body's ability to fight cancer, physical & mental aptitude, among a host of other ideas to which Walker proposes that our lack of sufficient and quality sleep may explain some of today's issues. This book, along with Michael Pollan's recent 'This is your Mind on Plants' book with the section about caffeine history and information, instilled a sense of curiosity to see if I could do a bit of my own digging in a sleep dataset to uncover something of interest (if anything).

## The Dataset

This dataset was provided via: https://www.kaggle.com/danagerous/sleep-data which notes that it was collected over a four year period utilizing the Sleep Cycle app from Northcube on iOS.

### Import Necessary Packages

In [None]:
import os 
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

### Import Dataset

In [None]:
# Import Sleep Cycle iOS App .csv Data

df = pd.read_csv("../input/sleep-data/sleepdata.csv", sep = ';')

df.head()

In [None]:
df.dtypes

### Cleaning the Data
Checking the .head() of our dataframe shows that we need to do some cleaning up before we can begin exploring this dataset. Calling on the .dtypes of our dataframe also reveals that our variables are not in the correct format either.

We need to:
* Change 'Start' and 'End' to the datetime format
* Change 'Sleep quality' to a numeric value instead of a string with the % symbol
* Change 'Time in bed' to amount of hours in bed represented as a float
* Change 'Wake up' to 'Happy' for ':)' and 'Moderate' for ':|'
* Fill in the missing values in 'Sleep Notes' with '' (a blank space) for us to use in the next step for creating binary variables
* Separate the 'Sleep Notes' column into binary values for each variable within this column (i.e. 'Drank tea' : T / F)
* Remove the 'Activity (steps)' column as there appears to be too many missing values (0's) on days when the individual noted they had worked out, or we are blatantly missing sleep notes


In [None]:
############ Cleaning Data ############

# Replacing nan values in 'Sleep Notes' with '', keeping all other columns with missing values 

df['Sleep Notes'].fillna(value = '', inplace = True)

# Resetting the index

df.reset_index(inplace = True, drop = True)

# Converting 'Wake up' column emojis from ':|' & ':)' to 'Moderate' & 'Happy'

df['Wake up'] = df['Wake up'].replace({':|':'Moderate', ':)':'Happy'})

# Subdividing the 'Sleep Notes' column which contains notes separated by a colon, into binary variable columns with True/False based on these 5 factors: 'Drank tea', 'Drank coffee', 'Worked out', 'Ate late', 'Stressful day'

df['Drank tea'] = [True if 'Drank tea' in x else False for x in df['Sleep Notes']]
df['Drank coffee'] = [True if 'Drank coffee' in x else False for x in df['Sleep Notes']]
df['Worked out'] = [True if 'Worked out' in x else False for x in df['Sleep Notes']]
df['Ate late'] = [True if 'Ate late' in x else False for x in df['Sleep Notes']]
df['Stressful day'] = [True if 'Stressful day' in x else False for x in df['Sleep Notes']]

# Dropping the 'Activity (steps)' column due to the value being 0 for all complete data (no added information), and dropping the 'Sleep Notes' column due to obsolescence 

df.drop('Activity (steps)', inplace = True, axis = 'columns')
df.drop('Sleep Notes', inplace = True, axis = 'columns')

# Converting 'Sleep quality' column to numeric values and removing the % symbol from the string

df['Sleep quality'] = df['Sleep quality'].replace('%', '', regex = True)
df['Sleep quality'] = df['Sleep quality'].astype('int')

# Converting 'Time in bed' column to separate columns for hours and minutes, then combining to calculate total minutes, then dividing by 60 to get total hours slept as a float in 'Time in bed (hours)'. Subsequently dropping the 'Time in bed' column.

df['Time in bed'] = df['Time in bed'].str.split(':')
df['Time in bed (hour)'] = [x[0] for x in df['Time in bed']]
df['Time in bed (minute)'] = [x[1] for x in df['Time in bed']]

df['Time in bed (hour)'] = df['Time in bed (hour)'].astype('int')
df['Time in bed (minute)'] = df['Time in bed (minute)'].astype('int')

df['Time in bed (hours)'] = (df['Time in bed (hour)'] * 60 + df['Time in bed (minute)']) / 60

df.drop('Time in bed', inplace = True, axis = 'columns')

# Converting the 'Start' and 'End' columns into datetime objects for later visualization

df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])

# Creating a Month column for time series plotting

df['Month'] = [x.month for x in df['Start']]

# Reordering columns to desired order

column_names = ['Start', 'End', 'Month', 'Sleep quality', 'Time in bed (hour)', 'Time in bed (minute)', 'Time in bed (hours)', 'Wake up', 'Heart rate', 'Drank tea', 'Drank coffee', 'Worked out', 'Ate late', 'Stressful day']

df = df.reindex(columns = column_names)

### Exploration
Now that our data is in a tidy format, we can now begin to explore and visualize the dataset.

In [None]:
df.head()

### Correlation Matrix
Let's start with a correlation matrix and display it as a heatmap to have a glance at the linear relationship between our variables and to see if there is anything that stands out to us.

It appears we have a decent linear relationship between Time in bed (hours) & Sleep quality (~0.711), this seems consistent with the general notion of sleeping more and feeling more refreshed.

We also notice there is a decent linear relationship between Worked out (whether our user worked out that particular day) & Drank Tea / Drank Coffee (~0.306 & ~0.377 respectively). It may be the case that this individual requires caffeine in order to jump start their morning to work out, or to make it through their work day and to have enough energy to hit the gym in the evening.

In [None]:
df_corr = df.corr()
sns.heatmap(df_corr)

### Comparing Average Sleep Scores on Caffeine
Let's take a closer look at the mean sleep scores (Sleep quality) to see if we can note anything of interest. We will group the days by the four different combinations of Drank Tea T/F ~ Drank Coffee T/F.

Interestingly, this person gained sleep quality and slept slightly longer (20 minutes on average) on the days that they drank tea & coffee compared to no tea & coffee. This raises the question of whether the subjective sleep quality reported by the individual is an accurate representation of the true quality of their sleep indicated by their brain's sleep waves which will reduce in quality if an individual has caffeine in their system when they sleep as described in this the following study: (https://jcsm.aasm.org/doi/10.5664/jcsm.3170). This individual also rarely worked out when they did not drink tea nor coffee, perhaps caffeine (tea or coffee) provides the motivation to begin a workout.


In [None]:
tc_df = df[['Sleep quality', 'Time in bed (hours)', 'Drank tea', 'Drank coffee', 'Worked out', 'Ate late', 'Stressful day']].groupby(['Drank tea', 'Drank coffee']).mean()

tc_df

### Comparing Worked Out T/F

Next we can compare our variables on whether this individual has worked out or not that particular day, and to see what affect this has on sleep quality or our other variables..

Interesingly, the sleep quality and time in bed appears to be just slightly higher on the nights the individual worked out. Drinking tea and coffee on average had nearly double the rates on the days this person worked out, compared to the days they did not work out. Relating back to the previous dataframe looking into caffeine, perhaps the stimulation provided the necessary "boost" for this person to work out. Let's see which days of the week this person worked out on and if there is a pattern.

In [None]:
wo_df = df[['Sleep quality', 'Time in bed (hours)', 'Drank tea', 'Drank coffee', 'Worked out', 'Ate late', 'Stressful day']].groupby(['Worked out']).mean()

wo_df

### Which days of the week had a workout session?

This person worked out most often on Monday ~ Thursday, with about a third of the amount worked out during the week on Friday ~ Sunday. Let's compare this to when the individual drank caffeine (tea & coffee)


In [None]:
df['Day of week'] = [x.weekday() for x in df['Start']]
df['Day of week'] = df['Day of week'].replace({0:'Monday', 1:'Tuesday', 2:'Wednesday', 3:'Thursday', 4:'Friday', 5:'Saturday', 6:'Sunday'})

sns.countplot(x = 'Day of week', hue = 'Worked out', data = df)

### Which days of the week was tea or coffee consumed?
The ratio of drank coffee (T) to not drank coffee (F) seems relatively consistent throughout Monday ~ Thursday, with Saturday having the largest disparity between coffee drank that day compared to not drank that day. Interestingly, Friday appears to have a low overall drank / not drank count. 

Looking at our weekly plot of whether this individual consumed tea, it appears that there is a preference for coffee on Sundays significantly more compared to tea, perahsp they are enjoying a lovely Sunday morning with a nice cuppa (of coffee).

In [None]:
sns.countplot(x = 'Day of week', hue = 'Drank coffee', data = df)

In [None]:
sns.countplot(x = 'Day of week', hue = 'Drank tea', data = df)

## Conclusion
What did we learn from analyzing this individual's sleep data?
* Our individual slept 20 minutes on average longer on the days they consume caffeine (tea & coffee)
* Sleep quality also improved by nearly 7 points (out of 100) on the days they consumed caffeine (tea & coffee)
* Sunday coffees are a must
* Working out on Sunday will not happen

There is always more to be gleaned when working with data, but from what we have gathered so far, perhaps there is a case to be made about using caffeine and improving our sleep time & quality when used in the appropriate fashion. It may be the case that caffeine allows one the possibility to improve their sleep through enabling one to be motivated enough to exercise (or anything in a similar vain).

Thanks for reading :)
