# **How Can a Wellness Technology Company Play It Smart?**
## Phase 1: Ask

#### Introduction
In this analysis we will take a close look at real-world data by FitBit users, a popular fitness tracker, in order to find important insights on user behaviour and trends to help Bellabeat's marketing team make the right decisions in continuing with their products' journeys. That includes identifying when and how participants used their fitness tracker, as well as how different groups of users differ in bevaviour from one another. 

Apart from supporting the upcoming marketing campaign, other goals of this analysis are customer segmentation of fitness tracker users and optimizing features of Bellabeats smart-devices.

#### Stakeholders
The stakeholders during this analysis are Urška Sršen and Sando Mur, cofounders and executives of Bellabeat. Additionally, the marketing team behind Bellabeat are also key-stakeholders, as this report aims to directly influence their upcoming campaigns.


## Phase 2: Prepare
#### Integrity
The data for this process comes from a public dataset from Kaggle. The data was collected from 03/12/2016 to 05/12/2016 and includes thirty participants. The dataset itself falls under CC0 - Public Domain and is therefore sharable and usable in any setting - commercially or not.

The three datasets we are going to use contain a maximum of 33 individual participants over the course of 31 days, which isn't the biggest sample size of information. Regardless, we can expect to find some interesting insights during this analysis. Furthermore, age, gender and social background are not included in our data, so it is difficult to imagine how diverse our data is.

#### Security
While working on the analysis in R, the data will be accessed from the online repository on Kaggle. However, for backup reasons, a second copy of the data has been saved locally.

#### Ambiguity
Contrary to many other datasets, our data doesn't offer any details on what each of the columns represent. It is not explained what the columns' data entails, apart from the header, or what unit the data is in. Examples: The difference between TrackerDistance and TotalDistance is not explained and there are no units given for either.

### Importing libraries and data

In [None]:
library(tidyverse)
library(gridExtra)

df_activity <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv')
df_sleep <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv')
df_intensity <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv')

### Getting a rough overview of the data

In [None]:
glimpse(df_activity)

In [None]:
glimpse(df_sleep)

In [None]:
glimpse(df_intensity)

In [None]:
paste("Total participants (activity data):",length(unique(df_activity$Id)))
paste("Total participants (sleep data):",length(unique(df_sleep$Id)))
paste("Total participants (intensity data (hourly)):",length(unique(df_intensity$Id)))

In [None]:
paste("Total days recorded:",length(unique(df_activity$ActivityDate)))

## Phase 3: Process

#### Exploration
We start the data cleaning process by taking a broad look at the data to find any obvious flaws with the data. These include null-values or wrong data types. 

Afterwards, we will go into more detail about the data to find any nuances, which might, if not handled, create problems during the analysis later on.

#### Cleaning
All issues we find with our data will be taken care of accordingly. Every decision will be document to explain the decision behind certain actions to maintain the data's integrity. This includes whether or not to drop columns with null values or to replace those with the average of the other participants, and so on.

In [None]:
sum(is.na(df_activity))
sum(is.na(df_sleep))
sum(is.na(df_intensity))

From the overview of our data in phase 2 we could tell that all of our dates are in string format. We will resolve that issue by converting them to a date object.

In [None]:
df_activity$ActivityDate <- as.Date(df_activity$ActivityDate, format="%m/%d/%Y")
df_sleep$SleepDay <- as.Date(df_sleep$SleepDay, format="%m/%d/%Y")

We create a new column 'ActiveMinutes', adding up all intensities of activity by the participant.

In [None]:
df_activity$ActiveMinutes <- df_activity$VeryActiveMinutes + df_activity$FairlyActiveMinutes + df_activity$LightlyActiveMinutes

As explained earlier, columns 'TrackerDistance' and 'TotalDistance' are ambiguous. From the brief overview of our data above, they seem to contain the identical values. Let's take a look at that.

In [None]:
mismatched <- 0
mm_vector <- c()

for(i in 1:nrow(df_activity)) {
    if (df_activity$TotalDistance[i] != df_activity$TrackerDistance[i]) {
        mismatched = mismatched + 1
        mm_vector <- append(mm_vector, i)
    }
}
paste("Mismatched rows:", mismatched)
paste("% of mismatched rows:", round((mismatched/940)*100,2), "%")
mm_vector

In [None]:
df_activity[690,]

In [None]:
differences = c()

for(i in mm_vector) {
    differences <- append(differences,(df_activity[i,]$TotalDistance) - (df_activity[i,]$TrackerDistance))
}

paste("Max difference:", round(max(differences),2))
paste("Average difference:", round(mean(differences),2))

From the data exploration above, we can see that 15 of our 940 total rows have non-identical values for 'TotalDistance' and 'TrackerDistance'. The average difference in those 15 cases is 900 metres with the maximum being 1.83 kilometres. Due to the columns having no further descriptions on what they convey and the mismatched rows only accounting for roughly 1.6% of all rows, I made the decision to not use 'TrackerDistance' for this analysis, but keep 'TotalDistance' as total distance a person has either walked or ran during that day.

## Phase 4: Analyze

#### Exploration
Now that the data is properly formatted, we will start analyzing our data. However, we are careful with our findings due to the limitations mentioned earlier (sample size, missing units, ...). Additionally, one thing to keep in mind is that coming correlation scatterplots, while they might show a strong correlation, do not necessarily show causation between the variables

### Five-Number Summary of our datasets

In [None]:
df_activity %>%
    select(TotalSteps, TotalDistance, Calories) %>%
        summary()

df_sleep %>%
    select(TotalMinutesAsleep, TotalTimeInBed) %>%
        summary()

Our first observation will focus on total steps where the average lies at 7.638, with the lowest 25% of entries only managing 3.790 steps at most. The CDC considers fewer than 5.000 steps per day to be sedentary and recommends 10.000 steps for general fitness*. This figure varies as age, fitness goal and medical conditions also play into it. However, only a quarter of all records achieve that mark, with the top 25% starting at 10.727 steps per day.

Average time asleep lies at 419.5 minutes per night. This is on the cusp of the lower bound for recommended sleep, which is between 7 and 9 hours (420 - 540 minutes). Again, the lowest 25% of sleep records only reach 360 minutes (6 hours) or less.

#### Conclusions:
* 25 percent of total step records fall under sedentary behaviour, by the CDC.
* On average, participants should take roughly 2.500 steps more per day.
* 50 percent of sleep records fall below the recommended amount of sleep per night.

#### Recommendations:
* Offer users an easy way to access their step count and inform them on the recommended amount of steps (adjusted for age and fitness).
* Remind customers to go to bed, according to their alarm (at least 7 hours earlier).
* Log sleeping pattern of users to give helpful tips on healthy sleeping habits.

*https://www.medicalnewstoday.com/articles/how-many-steps-should-you-take-a-day#for-general-health

### Visualizations

In [None]:
ggplot(data=df_activity) +
    geom_point(mapping=aes(x=ActiveMinutes, y=Calories)) +
    geom_smooth(mapping=aes(x=ActiveMinutes, y=Calories)) +
    labs(title="Active minutes vs. calories burned")

paste("r =", cor(df_activity$VeryActiveMinutes, df_activity$Calories))

Unsurprisingly, those participants who were active longer during the day, also burned more calories.

#### Conclusions:
* The more active people are in a day, and the more intense the activity is, the more calories they burn.

#### Recommendations:
* Remind users to consume sufficient nutrients during the day, based on calories burned.
* Inform user periodically on their burned calories, in order for them to adapt their calorie intake (based on fitness goal; eg. gain/lose weight, etc.).
* Offer customers healthy meal plans and recipes to not neglect the nutrition side of their health journey.

In [None]:
ggplot(data=df_sleep) +
    geom_density(mapping=aes(x=TotalMinutesAsleep)) + 
    geom_vline(xintercept=420, linetype="dashed", color = "red") +
    geom_vline(xintercept=540, linetype="dashed", color = "red") +
    labs(title="Distribution of total minutes asleep", subtitle="Marked area indicates recommended amount of sleep for adults (7-9 hours)")

The dashed red lines mark both lower and upper bound for recommended sleep for an adult (7-9 hours)*. Interestingly, we can see a shift left of our "perfect window" by our plot, showing us participants were more likely to undersleep than oversleep.

#### Conclusions:
* Many participants already have a healthy sleep pattern.
* People were more likely to undersleep than oversleep.
* The average sleeping duration only narrowly exceeds the minimum recommended duration of 7 hours.

#### Recommendations:
* Inform user once an unhealthy pattern emerges (repeatedly sleeping too little/much).
* Educate customers on their recommended hours of sleep (based on age).

*https://www.sleepfoundation.org/how-sleep-works/how-much-sleep-do-we-really-need#:~:text=National%20Sleep%20Foundation%20guidelines1,to%208%20hours%20per%20night.

### Joining the data

In order to be able to compare activity and sleep in our analysis, we need to perform a join on our two datasets, based on the participants' Id and the date of the sleep/activity record.

In [None]:
df_combine <- merge(df_activity, df_sleep, by.x=c("Id","ActivityDate"), by.y=c("Id", "SleepDay"))
glimpse(df_combine)
paste("Number of participants:",length(unique(df_combine$Id)))

**Note:** Because we performed an inner join on the two dataframes and the sleep data only includes 24, as opposed to 33 in the activity data, the number of rows in the new dataframe decreased from 940 rows in df_activity, to 413.

In [None]:
r1 <- paste("r=",round(cor(df_combine$SedentaryMinutes, df_combine$TotalMinutesAsleep),3))

ggplot(data=df_combine) +
    geom_point(mapping=aes(x=SedentaryMinutes, y=TotalMinutesAsleep, color=TotalSleepRecords)) +
    geom_hline(yintercept=420, linetype="dashed", color = "red") +
    geom_hline(yintercept=540, linetype="dashed", color = "red") +
    labs(title="sedentary minutes vs. total minutes asleep", subtitle=r1)


From the plot above(Sedentary minutes vs. time asleep), we can see quite a strong relationship between the two variables. As sedentary minutes during the day go up, sleep massively decreases. The area marked by the two dashed lines indicates the recommended timeframe for sleep duration for adults (7-9 hours).

#### Conclusions:
* Minutes spent sleeping and sedentary minutes have a strong negative correlation.
* We cannot tell whether sleeping longer makes people sit longer or sitting for a long period of time causes people to sleep longer.

#### Recommendations:
* Send users a notification when sedentary minutes are reached, at which point a user is expected to oversleep (around 750 minutes).

In [None]:
ggplot(data=df_combine) +
    geom_point(mapping=aes(x=SedentaryMinutes, y=TotalMinutesAsleep, color=TotalSleepRecords)) +
    facet_wrap(~TotalSleepRecords) +
    geom_hline(yintercept=420, linetype="dashed", color = "red") +
    geom_hline(yintercept=540, linetype="dashed", color = "red") +
    labs(title="sedentary minutes vs. total minutes asleep", subtitle=r1)

Participants sleeping in one interval tended to sleep much shorter, compared to two or three intervals. However, all three entries where a person slept in three intervals, overshot the target sleeping duration of 9 hours. This however, has to be taken with a grain of salt, as three individual cases are not sufficient enough to make a clear decision.

#### Conclusion:
* The more intervals/sessions you sleep in, the less likely it is for you to *undersleep*.
* Likewise, as sleeping sessions go up, so does your risk of *oversleeping*.

#### Recommendations:
* Recommend customers *not* to sleep in more than two individual sessions (for risk of oversleeping).
* Inform users on time spent sleeping on wake-up, in order to let them know whether or not more sleep would be recommended.

In [None]:
ggplot(data=df_activity) +
    geom_point(mapping=aes(x=SedentaryMinutes, y=ActiveMinutes, colour= (SedentaryMinutes > 600) & (ActiveMinutes < 40))) +
    geom_segment(x = 600, y = 40, xend = 1500, yend = 40, linetype="dashed", color = "red") +
    geom_segment(x = 600, y = 0, xend = 600, yend = 40, linetype="dashed", color = "red") +
    scale_colour_manual(name = 'At risk', values = setNames(c('darkred','black'),c(T, F)))

The Journal of the American College of Cardiology found in their study "Sitting Time, Physical Activity, and Risk of Mortality in Adults" that people aged 45 and above, sitting for 10 hours or more a day should exercise for at least 20-40 minutes per day to mitigate most cardiovascular risks associated to their sedentary behaviour.
That means most people, especially those 45 and over, in the red area in the plots above, should strive to either reduce their sedentary time, increase active minutes or both.

#### Conclusions:
* Very few participants fall in the 'red box'.
* The most active people sit for roughly 500 to 1.000 minutes per day.

#### Recommendations:
* Warn people repeatedly falling into the red area about the health risks associated with their behaviour.
* Inform or encourage those individuals to do light exercise.
* Notify users sitting for a long continuous time to just shortly walk around or do light exercises.

### Creating new column 'WeekDay'

We will continue by creating a new column in our combined dataframe, which adds the day of the week to our data. This will help us greatly identify sleeping and exercising trends impacted by the day of the week.

In [None]:
df_combine$WeekDay <- weekdays(df_combine$ActivityDate)

### Analyze behaviour by day

In [None]:
week_order <- c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")

combine_by_weekday <- df_combine %>%
    group_by(WeekDay) %>%
    summarise(MeanSteps=mean(TotalSteps),
              MeanDistance=mean(TotalDistance),
              MeanActiveMinutes=mean(ActiveMinutes),
              MeanCalories=mean(Calories),
              MeanMinutesAsleep=mean(TotalMinutesAsleep)) %>%
    arrange(match(WeekDay,week_order))

combine_by_weekday

In [None]:
step_plot <- ggplot(data=combine_by_weekday) +
    geom_bar(mapping=aes(x=factor(WeekDay,levels=week_order),y=MeanActiveMinutes),stat="identity") +
    labs(x="Day", y="Time being active [in min]", title="Average time active by weekday")

distance_plot <- ggplot(data=combine_by_weekday) +
    geom_bar(mapping=aes(x=factor(WeekDay,levels=week_order),y=MeanDistance),stat="identity") +
    labs(x="Day", y="Distance [in km]", title="Average distance by weekday")

grid.arrange(step_plot,distance_plot,nrow=2)

Participants' energy and motivation for physical activity steadily drops during the week. They experience a big spike on Saturday, followed by a potential rest day on Sunday.

#### Conclusions:
* Participants were the most active on Saturday.
* Participants were least active on Sunday.
* Activity levels slowly decrease during the work week.

#### Recommendations:
* Send customers reassuring and motivational reminders as the week progresses to keep energy high.
* Encourage customers on Sunday to a light workout or walk.

In [None]:
ggplot(data=combine_by_weekday) +
    geom_bar(mapping=aes(x=factor(WeekDay,levels=week_order),y=MeanMinutesAsleep),stat="identity") +
    geom_hline(yintercept=mean(df_sleep$TotalMinutesAsleep), linetype="dashed", color = "red") +
    labs(x="Day", y="Time spent sleeping [in min]", title="Average time spent sleeping by weekday", subtitle="Red line indicates average minutes asleep by participants")

Participants see a big spike in minutes slept on Sunday. This might be one of the reasons for their shorter distance travelled and less activity done on that day as seen in the plots above. They sleep about half an hour longer than the mean on Sunday. However, sleeping longer doesn't necessarily cause the drop in activity, as participants also sleep longer on Saturday than during the week, where they exercise the most.

#### Conclusions:
* Participants sleep the longest on Sunday.

#### Recommendations:
* Send customers helpful information on healthy sleeping habits and patterns to not oversleep.
* Recommend users when to set their alarm based on the time they go to bed. E.g. when going to bed at 10, set alarm earlier than when going to bed at 11 (on weekend).
* Offer customers easy access to their sleeping data for them to analyze their sleeping habits themselves.

In [None]:
df_intensity$ActivityHour = as.POSIXct(df_intensity$ActivityHour, format="%m/%d/%Y %I:%M:%S %p")
df_intensity$Time <- format(df_intensity$ActivityHour, format = "%H:%M:%S")
df_intensity$Date <- as.Date(df_intensity$ActivityHour, format = "%m/%d/%y")
df_intensity$WeekDay <- weekdays(df_intensity$Date)
glimpse(df_intensity)

### Looking at intensity by time of day

In the last step of this analysis, we dive even deeper into the habits of our participants, looking at it on an hourly basis.

In [None]:
combine_by_hour <- df_intensity %>%
  group_by(Time) %>%
  summarise(MeanIntensity = mean(TotalIntensity))

combine_by_hour

In [None]:
ggplot(data=combine_by_hour) +
    geom_bar(mapping=aes(x=Time,y=MeanIntensity),stat="identity") +
    theme(axis.text.x = element_text(angle = 90))

Plotting the average intensity by our participants for every hour of the day, we can see the biggest spike of intense activity between 5 and 7 p.m. After that we see a steep decline until early in the morning again. From 5 a.m. intensity steadily climbs again, only experiencing a small dip at 3 p.m.

#### Conclusions:
* The most intense exercise occurs between 5 and 7 p.m. This is very likely after work for most participants.
* From 4 to 5 a.m. there is a big increase in intensity, likely from participants exercising before work.

#### Recommendations:
* Let people set preferred workout times and subsequently remind them an hour before, so they don't miss a workout.

In [None]:
ggplot(data=df_intensity) +
    geom_bar(mapping=aes(x=Time,y=TotalIntensity),stat="identity") +
    geom_vline(xintercept="12:00:00", color="blue") +
    facet_wrap(~WeekDay) +
    labs(title="Intensity per hour, by day", subtitle="Blue line indicates 12 p.m.") +
    theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

From the plots above, we can tell that the majority of activity intensity on weekdays occurs around 5 p.m. to 7 p.m. This proves our hypothesis that this peak in the plot above comes from people working out after work. On Saturday, people preferred to work out earlier (around 1 p.m.), while on Sunday, the participants don't show to have a favoured time to work out.

#### Conclusions:
* People specifically like to work out between 5-7 p.m. on work days.
* The majority of people prefer to work out earlier on Saturday (compared to work days). However, there is still a small spike around noon.
* Participants had no real prefered time to workout on Sunday.

### Recommendations:
* Send custom workout reminders by looking at the workout pattern of the user.

## Phase 5: Share
### Key insights

#### Target audience
We already know that Bellabeat's main demographic are women. However, with this data of another fitness tracker, we can tell that most of the customers work roughly a 9-5 schedule, looking at when they decide to work out. Furthermore, many of the participants spend a significant time sat down, prompting us to believe these people might work regular desk jobs.

**Important insight:**
Fitness tracker users aren't necessarily avid athletes. Bellabeat's target audience contains everyone in the process of or thinking about adapting to a healthier lifestyle. This directly affects the upcoming marketing campaign, as the actors in it, if any are planned, should cover various body types and lifestyle situations of women. Not just athletes.

#### Problems customers can solve
Thanks to our analysis we could identify many slight health risks and trends that, without a fitness tracker, would go by unnoticed. These include total steps in a day, time slept, workout intensity by day and even hour, heartrate, and many more.

All of these features are important selling points for all the smart products of Bellabeat and should be advertised in the campaign.

#### App/Tracker Features
Many of the recommended features for the smart devices by Bellabeat have already been named and explained in the analysis. Here are the most significant ones.
* Dashboard for users to recap on activity, steps, sleep, etc.
* Notifications for unhealthy behaviour such as long sedentary time and over sleeping.
* Reminders to stand up after sitting for an extended period of time, to go to bed soon or not to forget to eat or workout.

# **Thank you**
for taking interest in this case study.
-SZ