This document is to provide more detail into the steps taken to clean data in preperation for analysis.

For stakeholders that wish wish to only see the results and findings - please skip to "section 8.Findings & Insights." in the table of contents.

<a id="1"></a> <br>
# 1. Business Task

**To analyse smart device usage data in order to gain insight into how people are already using their smart devices. With this information, make high-level recommendations for how the trends discovered can inform Bellabeat's marketing strategy**

<a id="1"></a> <br>
# 2. Stakeholders

* **Urška Sršen**: Bellabeat’s cofounder and Chief Creative Officer
* **Sando Mur**: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team
* **Bellabeat marketing analytics team**: A team of data analysts responsible for collecting, analyzing, and
reporting data that helps guide Bellabeat’s marketing strategy.

<a id="1"></a> <br>
# 3. Questions to Answer

* What are some trends in smart device usage?
* How could these trends apply to Bellabeat customers?
* How could these trends help influence Bellabeat marketing strategy?

<a id="1"></a> <br>
# 4. About the Data

For this analysis, the Bellabeat marketing analytics team has been provided smart device usage data from a non-Bellabeat product. This dataset will be analysed to draw insights from to help guide Bellabeat's marketing strategy.

* **Dataset Name**: FitBit Fitness Tracker Data
* **Data Collection Method**: Distibuted survey
* **Data Collection By**: Amazon Mechanical Turk
* **Data Collection Period**: 03.12.2016 - 05.12.2016
* **No. of Participants**: 30 
* **Participant Type**: Eligible Fitbit users
* **Privacy Protection**: Users consented to submission of tracker data - including *minute-level output for physical activity*, *heart rate*, and *sleep monitoring*.
* **Parsing Info**: *Session ID* (col_A) or *Timestamp* (col_B)


<a id="1"></a> <br>
# 5. Data Preparation

## **Initial Observations**

Upon looking through each of the .csv files using excel, the following observations were made:

* Data can be categorized into data relating to - sleep, activity and weight.
* Data is represented in different data frame formats - wide/narrow - each have their own .csv file.
* Data is logged in different scales - minutes, hours and days - each have their own .csv file.
* "dailyActivity_merged.csv" has aggregated data from other data frames such as: 
    * steps 
    * calories 
    * intensity minutes 
    * distances
    * parsed by ID and date
* "SleepDay_merged.csv" includes useful user data on tracking sleep.
* "weightLogInfo.csv" includes useful user data on tracking weight.

## **Importing Data**

Proceed by importing the datasets deemed valuable from our initial ovservations.

Install and load the 'tidyverse' package that will provide fuctionality to help clean data.

In [None]:
install.packages("tidyverse")
library(tidyverse)

In [None]:
## Import daily sleep data
sleep <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
## Import daily activity data
activity <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
## Import weight data
weight <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")

## **Understanding Data Types**

Let's find out more about the datasets we are working with. Get an overview of each of the datasets by using the glimpse function.

In [None]:
glimpse(sleep)
glimpse(activity)
glimpse(weight)

**Observation Notes**

* Date varibles in each data frame need formatting
    * Recognise date varible and a date not character
    * Remove timestamp mm:hh:ss as it is not useful data
    * Rename varible name so they are all identical
    * Parse the order of mdy so that R recognised the correct date order.
    
    
* Different amounts of entries into each data frame
    * activity has 940 entries
    * sleep has 413 entries
    * weight has 67 entries

<a id="1"></a> <br>
# 5. Data Cleaning

## **Date Formatting**

Reformatting date variables so that they are formatted identically.

In [None]:
## Remove any characters found after "2016"
sleep$SleepDay <-gsub("(2016).*","\\1",sleep$SleepDay)
weight$Date <-gsub("(2016).*","\\1",weight$Date)

## verify - should equal either 8 or 9 characters.
nchar(sleep$SleepDay)
nchar(weight$Date)

Now use the 'lubridate' package to properly format and parse the dates in all datasets.

In [None]:
## Load lubridate
library(lubridate)

## Parse dates
sleep$SleepDay <- mdy(sleep$SleepDay)
activity$ActivityDate <- mdy(activity$ActivityDate)
weight$Date <- mdy(weight$Date)

## Verify
str(sleep)
str(activity)
str(weight)

Change the names of the data varibles in both datasets so they match. Use the 'dplyr' package to do so.

In [None]:
## Rename date varibles
sleep <- rename(sleep, Date=SleepDay)
activity <- rename(activity, Date=ActivityDate)

## Verify - and create vector for col names for convnience.
sleepnames <- colnames(sleep)
activitynames <- colnames(activity)
weightnames <-  colnames(weight)

sleepnames
activitynames
weightnames

## **Check for Missing Values**

Check for missing or NA values in each of the data sets.

In [None]:
sleep %>% 
  select(all_of(sleepnames)) %>% 
  filter(!complete.cases(.))

weight %>% 
  select(all_of(weightnames)) %>% 
  filter(!complete.cases(.))

activity %>% 
  select(all_of(activitynames)) %>% 
  filter(!complete.cases(.))

**Notes**

* sleep df has no missing values
* activity df has no missing values
* 65 of 67 entries did not record 'Fat'

## **Creating New Data Frame - weight**

When checking for missing values, it was observed that the weight data frame could use further cleaning.
* Use Kg measurement as that is what we use at Bellabeat.
* Remove 'Fat' variable as data is not useful for analysis.
* Remove 'LogId' variable as data is not useful for analysis.

In [None]:
weight <- data.frame(Id=weight$Id,
                        Date=weight$Date,
                        WeightKg=weight$WeightKg,
                        BMI=weight$BMI,
                        IsManualReport=weight$IsManualReport)

as_tibble(weight)

## **Data Cleaning Results**

Cleaning efforts resulted in 3 data frames to conduct analysis from.
1. activity
2. sleep
3. weight


<a id="1"></a> <br>
# 6. Data Analysis

## **Trends in smart device usage**

**How many different dates are there in each of the data frames?**

In [None]:
n_distinct(sleep$Date)
n_distinct(activity$Date)
n_distinct(weight$Date)

There are the same amount of unique dates in each of the data frames, 31. 

**How many entries were submitted on each day?**

In [None]:
## tally the number of entries for each df by Date.
sleepcount_day <- tally(group_by(sleep, Date))
activitycount_day  <- tally(group_by(activity, Date))
weightcount_day  <- tally(group_by(weight, Date))

## merge tally data
entrycount_date <- activitycount_day %>% 
  full_join(sleepcount_day,by="Date") %>% 
  full_join(weightcount_day, by="Date")

## rename variables
entrycount_date <- rename(entrycount_date,
                      ActivityCount=n.x,
                      SleepCount=n.y,
                      WeightCount=n)

## Verify
View(entrycount_date)


**How many entries were submitted by each ID?**

In [None]:
## tally the number of entries for each df by ID.
sleepcount <- tally(group_by(sleep, Id))
activitycount <- tally(group_by(activity, Id))
weightcount <- tally(group_by(weight, Id))

## merge tally data
df <- sleepcount %>% full_join(weightcount,by="Id")
countmerged <- activitycount %>%  full_join(df, by="Id")

## rename varibles 
countmerged_byid <- rename(countmerged,
                      ActivityCount=n,
                      SleepCount=n.x,
                      WeightCount=n.y)

## replacing NA values with '0'.
countmerged_byid[is.na(countmerged_byid)] <- 0

## verify
View(countmerged_byid)

**What is the average number of entries for each data type in a day?**

In [None]:
## Use summarize function from dplyr to help calculate the mean of each data category.
countmerged_byid_sum <- countmerged_byid %>% 
  dplyr::summarize(mean_sleepcount = mean(SleepCount),
                   mean_actvitiycount = mean(ActivityCount),
                   mean_weightcount = mean(WeightCount))%>% 
  as.data.frame()

## Verify
View(countmerged_byid_sum)

**What is the difference between maximum possible entry count vs actual entry count?**

In [None]:
## no. of unique dates
testdays <- 31
## no. of unique participants
participants <-33 
## no. of data categories - sleep, weight, activity
datacat <- 3

## calculate maximum no. of entries.
potentialentry_type <- testdays * participants
potentialentry_total <- potentialentry_type * datacat

potentialentry_type
potentialentry_total

Maxmimum amount of entries-
* 1023 entries - weight
* 1023 entries - sleep
* 1023 entries - activity

**3069 entries - total**

In [None]:
## check actual number of entries
tally(sleep)
tally(weight)
tally(activity)

Actual amount of entries -
* 413 entries - sleep
* 67 entries - weight
* 940 entries - activity

**1420 entries - total**

In [None]:
## Create a new dataframe with entry count data

datatype <-c("Sleep", "Weight", "Activity")
maxcount <- c(1023, 1023, 1023)
actualcount <- c(413, 67, 940)

countsummary <- data.frame(Category=datatype,
                           MaxCount=maxcount,
                          ActualCount=actualcount)

## Verify
View(countsummary)

## **Investigating Further Insights**

The team and I have agreed that there is an inadequate amount of data relating to weight tracking for any actionable recommendations to be made to stakeholders. While weight data can be very useful to help Bellabeat better understand smart device users, in this case, more data needs to be collected in order for further analysis to be conducted.

For this investigation we will take a closer look at sleep and activity data as there is significantly more observations recorded to which can be analysed to find any potential trends that may help provide useful insights to Bellabeat's marketing strategy.


**Merge sleep and activity data**

In [None]:
## Merge by ID and Date
data <- merge(sleep, activity, by=c("Id", "Date"))
## Check that no. of distinct Id's = no. sleep data unique Id's
n_distinct(data$Id)
## Verify
View(data)

**Check that data is valid for summations**

In order to reliably draw insights from sleep data, we must ensure that each observation holds valid data.

**Which Id's recorded less than 3 unique observations for sleep?**

In [None]:
## Calculate how many entries of sleep data for each ID.
sleeptally <- tally(group_by(data, Id))
## Rename variables
sleeptally <- rename(sleeptally, EntryCount=n)


## Set argument
invalid <- sleeptally %>% 
  select(EntryCount, Id) %>% 
  filter(EntryCount<3)

## Check results
View(invalid)

Here we have 2 unique IDs that have less than 3 unique entries.

* **2320127002**
* **7007744171**


In [None]:
## Review filtered data
data %>% 
  filter(Id %in% c(invalid$Id))

**Remove invalid ID's from data**

In [None]:
## filter unwanted data
sleepdata <- data %>% 
                  filter(!Id %in% c(2320127002, 7007744171))

## Verify that IDs got removed
sleepdata %>% 
  filter(Id %in% c(2320127002, 7007744171))

**Get a summary of cleaned sleep data**

In [None]:
summary(sleepdata)

**Create a new dataframe taking an average of all data by ID**

In [None]:
## Create a new dataframe with useful data for analysis
sleepdata_avg <- sleepdata %>% 
  group_by(Id) %>%
  dplyr::summarize(mean_timeinbed = mean(TotalTimeInBed),
                   mean_timeasleep = mean(TotalMinutesAsleep),
                   mean_steps = mean(TotalSteps),
                   mean_distance = mean(TotalDistance),
                   mean_calories = mean(Calories),
                   mean_sedentary = mean(SedentaryMinutes),
                   mean_light = mean(LightlyActiveMinutes),
                   mean_fair = mean(FairlyActiveMinutes),
                   mean_very = mean (VeryActiveMinutes))%>% 
  as.data.frame()

## Shortening the amount of decimals by rounding
sleepdata_avg <- sleepdata_avg %>% 
  mutate(across(where(is.numeric), ~ round(., 0)))

## Verify
View(sleepdata_avg)

**Calculate time awake (time spent in bed but not asleep)**

In [None]:
## Set variables
a <- sleepdata_avg$mean_timeinbed
b <- sleepdata_avg$mean_timeasleep

## Calculate
awake <- (a-b)

## Create a new df with results
sleep_analysis <- data.frame(Id=sleepdata_avg$Id,
                            mean_timeinbe=sleepdata_avg$mean_timeinbed,
                            mean_timeasleep=sleepdata_avg$mean_timeasleep,
                            mean_timeawake=awake)

## Verify
View(sleep_analysis)

## Review
summary(sleep_analysis)

**Categorising time awake data**

Calculate the number of people in each of the following categories -

* Less than 15min awake
* 15-30min awake
* More than 30min awake

In [None]:
## Outline conditions
awakegroup <- sleep_analysis %>%
  mutate(Group = case_when(mean_timeawake <= 15 ~ ">15min",
                           mean_timeawake <= 30 ~ "15-30min",
                           mean_timeawake >= 31 ~ "30min+"))

## Verify
View(awakegroup)

In [None]:
## Count how many occurances in each awake group
awakegroup %>%  count(Group)

## Create new variables for new df
UserCount <- c(22)
Freq <- c(3, 9, 10) ## results from above
Groupname <- c("<15min", "15-30min", "30min+")

## Create new datafame

awakeresults <- data.frame(Groupname,
                           Freq,
                           UserCount)
## Calculate percentages
awakeresults <- awakeresults %>% 
  mutate(percentage=(Freq/UserCount*100)) %>%
     mutate(across(where(is.numeric), ~ round(., 0)))

## Verify
View(awakeresults)


<a id="1"></a> <br>
# 7. Visualisations

* Entry count by data category - sleep, weight and activity tracking
* Maximum entry count vs actual entry count
* Relationships between amount of time asleep and activity time
* Awake time groupings

## **Smart Device Usage**

[<div class='tableauPlaceholder' id='viz1654594332054' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Be&#47;Bellabeat_16541009569430&#47;UsagebyUser&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Bellabeat_16541009569430&#47;UsagebyUser' /><param name='tabs' value='yes' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Be&#47;Bellabeat_16541009569430&#47;UsagebyUser&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en-GB' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1654594332054');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>](http://)

## **Amount of time awake in bed (% of users)**

In [None]:
TimeAwakePie <- pie(awakeresults$percentage,
                    labels=awakeresults$percentage,
                    main="Amount of time awake in bed (%)",
                    col = rainbow(length(awakeresults$percentage)))
legend("topright", c("<15min", "15-30min", "30min+"), cex = 2,
       fill=rainbow(length(awakeresults$percentage)))

## **Sleep Data**

**Does more calories burned in a day mean more sleep?**

In [None]:
caloriesvssleep <- sleepdata_avg %>% 
  ggplot(aes(mean_timeasleep, mean_calories))+
  geom_point(size=4, color="darkorchid1")+
  geom_smooth(method=lm, color = "red", se=T)+
  xlab("Minutes Asleep")+ ylab("Calories Burned")+
  ggtitle("Sleep Time vs Calories Burned")+
  theme(plot.title = element_text(hjust = 0.5),
        axis.title = element_text(face="bold"))

caloriesvssleep

**Does more steps in a day mean more sleep?**

In [None]:
stepsvssleep <- sleepdata_avg %>% 
  ggplot(aes(mean_timeasleep, mean_steps))+
  geom_point(size=4, color="blue")+
  geom_smooth(method=lm, color = "red", se=T)+
  xlab("Minutes Asleep")+ ylab("Number of Steps")+
  ggtitle("Sleep Time vs Number of Steps")+
  theme(plot.title = element_text(hjust = 0.5),
        axis.title = element_text(face="bold"))

stepsvssleep

**Does more time spent sedentary in a day mean more sleep?**

In [None]:
sedvssleep <- sleepdata_avg %>% 
  ggplot(aes(mean_timeasleep, mean_sedentary))+
  geom_point(size=4, color="darkturquoise")+
  geom_smooth(method=lm, color = "red", se=T)+
  xlab("Minutes Asleep")+ ylab("Minutes Sedentary")+
  ggtitle("Sleep Time vs Time Spent Sedentary")+
  theme(plot.title = element_text(hjust = 0.5),
        axis.title = element_text(face="bold"))

sedvssleep

**Does more time spent lightly active in a day mean more sleep?**

In [None]:
lightvssleep <- sleepdata_avg %>% 
  ggplot(aes(mean_timeasleep, mean_light))+
  geom_point(size=4, color="darkorange")+
  geom_smooth(method=lm, color = "red", se=T)+
  xlab("Minutes Asleep")+ ylab("Minutes Lightly Active")+
  ggtitle("Sleep Time vs Time Spent Lightly Active")+
  theme(plot.title = element_text(hjust = 0.5),
        axis.title = element_text(face="bold"))

lightvssleep

**Does more time spent fairly active in a day mean more sleep?**

In [None]:
fairvssleep <- sleepdata_avg %>% 
  ggplot(aes(mean_timeasleep, mean_fair))+
  geom_point(size=4, color="deeppink2")+
  geom_smooth(method=lm, color = "red", se=T)+
  xlab("Minutes Asleep")+ ylab("Minutes Fairly Active")+
  ggtitle("Sleep Time vs Time Spent Fairly Active")+
  theme(plot.title = element_text(hjust = 0.5),
        axis.title = element_text(face="bold"))

fairvssleep

**Does more time spent very active in a day mean more sleep?**

In [None]:
veryvssleep <- sleepdata_avg %>% 
  ggplot(aes(mean_timeasleep, mean_very))+
  geom_point(size=4, color="darkred")+
  geom_smooth(method=lm, color = "red", se=T)+
  xlab("Minutes Asleep")+ ylab("Minutes Very Active")+
  ggtitle("Sleep Time vs Time Spent Very Active")+
  theme(plot.title = element_text(hjust = 0.5),
        axis.title = element_text(face="bold"))

veryvssleep

<a id="1"></a> <br>
# 8. Findings & Insights

## **Trends in Smart Device Usage**

* Smart device users do not consistently track and record weight, sleep and activity data on a daily basis.
    * The majority of smart device users **did not** track and record data relating to their weight. Only **7%** of the total possible entries were submitted.
    * A large portion of smart device users **did not** track and record data relating to their sleep. Only **40%** of the total possible entries were submitted.
    * The majority of smart device users **did** track and record data relating to their to their activity. **92%** of the total possible entries were submitted.
* **100%** of smart device users tracked **at least 1** observation of actitivity data.
* **76%** of smart device users tracked **at least 1** observation of sleep data.
* **24%** of smart device users tracked **at least 1** observation of weight data.
* **24%** of smart device users tracked **at least 1** observation of each of the three data categories - sleep, weight and activity.




## **Trends in Sleep Data**

* Strong correlation between amount of time sedentary vs amount of sleep in a day.
  * **The less time users spent sedentary the more sleep they got.**
* No correlation between amount of steps and amount of sleep in a day.
* No correlation between amount of lightly active and amount of sleep in a day.
* No correlation between amount of fairly active and amount of sleep in a day.
* No correlation between amount of very active and amount of sleep in a day.
* **14%** of smart device users spend **less than 15min** awake in bed.
* **41%** of smart device users spend **15-30min** awake in bed.
* **45%** of smart device users spend **more than 30min** awake in bed.

<a id="1"></a> <br>
# 9. Conclusion & Recommendations

## **There is a high probability that all users track activity data because the tracking process is automated. Although, sleep data can also automatically be tracked, we do see a significant drop in the usage of tracking sleeping. Furthermore, very few users are tracking weight data.**

   **Reccommendation**

   * Tracking activity data is highly adopted by smart device users becuase the process is automated. Users feel comfortable wearing a smart device throughout their day. Bellabeat should continue to expand on the Bellabeat Time collection to offer styles that could cater to a larger audience. A potential customer may want to purchase a smart device, there just simply isn't a style available that suits them.
   
   * Even users that are most actively tracking data will show instances where they failed to track on a given day. This could be as a result of forgetting to wear their device before leaving their home or something similar. Could there be a functionality on the Bellabeat App that reminds users to wear their smart device?
   
   * Comfort could be a big contributor towards having sleep data being tracked. Some users may choose to remove their smart device while sleeping because they find wearing it while sleeping to be uncomfortable. This is an opportunity for Bellabeat to market the Bellabeat Leaf as a more comfortable solution which won't hinder comfort of sleep. Users can track important sleep data while maintaining the beuty rest they deserve.
   
   * The analytics team could take a look at who Bellabeat's most active users are. Users that are showing interest in tracking sleep, activity and weight could be a good target customer to sell the Bellabeat Spring water bottle and membership plan.



## **Almost half of smart device users that track sleep data spend more than 30 minutes awake in bed.**

**Reccommendation**

* There is a timeframe where people are in bed but awake. People could possibly be on their phones during this awake which may present a good opportunity for a timeframe to push an internet marketing campaign on social media. This may allow Bellabeat to be more accurate with advertisement strategy and also help with budgeting.

## **People who spend less time sedentary get more sleep.**

**Further Investigation**

* This may be a powerful narrative for a Bellabeat marketing campaign. We need to investigate if people are spending less time sedentary after adopting the usage of a smart device. If this is true, this could be a powerful message where 'wearing a bellabeat product gives you more sleep'. Sleep being a hot topic in todays coversation, this could be a very compelling marketing idea.