# Google Data Analytics Certificate's Capstone Project
### Bellabeat data analysis case study

**INTRODUCTION**

Bellabeat is a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Bellabeat believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. We have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights you discover will then help guide marketing strategy for the company. Their Products include:

* **Bellabeat app:** The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits.
* **Leaf:** Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.
* **Time:** This wellness watch track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
* **Spring:** This is a water bottle that tracks daily water to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.
* **Bellabeat membership:** Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.

### 1.Ask 

#### Key Stakeholders
* Urška Sršen - Bellabeat's cofounder and Chief Creative Officer
* Sando Mur - Mathematician and Bellabeat's co-founder.

**Business Task** : To analyze smart device usage data to gain insight into how consumers use non-Bellabeat smart devices. Based on insights, select one Bellabeat product which makes the user more inclusive to use and stay healthy. Make high-level recommendations for how these trends or insights help Bellabeat marketing team craft new strategies.

Guiding questions:
* What are some trends in smart device usage?
* How could these trends apply to Bellabeat customers?
* How could these trends help influence Bellabeat marketing startegy?

Installing packages required.

In [None]:
# install.packages("tidyverse")
# install.packages("dplyr")
# install.packages("tidyr")
# install.packages("readr")
# install.packages("lubridate")
# install.packages("ggplot2")

Importing libraries.

In [None]:
library(tidyverse)
library(ggplot2)
library(readr)
library(dplyr)
library(lubridate)

### 2. Prepare
For this case study, we use data that is available in public. [FitBit Fitness Tracker Data](https://www.kaggle.com/arashnic/fitbit) (CC0: Public Domain, dataset made available through Mobius): This Kaggle data set
contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of
personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes
information about daily activity, steps, and heart rate that can be used to explore users’ habits.

After a quick check on all data files through Google sheets, we have selected few datasets which would cover big picture of user's health and which has sufficient data to consider like 
* Daily activities which shows a gradual trend of user's activity through days or week.
* Sleep is important and having tracking data of it will help us suggest user to get sufficient sleep time .
* Weight, hourly data of intensities and calories are selected to find any patterns which would be helpful and know the story.

In this phase, we organise the data and perform sort and fliter operations as per requirement.

In [None]:
daily_Activity <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
daily_calories <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv")
daily_intensity <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv")
daily_steps <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailySteps_merged.csv")
sleep_day <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
minute_sleep <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/minuteSleep_merged.csv")
weight_log <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")
hourly_intensity <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv")
hourly_calories <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlyCalories_merged.csv")
hourly_steps <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv")

The user_ids being primary key, selecting appropriate datasets to merge to do analysis with proper number of users. Given the data consists of 30 user's tracking data. Population is small enough to consider for analysis. Although they don't represent all sections of people (e.g: male).

Checking number of unique users in each dataframe.

In [None]:
n_distinct(daily_Activity$Id)
n_distinct(daily_calories$Id)
n_distinct(daily_intensity$Id)
n_distinct(daily_steps$Id)
n_distinct(sleep_day$Id)
n_distinct(minute_sleep$Id)
n_distinct(weight_log$Id)
n_distinct(hourly_intensity$Id)
n_distinct(hourly_calories$Id)
n_distinct(hourly_steps$Id)

Now that all data was processed, we are able to see min(date) and max(date) and n_distinct(df$id) for each data frame:
* daily_activity data available for 2016-04-12 and 2016-05-12 with 33 unique IDs.
* hourly_activity data available for 2016-04-12 and 2016-05-12 with with 33 unique Ids.
* sleep data available for 2016-04-12 and 2016-05-12 with with 24 unique Ids.
* weight_log data available for 2016-04-12 and 2016-05-12 with with 8 unique Ids.


Only 24% and 73% of users are willing to track weight and sleep time respectively. 

In [None]:
# Looking at first few rows
head(daily_Activity) 
# Data types of columns 
str(daily_Activity)

head(daily_calories)
str(daily_calories)

head(daily_intensity)
str(daily_intensity)

head(daily_steps)
str(daily_steps)

head(sleep_day)
str(sleep_day)

head(minute_sleep)
str(minute_sleep)

head(weight_log)
str(weight_log)

head(hourly_intensity)
str(hourly_intensity)

head(hourly_calories)
str(hourly_calories)

head(hourly_steps)
str(hourly_steps)

Since daily activities like calories, intensities and steps are already merged in **daily_Activity** dataframe. Considering **daily_Activity** data for further analysis. Likewise, merging hourly activity data into one dataframe for simplicity.

In [None]:
hourly_Activity <- Reduce(merge,list(hourly_calories, hourly_intensity, hourly_steps))

sum(is.na(hourly_Activity))
head(hourly_Activity)
str(hourly_Activity)

### 3.Process

In this phase, we check for errors in data, format data to use effectively. Cleaning and manipulation techniques will be performed to make data work for further analysis moe effectively.

Here, **ActivityHour** timestamps are in character type, converting into date-time format helps find any discountinuity in recording data.

In [None]:
# for daily activity
daily_Activity$ActivityDate <- as.POSIXct(daily_Activity$ActivityDate, format="%m/%d/%Y", tz=Sys.timezone())
daily_Activity$Date <- format(daily_Activity$ActivityDate,format = "%m/%d/%y")

# Similarly  for hourly activity
hourly_Activity$ActivityHour <- as.POSIXct(hourly_Activity$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
hourly_Activity$Date <- format(hourly_Activity$ActivityHour,format = "%m/%d/%y")
hourly_Activity$Time <- format(hourly_Activity$ActivityHour,format = "%H" )
hourly_Activity$Day<-  weekdays(hourly_Activity$ActivityHour)

# For sleep day
sleep_day$SleepDay <- as.POSIXct(sleep_day$SleepDay, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
sleep_day$Date <- format(sleep_day$SleepDay,format = "%m/%d/%y")

In [None]:
# a quick check on changes on dates and time
head(daily_Activity)
head(sleep_day)
head(hourly_Activity)

Now, we have daily, hourly activities of 33 users and sleep records of 24 users. We can start exploring the cleaned and organized data to find patterns through some great visualizations.

### 4.Analyze
In this phase, we analyze data through aggregating useful features to draw some insights by plotting self-explanatory visualizations. 
Some tasks which include:
* Aggregating data.
* Formatting and creating useful data.
* Peforming calculations.
* Identifying trends and relationships.

In [None]:
# Daily activity summary on distance covered based on levels, steps and active minutes
print("Daily activity")
daily_Activity %>%
    select(VeryActiveDistance, ModeratelyActiveDistance, LightActiveDistance, SedentaryActiveDistance)%>%
    summary()
daily_Activity %>%
    select(TotalSteps, TotalDistance, TrackerDistance)%>%
    summary()
daily_Activity  %>%
    select(VeryActiveMinutes,FairlyActiveMinutes, LightlyActiveMinutes, SedentaryMinutes)%>%
    summary()

# Sleep records summary
print("sleep day")
sleep_day %>%
    select(TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed)%>%
    summary()

# Hourly activity summary on total steps 
print("Hourly activity")
hourly_Activity %>%
    select(Calories, TotalIntensity, StepTotal)%>%
    summary()


#### Findings from summaries:
* Average steps in day is found to be 7638. However, average of 10,000 steps is commonly recommended to stay fit. So, steps per day need to be maintained.
* 75% users are maintaining 10,000 steps per day according 3rd quartile stat of TotalSteps.
* Average Sedentary/Inactive Minutes per day is 991 minutes i.e., 16.5 hours which is 2/3rd of day. This needs to be reduced.(assuming sleep time may or maynot recorded)
* Total Time in bed is recorded average of 458.6 minutes i.e., 7.6 hours after sleep is quite too long if assumed user is not working on bed.


Before we start the analysis, we’ll set up a common theme for our plots.

In [None]:
custom_theme <- function() {
  theme(
    panel.border = element_rect(colour = "black", 
                                fill = NA, 
                                linetype = 1),
    panel.background = element_rect(fill = "white", 
                                    color = 'grey50'),
#     panel.grid.minor.y = element_blank(),
    axis.text = element_text(size=14,colour = "black", 
                             face = "italic", 
                             family = "Helvetica"),
    axis.title = element_text(size=16,colour = "black", 
                              family = "Helvetica"),
    axis.ticks = element_line(colour = "black"),
    plot.title = element_text(size=23, 
                              hjust = 0.5, 
                              family = "Helvetica"),
    plot.subtitle=element_text(size=18, 
                               hjust = 0.5),
    plot.caption = element_text(size=16,colour = "black", 
                             face = "italic", 
                             family = "Helvetica"),
    legend.text = element_text(size=14,colour = "black"),
    legend.title = element_text(size= 18,colour = "black")
  )
}

In [None]:
head(daily_Activity)

In [None]:
# visualizing no. of steps in a day 
options(repr.plot.width = 12, repr.plot.height = 8)
daily_Activity$weekday <- weekdays(as.Date(daily_Activity$Date,format="%m/%d/%Y"))
daily_Activity$weekday <- factor(daily_Activity$weekday, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday","Sunday"))
daily_Activity_group_steps <- daily_Activity %>%
  group_by(weekday) %>%
  drop_na() %>%
  summarise(mean_TotalSteps = mean(TotalSteps))

ggplot(data = daily_Activity_group_steps)+ geom_col(mapping = aes( x = weekday, y = mean_TotalSteps, fill = mean_TotalSteps))+
            labs(title = "Average Steps each day of the week ", 
                 caption = paste0("Data of 33 users from date 4/12/2016 to 5/12/2016"), 
                 x="Week day", y = "Average Steps", 
                 fill = "TotalSteps")+custom_theme()


Users tends to be excersing at medium pace on working days and significantly higher than usual on Saturdays. On Sundays, users excersing less and relaxing more. 

In [None]:
daily_Activity_minutes_active <- daily_Activity %>%
  group_by(weekday) %>%
  select(VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes, SedentaryMinutes)%>%
  summarise(mean_veryactive = mean(VeryActiveMinutes),
           mean_fairlyactive = mean(FairlyActiveMinutes),
           mean_lightlyactive = mean(LightlyActiveMinutes),
           mean_sedentaryactive = mean(SedentaryMinutes)) 
daily_Activity_minutes_active

From above table, we can say that users are lightly active for atleast 3 hrs in a day through out the week. Due to office work or any other activity in a day, sedentary minutes are relatively higher in a day. Bellabeat can remind the users to talk a walk for 5-10 mins for every 2-3 hrs from avoiding continuous sitting risks and increase their productivity. 

In [None]:
options(repr.plot.width=8, repr.plot.height=8)
ggplot(daily_Activity_minutes_active, aes(x=weekday,y=mean_veryactive, group=1))+
  geom_line(color = "blue")+ geom_point()+
  labs(title="Average Very Active Minutes by Weekdays", x= "Weekdays", y="Daily Very Active Minutes")+ custom_theme()



Reviewing activity levels, we can see the overall trend in user's activity:
* Activity level are high on Monday and Tuesday
* Activity level are high on Thursday and Sunday

Users starts off the week active and slow down in the mid. This might be due to the busy schedule, they feel tired. While on saturday, high activity level is observed, which could due to the weekend and users spending time outdoors.
Let's now analyse further about daily activity time.

In [None]:
require(gridExtra)

options(repr.plot.width=15, repr.plot.height=6)
p2<-ggplot(daily_Activity_minutes_active, aes(x=weekday,y=mean_fairlyactive, group=1))+
  geom_line(color = "blue")+ geom_point()+
  labs(title="Average Fairly Active Minutes by Weekdays", x= "Weekdays", y="Daily Fairly Active Minutes")

p3<-ggplot(daily_Activity_minutes_active, aes(x=weekday,y=mean_lightlyactive, group=1))+
  geom_line(color = "blue")+ geom_point()+
  labs(title="Average Lightly Active Minutes by Weekdays", x= "Weekdays", y="Daily Lightly Active Minutes")

p4<-ggplot(daily_Activity_minutes_active, aes(x=weekday,y=mean_sedentaryactive, group=1))+
  geom_line(color = "blue")+ geom_point()+
  labs(title="Average Sedentary Active Minutes by Weekdays", x= "Weekdays", y="Daily Sedentary Active Minutes")

grid.arrange(p2,p3,p4, ncol=3)

options(repr.plot.width=15, repr.plot.height=6)

In [None]:
options(repr.plot.width = 6, repr.plot.height = 6)
ggplot(data = daily_Activity,aes(x = TotalSteps, y = Calories))+geom_point()+geom_smooth()+ labs(title="Calories vs Total Steps")+custom_theme()

ggplot(data = daily_Activity, aes(x = TotalSteps, y = TotalDistance))+geom_point()+geom_smooth()+ labs(title="Total Distance vs Total Steps")+custom_theme()

In [None]:
str(hourly_Activity)

Now, let's analyse at what time any user is most active in a day

In [None]:
hourly_Activity_s <- hourly_Activity %>%
  group_by(Day,Time) %>%
  drop_na() %>%
  summarise(mean_TotalIntensity = mean(TotalIntensity))

ggplot(data=hourly_Activity_s) +
  geom_col(mapping = aes(x = Time, y=mean_TotalIntensity)) +
  labs(title="Hourly Intensity of a Day",
        caption = paste0("Data of 33 users from date 4/12/2016 to 5/12/2016"), 
       x="Time(hour)",
       y="Average intensity") +custom_theme()
    
options(repr.plot.width = 12, repr.plot.height = 6)

We can observe most of the people are active at the evening(around 5-7pm). People might be doing exercise at that time.
Let's now analyse how this trend varies through the week.

In [None]:
ggplot(data=hourly_Activity_s) +
  geom_col(mapping = aes(x = Time, y=mean_TotalIntensity, fill=mean_TotalIntensity)) +   facet_wrap(~Day) +
  labs(title="Hourly Intensity of a Day",
        caption = paste0("Data of 33 users from date 4/12/2016 to 5/12/2016"), 
       x="Time(hour)",
       y="Average intensity",fill="Intensity") +custom_theme()+
  theme(axis.text.x = element_text(size = 9))
    
options(repr.plot.width = 10, repr.plot.height = 10)

We can see that users usually exercise between 6pm and 7pm on weekdays. On Saturdays, they are very active around 1pm and are pretty active until 7pm. On Sunday, the activity level dropped significantly.
Let's now analyse, On which week day how active the users are.


In [None]:
daily_df <- merge(daily_Activity, sleep_day, by = c('Id', 'Date'), all.x = TRUE)
daily_df <- daily_df %>% drop_na()

head(daily_df)
n_distinct(daily_df$Id)

In [None]:
#Sleep minutes vs Week day
daily_df$weekday <- weekdays(daily_df$ActivityDate)
daily_df$weekday <- factor(daily_df$weekday, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday","Sunday"))
options(repr.plot.width = 8, repr.plot.height = 8)
daily_df_group_sleep <- daily_df %>%
  group_by(weekday) %>%
  drop_na() %>%
  summarise(mean_TotalMinutesAsleep = mean(TotalMinutesAsleep))

# daily_Activity_group_steps <- daily_Activity_group_steps[order(daily_Activity_group_steps)]
# daily_Activity_group_steps
ggplot(data = daily_df_group_sleep)+ geom_col(mapping = aes( x = weekday, y = mean_TotalMinutesAsleep, fill = mean_TotalMinutesAsleep))+
            labs(title = "Average Sleep each day of the week ", 
                 caption = paste0("Data of 24 users from date 4/12/2016 to 5/12/2016"), 
                 x="Week day", y = "Average Sleep Minutes", 
                 fill = "Total Sleep Minutes")+custom_theme()

We can observe that, people sleep more on weekends. Average of 400 mins i.e 6-7 hrs is maintained in working days. As per common recommendations, between 7-9 hrs of sleep is must. 


In [None]:
options(repr.plot.width = 8, repr.plot.height = 8)
#Total Time in bed & Sleep time correlation
ggplot(data=daily_df, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) + 
  geom_point()+geom_smooth()+ labs(title="Total Time in Bed Vs Total Minutes Asleep ")+custom_theme()
    

The relationship between Total Minutes Asleep and Total Time in Bed seems to be linear. **So, if the Bellabeat app users want to improve their sleep hour, we should consider giving a notification via app to sleep at a specific time, set by the user.**

Based on this visual we can infer that there is some correlation between time asleep and time spent in bed. 
Based on this correlation, Bellabeat should set a reminder to encourage people to spend less time on bed. Though, there are some ouliers,showing that some of the people face struggle to sleep, even after spending long duration on bed. 

Here, **TotalTimeInBed** includes sleep time. Let's analyse extra time in bed and sleep time to know how correlated they are.

In [None]:
daily_df$ExtraTimeInbed <- daily_df$TotalTimeInBed- daily_df$TotalMinutesAsleep

ggplot(data=daily_df, aes(x=TotalMinutesAsleep, y=ExtraTimeInbed)) + 
  geom_point(color='darkblue')+geom_smooth()+ labs(title="Extra Time in Bed Vs Total Minutes Asleep")+custom_theme()

The above plot shows that majority of users spend atleast 30-60 mins in bed after or before sleep time. However, we can observe that some users spending time in bed more than half of sleep time, the reasons could be users might be working on bed, just scrolling phone, or having insomnia. Bellabeat should put up active reminders to ensure the user to stay productive and healthy.

Let's look at the relationship between Total Minutes Asleep and Sedentary Minutes.

In [None]:
ggplot(data=daily_df, aes(x=TotalMinutesAsleep, y=SedentaryMinutes)) + 
geom_point(color='darkblue') + geom_smooth() +
  labs(title="Minutes Asleep vs. Sedentary Minutes")+custom_theme()

We can see the negative correlation between Sedentary Minutes and Sleep time. 
 So,**If the Bellabeat user want to improve their sleep, the app can recommend reducing the sedentary time**. 
Though, correlation doesn't always implies causation, we need to support this insight with more data.

### Recommendations
* Based on daily data plots, Bellabeat should provide rewards in order to hit daily steps target by  the users. Atleast 5000 steps in day is suggested, for that bellabeat notify users in certain intervals or at specific time like in morning and evening. 
* Since users are observed to be more sedentary in the day i.e, 9-5  which are office hours, they are often be sitting for hours. This can also raise some serious health issues in long run, Bellabeat should make active reminders which notify the users to take a walk and do some warm up excercises for every 2-4 hrs. 
* We can recommend giving drinking water reminders, which help user stay hydrated. Besides tracking only activities like steps, sleep time, weight and heartrates; Bellabeat can offer sessions on eating habits, yoga etc., more often.
* Based on sleep time analysis, Bellabeat should consider teaching some routine before sleep, which reduces the sedentary time in bed and struggle to sleep through their app. Oversleeping and Insufficient sleep have to monitored since both cause a more or less impact on rest of the day. Reminding users to sleep and wake up on time.
* Bellabeat spring and leaf products are suggested for users who work at home or from home. Unique products can be designed which can be able to track user's activity even if users are not consious about product around them and which are comfortable to use even during sleep.
* We observed a clear relation between higher intensity activity and calories burned so logging the activity with the an interative app/device could be a good motivator to increase your activity if any user has some kind of plan to lose weight.

**Tableau Visualization Work: https://bit.ly/bellabeat_tableau**
