# **Data Analysis for Bellabeat Case Study**

Author: Tung Anh Pham

Date: 31/07/2021

# **Introduction**

This is a case study as the capstone project for my Google Data Analytics Professional Certificate on smart device fitness data in order to find new marketing strategies for Bellabeat, a high-tech manufacturer of health-focused products for women around the world.

This analytic will applied the 6 phases of APPASA approach: Ask, Prepare, Process, Analyze, Share, and Act and using Python programming language for data cleaning, transformation and visualisation.



**About Bellabeat:**

Bellabeat is a smart device manufacturer company found in 2013. It is a tech-driven wellness company for women with their offices around the world. The company offers a range of products including:

Bellabeat App - connects to the company's lines of wellness products and provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits.

* Leaf - a bracelet which tracks user's sleep, activity and stress
* Time - a smart watch to track user's sleep, activity and stress
* Spring - smart water bottle to track daily water intake of the user
* Membership - gives users 24/7 access to fully personalized health guidance based on their lifestyle and goals

# **PHASE 1: ASK**

**Business task**

To discover the trends in smart device usage and assist Bellabeat's marketing team to come up with an effective marketing strategy for the company based on the findings.

**Stakeholders**

Co-Founder and Chief Creative Officer, Ueska Srsen
Co-founder and executive member, Sando Mur
Bellabeat's marketing analytics team

**Questions:**

1. For each user, on every day, what is the hour that they have the highest number of steps? 
2. Is there a relationship between Steps Taken and Calories Burned? 
3. Do the user sleep enough? 

# **PHASE 2: PREPARE**

**Information on Data Source**

* The dataset analyzed in this study is the public dataset titled 'FitBit Fitness Tracker Data' provided by Mobius. 
* These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. 
* 30 eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.

**Limitations of Dataset**

* Small sample size (around 30).
* Data is collected 5 years ago in 2016, which may not timly and relevant anymore.
* The data period is just around 2 months which is a small period to uncover substantial long-term trends.


For our analysis, we only focus on data from hourlySteps_merged.csv, dailyActivity_merged.csv and sleepDay_merged.csv. 

In [None]:
# import packages and alias
import numpy as np # data arrays
import pandas as pd # data structure and data analysis
import matplotlib.pyplot as plt # data visualization
import datetime as dt # date time
import seaborn as sns # seaborn

# PHASE 3: PROCESS
    
**3.1 Importing the dataset**

In [None]:
# load 3 datasets
hourly_step= pd.read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv")
daily_activity = pd.read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
sleep_day = pd.read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")

*Getting to know the dataset*

In [None]:
hourly_step.info()

In [None]:
daily_activity.info() 

In [None]:
sleep_day.info()

"ActivityDate" in daily_activity is wrongly classified as object dtype and has to be converted to datetime64 dtype. Doing the same with "ActivityHour" of hourly_step and "SleepDay" of sleep_day.

In [None]:
# convert "ActivityDate" to datatime64 dtype and format to yyyy-mm-dd
daily_activity["ActivityDate"] = pd.to_datetime(daily_activity["ActivityDate"], format="%m/%d/%Y")
hourly_step["ActivityHour"] = pd.to_datetime(hourly_step["ActivityHour"], format='%m/%d/%Y %I:%M:%S %p')
sleep_day["SleepDay"] = pd.to_datetime(sleep_day["SleepDay"], format='%m/%d/%Y %I:%M:%S %p')

In [None]:
# check again 
daily_activity.info()
hourly_step.info()
sleep_day.info()

**3.2 Data cleaning and manipulation**

In [None]:
# preview first 5 rows with all columns
hourly_step.head()

In [None]:
daily_activity.head()

In [None]:
sleep_day.head()

*Looking for any missing values in the data*

In [None]:
daily_activity.isna().sum()

In [None]:
hourly_step.isna().sum()

In [None]:
sleep_day.isna().sum()

Now we can found out that the are no missing data in those datasets.

Next, Im looking for the **duplicate values** if any:

In [None]:
#create function for finding duplicate
def find_duplicated_data(daily_activity):
    return daily_activity[daily_activity.duplicated()]

len(find_duplicated_data(daily_activity)), len(find_duplicated_data(sleep_day)), len(find_duplicated_data(hourly_step))

In [None]:
# There are some duplicated data in the daily sleep data
sleep_day[sleep_day.duplicated(keep=False)]

In [None]:
# Drop the duplicates
sleep_day.drop_duplicates(inplace=True, ignore_index=False)

I'm also going to count the unique IDs to confirm whether data has 30 IDs or not.

In [None]:
hourly_step.Id.unique() # 33

In [None]:
daily_activity.Id.nunique()


In [None]:
sleep_day.Id.nunique()

> We can see the different between the number of users who did the survey on daily_acitivity, hourly_step compared with sleep_day data. The reason could be some users are not prefer to track their sleep activities when using the feature or other reasons which we can't control (not comfortable when wearing device during night, out of battery, etc...)


# PHASE 4: ANALYZE

Create new column DayOfTheWeek by generating date in the form of day of the week for further analysis.

In [None]:
daily_activity.insert(2, 'DayOfTheWeek', daily_activity['ActivityDate'].dt.day_name())

In [None]:
# rename columns
daily_activity.rename(columns = {"Id":"id", "ActivityDate":"date", "DayOfTheWeek":"day_of_the_week", "TotalSteps":"total_steps", "TotalDistance":"total_dist", "TrackerDistance":"track_dist", "LoggedActivitiesDistance":"logged_dist", "VeryActiveDistance":"very_active_dist", "ModeratelyActiveDistance":"moderate_active_dist", "LightActiveDistance":"light_active_dist", "SedentaryActiveDistance":"sedentary_active_dist", "VeryActiveMinutes":"very_active_mins", "FairlyActiveMinutes":"fairly_active_mins", "LightlyActiveMinutes":"lightly_active_mins", "SedentaryMinutes":"sedentary_mins", "TotalExerciseMinutes":"total_mins","TotalExerciseHours":"total_hours","Calories":"calories"}, inplace = True)

Creating new column total_mins being the sum of total time logged and convert it to hours

In [None]:
daily_activity["total_mins"] = daily_activity["very_active_mins"] + daily_activity["fairly_active_mins"] + daily_activity["lightly_active_mins"] + daily_activity["sedentary_mins"]
daily_activity["total_hours"] = round(daily_activity["total_mins"] / 60)
daily_activity["total_hours"].head() # check the result

In [None]:
daily_activity.head()

In [None]:
daily_activity.describe()

1. Based on this above table, we can see that on average, users walked 7638 steps. CDC recommended that 10,000 steps per day is a suitable goal for an adult to have a better heatlh, weigh loss and fitness improvement. (Source: [How many steps should people take per day?](https://www.medicalnewstoday.com/articles/how-many-steps-should-you-take-a-day#by-sex))

2. Most of users logged in devices is Sedentary user, which means they may use device for normal activities, not focus on fitness only. We need to find out more later.

Adding new column "Day"- Day of week to "sleep_day" data

In [None]:
sleep_day.insert(1, 'Day', sleep_day.SleepDay.dt.day_name())

Calculating and adding new column "TotalTimeAwakeOnBed" to "sleep_day" data

In [None]:
sleep_day['TotalTimeAwakeOnBed'] = sleep_day['TotalTimeInBed'] - sleep_day['TotalMinutesAsleep']

In [None]:
# check the table again
sleep_day.head()

Now we go back to hourly step data and creating a new column name "hour" and calculate the average step per hour

In [None]:
# get only the hour
hourly_step['hour'] = hourly_step.ActivityHour.dt.hour

# calculate average steps for every hour
avg_hourly_step = hourly_step.groupby('hour')['StepTotal'].mean()

In [None]:
avg_hourly_step.head()

# PHASE 5: SHARE

Now, we should creating visualizations of data and communicating our findings based on our analysis with our stakeholder.

**Plotting a chart to show the percentage of activity in Mins**

In [None]:
# calculating the total mins for each column
very_active_mins = daily_activity["very_active_mins"].sum()
fairly_active_mins = daily_activity["fairly_active_mins"].sum()
lightly_active_mins = daily_activity["lightly_active_mins"].sum()
sedentary_mins = daily_activity["sedentary_mins"].sum()

# plotting pie chart
slices = [very_active_mins, fairly_active_mins, lightly_active_mins, sedentary_mins]
labels = ["Very active minutes", "Fairly active minutes", "Lightly active minutes", "Sedentary minutes"]
colours = ["purple", "orange", "lightblue", "yellow"]
explode = [0, 0, 0, 0.1]
plt.style.use("default")
plt.pie(slices, labels = labels, 
        colors = colours, wedgeprops = {"edgecolor": "black"}, 
        explode = explode, autopct = "%1.1f%%")
plt.title("Percentage of Activity in Minutes")
plt.tight_layout()
plt.show()

*From the pie chart:*

- Sedentary minutes takes the biggest slice  (81.3%). This indicates that users are using the FitBit app to log daily activities such as daily commute, inactive movements, etc...

- There are small percentage of user ussing Firbit to track thier fitness activity (1.1% of Fairly active minutes, and 1.7% of Very active minute)

In [None]:
# plotting histogram
plt.style.use("default")
plt.figure(figsize=(7,6)) 

plt.hist(daily_activity.day_of_the_week, bins = 7, 
         width = 0.7, color = "lightblue", edgecolor = "black")
plt.xlabel("Day of the week")
plt.ylabel("Frequency")
plt.title("No. of times users logged in app in one week")
plt.show()


By looking in this histogram, we can see the frequency of user logged in FitBit app for each day of the week and user more using it on the weekday than weekend (from Tuesday to Friday and then a bit decreased after that).

**For the first question: "For each user, on every day, what is the hour that they have the highest number of steps?", we will create a bar chart to look deeper:**


In [None]:
avg_hourly_step.plot(kind='bar')
plt.xlabel("Hour")
plt.ylabel("Step taken")
plt.title('Average Steps Taken for Every Hour');

A large numebr of step taken happenend from 08:00 to 20:00, which means on user daily basis. plus wwith the chart of No. times user logged in app in one week, we can concluded that they prefer using Firbit on Working day.

The highest steps taken are around 18:00, which should be the hour when most users end their job and start to do their exercise routines.

**Now, we are creating a visualization to answer the question: Is there a relationship between Steps Taken and Calories Burned?**

In [None]:
# plotting scatter plot
plt.style.use("default")
plt.figure(figsize=(7,6)) 
plt.scatter(daily_activity.total_steps, daily_activity.calories, 
            alpha = 0.6, c = daily_activity.calories, 
            cmap = "coolwarm")

plt.colorbar(orientation = "vertical")
plt.xlabel("Steps taken")
plt.ylabel("Calories burned")
plt.title("Calories burned for every step taken")
plt.grid(True)
plt.legend()
plt.show()

We discovered that it have a positive correlation between Steps Taken and Calories Burned. The more users making step, the more their calories burned. Most of user have walked inside the range 0 -> 15,000 steps.

But we also noticed that some users have calories burned but taking 0 step. It could be device error.

**The last question:  Do the user sleep enough?**

In [None]:
fig, axes = plt.subplots(2, 1, sharex=True, figsize=(9,5))
for col, ax in zip(['TotalMinutesAsleep', 'TotalTimeInBed'], axes):
    sns.boxplot(data=sleep_day, x=col, orient='h', ax=ax)
plt.xticks(np.arange(200, 750, 50));

* On average, users stay in bed for around 400 to 525 minutes, which is around 7 hours to approximately 9 hours.
* Most of them only sleep around 370 to 480 minutes, which is around 6 hours to 8 hours.
* It is recommend that a best practice for adults is sleeping for around 7-9 hours per day (Source: [How Much Sleep Do We Really Need?](https://www.sleepfoundation.org/how-sleep-works/how-much-sleep-do-we-really-need#:~:text=National%20Sleep%20Foundation%20guidelines1,to%208%20hours%20per%20night.)). That's mean they still do other thing before/after sleep time ( surfing website, social media, reading book,...)

To describe more about the time awake on bed of users, we create one more boxplot chart

In [None]:
fig = plt.figure(figsize=(10, 3))
sns.boxplot(data= sleep_day, x='TotalTimeAwakeOnBed')
plt.xticks(np.arange(0, sleep_day['TotalTimeAwakeOnBed'].max(), 20));

As we can see, they mostly spent 20-40 mins each day on bed to do something else than sleep. An there are even more user in the outlier, means they may stay in bed for a long time before of after sleep. But we can't make sure that those outliers is all correct because sometimes it should be a device error.

# PHASE 6: ACT

**CONCLUSION:**

* Majority of users (81.3%) are using the FitBit app to track sedentary activities and not using it for tracking their health habits.
* Users prefer to track their activities during weekdays as compared to weekends and especially in working hours. They may spend more time outside on weekdays and stay in on weekends and perfer to do exercise after work.
* If the user makes more step, they will burn more calories.
* Most of users sleep not enough (6-8 hours) and they spend time to do other thing on bed.

**My Recommendation**

- The Bellabeat marketing team can create some workshop or write article to raising users about fitness benefits, 
- Set new features of different types of exercise for each period of week (5 minutes exercise on weekday and a 15 minutes exercise on weekends) and show the calories burnt information on the Bellabeat app.
- Create some reminders of small break during working hour to let user do quick and short exercises after several hours of sitting and working.
- Send to the user who have not enough sleep or have longly awake time in bed a recommendation of how to sleep faster (like this [article](https://www.healthline.com/nutrition/ways-to-fall-asleep#1.-Lower-the-temperature))
.

*Alright, this is the end of my analysis for this case study. Thanks Kaggle members and community help me a lot to find an idea and how to solve it. It's my very first completed analysis so if there any issue or recommendation, feel free to connect with share it with me.*