# Introduction

This is a case study on smart device fitness data in order to unlock new growth opportunities for Bellabeat, a high-tech manufacturer of health-focused products for women, as the capstone project for my **Google Data Analytics Professional Certificate**. The main purpose of the case study is to find out the trends in smart device usage, and how these trends can be applied in guiding Bellabeat marketing strategy by analyzing smart device usage and data.

# Background Information
### Product
Bellabeat offers various products and services to its customers, including:

1. Bellabeat app, which provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits collected by 3 different kinds of smart wellness trackers: 
2. Leaf, the classic one that can be worn as a bracelet, necklace, or clip, 
3. Time, a wellness watch combines the timeless look of a classic timepiece with smart technology,and 
4. Spring, the water bottle that tracks daily water intake using smart technology.
5. Bellabeat membership, which gives 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.

<br>

### Current marketing strategy
Bellabeat invests in Traditional advertising media, such as radio, out-of-home billboards, print, and television, but focuses on digital marketing extensively. 

The company invests year-round in Google Search, maintaining active Facebook and Instagram pages, and consistently engages consumers on Twitter. Additionally, Bellabeat runs video ads on Youtube and displays ads on the Google Display Network to support campaigns around key marketing dates.

<br>

# Method applied
In the stage of data analytics, the general process of data analytics **APPASA approach** will be applied in this case study, which includes **Ask, Prepare, Process, Analyze, Share, and Act**.

<br>

# Description the data 
From the “FitBit Fitness Tracker Data” Dataset, we accessed the data which generated by 30 respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016 - 05.12.2016 whose consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring, provided by Möbius under license CC0: Public Domain. Which means the dataset can be used by us to work on our analysis without restriction under copyright law, and all the data from this dataset is collected with the user's consent.

The data set includes minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID or timestamp. Which can be divided into 3 groups: 1. data in minute-level ; 2. data in hour-level ; 3. merged summary data. For this analysis, merged summary data will be mainly applied with the support of data at minute/hour-level.

While the dataset has a few problems that make the final insight may not be very accurate as the data is not up-to-date. First, there's a difference in data size (Nos. of observation: dailyactivity > sleepdata > weightinfo), the analysis in relationship on dailyactivity and weightinfo may lack data supported. Also, the data is collected by voluntary submission, a kind of survivorship bias may occur in the dataset since the better the data performed, the users are more willing to share out, vise versa. 

Need to find how many users occured in all dataset. (find out how many users ID occured in all dataset)

<br>

# Step 1 - Ask 
### Define the business Task
This case study is trying to analyze smart device usage and data in order to gain insight into how consumers are already using their smart devices. To start with the case study, we first need to find out the trends in smart device usage, and then apply these trends to Bellabeat customers and help influence Bellabeat marketing strategy. 

To make it specific, **we will study the current usage of smart devices, apply the results to figure out what our products can improve in, which product should be mainly promoted, who should be our target audience.**

<br>

### Key stakeholders
- Urška Sršen: Bellabeat’s co-founder and Chief Creative Officer 
- Sando Mur: Mathematician and Bellabeat co-founder; key member of the Bellabeat executive team
- Bellabeat executive team  

<br>

### Question interested
- Generally, when will user apply these smart devices?
- How lond the user sleep? 
- Is there a relationship between Total Distance and Calories Burned?
- Is there a relationship between Total Distance and Total Minutes Asleep?
- What is the distribution of activity in Minutes? (compare to government suggested time?)
- What is the distribution of activity in Distance?
<br>

# Step 2 - Prepare
### Environment Set-Up

In [None]:
import pandas as pd
import numpy as np

import re
import math

from pandas.api.types import CategoricalDtype
import warnings
warnings.filterwarnings("ignore")

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('ticks')
sns.set(rc={'figure.figsize':(15,10)})
plt.rcParams["figure.figsize"] = (20,3)


%matplotlib inline

import plotly.express as px
import cufflinks as cf
cf.go_offline()

In [None]:
dailyActivity = pd.read_csv('/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv', parse_dates={'Date': [1]}) 
sleepDay = pd.read_csv('/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv', parse_dates={'Date': [1]}) 

In [None]:
#### dailyActivity
dailyActivity.insert(1, 'Day', dailyActivity['Date'].dt.day_name())
dailyActivity

In [None]:
dailyActivity.describe()

In [None]:
dailyActivity.isna().sum()

In [None]:
print(f"For dailyActivity:\nNumbers of uniques date:{dailyActivity.Date.nunique()}\n{dailyActivity.Date.unique()}\n")
print(f"For dailyActivity:\nNumbers of uniques ID:{dailyActivity.Id.nunique()}\n{dailyActivity.Id.unique()}")

In [None]:
dailyActivity.groupby("Id")["Day"].count()

In [None]:
#### sleepDay
sleepDay

In [None]:
sleepDay.describe()

# Step 3 & 4 - Process and Analyse 


In [None]:
merged_df = pd.merge(dailyActivity,sleepDay, on=['Id','Date'],how ='left')
merged_df

In [None]:
missing_values = merged_df.isna()
plt.figure(figsize=(15,10))
sns.heatmap(data = missing_values)

In [None]:
merged_df.dropna(inplace=True)

In [None]:
missing_values = merged_df.isna()
plt.figure(figsize=(15,10))
sns.heatmap(data = missing_values)

# Step 5 - Share

Insight and graph to prove, just like conclusion

In [None]:
merged_df.columns

In [None]:
plt.figure(figsize=(15,10))
merged_df_corr = merged_df.corr()
sns.heatmap(data = merged_df_corr,annot=True)

In [None]:
sns.pairplot(data=merged_df, kind="reg", plot_kws={'line_kws':{'color':'red'}} , corner=True)

In [None]:
plt.figure(figsize=(10,8))
sns.countplot(x="Day", data=merged_df, 
              order=["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"])
plt.xticks(rotation = 90)
plt.title("Number of Users in each day")

#### Finding
From this above plot, we can observe that **most of the users applied the smart devices on Wednesday, Tuesday, and Thursday, normal weekdays.**

In [None]:
plt.figure(figsize=(10,8))
sns.boxplot(x="Day",y="SedentaryMinutes",data=merged_df,
           order=["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"])
plt.xticks(rotation=90)

In [None]:
plt.figure(figsize=(10,8))
sns.boxplot(x="Day",y="TotalSteps",data=merged_df,
           order=["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"])
plt.xticks(rotation=90)

#### Finding
From the sedentary of day plot, we can observe that **users have a longer sedentary on a weekday as the box located higher than on weekend, one of the possible reasons is that they are working and sitting in the office on a weekday.**

From the Total Steps of day plot, we can observe that generally **users walk most on Saturday and walk the least on Sunday.**

In [None]:
plt.figure(figsize=(10,8))
sns.boxplot(x="Day",y="TotalMinutesAsleep",data=merged_df,
           order=["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"])
plt.xticks(rotation=90)
plt.title("")

#### Finding
The box shows that **users tend to have a longer sleep time on the weekend when compares with the sleeping time on weekday.**

In [None]:
plt.figure(figsize=(10,8))
sns.regplot (x="TotalDistance",y="Calories",data=merged_df)
plt.title("Total Distance vs Calories Burned")

In [None]:
plt.figure(figsize=(10,8))
sns.regplot (x="TotalSteps",y="TotalMinutesAsleep",data=merged_df)
plt.title("Total Steps vs Total Minutes Asleep")

#### Finding
As general, **walk more will burn more calories, while walk less, did not sleep more.**

In [None]:
merged_df['TotalMinutes'] = merged_df['LightlyActiveMinutes'] + merged_df['FairlyActiveMinutes'] + merged_df['VeryActiveMinutes'] + merged_df['SedentaryMinutes']

activitiesm_means = merged_df['TotalMinutes'].mean()
lightlym_pcr = (merged_df['LightlyActiveMinutes'].mean()/activitiesm_means) * 100
fairlym_pcr = (merged_df['FairlyActiveMinutes'].mean()/activitiesm_means) * 100
verym_pcr = (merged_df['VeryActiveMinutes'].mean()/activitiesm_means) * 100
sedentarym_pcr = (merged_df['SedentaryMinutes'].mean()/activitiesm_means) * 100

# plotting
plt.figure(figsize=(10,8))
plt.pie([ fairlym_pcr, verym_pcr,lightlym_pcr,sedentarym_pcr], 
        labels = [ "Fairly Active Minutes", "Very Active Minutes", "Light Active Minutes", "Sedentary Minutes"], 
        colors = ['#d8e2dc', '#ffe5d9', '#ffcad4', '#9d8189'], 
        # wedgeprops = {"edgecolor": "black"}, 
        explode = [0, 0, 0, 0.1], 
        autopct = "%1.1f%%")

plt.title("Percentage of Activity in Minutes")
plt.show()

#### Finding
**Almost 3/4 of the total time user spends is sedentary minutes, and Light Active minutes only got 22.3% of the total time of usage. One of the possible reasons is that users wear smart devices all day, not only when they getting exercise.**

In [None]:
activities_means = merged_df["TotalDistance"].mean()

lightly_pcr = (merged_df['LightActiveDistance'].mean()/activities_means) * 100
moderately_pcr = (merged_df['ModeratelyActiveDistance'].mean()/activities_means) * 100
very_pcr = (merged_df['VeryActiveDistance'].mean()/activities_means) * 100

plt.figure(figsize=(10,8))
plt.pie([lightly_pcr, moderately_pcr, very_pcr], 
        labels = ["Light Active Distance", "Moderately Active Distance", "Very Active Distance"], 
        colors = ['#9d8189', '#ffe5d9', '#d8e2dc'],  
        explode = [0.1, 0, 0], 
        autopct = "%1.1f%%")


plt.title("Percentage of Activity in Distance")
plt.show()

#### Finding

**Almost 3/4 of the total time user spends is sedentary minutes, and Light Active minutes only got 22.3% of the total time of usage. One of the possible reasons is that users wear smart devices all day, not only when they getting exercise.**

#### Short Summary

<br>

**From the above finding, we can conclude that,**

1. Users tend to use smart devices on weekdays, especially on Wednesdays, but the total steps on Wednesdays are not the highest.

2. There is a positive relationship between Total Distance and Calories Burned. 

3. Users tend to walk more on Saturday and less on Sunday, one of the possible reasons is that they hand out with friends on Saturday night, and get more rest on Sunday. Which also explains in the Sleep data.

4. Almost 3/4 of the total time user spends is sedentary minutes, and Light Active minutes only got 22.3% of the total time of usage. One of the possible reasons is that users wear smart devices all day, not only when they getting exercise.

# Step 6 - Act

From the about data, I would suggest that Bellabeat can keep working hard on the current marketing strategy in the digital market. And I would recommend the brand add some new features as sending out encourage messages to its users and schedule planner on smart devices personalize on the users' data.

For example, the brand can send out these messages to its users on Wednesday, the day with the highest log-in with a lower work-out number as shown on the above graph. By doing this, it may help to enhance the customers' dependence on the brands, work out more, have more products with her journals. 

And with the help of the data, we can recognize that users are not likely to do exercise on weekends. Therefore we can train the machine on giving out a reasonable training schedule based on the user's data, like a light training on normal weekday, and end with a great one on Friday. 

