## Case Study 2: How Can a Wellness Technology Company Play It Smart?

+ Author: Yuan-Siang Pan
+ Date: 21/08/2021
+ Data License CC0: Public Domai

This is the final task of Google data analyics certificate to present a case study. My coal is to help Bellabeat by using non-Bellabeat smart devices data to catch users' behaviors to help improve its own products and services.

So the key questions are : 

+ <font color=#0000FF>What are some trends in smart device usage?</font>
+ <font color=#0000FF>How could these trends apply to Bellabeat customers?</font>
+ <font color=#0000FF>How could these trends help influence Bellabeat marketing strategy?</font>


### Data Prepare

In this case, I will use Python to help me prepare、process、analyze and visualize data, let's import moduals and data first.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import warnings 
warnings.filterwarnings('ignore')
%matplotlib inline

**<font color=#0000FF>The purpose of this project is to find the trends in smart device usage, in order to find the patern of users' habits I will focus on daily data only.</font>**

In [None]:
daily_activities=pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv')
daily_calories=pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv')
daily_intensities=pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv')
daily_steps=pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailySteps_merged.csv')
sleep_day=pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv')
weight_info=pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv')

<font color=#0000FF>Explore data to get the basic information and chect if there was any missing value.

### Data Process

In [None]:
daily_activities.info()
daily_calories.info()
daily_intensities.info()
daily_steps.info()

<font color=#0000FF>*We notice that the data type of ActivityDay is object instead of datetime, it needs to be fixed later on*

In [None]:
[len(daily_activities.Id.unique()), len(daily_calories.Id.unique()), len(daily_intensities.Id.unique()), len(daily_steps.Id.unique())]

<font color=#0000FF>*We can see that daily data recorded <font color=#FF0000>33</font> unique users's activities information*

In [None]:
daily_activities.describe()

In [None]:
daily_calories.describe()

In [None]:
daily_intensities.describe()

In [None]:
daily_steps.describe()

<font color=#0000FF>*After checking the general statistics information we know that the dataset "daily_activities" is the collection data of all the daily data in daily_calories, daily_intensities and daily_steps and also with three additional distance data .*

<font color=#0000FF>Lastly, let's check sleep_day and weight_info.

In [None]:
sleep_day.info()
weight_info.info()

<font color=#0000FF>*These two dataset also have the datetime to be fixed*

In [None]:
[len(sleep_day.Id.unique()), len(weight_info.Id.unique())]

**<font color=#0000FF>*In here, we find two issues :***

**<font color=#0000FF>*1. There is only <font color=#FF0000>24</font> unique Id in sleep_day data and in weight_info data there is only  <font color=#FF0000>8</font> Ids(comparing to the activities data contains <font color=#FF0000>33</font> Id), which means in order to analyze the relation between activities sleep or weight, we can only use these eight users's data to prevent the bias.***

**<font color=#0000FF>*2. In another aspect of view, it seems like some users are not really intereating in collecting data vis smart devices especially when it comes to personal weight information which needs to manually report.***

Next, let's convert the datetime entries from object data type to datetime data type.

In [None]:
# Change column names

sleep_day.columns=['Id', 'Date', 'TotalSleepRecords', 'TotalMinutesAsleep',
       'TotalTimeInBed']
daily_activities.columns=['Id', 'Date', 'TotalSteps', 'TotalDistance', 'TrackerDistance',
       'LoggedActivitiesDistance', 'VeryActiveDistance',
       'ModeratelyActiveDistance', 'LightActiveDistance',
       'SedentaryActiveDistance', 'VeryActiveMinutes', 'FairlyActiveMinutes',
       'LightlyActiveMinutes', 'SedentaryMinutes', 'Calories']

# Change all date columns in three tables into same format

sleep_day.Date=pd.to_datetime(sleep_day['Date'], format="%m/%d/%Y %I:%M:%S %p").dt.strftime("%m/%d/%Y")
daily_activities.Date=pd.to_datetime(daily_activities['Date'], format="%m/%d/%Y").dt.strftime("%m/%d/%Y")
weight_info.Date=pd.to_datetime(weight_info['Date'], format="%m/%d/%Y %I:%M:%S %p").dt.strftime("%m/%d/%Y")

In [None]:
# Add weekday information to dataset

weekday = []
for i in range(len(daily_activities["Date"])):
    day = dt.datetime.strptime(daily_activities["Date"][i], "%m/%d/%Y").strftime("%A")
    weekday.append(day)
daily_activities["Weekday"] = weekday

weekday = []
for i in range(len(sleep_day["Id"])):
    day = dt.datetime.strptime(sleep_day["Date"][i], "%m/%d/%Y").strftime("%A")
    weekday.append(day)
sleep_day["Weekday"] = weekday
    
weekday = []    
for i in range(len(weight_info["Id"])):
    day = dt.datetime.strptime(weight_info["Date"][i], "%m/%d/%Y").strftime("%A")
    weekday.append(day)
weight_info["Weekday"] = weekday  

### Data Analyze

Why people wear smart devices ? Maybe they are health conscious ? Or maybe they want to be fancy or have other reasons. In this case, we assume people who wear smart devices are health conscious. So we want to if there is any evidence that can support our assumption

First, I will focus on the total steps because this attribute indicate the real physical movements or exercise, and in here, I will use ECDF（Empirical Cumulative Distribution Function） to check if the total steps of users follow the normal distribution, and then we can use this sample to infer the statistics of the population.

In [None]:
#groupby Id by TotalSteps and get the mean information

all_mean_steps = daily_activities.groupby(["Id"])['TotalSteps'].mean()

def ecdf(data):
    """Compute ECDF for a one-dimensional array of measurements."""
    # Number of data points: n
    n = len(data)

    # x-data for the ECDF: x
    x = np.sort(data)

    # y-data for the ECDF: y
    y = np.arange(1, n+1) / n
    
    return x, y

# Compute ECDF : x_vers, y_vers

x_vers, y_vers = ecdf(all_mean_steps)
mu = np.mean(all_mean_steps)
sigma = np.std(all_mean_steps)
samples = np.random.normal(mu, sigma, size=10000)
x_theor, y_theor = ecdf(samples)

# Generate plot

_ = plt.plot(x_theor, y_theor)
_ = plt.plot(x_vers, y_vers, marker='.',linestyle = 'none')

# Label the axes

_ = plt.xlabel('Total Steps')
_ = plt.ylabel('ECDF')


# Display the plot
plt.show()

**<font color=#0000FF>The graph shows the total steps data approximate follow the normail distribution. And through the graph could know : there are about 20% of users' average steps are less than 5000 everyday, and nearly 50% of users having steps less than 7500.**

Next, I want to infer the mean steps of the population in order to get a reference data for the future use, and it is a good benchmark to analyze Bellabeat's product users.

In [None]:
def bootstrap_replicate_1d(data, func):
    """Generate bootstrap replicate of 1D data."""
    bs_sample = np.random.choice(data, len(data))
    return func(bs_sample)

def draw_bs_reps(data, func, size=1):
    """Draw bootstrap replicates."""

    # Initialize array of replicates: bs_replicates
    bs_replicates = np.empty(size)

    # Generate replicates
    for i in range(size):
        bs_replicates[i] = bootstrap_replicate_1d(data, func)

    return bs_replicates

# Take 10,000 bootstrap replicates of the mean: bs_replicates
bs_replicates = draw_bs_reps(all_mean_steps,np.mean, 10000)

# Compute and print SEM、standard deviation of bootstrap replicates、
bs_mean = np.mean(bs_replicates)
sem = np.std(all_mean_steps) / np.sqrt(len(all_mean_steps))
bs_std = np.std(bs_replicates)
conf_int = np.percentile(bs_replicates,[2.5,97.5])
                
print("mean of bootstrap replicates：" , bs_mean)                
print("standard error of the mean：", sem)
print("standard deviation of bootstrap replicates：" , bs_std)
print("95% confidence interval：" , "2.5%:", conf_int[0]," 97.5%:", conf_int[1])

# Make a histogram of the results
_ = plt.hist(bs_replicates, bins=50, density=True)
_ = plt.xlabel('Mean total Steps')
_ = plt.ylabel('PDF')

# Show the plot
plt.show()

**<font color=#0000FF>*So now we have the inferential statistics of the population and these data could be used for a reference information for our product.***

**Next, I make two assumptions ：** 

**The first one is only take users's "VeryActiveMinutes" data and indicate they were doing "exercise", and see the proportion of those having more than 30 minutes exercise.**
    
 **Secomd, I am going to plus "VeryActiveMinutes" and "LightlyActiveMinutes" data and take the data as "exercise", and see the proportion of those having more than 30 minutes exercise in everyday life.**   


In [None]:
# Get the proportion of each type of active minutes

daily_activities["TotalActiveMinutes"] = daily_activities["SedentaryMinutes"] + daily_activities["LightlyActiveMinutes"] 
+daily_activities["FairlyActiveMinutes"] +daily_activities["VeryActiveMinutes"]

daily_activities["Exercise_1"] = daily_activities["VeryActiveMinutes"]
daily_activities["Exercise_2"] = daily_activities["VeryActiveMinutes"] + daily_activities["FairlyActiveMinutes"]
daily_activities["Exercise_3"] = daily_activities["VeryActiveMinutes"] + daily_activities["FairlyActiveMinutes"] + daily_activities["LightlyActiveMinutes"]



# Check more than 30 mins exercise
daily_id_mean = daily_activities.groupby(["Id"]).mean()

very_act_1 = sum(daily_id_mean["Exercise_1"] > 30) / len(daily_id_mean.index) * 100
print("There are {0}% of Users having more than 30 mins very active activities(take VeryActiveMinutes data only)".format(int(very_act_1)))

very_act_2 = sum(daily_id_mean["Exercise_2"] > 30) / len(daily_id_mean.index) * 100
print("There are {0}% of Users having more than 30 mins very active activities(VeryActive plus FairlyActive)".format(int(very_act_2)))

# plot 
fig, axes= plt.subplots(1, 4, figsize=(20,8))

daily_id_mean = daily_activities.groupby(["Id"]).mean()
workout_1 = sns.boxplot(y = daily_id_mean["Exercise_1"], ax=axes[0], color = "red")
workout_1.set_title("VeryActive")
workout_1.set_ylabel("Workout minutes")

workout_2 = sns.boxplot(y = daily_id_mean["Exercise_2"], ax=axes[1], color = "green")
workout_2.set_title("VeryActive + FairlyActive")
workout_2.set_ylabel("Workout minutes")

chill = sns.boxplot(y = daily_id_mean["LightlyActiveMinutes"]/60, ax=axes[2], color = "yellow")
chill.set_title("Chill")
chill.set_ylabel("Hours")

total_act = sns.boxplot(y = daily_id_mean["TotalActiveMinutes"]/60, ax=axes[3], color = "orange")
total_act.set_title("Total Active")
total_act.set_ylabel("Hours")

**<font color=#0000FF>So with these graphs we could know that in the first scenario, there are 24% of the users having more than 30 mins exercise which is also our strict assumption, and in the second scenario, there are 51% of users having more than 30 mins exercise which is also our less strick assumption. Then let's check their total steps comparing to the inferential mean steps of the population.**

In [None]:
# Compare (Very + Fairly)users's steps with population's.

exercise_mean_steps = daily_id_mean[daily_id_mean["Exercise_2"] > 30]["TotalSteps"]

fig, axes= plt.subplots(figsize=(20,8))

_ = plt.hist(bs_replicates, bins=30, alpha=0.3, density=True)

a = list(exercise_mean_steps.values)
for i in a:
    plt.axvline(x=i, color="black", alpha=0.5, ls='--')

plt.title("Workout users' Steps v.s. Population Steps")

**<font color=#0000FF>Except 3 of the users' mena steps fall in the area that less than population average steps, most of them are far greater than the average, which could be seemed as a significant evidence that they do use the smart devices and doing workout.**

**So now we know 51% of the users use the smart devices and do the workout, but in which weekday they work out least and most ?**

In [None]:
#extract the (Very + Fairly) group and count the times with weekday information.

exercise_week_x = daily_activities[daily_activities["Exercise_2"] > 30]['Weekday'].value_counts().index
exercise_week_y = daily_activities[daily_activities["Exercise_2"] > 30]['Weekday'].value_counts()

fig, axes= plt.subplots(figsize=(15,5))
exercise_week = sns.barplot(x = exercise_week_x, y = exercise_week_y)
exercise_week.set_title("WeekDay Exercise Counts")
exercise_week.set_ylabel("Counts")

**<font color=#0000FF>It is quite intutive that people do less workout on the weekend and it is also interesting to find out that users do workout most in Tuesday.**
  

Next, let's take a look the sleep and weight data, first I will merge the two datasets with daily_activities. Second, check if there are relations between these datasets(i.e. The more active time the less the time before falling asleep).

In [None]:
# Merge dataset 

act_sleep_merge = sleep_day.merge(daily_activities,how='inner',on=["Id", "Date"])
act_weight_merge = weight_info.merge(daily_activities,how='inner',on=["Id", "Date"])

In [None]:
# Add a new column call BeforeSleep which indicate the time in bed but before fall asleep.

act_sleep_merge["BeforeSleep"] = act_sleep_merge["TotalTimeInBed"] - act_sleep_merge["TotalMinutesAsleep"]

#Check the value first
act_sleep_merge["BeforeSleep"].describe()


In [None]:
#check the top 20 "BeforeSleep" values.
act_sleep_merge.sort_values("BeforeSleep", ascending=False)["BeforeSleep"].head(20)

<font color=#0000FF>It is unresonable that users took so much time before falling asleep. Maybe the devices have some technical issues but I decide to slice the data and keep the data only have "BeforeSleep" time less than 100 and 60 mins.

In [None]:
#Slice data into less than 100 and 60 mins.
slice_1 = act_sleep_merge[act_sleep_merge["BeforeSleep"]<100]
slice_2 = act_sleep_merge[act_sleep_merge["BeforeSleep"]<60]

In [None]:
# plot Before bed time less than 100 mins v.s. active minutes

fig, axes= plt.subplots(1, 3, figsize=(20,8))


before_sleep_1 = sns.regplot(x = "Exercise_1", y = "BeforeSleep", data = slice_1, ax=axes[0], color = "red")
before_sleep_1.set_title("VeryActive")
before_sleep_1.set_xlabel("Active Minutes")
before_sleep_1.set_ylabel("BeforeSleep Minutes")

before_sleep_2 = sns.regplot(x = "Exercise_2", y = "BeforeSleep", data = slice_1, ax=axes[1], color = "green")
before_sleep_2.set_title("VeryActive + FairlyActive")
before_sleep_2.set_xlabel("Active Minutes")
before_sleep_2.set_ylabel("BeforeSleep Minutes")

before_sleep_3 = sns.regplot(x = "Exercise_3", y = "BeforeSleep", data = slice_1, ax=axes[2], color = "orange")
before_sleep_3.set_title("VeryActive + FairlyActive + LightlyActiveMinutes")
before_sleep_3.set_xlabel("Active Minutes")
before_sleep_3.set_ylabel("BeforeSleep Minutes")

**<font color=#0000FF>It seems like there is a weak negative relationship between the time fall asleep(less than 100 mins) and active time.**

In [None]:
# plot Before bed time less than 60 mins v.s. active minutes
fig, axes= plt.subplots(1, 3, figsize=(20,8))


before_sleep_1 = sns.regplot(x = "Exercise_1", y = "BeforeSleep", data = slice_2, ax=axes[0], color = "red")
before_sleep_1.set_title("VeryActive")
before_sleep_1.set_xlabel("Active Minutes")
before_sleep_1.set_ylabel("BeforeSleep Minutes")

before_sleep_2 = sns.regplot(x = "Exercise_2", y = "BeforeSleep", data = slice_2, ax=axes[1], color = "green")
before_sleep_2.set_title("VeryActive + FairlyActive")
before_sleep_2.set_xlabel("Active Minutes")
before_sleep_2.set_ylabel("BeforeSleep Minutes")

before_sleep_3 = sns.regplot(x = "Exercise_3", y = "BeforeSleep", data = slice_2, ax=axes[2], color = "orange")
before_sleep_3.set_title("VeryActive + FairlyActive + LightlyActiveMinutes")
before_sleep_3.set_xlabel("Active Minutes")
before_sleep_3.set_ylabel("BeforeSleep Minutes")

**<font color=#0000FF>It seems like there is a weak negative relationship between the time fall asleep(less than 60 mins) and active time. And we could get a significant evidence that the more active time the less the time before falling asleep**

In [None]:
for i in ["Exercise_1", "Exercise_2", "Exercise_3"]:
    corr = np.corrcoef(slice_1[i], slice_1["BeforeSleep"])[0,1]
    print("The Coefficient of Correlation between {0} and BeforeSleep less than 100 mins is {1} ".format(i,corr))

print("-"*80)    

for i in ["Exercise_1", "Exercise_2", "Exercise_3"]:
    corr = np.corrcoef(slice_2[i], slice_2["BeforeSleep"])[0,1]
    print("The Coefficient of Correlation between {0} and BeforeSleep less than 60 mins is {1} ".format(i,corr))

Next, let's take a look if users who record weight could lead to more exercise.

In [None]:
act_weight_merge_mean = act_weight_merge.groupby("Id").mean()

In [None]:
# plot Before bed time less than 60 mins v.s. active minutes

fig, axes= plt.subplots(1, 2, figsize=(10,6))

act_weight_merge_mean = act_weight_merge.groupby("Id").mean()

daily_id_mean = daily_activities.groupby(["Id"]).mean()
weight_workout_1 = sns.swarmplot(y = act_weight_merge_mean["Exercise_1"], ax=axes[0], color = "red")
weight_workout_1.set_title("VeryActive")
weight_workout_1.set_ylabel("Workout minutes")

weight_workout_2 = sns.swarmplot(y = act_weight_merge_mean["Exercise_2"], ax=axes[1], color = "orange")
weight_workout_2.set_title("VeryActive + FairlyActive")
weight_workout_2.set_ylabel("Workout minutes")

**<font color=#0000FF>There is not enough samples to get the evidence to prove our assumption : users who record weight could lead to more exercise, but we can still know that only 8/33 users are willing to report the weight data.**

### Conclusion

+ <font color=#0000FF>What are some trends in smart device usage?</font>
    
  First, After using these datasets we know the average steps of the population fall on about 7500 steps (standard deviation is about 600 steps), this information 
  could help us check users's active status and almost half of the users do exercise more than 30 mins, on contrast, the other half don't.
  
  Second, there are very weak negative relationship between the time fall asleep and active time, the coefficient of correlations are just between -0.029 ~ -0.10,
  so it is hardly to prove the more exercise the less time waiting for falling asleep.
  
  Last, there are only only eight users recorded their weight information, due to the insufficient data, it is hard to find any pattern between using smart devices and 
  losing weight.

+ <font color=#0000FF>How could these trends apply to Bellabeat customers?</font>
   
   Since we know half of the users doing workout more than thirty mins everyday, it is very important to develop the app to track and help our customers do more 
   healthier exercise based on their workout pattern. Like remind them to set a goal to complete and also encourage customers to set out like sleep monitor and weight
   report, because without data it is hard to extract any insight without data.
   
   For the other half of the users who do less than thirty mins workout everyday, which means the purpose of buying or wearing smart devices is not connecting to 
   "Fitness". Customers may just due to the curiosity or any other reason.

    
+ <font color=#0000FF>How could these trends help influence Bellabeat marketing strategy?</font>

    For the marketing strategy, since half of the users whose purpose of buying or wearing smart devices is not connecting to "Fitness". In other words, fucus on 
    different stylish or fancy design might attract more customers who are not really interesting in exercising. 