#How Can a Wellness Technology Company Play It Smart?#


In this case study I'm presented to bellabeat, a high-tech manufacturer of health-focused products for women. 
This case study is part o Google Data Analytics Professional Certificate.

Data Source -  [FitBit Fitness Tracker Data](https://www.kaggle.com/arashnic/fitbit)

License - CC0: Public Domain, dataset made available through [Mobius](https://www.kaggle.com/arashnic)

##About the company##

Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products. Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits. Since it was founded in 2013, Bellabeat has grown rapidly and quickly
positioned itself as a tech-driven wellness company for women.
By 2016, Bellabeat had opened offices around the world and launched multiple products. Bellabeat products became available through a growing number of online retailers in addition to their own e-commerce channel on their website. The company has invested in traditional advertising media, such as radio, out-of-home billboards, print, and television, but focuses on digital marketing extensively. Bellabeat invests year-round in Google Search, maintaining active Facebook and Instagram pages, and
consistently engages consumers on Twitter. Additionally, Bellabeat runs video ads on Youtube and display ads on the Google Display Network to support campaigns around key marketing dates.

##The scenario and main questions##

Sršen knows that an analysis of Bellabeat’s available consumer data would reveal more opportunities for growth. She has asked the marketing analytics team to focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart devices. Then, using this information, she would like high-level recommendations for how these trends can inform Bellabeat marketing strategy.

We need to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices. We must select one Bellabeat product to apply these insights to in our presentation. These questions will guide the analysis:

1. What are some trends in smart device usage?
2. How could these trends apply to Bellabeat customers?
3. How could these trends help influence Bellabeat marketing strategy?

##Key Stakeholders:##

**Urska Srsen** - Bellabeat cofounder and CCO

**Sando Mur** - Bellabeat cofounder

**Bellabeat marketing analytics team**

##Products##

**1. Bellabeat app:** The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.

**2. Leaf:** Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.

**3. Time:** This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.

**4. Spring:** This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.

**5. Bellabeat membership:** Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.

In [None]:
# Importing packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statistics as sts

##Preparation##


###Data Source and Evaluation###

The data source that we are using is large and divided into several different dataframes. For this case study, I've identified that I need to use just some of them to achieve the goals of my analysis:

- Daily Activity
- Daily Calories
- Daily Steps
- Heart rate Seconds
- Hourly Calories
- Hourly Steps
- Sleep Day
- Weight Log

If needed, I'll add more dataframes further on.

For the first step I'll import the dataset and identify if there's any inconsistencies, first looking at the "head" of each file to understand the structure, than identifying if there's null/NaN values.

In [None]:
dailyActivity = pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv')
dailyCalories = pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv')
dailySteps = pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailySteps_merged.csv')
heartrateSeconds = pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv')
hourlyCalories = pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlyCalories_merged.csv')
hourlySteps = pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv')
sleepDay = pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv')
weightLog = pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv')

In [None]:
dailyActivity.head()

In [None]:
dailyCalories.head()

In [None]:
heartrateSeconds.head()

In [None]:
dailySteps.head()

In [None]:
hourlyCalories.head()

In [None]:
sleepDay.head()

In [None]:
hourlySteps.head()

In [None]:
weightLog.head()

In [None]:
print('Daily Activity')
print(dailyActivity.isnull().sum())
print('----------')
print('\n')
print('Daily Calories')
print(dailyCalories.isnull().sum())
print('----------')
print('\n')
print('Daily Steps')
print(dailySteps.isnull().sum())
print('----------')
print('\n')
print('Heart Rate Seconds')
print(heartrateSeconds.isnull().sum())
print('----------')
print('\n')
print('Hourly Calories')
print(hourlyCalories.isnull().sum())
print('----------')
print('\n')
print('Hourly Steps')
print(hourlySteps.isnull().sum())
print('----------')
print('\n')
print('Sleep Day')
print(sleepDay.isnull().sum())
print('----------')
print('\n')
print('Weight Log')
print(weightLog.isnull().sum())

In [None]:
fatGroup = weightLog.groupby(['Fat']).size()
print(fatGroup)

As we can see, there's 65 entries in "Weight Log" that are null values. I've grouped the values of 'Fat' in order to try to identify a median, but the vast majority of our ocurrencies are null values, with only 2 real 'Fat' information available, being irrelevant to use. I'll make the analysis without that information.
The other entries are correct for all the dataframes. We can handle any dataype inconsistencies identified further on during the exploratory analysis.

# Exploratory Analisys #



##1 - How often those users wear their devices##

First, we can identify how often those users wear their devices, so we can identify how many of them are heavy, moderate and light users. To so do, we need to identify in what daily dataframe we have the highest amount of individual Id, so we can consider how often each one used the devices.

In [None]:
print('Daily Activity Ids')
print(dailyActivity['Id'].nunique())
print('----------')
print('\n')
print('Daily Calories Ids')
print(dailyCalories['Id'].nunique())
print('----------')
print('\n')
print('Daily Steps Ids')
print(dailySteps['Id'].nunique())
print('----------')
print('\n')
print('Sleep Day Ids')
print(sleepDay['Id'].nunique())

Daily Activiy, Calories and Steps shows the highest amount of single users, being 33. So that's the amount we will consider to evalute the usage.
Now, let's find out how often those users wear their smart devices. We can use the "Daily Steps" dataframe to evaluate, considering that users with 0 steps in a day are not using their devices.

In [None]:
# Creating a separated dataframe with the result of ".value_counts" applied to 'Id'.
singleUsers = dailySteps['Id'].value_counts().rename_axis('unique_values').reset_index(name='counts')

# Simple function to calculate the usage rate by dividing the total days a single user used the device by the total days in the dataframe
def usageRate(x):
    return x / 31

# Iterating through each row
for row in singleUsers.iterrows():
    singleUsers['usageRate'] = singleUsers.apply(lambda x: usageRate(x['counts']), axis = 1)

# Now, we can define and apply bins for Heavy Users, Regular Users and Casual Users.    

def binnedUsage(x):
    if x == 1.0:
        return "Heavy User"
    elif x >= 0.7 and x < 1.0 :
        return "Regular User"
    else :
        return "Casual User"

for row in singleUsers.iterrows():
    singleUsers['usageProfile'] = singleUsers.apply(lambda x: binnedUsage(x['usageRate']), axis = 1)
    
# Calculating the percentage of each user profile

singleUsers['usageProfile'].value_counts() / 33


In [None]:
singleUsers

Now we can plot a visualization to better understand the distribution of usage between each user.

In [None]:
labels = 'Heavy User', 'Regular User', 'Casual User'
sizes = [63.6, 24.2, 12.2]
explode = (0.1, 0, 0)

fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode = explode, labels = labels, autopct = '%1.1f%%',
        shadow= True , startangle = 90)
ax1.axis('equal')

plt.show()

###Key Takeaways:###

- The majority of users in this dataframe are Heavy Users, being 63.6% of them
- If we consider the sum of Heavy and Regular users, we achieve 88% of the dataframe, wich proves that the adoption of those devices in a daily basis is strong


##2 - Identifying how active those users are##

In [None]:
dailyActivity.head()

In [None]:
print(dailyActivity['ActivityDate'].dtype)

As we can see, the activity date vector is in AAA-MM-DD format, but as a "object" datatype. We need to convert it to "datetime" in order to identify weekends, business days, what's the day we are analysing etc. This will be helpful to identify trends regarding each user's activity profile.

In [None]:
dailyActivity['ActivityDate'] = pd.to_datetime(dailyActivity['ActivityDate'])

dailyActivity['month'] = dailyActivity['ActivityDate'].dt.month_name()

dailyActivity['dayName'] = dailyActivity['ActivityDate'].dt.day_name()

dailyActivity['bdayWeekend'] = np.where(dailyActivity['ActivityDate'].dt.dayofweek > 4, 'Weekend', 'Business Day')

Now that we have the day of the week, if it's a business day or weekend and the month, we can start to evaluate how active those users are on each of these variables. To do so, we can first see a full statistical summary of the dataframe and create pivot tables and plot graphs for a better visualization.

In [None]:
print("Daily Activity Summary")
print('\n')
summary = dailyActivity.describe()
print(summary)

In [None]:
activityPerDay = dailyActivity.pivot_table(
    values = ['VeryActiveMinutes', 'FairlyActiveMinutes', 'LightlyActiveMinutes', 'SedentaryMinutes'],
    index = 'dayName'
)

activityPerMonth = dailyActivity.pivot_table(
    values = ['VeryActiveMinutes', 'FairlyActiveMinutes', 'LightlyActiveMinutes', 'SedentaryMinutes'],
    index = 'month'
)

activityBusinessWeekend = dailyActivity.pivot_table(
    values = ['VeryActiveMinutes', 'FairlyActiveMinutes', 'LightlyActiveMinutes', 'SedentaryMinutes'],
    index = 'bdayWeekend'
)

In [None]:
activityPerMonth

Obs: We only have 2 months to evalaluate, each of them with little variance between activity level, being of low relevance for our analysis trying to understand the behavior per month.

In [None]:
activityBusinessWeekend

In [None]:
activityPerDay

In [None]:
activityPerDay.plot(kind = 'bar', figsize = [16, 6])

In [None]:
activityBusinessWeekend.plot(kind = 'bar', figsize = [16, 6])

In order to be more specific, dailySteps and hourlyCalories. With both these databases we can identify the hour of the day with more activity and the correlation with the calories burned.

In [None]:
dailySteps.head()

In [None]:
hourlyCalories.head()

Now, we can merge both tables with our previous table "dailyActivity" to evaluate how the information from one another can be related.

In [None]:
mergedStepsCalories = dailySteps.merge(hourlyCalories, left_index=True, right_index=True)
mergedStepsCalories = mergedStepsCalories.merge(dailyActivity, left_index=True, right_index=True)
mergedStepsCalories.head()

In [None]:
mergedStepsCalories['ActivityHour'] = pd.to_datetime(mergedStepsCalories['ActivityHour'])
mergedStepsCalories['activityHour'] = mergedStepsCalories['ActivityHour'].dt.hour

In [None]:
mergedStepsCalories.head()

In [None]:
# Per hour we will use "Calories_x", wich shows the calories burned each hour. We will not include Steps here for this information is always daily related...
# ...being inequivalent to calories per hour.

activityPerHour = mergedStepsCalories.pivot_table(
    values = ['Calories_x'],
    index = 'activityHour'
)

# Per day we can use "Calories_y" wich shows the calories burned each day
activityPerDay = mergedStepsCalories.pivot_table(
    values = ['StepTotal', 'Calories_y'],
    index = 'dayName'
)

activityBusinessWeekend = mergedStepsCalories.pivot_table(
    values = ['StepTotal', 'Calories_y'],
    index = 'bdayWeekend'
)

In [None]:
activityBusinessWeekend

In [None]:
activityPerDay

In [None]:
activityPerHour

In [None]:
activityBusinessWeekend.plot(kind = 'bar', figsize = [16, 6])

In [None]:
activityPerDay.plot(kind = 'bar', figsize = [16, 6])

In [None]:
activityPerHour.plot(kind = 'bar', figsize = [16, 6])

###Key Takeaways:###

- The Users are slightly more active on saturdays
- In general, the activity level comparing full weekends with full business days are similar, with similar calories burned too
- The user's very active, fairly active, lightly active and sedentary minutes are, respectively, 21.16, 13.56, 192.81 and 991.21 minutes (mean)
- The users in this dataframe are mostly slightly active
- The more steps, the more calories burned
- The peak of activity level happens between 6pm and 9pm



##3 - Evaluating the sleeping habits of those users##

In [None]:
sleepDay.head()

In [None]:
# Converting the date formats and creating the day name and weekend evaluator

sleepDay['SleepDay'] = pd.to_datetime(sleepDay['SleepDay'])
sleepDay['dayName'] = sleepDay['SleepDay'].dt.day_name()
sleepDay['bdayWeekend'] = np.where(sleepDay['SleepDay'].dt.dayofweek > 4, 'Weekend', 'Business Day')

In [None]:
print("Sleep Day Summary")
print('\n')
sleepDaySummary = sleepDay.describe()
print(sleepDaySummary)

In [None]:
sleepDayPivot = sleepDay.pivot_table(
    values = ['TotalMinutesAsleep', 'TotalTimeInBed'],
    index = 'dayName'
)

In [None]:
sleepDayPivot

In [None]:
sleepDayPivot.plot(kind = 'bar', figsize = [16, 6])

In [None]:
sleepDayPivot = sleepDay.pivot_table(
    values = ['TotalSleepRecords', 'TotalMinutesAsleep', 'TotalTimeInBed'],
    index = 'dayName'
)

###Key Takeaways:###

- Users tend to sleep 1 time per day (1.11 mean)
- The users in this dataframe sleep 419.46 minutes(~ 7h) daily (mean)
- users in this dataframe spent 458.63 minutes (~7.5h) daily (mean)
- On sundays people tend to sleep more (~ 7h50)
- On thursdays people tend to sleep less (~ 6h)
- In general, the users from this dataframe have good sleeping habits, staying between 7 and 9 hours, wich is recomended ([source](https://www.sleepfoundation.org/how-sleep-works/how-much-sleep-do-we-really-need#:~:text=National%20Sleep%20Foundation%20guidelines1,to%208%20hours%20per%20night.))


##4 - Weight log exploration##


In [None]:
weightLog.head()

As we can see the Weight Log can give us some information regarding manual and auto reporting. This is useful to understand the use behavior of those customers. It's interesting to merge this dataframe with the activity dataframe in order to identify how more active users behave in terms of auto/manual reports.

In [None]:
mergedWeightActivity = weightLog.merge(dailyActivity, left_index=True, right_index=True)
mergedWeightActivity.head()

In [None]:
# Changing the datatype of 'BMI', wich is Object, to float, so we can create our bins latter on

mergedWeightActivity['BMI'] = mergedWeightActivity['BMI'].astype(float)

In [None]:
mergedWeightActivity['BMI'].dtype

In [None]:
reportTypeActivity = mergedWeightActivity.pivot_table(
    values = ['VeryActiveMinutes', 'FairlyActiveMinutes', 'LightlyActiveMinutes', 'SedentaryMinutes'],
    index = 'IsManualReport'
)

reportTypeActivity

In [None]:
reportTypeActivity.plot(kind = 'bar', figsize = [16, 6])

Other evaluation that we can make is how people with a healthy BMI (Between 18.5 and 24.9) compare with the ones who are underweight (Less than 18.5) or overweight (more than 24.9) in terms of exercise.

In [None]:
# First, we can create bins to separate users who are underweight, healthy and overweight.

def binnedBMI(x):
    if x < 18.5:
        return "Underweight"
    elif x >= 18.5 and x <= 24.9 :
        return "Healthy"
    else :
        return "Overweight"

for row in mergedWeightActivity.iterrows():
    mergedWeightActivity['BMI_Binned'] = mergedWeightActivity.apply(lambda x: binnedBMI(x['BMI']), axis = 1)


# Now, we can create the pivot table

activityBMI = mergedWeightActivity.pivot_table(
    values = ['VeryActiveMinutes', 'FairlyActiveMinutes', 'LightlyActiveMinutes', 'SedentaryMinutes'],
    index = 'BMI_Binned'
)

activityBMI

In [None]:
mergedWeightActivity['BMI_Binned'].value_counts()

In [None]:
activityBMI.plot(kind = 'bar', figsize = [16, 6])

###Key Takeaways:###


First, we are considering the [American Cancer Society](https://www.cancer.org/cancer/cancer-causes/diet-physical-activity/body-weight-and-cancer-risk/adult-bmi.html) weight ranges to determine each category.

- Users who insert manual weight reports are more active than users who uses automatic reports
- Users who rely on automatic reports are more sedentary than the ones who don't
- Healthy BMI users are, in general, more active than overweight users when we evaluate the entire activity minutes, from sedentary minutes to very active minutes
- Users who are more active and have a healthy BMI tend to use manual reports
- The dataframe has similar amount of Overweight and Healthy users, with no users falling into "underweight" category

# Key takeaways and suggestions #


### What are some trends in smart device usage? ###

- Users analyzed are mostly Heavy or Regular users, being 88% of the population
- 63.6% of those users are using their devices everyday
- Users who exercise more/are more active tend to use manual weight reporting
- In general, the users from this dataframe have good sleeping habits, staying between 7 and 9 hours, wich is recomended ([source](https://www.sleepfoundation.org/how-sleep-works/how-much-sleep-do-we-really-need#:~:text=National%20Sleep%20Foundation%20guidelines1,to%208%20hours%20per%20night.)). If we consider that this dataframe has a sample that well reflects the population, smart device users tend to have enough sleep


### How could these trends apply to Bellabeat customers? ###

- Bellabeat should explore the fact that many users tend to use their devices on a daily basis, improving their products to be reliable for any kind of activity
- Improving the reliability of products may help improve device usage. We should explore further, but part of those users may not wear their devices daily due to the lack of resistance for different activities, like swimming for example 
- Considering that the users who are most active normally rely on manual reporting, Bellabeat should invest on improving their accuracy, increasing the adoption of automatic reports, wich would give more practicality for customers
- Those users don't have a product like Spring, wich can help tracking water intake. Bellabeat should explore this in order to cross sell Spring for users who own other devices


### How could these trends help influence Bellabeat marketing strategy? ###

- Knowing that users are mostly active between 6pm and 9pm, Bellabeat could explore the activities that are mostly executed by women in that timeframe and target their pratictioners 
- Knowing that with more steps users tend to burn more calories and mantain a better heart rate, Bellabeat should explore this to communicate the benefits of walking enough and tracking activity to maintain your overall health. This could lead to more customer acquisition
- Bellabeat should communicate clearly the reliability of their products to be used daily and for any kind of activity, considering that the majority of users (88%) tend to use their devices daily
- Bellabeat should focus on notifications during the day to inform customers of their current data (heart rate, calories etc) and health agencies recomendations for each of those informations. This could increase adoption and encourage users to buy other products for a better tracking of their activities

### Attention points ###

- The dataset used doesn't focus on women. Any variation of usage between men and women couldn't have been identified
- The dataset rely on information from 2016, wich can be a little outdated
- The dataset has a small timeframe (1 month), so any variation between months, quarters, semesters etc couldn't have been identified