# BELLABEAT CASE STUDY USING PYTHON
  
  CAPSTONE PROJECT FOR GOOGLE DATA ANALYTICS CERTIFICATE
  
  AUTHOR: SOUMYA
  
  DATE: 7/14/2021


# INTRODUCTION

This case study is part of my Google Data Analytics Professional Certificate capstone project. The idea is to perform many real-world tasks of a junior data analyst and showcase my data analysis skills.



# ABOUT THE COMPANY

Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products.Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around
the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits. Since it was founded in 2013, Bellabeat has grown rapidly and quickly
positioned itself as a tech-driven wellness company for women.

Bellabeat products became available through a growing number of online retailers in addition to their own e-commerce channel on their [website](https://bellabeat.com/).

Bellabeat Products:

* Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress,
  menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and
  make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
  
* Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects
  to the Bellabeat app to track activity, sleep, and stress.
  
* Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user
  activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your
  daily wellness.
  
* Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are
  appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your
  hydration levels



# BUSINESS TASK

Sršen wants us to use public data that explores smart device users’ daily habits.Then, using this information, she would like high-level recommendations for how these trends can inform Bellabeat marketing strategy.

* To identify some trends in smart device usage
* How these trends apply to bellabeat customers
* How these trends could influence bellabeat marketing strategy


# PRIMARY STAKEHOLDERS

* Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
* Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team

# SECONDARY STAKEHOLDER

* Bellabeat marketing analytics team



# DATA SOURCE

* Public FitBit Tracker data set has been used for the case study which can be found [here](https://www.kaggle.com/arashnic/fitbit)
* The data set was made available through [Mobius](https://www.kaggle.com/arashnic)
* These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk from
  04.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, 
  including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be 
  parsed by export session ID (column A) or timestamp (column B). Variation between output represents use of different 
  types of Fitbit trackers and individual tracking behaviors / preferences.
* There are 18 csv files in the data set

DATA LIMITATION

* Only 33 participants took part in the survey, so could be biased and does not represent entire population of FitBit 
  users
* The data was collected 5 years ago
* The data was collected by third party



# DATA PREPARATION


IMPORTING PACKAGES

In [None]:
import numpy as np   # Arrays
import pandas as pd   # Data Structure,Cleansing & Analysis
import seaborn as sns    # Data Visualization
import matplotlib.pyplot as plt   # Data Visualization 
import datetime as dt   # Date and Time

READING FILES

In [None]:
daily_activity = pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv')
sleep_day = pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv')

Notes: 

* I have imported only 2 file named 'dailyActivity_merged and sleep_day merged for my analysis

# DATA PROCESSING


CHECKING DATA FRAMES

In [None]:
# Checking head of the daily_activity data frame

daily_activity.head()

In [None]:
# Checking head of sleep_day data frame

sleep_day.head()

Notes:

* SleepDay column has date and time as cell values

In [None]:
# Checking if daily_activity data frame has any missing values

daily_activity.isnull().sum()

In [None]:
# Checking if sleep_day data frame has any missing values

sleep_day.isnull().sum()

In [None]:
# Checking structure of daily_activity data frame

daily_activity.info()

In [None]:
# Checking structure of sleep_day data frame

sleep_day.info()

Notes:

* daily_activity data frame has 940 rows and 15 columns, whereas sleep_day data frame has only 413 rows and 5 columns
* Both the data frames have an 'Id' column in common
* ActivityDate column in daily_activity data frame and SleepDay column in sleep_day data frame have object data type

In [None]:
# Converting ActivityDate from 'Object' to 'DateTime' data type

daily_activity["ActivityDate"] = pd.to_datetime(daily_activity["ActivityDate"])

daily_activity["ActivityDate"] = pd.to_datetime(daily_activity["ActivityDate"],format='%Y%m%d')

In [None]:
# Extracting only date from SleepDay column and then converting from 'Object' to 'DateTime' data type

sleep_day["SleepDay"] = pd.to_datetime(sleep_day["SleepDay"]).dt.date

sleep_day["SleepDay"] = pd.to_datetime(sleep_day["SleepDay"])

In [None]:
# Checking data types of daily_activity data frame

daily_activity.dtypes

In [None]:
# Checking data types of sleep_day data frame

sleep_day.dtypes

In [None]:
# Checking count of unique Id's

daily_activity['Id'].nunique() 

In [None]:
# Checking count of unique Id's

sleep_day['Id'].nunique()

Notes: 

* daily_activity data frame has 33 unique Id's whereas sleep_day frame has only 24 unique Id's

In [None]:
# Checking  value count of each unique ID

daily_activity['Id'].value_counts() 

In [None]:
# Checking  value count of each unique ID

sleep_day['Id'].value_counts()

Notes: 
    
* Some user data is missing and does not cover the whole period mentioned(4/12/2016 to 5/12/2016)
* This indicates that not every user who participated in the survey used the smart device daily, and data collected is 
  biased

In [None]:
# Renaming the column names of daily_activity data frame

daily_activity.rename(columns = {"Id" : "id", "ActivityDate" : "date", "TotalSteps" : "total_steps","VeryActiveMinutes" :"very_active_mins", "FairlyActiveMinutes" : "fairly_active_mins", "LightlyActiveMinutes" : "lightly_active_mins" ,"SedentaryMinutes" : "sedentary_mins", "Calories" : "total_calories"},inplace = True)

In [None]:
# Renaming column names of sleep_day data frame

sleep_day.rename(columns = {"Id" : "id", "SleepDay" : "date", "TotalSleepRecords" : "total_sleep_records", "TotalMinutesAsleep" : "total_sleep_mins","TotalTimeInBed" : "total_bedtime_mins"},inplace = True)

In [None]:
# Checking if column names have been renamed

daily_activity.columns

In [None]:
sleep_day.columns

In [None]:
# Creating a new total_active_mins column
    
daily_activity["total_active_mins"] = daily_activity[["very_active_mins","fairly_active_mins","lightly_active_mins"]].sum(axis=1)

In [None]:
# Creating a new weekday column

daily_activity['weekday'] = daily_activity['date'].dt.day_name()

In [None]:
# Removing columns from daily_activity data frame which are not going to be used in the analysis

daily_activity.drop(['TotalDistance','VeryActiveDistance','ModeratelyActiveDistance','LightActiveDistance','SedentaryActiveDistance','TrackerDistance','LoggedActivitiesDistance'], axis = 1, inplace = True)

In [None]:
# Removing columns from sleep_day data frame which are not going to be used in the analysis

sleep_day.drop(['total_sleep_records'], axis =1, inplace = True)

In [None]:
# Checking head of daily_activity data frame 

daily_activity.head()

In [None]:
# Checking head of sleep_day data frame 

sleep_day.head()

In [None]:
# Checking number of rows and columns in data frames

daily_activity.shape

In [None]:
sleep_day.shape

# DATA ANALYSIS

In [None]:
# Checking summary statistics of daily_activity data frame

daily_activity[['very_active_mins','fairly_active_mins','lightly_active_mins','sedentary_mins','total_steps','total_calories','total_active_mins']].describe()

In [None]:
# Checking summary statistics of sleep_day data frame

sleep_day[['total_sleep_mins','total_bedtime_mins']].describe()

In [None]:
# Defining a function based on average values daily_activity summary staistics

def new_col(s):
    if s['very_active_mins'] > 21.164894:
        return "very_active"
    elif s['fairly_active_mins'] > 13.564894:
        return "fairly_active"
    elif s['lightly_active_mins'] >  192.812766:
        return "lightly_active"
    else:
        return "sedentary"         

Notes: 

* Created a function that checks for individual user's average( active,fairly active, lightly active and sedentary)
  minutes and compares it with their respective whole average.
* The function checks for the first criteria defined, if it does not meet the criteria then goes to second and 
  so on, and eventually categorizes user to either 'very_active', 'fairly_active','lightly_active'or 'sedentary'
   

In [None]:
# Creating a new data frame with only average values for each users, adding a new column and then renaming column names

new_df = pd.DataFrame(daily_activity.groupby('id')[['very_active_mins','fairly_active_mins','lightly_active_mins','sedentary_mins','total_steps','total_calories','total_active_mins']].mean().round(2))
                      
new_df['user_type'] = new_df.apply(lambda s : new_col(s),axis =1)
                      
new_df.rename(columns = {'very_active_mins' : 'avg_very_active_mins','fairly_active_mins' : 'avg_fairly_active_mins','lightly_active_mins' : 'avg_lightly_active_mins','sedentary_mins' : 'avg_sedentary_mins','total_steps' : 'avg_total_steps','total_calories' : 'avg_total_calories','total_active_mins' : 'avg_total_active_mins'},inplace = True)
                      
new_df.head()

In [None]:
new_df.shape

In [None]:
# Checking count of user types

new_df['user_type'].value_counts()

Notes:

* Created a new data frame called new_df by grouping the daily_activity data frame based on id,calculated the average
  of specifics columns and then rounded it to 2 decimals
* Then applied the function new_col to create a new column nmaed 'user_type' which categorizes all the 33 users either   one of the 4 categories created
* Renamed columns
* The new_df data frame contains only the average values
* It has 33 rows and 8 columns

In [None]:
# Merging sleep_day and daily_activity data frame on two common keys

daily_merged = pd.merge(sleep_day,daily_activity, on = ['id','date'])

In [None]:
daily_merged.head()

In [None]:
daily_merged.shape

Note:

* The daily_merged data frame has 413 rows and 12 columns
* The data has been merged using 'id' and 'date' as common keys
* The data frame has only 24 unique id's 
* So any anlysis done using daily_merged data frame is considering only 24 users

# DATA VISUALIZATION

In [None]:
fig, ax = plt.subplots(figsize= (10,6))
sns.set_theme(style = "darkgrid")
sns.barplot(data = new_df ,x = 'user_type', y = 'avg_total_steps',order = ['very_active','fairly_active','lightly_active','sedentary'],palette ='coolwarm')
plt.title("User Type Vs Average of Total Steps")

In [None]:
fig, ax = plt.subplots(figsize= (10,6))
sns.set_theme(style = "darkgrid")
sns.barplot(data = new_df ,x = 'user_type', y = 'avg_total_calories',order = ['very_active','fairly_active','lightly_active','sedentary'],palette ='coolwarm')
plt.title("User Type Vs Average Total Calories")

In [None]:
a = sleep_day.groupby('id')['total_sleep_mins'].mean().round(0)
b = sleep_day.groupby('id')['total_bedtime_mins'].mean().round(0)

fig, ax = plt.subplots(figsize= (10,6))
sns.set_theme(style = "darkgrid")
ax = sns.barplot(x = a , y = b,palette ='coolwarm')
ax.set_xticklabels(ax.get_xticklabels(), rotation=40)
plt.title("Average Sleep Time Vs Average Bed Time In Minutes")
ax.set(xlabel='Average Sleep Time', ylabel='Average Time On Bed')

In [None]:
fig, ax = plt.subplots(figsize= (10,6))
ax = sns.barplot(data = daily_activity, x = 'weekday', y = 'total_active_mins', palette = 'YlGnBu', order = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'])
ax.set_xticklabels(ax.get_xticklabels(), rotation=40)
plt.title("Weekday Vs Total Active Minutes")

In [None]:
fig, ax = plt.subplots(figsize= (10,6))
ax = sns.barplot(data = daily_activity, x = 'weekday', y = 'sedentary_mins', palette = 'YlGnBu', order = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'])
ax.set_xticklabels(ax.get_xticklabels(), rotation=40)
plt.title("Weekday Vs Sedentary Minutes")

In [None]:
fig, ax = plt.subplots(figsize= (10,6))
ax = sns.barplot(data = daily_activity, x = 'weekday', y = 'total_steps', palette = 'YlGnBu', order = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'])
ax.set_xticklabels(ax.get_xticklabels(), rotation=40)
plt.title("Weekday Vs Total Steps")

In [None]:
fig, ax = plt.subplots(figsize= (10,6))
sns.scatterplot(data = daily_merged, x = 'total_sleep_mins', y = 'total_steps')
plt.title("Total Sleep Minutes Vs Total Steps")

In [None]:
fig, ax = plt.subplots(figsize= (10,6))
sns.scatterplot(data = daily_merged, x = 'sedentary_mins', y = 'total_sleep_mins')
plt.title("Sedentary Minutes Vs Total Sleep Minutes")

In [None]:
fig, ax = plt.subplots(figsize= (10,6))
sns.set_theme(style = "darkgrid")
sns.scatterplot(data = daily_merged, x = 'very_active_mins', y = 'total_sleep_mins')
plt.title("Very Active Minutes Vs Total Sleep Minutes")

# FINDINGS

* Here we assume that the smart device used to collect user data was either Fitbit smart watch or wrist bands
* Out of the 33 users,4 users have used the smart device for less than or equal to 20 days in comparison to rest of the 
  users 
* Average total steps taken by active users is higher than rest of the user types
* Most steps are taken on Tuesday and Saturday
* Average of sedentary minutes is higher than average of active minutes
* Sedentary minutes is almost same throughout the week but little higher on Monday
* Users are most active on Saturday
* Most with average sleep time between 400 to 500 mins, spend between 400 to 500 mins on bed as well
* Sleep does not exceed 600 minutes even when 15000 or more steps were taken a day
* Once sedentary minutes crosses 600 mins, sleep mins starts dropping significantly(between 100 to 300 mins)
* With very active mins around 50 , the sleep mins ranges between 400 to 500

# RECOMMENDATIONS

* Very active users take more steps and burn more calories
* BellaBeat can make use of these trends, to show how smart devices helps users to track their daily activities, which
  will motivate them to stay active and healthy, also improve their sleep mins
 