# Case Study: Bellabeat
## Marharyta Datsik
10.08.2021

## Google Data Analytics Capstone - Case Study
### How Can a Wellness Company Play it Smart?
#### Introduction

This is a case study for Google Data Analytics Certification. Here we were tasked to improve the marketing strategy for the smart devices products of health-focused manufacturer for women, Bellabeat.

The main goals are:
* What are the trends in smart device usage?
* How could these trends apply to Bellabeat customers?
* What features should Bellabeat products consider adding to entice more customers?

For this case I will use Python despite it wasn't the language of the Data Analytics Course by Google. 
So if you find some pieces of code thad you would like to improve, write the comment. 

### Uploading Data
We will use public data that explores smart device users' daily habits [www.kaggle.com/arashnic/fitbit](https://www.kaggle.com/arashnic/fitbit)

This data set contains personal fitness tracker from thirty fitbit users.

There are a number of different csv files that range from Daily activity, calories, steps; hourly calories, intensities, and steps, and heart rate, sleep data and weight logs.


### Data preparation
We’ll create our data frames. The data frames I’ll be working with will be creating objects for:
* daily_activity
* daily calories
* daily sleep
* weight log info
* daily intensities

We’ll follow typical naming conventions based on the csv file names. 
Let's load the libraries and csv files to have a first look at it.

#### Data Cleaning

In [None]:
#load libraries
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import plotnine
%matplotlib inline

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Now we are loading the relevant files for us. Than we will make a first look on daily_activity dataframe.

In [None]:
daily_activity = pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv')
daily_calories = pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv')
weight_info = pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv')
sleep_day = pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv')
daily_intensities = pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv')

#### Exploring Dataframes
Let’s take a beat to investigate the tables. For each one we’ll look at the first five values using the head() function.

##### Daily activity

In [None]:
daily_activity.head()

In [None]:
daily_activity.info()

##### Daily calories

In [None]:
daily_calories.head()

In [None]:
daily_calories.info()

##### Weight info

In [None]:
weight_info.head()

##### Sleep day

In [None]:
sleep_day.head()

##### Daily intensities

In [None]:
daily_intensities.head()

In [None]:
daily_intensities.info()

#### At a Glance

All 5 data frames have the same ‘ID’ field, so we can merge the datasets if need.

It seems the daily_activity table might have a log of calories and intensities already, so we should confirm that the values actually match for any given ‘ID’ number. And we can officially remove those two datasets from analysis.

There have been left:
* daily_activity
* sleep_day
* weight_info

#### The Analysis

In [None]:
#check number of unique users
daily_activity['Id'].nunique()

In [None]:
sleep_day['Id'].nunique()

In [None]:
weight_info['Id'].nunique()

How many observations are there in each dataframe?

In [None]:
# number of rows in the df
daily_activity.shape[0]

In [None]:
sleep_day.shape[0]

In [None]:
weight_info.shape[0]

Number of unique users in data frames daily_activity,sleep_day and weight_infoFor the sleep dataframe: is different.

What are some quick summary statistics we’d want to know about each data frame?

For the daily activity dataframe:

In [None]:
daily_activity.describe()

Te summary of statics in daily_activity df says us:
* The mean of LoggedActivitiesDistance (0.108171) is too small vs. TrackerDistance (5.475351). This means that users prefer not to log their activity. They like automatic logging by tracker.
* TotalSteps, Tracker Distance, SedentaryMinutes and Calories have a relatively high standard deviations which shows different behavior.

For the sleep dataframe:

In [None]:
sleep_day.describe()

* There were only few total sleep records. Probably users don't like to sleep with their trackers on them.

For the weight dataframe:

In [None]:
weight_info.describe()

* Fat coloumt include only 2 values. So there are 65 missing values.

#### Plotting a few explorations

What’s the relationship between steps taken in a day and sedentary minutes? It seems that we have a negative relationship between total steps taken and the minutes someone has remained sedentary. We also see that calories generally trend positively with total steps taking.

The plot below shows that sedentary time is not necessarily related to calories burned.

In [None]:
from plotnine import ggplot, aes, labs, geom_point

(
    ggplot(daily_activity)+ aes(x='TotalSteps', y='SedentaryMinutes', color = 'Calories')+ labs(
        x="Total Steps",
        y="Sedentary Minutes",
        color="Calories",
        title="relationship between steps taken in a day and sedentary minutes") + geom_point()
)


Let's make some more plots to dip deeper.

In [None]:
sns.lmplot(x='TotalSteps',y='Calories',data=daily_activity,height=4,aspect=3)
plt.title('Relationship between total steps and calories burned');

It's pretty clear that people who took the most steps tended to burn  the most calories, however there’s a large spread there clustered towards the lower amounts. So we will look closer to residuals and the estimated value.

In [None]:
#import necessary libraries 
import statsmodels.api as sm
from statsmodels.formula.api import ols

#fit simple linear regression model
model = ols('Calories ~ TotalSteps', data=daily_activity).fit()

#define figure size
fig = plt.figure(figsize=(12,8))

#produce regression plots
fig = sm.graphics.plot_regress_exog(model, 'TotalSteps', fig=fig)

Here you can find that in order to burn calories, you don’t have to do high-intensity work outs you can just get out there and start walking!

It will be usefull for customers, because they only need to start walking to burn calories. They can recieve a daily reminder how many steps they need to achieve, so that thay can burn a dreamed amount of calories.

What's the relationship between minutes asleep and time in bed?

In [None]:
sns.lmplot(x='TotalMinutesAsleep',y='TotalTimeInBed',data=sleep_day,height=4,aspect=3)
plt.title('Relationship between minutes asleep and time in bed');

As we can see, there are some outliers here! some data points that spent a lot of time in bed, but didn’t actually sleep.

#### Merging these two datasets together

In [None]:
df1 = daily_activity.groupby(['Id']).mean().reset_index()
df2 = sleep_day.groupby(['Id']).mean().reset_index()

merged_df = pd.merge(df2, df1, on="Id")
merged_df.head(5)

In [None]:
# number of unique Id's
merged_df['Id'].nunique()

We had only 24 active users in dataframe sleep_day. That's why we had to get the similar number of unique Id's in our merged dataframe.

#### Sedentary Time with Time in Bed

Let’s run a correlation to see what the correlation coefficient coefficient would be for a linear regression:

In [None]:
from scipy.stats import pearsonr

#fit simple linear regression model
model2 = ols('SedentaryMinutes ~ TotalTimeInBed', data = merged_df).fit()

#find Person's Correlation
corr, _ = pearsonr(merged_df['TotalTimeInBed'], merged_df['SedentaryMinutes'])
print(corr)



The Pearson Correlation was used to summarize the strength of the linear relationship between two data samples. The correlation coefficient is negative. We can assume that total time in bed is not correlated with sedentary minutes at all. 

### Summary
We looked at this dataset of fitbit users pretty intensively to get an idea on what features are being used, and how we can market our items.
* Automated tracking of activities instead of manual input
More users show log their steps taken, calories, probably because this data was automatically by Fitbit gathered. Only 72% of them loged their sleep and only few times. Probably it is not comfortable to sleep with tracker on the hand. And only 24% of total users logged their weight.
* For marketing strategy the data shows how affectivly to loose the calories only taking steps. With automatically gathering the steps, user will be more motivated to achieve his/her goal in burned calories.
* Reminders on the phone,goal-setting could help increase the user's engagement.
* Motivate people to wear the tracker during sleep and also for appropriate predictions of burned calories, users have to share with us their weight, sex, and height. Only with this data Bellabit can make more predictable analysis of health for women. 

Thank you for reading the first case in my life. P.S. I'm wearing a fitness tracker for 3 years and can realy say that it is very helpful. It could be more helpful if more functions were automated tracked.