<H1 align="center">  Google Data Analytics Certificate: Capstone Project </H1>

---

## How Can a Wellness Technology Company Play It Smart?

For this case study, I'm playing the part of a junior data analyst for [Bellabeat](https://bellabeat.com/).

Bellabeat is a high-tech manufacturer of health-focused products for women.

## What's expected from me

- Analyze data generated by the smart devices.
- Generate insights on how our customers use our products so our stakeholders can make data-driven decisions.

## About the company
 
- Founded by Urška Sršen and Sando Mur. 
- High-tech company that manufactures health-focused smart products.
- Sršen started this company around the idea of informing and inspiring women worldwide about their health and fitness.
- Rapid growth since opening its doors in 2013. 
- Positioned as a tech-driven wellness company for women.

The case study tells us that Sršen is aware that an analysis of Bellabeat's available consumer data would reveal more growth opportunities. Therefore, she asked the marketing analytics team to focus on a Bellabeat product and analyze smart device usage data to understand how people are already using their smart devices. 

High-level recommendations for how these trends can inform Bellabeat marketing strategy should be provided.

# Data Analysis Phases

The insights will be presented following the data analysis process steps taught in the Google Data Analytics Professional Certificate:

* **Ask**: What problem are you trying to solve? How can your insights drive business decisions?

* **Prepare**: Where is your data stored? How is the data organized? Is it in long or wide format? Are there issues with bias or credibility in this data? Does your data ROCCC? How are you addressing licensing, privacy, security, and accessibility? How did you verify the data's integrity? How does it help you answer your question? Are there any problems with the data?
 
* **Process**: What tools are you choosing and why? Have you ensured your data's integrity? What steps have you taken to ensure that your data is clean? How can you verify that your data is clean and ready to analyze? Have you documented your cleaning process so you can review and share those results?.

* **Analyze**: How should you organize your data to perform analysis on it? Has your data been adequately formatted? What surprises did you discover in the data? What trends or relationships did you find in the data? How will these insights help answer your business questions?

* **Share**: Were you able to answer the business questions? What story does your data tell? How do your findings relate to your original question? Who is your audience? What is the best way to communicate with them? Can data visualization help you share your findings? Is your presentation accessible to your audience?

* **Act**: What is your final conclusion based on your analysis? How could your team and business apply your insights? What next steps would you or your stakeholders take based on your findings? Is there additional data you could use to expand on your findings?


# **Ask** ❓
---

## The business task 📊

Analyze smart device usage data to understand how people are already using their smart devices. Then, provide high-level recommendations for how these trends can inform Bellabeat marketing strategy.

This analysis is meant to answer the following question:

**How current user trends can guide marketing strategy?**

## The stakeholders 🤵🏻‍♀️🤵🏻‍♂️

* **Urška Sršen**: Bellabeat's cofounder and Chief Creative Oﬃcer
* **Sando Mur**: Mathematician and Bellabeat's cofounder; a key member of the Bellabeat executive team
* **Bellabeat marketing analytics team**: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat's marketing strategy.

# **Prepare/Process** ⚙️:
---

## Getting the data 👨🏻‍💻

How can we answer our main business question? I'm thinking of using only the data from **[FitBit Fitness Tracker Data](https://www.kaggle.com/arashnic/fitbit)** for now.

This Kaggle data set contains a personal ﬁtness tracker from thirty-three ﬁtbit users. They (the users) consented to submit their personal tracker data, including info about their physical activity, steps, daily activity, heart rate, and sleep.


Should we find it necessary later, I'll consider using other datasets.

Inside the `fitbit` folder, a folder named `Fitabase Data 4.12.16-5.12.16` stores 18 `csv` files with tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.

To better read our files let's create a `path` variable to store the folder path containing all `csv` files.

## Importing libraries 📚

The first thing I like to do is to import the libraries that I'll need for the analysis.

In [1]:
import pandas as pd # for data wrangling, basically.
import numpy as np # for aggregate functions like mean, median, etc.
import matplotlib.pyplot as plt # for data visualization
import datetime # to manipulate datetime data.
import seaborn as sns # for data visualization, too.
import sqlite3 as sql # to use SQL syntax in a Python notebook (makes some things easier).
import os

## Checking for integrity ✅

Time for an integrity check. 

There should be 18 file paths in total.

In [1]:
path = '../input/fitbit/Fitabase Data 4.12.16-5.12.16'

## Get the full path of all the csv files.
full_path_list = [os.path.join(path,f) for\
                 f in os.listdir(path) if os.path.isfile(os.path.join(path,f)) ]

In [1]:
len(full_path_list)

Let's check all file paths containing data from Fitbit users:

In [1]:
full_path_list

## Check for Redundancy 🤔

Five (5) of our tables contain `daily` or `Day` in their names. Let's inspect them for redundant information.

Let's start with `dailyActivity_merged`.

In [1]:
dailyActivity_df = pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv')

dailyActivity_df.head()

We do the same for the `dailyIntensities_merged` table.

In [1]:
dailyIntensities_df = pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv')

dailyIntensities_df.head()

Ok. Both tables have sharing columns. Let's see if they have the same amount of rows.

In [1]:
print(f'dailyActivity_df length: {len(dailyActivity_df)}')
print(f'dailyIntensities_df length: {len(dailyIntensities_df)}')

The table `dailyIntensities_merged` seems to contain data already present in `dailyActivity_merged`. 

### No bueno! 🙅‍♂️

Let's concatenate both tables, dropping all duplicates on all columns related to distance and activity (in minutes). I made sure that unique rows in `dailyIntensities_df` aren't considered by duplicating it.

**Redundancy (final check)**:

If these columns are redundant between both tables, the resulting data frame should **return empty**.

In [1]:
pd.concat([dailyActivity_df, pd.concat([dailyIntensities_df]*2)]).drop_duplicates(['VeryActiveDistance', 
                  'ModeratelyActiveDistance', 
                  'LightActiveDistance', 
                  'SedentaryActiveDistance', 
                  'VeryActiveMinutes',
                  'FairlyActiveMinutes', 
                  'LightlyActiveMinutes', 
                  'SedentaryMinutes'], keep=False)

### It worked! Awesome 🙌

Let's inspect the `dailySteps_merged table`, next.

In [1]:
dailySteps_df = pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailySteps_merged.csv')

In [1]:
dailySteps_df.head()

Ok. We're having a similar situation between `dailyActivity_merged` and `dailySteps_merged`. 

Both have columns related to total steps taken.

In `dailyActivity_merged` there's `TotalSteps`.

In `dailySteps_merged` there's `StepTotal`. 

Let's concatanate these tables like we did the first time.

In [1]:
pd.concat([dailyActivity_df, pd.concat([dailySteps_df]*2)]).drop_duplicates(['StepTotal'], keep=False)

### Good. 👌

Let's repeat the same process for the `Calories` column in `dailyCalories_merged`. Let's read the table first:

In [1]:
dailyCalories_df = pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv')

In [1]:
pd.concat([dailyActivity_df, pd.concat([dailyCalories_df]*2)]).drop_duplicates(['Calories'], keep=False)

### Wooh! 💪

We now know for sure that we don't need the `dailyIntensities_merged`, `dailySteps_merged` and `dailyCalories_merged` tables, since all of the data contained within them is already in the `dailyActivity_merged` table.

## Data Cleaning 🧹
---

### Renaming columns

Let's rename the columns of our `DailyAcitivity` table in `snake_case`. We can visualize all column names first for easy access.

In [1]:
dailyActivity_df.columns

In [1]:
#renaming the original dataframe
dailyActivity_df.rename(columns={'Id': 'id',
                                 'ActivityDate':'activity_date',
                                 'TotalSteps': 'total_steps', 
                                 'TotalDistance': 'total_dist', 
                                 'LoggedActivitiesDistance': 'logged_dist', 
                                 'VeryActiveDistance': 'very_active_dist', 
                                 'ModeratelyActiveDistance': 'moderately_active_dist', 
                                 'LightActiveDistance': 'lightly_active_dist', 
                                 'SedentaryActiveDistance': 'sedentary_active_dist', 
                                 'VeryActiveMinutes': 'very_active_mins', 
                                 'FairlyActiveMinutes': 'fairly_active_mins', 
                                 'LightlyActiveMinutes': 'lightly_active_mins', 
                                 'SedentaryMinutes': 'sedentary_mins', 
                                 'Calories': 'calories', }, inplace=True)

### Date formatting 🗓

Let's format the date columns in the standard **YYY-MM-DD**, just because.

(easier on the eyes, imo).

Starting with `dailyActivity_merged`:

In [1]:
dailyActivity_df['activity_date'] = pd.to_datetime(dailyActivity_df['activity_date'])

dailyActivity_df.activity_date.apply(lambda x: x.strftime('%Y%m%d'))

Let's check our final results!

In [1]:
dailyActivity_df.head()

## Sleeping Data 💤

We have an interesting table with sleep data which might come in handy for our analysis. Let's do the same process we did with the `dailyActivity_merged` table.

In [1]:
sleep_df = pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv')

In [1]:
sleep_df.head()

In [1]:
sleep_df.columns

In [1]:
#renaming the original dataframe
sleep_df.rename(columns={'Id': 'id', 
                         'SleepDay': 'sleep_day', 
                         'TotalSleepRecords': 'total_sleep_records',
                         'TotalMinutesAsleep': 'total_mins_asleep',
                         'TotalTimeInBed': 'total_time_in_bed'}, inplace=True)

In [1]:
sleep_df['sleep_day'] = pd.to_datetime(sleep_df['sleep_day'])

sleep_df.sleep_day.apply(lambda x: x.strftime('%Y%m%d'))

In [1]:
sleep_df['day_of_week'] = sleep_df.sleep_day.apply(lambda x: x.strftime('%w'))
sleep_df.head()

The date now is in the proper `YYYY-MM-DD` format, and we added a day of the week column.

Lastly, let's go the extra mile and extract information on the day, month, year and day of the week (*day_of_week*) from the formated dates:

In [1]:
full_dailyActivity_df = dailyActivity_df.copy()

full_dailyActivity_df['day'] = full_dailyActivity_df.activity_date.apply(lambda x: x.strftime('%d'))
full_dailyActivity_df['month'] = full_dailyActivity_df.activity_date.apply(lambda x: x.strftime('%m'))
full_dailyActivity_df['year'] = full_dailyActivity_df.activity_date.apply(lambda x: x.strftime('%Y'))
full_dailyActivity_df['day_of_week'] = full_dailyActivity_df.activity_date.apply(lambda x: x.strftime('%w'))

We saved the resulting query in a larger dataframe named `full_dailyActivity_df`. Here are the first five rows:

In [1]:
full_dailyActivity_df.head()

# **Analyze** 📈
---
### Data on Average Calories, Steps and Distance by Id and by day of the week 🏃‍♂️

In [1]:
# make a copy of original df
dailyActivity_df_agg = dailyActivity_df.copy()

# create a new day_of_week column
dailyActivity_df_agg['day_of_week'] = dailyActivity_df_agg.activity_date.apply(lambda x: x.strftime('%w'))

In [1]:
# calculate mean by Id and day of week, rounded by two decimal points
activity_dist = dailyActivity_df_agg.groupby(['id', 'day_of_week']).mean().round(decimals = 2)

# reset df index
activity_dist = activity_dist.reset_index()

# include only necessary columns in df
activity_dist = activity_dist[['id', 'day_of_week', 'calories', 'total_steps', 'total_dist']]

In [1]:
activity_dist.head()

### Categorize days: does this day belong to the weekend or is it a weekday? 📅

Let create a column that checks if a certain date happened on the weekend.

We'll first create a custom function that will tell us just that.

If the day of week = 0 (0 being sunday, the first day of the week), or 6 (6 being saturday, last day of the week), then assign `yes` to those rows.

In [1]:
def check_weekend(row):
    """custom function that tells us if a certain date happened on the weekend or not"""

    if row['day_of_week'] == '0':
        val = 'yes'
    elif row['day_of_week'] == '6':
        val = 'yes'
    else:
        val = 'no'
    return val

In [1]:
# make a copy of the original dailyActivity df
activity_weekend = dailyActivity_df.copy()

# add new column extracting only day of the week from the ActivityDate column
activity_weekend['day_of_week'] = activity_weekend.activity_date.apply(lambda x: x.strftime('%w'))

# apply the check_weekend function
activity_weekend['weekend'] = activity_weekend.apply(check_weekend, axis=1)

# drop the day of week column as its no longer needed
activity_weekend.drop('day_of_week', axis=1)

# show first five rows of df to check results
activity_weekend.head()

### Awesome! 👍

## Joining activity data with sleep data

Let's join the activity table with our sleep data:

In [1]:
# inner join the dailyActivity and sleep df
activity_sleep_df = pd.merge(dailyActivity_df, sleep_df, left_on=  ['id', 'activity_date'],
                   right_on= ['id','sleep_day'], 
                   how = 'left')

# keep only relevant columns
activity_sleep_df = activity_sleep_df[['activity_date', 
                                       'sedentary_mins', 
                                       'lightly_active_mins', 
                                       'total_mins_asleep']
                                     ]

# show first five rows of updated df
activity_sleep_df.head()

## **Initial exploratory visualizations** 📊

### Formatting the style of our seaborn viz:

Let's get the formatting of our seaborn graphs out of the way. We want to make sure that all plots are easy to read, and that we can include multiple graphs at a time.

In [1]:
sns.set(rc={'figure.figsize': (10, 6)})
sns.set_style('whitegrid')
sns.set_palette('bright')
sns.set_context("paper")

### **How users spend their activity time?** ⏰

In our `dailyActivity_df` there are four measures of how users spend their time:
1. `very_active_mins` --> `very_active_dist`
2. `fairly_active_mins` --> `moderately_active_dist`
3. `lightly_active_mins` --> `lightly_active_dist`
4. `sendentary_mins` --> `sedentary_active_dist`

Let's start plotting!

## Plotting

Let's plot each minute-distance pair in a scatter plot.

We'll draw a regression line to get an estimate of the speed users were in during these activities.

In [1]:
fig, axes = plt.subplots(1, 4, figsize=(15, 5), sharey=True)
fig.suptitle('Distance per Minutes given kind of Activity')

sns.regplot(data = dailyActivity_df, x = 'very_active_mins', y = 'very_active_dist', ax=axes[0])

sns.regplot(data = dailyActivity_df, x = 'fairly_active_mins', y = 'moderately_active_dist', ax=axes[1])

sns.regplot(data = dailyActivity_df, x = 'lightly_active_mins', y = 'lightly_active_dist', ax=axes[2])

sns.regplot(data = dailyActivity_df, x = 'sedentary_mins', y = 'sedentary_active_dist', ax=axes[3])

## For fun, let's try to fit a linear regression line into these graphs, this time representing calories burned instead.

In [1]:
# import LinearRegression from sklearn library.
from sklearn.linear_model import LinearRegression

Our first course of order is to define our inputs and outputs for our model.

Inputs are (`X`). Outputs are (`y`). 

In [1]:
X = full_dailyActivity_df['total_steps'].values.reshape((-1, 1))

y = full_dailyActivity_df['calories'].values

### Let's explain this last cell, briefly:

The `.values` returns a numpy representation (or array) from our dataframe. In this case, the `TotalSteps` column.

`reshape()` transforms our array in 2D format. In other words, it becomes a column with n amounts of rows.

### Let's fit the model into the data and obtian our intercept and slope values.

In [1]:
model = LinearRegression()

model.fit(X, y)

In [1]:
print('intercept:', model.intercept_)

print('slope:', model.coef_)

Ok, so our intercept = ~1665.74.

This is our predicted BMR. Let's hold on to that thought for a sec.

Now, lets try to draw this regression line we just generated into the scatterplot we did earlier and see if the fit is similar to the one our regplot created for us.

To draw this regression line, we define an `siline` function to use matplotlib to draw a line in a 2D space from a slope and an intercept.

In [1]:
def siline(slope, intercept):
    """Plot a line from slope and intercept"""
    axes = plt.gca()
    x_vals = np.array(axes.get_xlim())
    y_vals = intercept + slope * x_vals
    plt.plot(x_vals, y_vals, color= 'r', ls = '--')

In [1]:
sns.scatterplot(data = full_dailyActivity_df, x= 'total_steps', y ='calories')

siline(model.coef_, model.intercept_);

### Nice 👀 

So, how can we interpret this?

Easy. 

### Apparently, the more steps you take, the more calories you'll burn.

If you look at the intercept of the regression line (where x and y = 0, the base of the line), we can see that people will burn calories EVEN if they decide to take no steps at all. 

In other words, the intercept shows us the basal metabolic rate, or the number of calories you burn on a day at rest.

We can calculate the basal metabolic rate of a person based on their sex, age, weight and height.

For reference, let's use the BMR calculator from [Active](https://www.active.com/fitness/calculators/bmr).

> BMR for a 35 year old man that weighs 175 pounds and stands at 5'11": **1820 cals**

> BMR for a 35 year old woman that weighs 135 pounds and stands at 5'5": **1384 cals**

Alrighty. Let's compare this info with some linear regression magic. We'll start by importing the `scikit-learn` library.


### Let's go deeper.

Let's get the information on BMR, but this time by filtering only those data points with zero steps taken.

We can also get some stats on the calories distribution.

In [1]:
full_dailyActivity_df[full_dailyActivity_df['total_steps']==0]['calories'].describe()

### Here are some observations:

The minimum calories burned seems to be 0. 

There seems to be outliers in our data. Remember that it's impossible for people to burn 0 calories even at a full day of rest (thanks to our BMR).

The maximum amount of calories burned seems to be 2664. Seems about right.

For now, let's focus on the outlier(s).

In [1]:
full_dailyActivity_df[full_dailyActivity_df['calories']== 0]

These 4 rows represent instances where users spent 1,440 minutes (a whole day) in a sedentary state. This most likely means that these users had their devices turned off during those days.

No need to have these in our data, so let's get rid of 'em!

In [1]:
full_dailyActivity_df = dailyActivity_df.copy()

full_dailyActivity_df['day'] = full_dailyActivity_df.activity_date.apply(lambda x: x.strftime('%d'))
full_dailyActivity_df['month'] = full_dailyActivity_df.activity_date.apply(lambda x: x.strftime('%m'))
full_dailyActivity_df['year'] = full_dailyActivity_df.activity_date.apply(lambda x: x.strftime('%Y'))
full_dailyActivity_df['day_of_week'] = full_dailyActivity_df.activity_date.apply(lambda x: x.strftime('%w'))

full_dailyActivity_df = full_dailyActivity_df[full_dailyActivity_df['calories'] != 0]

In [1]:
len(full_dailyActivity_df)

Cool. 4 outliers dropped, so we're left with a df consisting of 336 rows. Let's see if dropping these outliers makes a difference in our linear regression.

To simplify our flow, let's turn the regression process into a single function:

In [1]:
def get_regression(full_dailyActivity_df, x ='total_steps', y = 'calories'):
    X = full_dailyActivity_df[x].values.reshape((-1, 1))
    y = full_dailyActivity_df[y].values

    model = LinearRegression()
    model.fit(X, y)

    print('intercept:', model.intercept_)
    print('slope:', model.coef_)

    sns.scatterplot(data = full_dailyActivity_df, x= x, y =y)
    siline(model.coef_, model.intercept_);

    return (model.intercept_, model.coef_)

In [1]:
get_regression(full_dailyActivity_df)

Wihtout the outliers, our fit has a slightly higher intercept of ~1,689.15.

### **Distribution according to type of activity**

Apart from time spent in sedentary position, users could also spend

1. Very active minutes
2. Fairly active minutes
or 
3. Very lightly active minutes

Let's use histograms to see how minutes are distributed across these three activities

In [1]:
fig, axes = plt.subplots(1, 3, figsize=(22, 5))
fig.suptitle('Distribution according to activity type')


sns.histplot(data = full_dailyActivity_df, x = 'lightly_active_mins', ax = axes[0]);
sns.histplot(data = full_dailyActivity_df, x = 'fairly_active_mins', ax = axes[1]);
sns.histplot(data = full_dailyActivity_df, x = 'very_active_mins', ax = axes[2]);

The distribution of `LightlyActiveMinutes` is pretty symmetrical.

Most users don't spend much of their time in a very active or fairly active state. 

Makes sense, given that most humans aren't capable of exercising too hard for too long due to lack of stamina and fatigue.

How do we know for how long our clients use our trackers in a day?

I tried exploring other datasets, but adding all activities' minutes is an easy workaround. Assuming all users logged data for the whole day, our addition should equal 1,440 minutes (the total of minutes in a day).

Let's check that!

In [1]:
daily_logs = dailyActivity_df.copy()

daily_logs['day'] = daily_logs.activity_date.apply(lambda x: x.strftime('%d'))
daily_logs['month'] = daily_logs.activity_date.apply(lambda x: x.strftime('%m'))
daily_logs['year'] = daily_logs.activity_date.apply(lambda x: x.strftime('%Y'))
daily_logs['day_of_week'] = daily_logs.activity_date.apply(lambda x: x.strftime('%w'))

daily_logs['total_mins'] = (daily_logs['very_active_mins'] + 
                            daily_logs['fairly_active_mins'] + 
                            daily_logs['lightly_active_mins'] + 
                            daily_logs['sedentary_mins']
                           )

In [1]:
daily_logs = daily_logs[(daily_logs['total_mins'] == 1440) & (daily_logs['calories'] != 0)]

In [1]:
daily_logs.head()

In [1]:
print(f'There are {len(daily_logs)} instances where users logged for the whole day.')

In [1]:
fig, axes = plt.subplots(1, 3, figsize=(22, 5))
fig.suptitle('Distribution based on activity type - Logged Entire Day')

sns.histplot(data = daily_logs, x = 'lightly_active_mins', ax = axes[0])
sns.histplot(data = daily_logs, x = 'fairly_active_mins', ax = axes[1]);
sns.histplot(data = daily_logs, x = 'very_active_mins', ax = axes[2]);

### Similar pattern.

Let's see what happens when we observe those users who didn't record anything in one whole day.

In [1]:
not_daily_logs = dailyActivity_df.copy()

not_daily_logs['day'] = not_daily_logs.activity_date.apply(lambda x: x.strftime('%d'))
not_daily_logs['month'] = not_daily_logs.activity_date.apply(lambda x: x.strftime('%m'))
not_daily_logs['year'] = not_daily_logs.activity_date.apply(lambda x: x.strftime('%Y'))
not_daily_logs['day_of_week'] = not_daily_logs.activity_date.apply(lambda x: x.strftime('%w'))

not_daily_logs['total_mins'] = (not_daily_logs['very_active_mins'] + 
                            not_daily_logs['fairly_active_mins'] + 
                            not_daily_logs['lightly_active_mins'] + 
                            not_daily_logs['sedentary_mins']
                           )

In [1]:
not_daily_logs = not_daily_logs[(not_daily_logs['total_mins'] != 1440) & (not_daily_logs['calories'] != 0)]

In [1]:
not_daily_logs.head()

In [1]:
print(f'There are {len(not_daily_logs)} instances where users did not log for the whole day.')

In [1]:
fig, axes = plt.subplots(1, 3, figsize=(22, 5))
fig.suptitle('Distribution according to activity type - Partial day logged')

sns.histplot(data = not_daily_logs, x = 'lightly_active_mins', ax = axes[0])
sns.histplot(data = not_daily_logs, x = 'fairly_active_mins', ax = axes[1]);
sns.histplot(data = not_daily_logs, x = 'very_active_mins', ax = axes[2]);

### Now we're onto something! 🙌

It seems that users who use our products throughout the whole day register a lot of `LightlyActiveMinutes`. And it makes sense! More demanding exercises require higher stamina expenditure, so lighter activities predominate. 

By similar logic, it seems that those who log only a part of their day are only using our product when engaging in more demanding exercises.

Let's see the distribution of total logged time in this second group.

In [1]:
sns.histplot(data = not_daily_logs, x = 'total_mins');

### **Sleeping habits and week day distributions** 🛌

Let's use histograms again to see the distribution of sleeping time for all users.

In [1]:
sns.histplot(data = sleep_df, x = 'total_mins_asleep');

It is of general consensus that an adult should get 7-8 hours of sleep per day. 

This corresponds to roughly 420 minutes of sleep. 

Let's see if our users are getting some proper sleep! 😴

In [1]:
sns.histplot(data = sleep_df, x = 'total_mins_asleep')
plt.axvline(420, 0, 65, color='red');

The distribution is somewhat symmetric with 231 rows to the right of the line (including the line) and 182 rows to the left.

We can further inspect the distribution of minutes asleep per week day. Let's make sure they're displayed in order (from sun to sat).

In [1]:
sns.boxplot(x="day_of_week", y="total_mins_asleep", data=sleep_df,
            order = ['0','1','2','3','4','5','6']);

Visually, there's no clear way to distinguish one day with the other. 

Sad.

You would expect to see some difference between the weekday and weekend, but it doesn't seem to be the case.

While we are looking at distributions across days of the week, we can use our `activity_dist` dataframe to inspect the average values of steps, calories and distances:

In [1]:
fig, axes = plt.subplots(1, 3, figsize=(22, 5))
fig.suptitle('Distribution of average values across days of the week')

sns.boxplot(x="day_of_week", y="total_steps", data=activity_dist, ax=axes[0]);

sns.boxplot(x="day_of_week", y="calories", data=activity_dist, ax=axes[1]);

sns.boxplot(x="day_of_week", y="total_dist", data=activity_dist, ax=axes[2]);

### **Distribution of calories and distance**

In [1]:
fig, axes = plt.subplots(1, 2, figsize=(22, 5))
fig.suptitle('Distribution of average values across days of the week')

sns.histplot(data=full_dailyActivity_df, x="calories", ax = axes[0]);

sns.histplot(data=full_dailyActivity_df, x="total_dist", ax = axes[1]);

The distribution of `Calories` and `TotalDistance` are somewhat skewed to the lower-side.

Is there a correlation between the two?

### **Understanding Sedentary Minutes**

Let's check the distribution of`SedentaryMinutes`:

In [1]:
sns.histplot(data= activity_weekend, x = 'sedentary_mins');

I want to double-check and see if there's really no correlation at all between the data and the day of the week.

Let's use our `dailyActivity_df_wknd` dataframe and use a facetplot visualize two groups (weekend = yes, weekend = no).

We need to normalize the distribution given that there are more weekdays than weekends (to our displeasure).

In [1]:
g = sns.FacetGrid(activity_weekend, col="weekend", height=6, aspect=1)
g.map(sns.histplot, "sedentary_mins", kde=True, stat='density');

It seems there are two groups of users based on the distribution of `SedentaryMinutes`. 

Let's obtain some statistics now! Why not start with the good old mean (average)?

In [1]:
# calculate mean by Id and day of week, rounded by two decimal points
avg_sedentary = dailyActivity_df.copy()

# group by ID
avg_sedentary = avg_sedentary.groupby('id').mean()

# order by Sedentary Minutes, in descending oder
avg_sedentary = avg_sedentary.sort_values(by=['sedentary_mins'], ascending = False)

# reset index
avg_sedentary = avg_sedentary.reset_index()

# include only Id and SedentaryMinutes columns in df
avg_sedentary = avg_sedentary[['id', 'sedentary_mins']]

avg_sedentary

### Let's visualize this table in a nice barplot. 📊

In [1]:
sns.barplot(data = avg_sedentary,
            x = 'id', y = 'sedentary_mins',
            order = avg_sedentary.sort_values('sedentary_mins',ascending = True)['id'])
plt.xticks(rotation=70);

Now, let's calculate the average minutes spent in sedentary position, just because.

In [1]:
mean_sedentary_minutes = np.mean(dailyActivity_df['sedentary_mins'])

mean_sedentary_minutes

A good idea I found in other kagglers' notebooks was to group users based on specific criteria.

Let's group users based on above-average sedentary minutes spent, and below-average sedentary minutes spent.

In [1]:
def above_below(user):
    '''Returns 1 if user has above-average SedentaryMinutes and 0 otherwise'''
    return int(avg_sedentary[avg_sedentary['id']==user]['sedentary_mins'].values[0] > mean_sedentary_minutes)

In [1]:
activity_weekend['user_group'] = activity_weekend['id'].apply(above_below)

In [1]:
# Rows in each group
print(f'How many rows in group 0 (Less Sedentary group)?:')
print(len(activity_weekend[activity_weekend['user_group']==0]))
print(f'How many rows in group 1 (More Sedentary group)?:')
print(len(activity_weekend[activity_weekend['user_group']==1]))

#Distinct users in each group
print('Super-lazy, unique users (Less Sedentary group):')
print(activity_weekend[activity_weekend['user_group']==0]['id'].nunique())

print('Super-active, unique users (More Sedentary group):')
print(activity_weekend[activity_weekend['user_group']==1]['id'].nunique())

Let's visualize the number of instances per group

In [1]:
sns.countplot(data=activity_weekend, x = 'user_group');

A boxplot should help us see a clear difference between the two groups

In [1]:
sns.boxplot(x="user_group", y="sedentary_mins", data=activity_weekend);

Group 1 is more sedentary, given that its `SedentaryMinutes` median is higher.

Now, let's see if there's a difference between weekdays and weekends.

In [1]:
sns.boxplot(x="user_group", y="sedentary_mins", hue = 'weekend', data=activity_weekend);

Let's see if the distribution for the less sedentary group varies between weekdays and weekends

In [1]:
g = sns.FacetGrid(activity_weekend[activity_weekend['user_group']==0], col="weekend", height=6, aspect=.7)
g.map(sns.histplot, "sedentary_mins", kde=True, stat='density');

Ok we found something interesting.

During weekends, people seem to be a little more active. This is probably because people work during weekdays and thus are bound to their desks (assuming most users in this analyze work at desk jobs).

### **Do average values change on weekends?**

___

In [1]:
# make a copy of the original dailyActivity df
weekend_avg = dailyActivity_df.copy()

# add new column extracting only day of the week from the ActivityDate column
weekend_avg['day_of_week'] = weekend_avg.activity_date.apply(lambda x: x.strftime('%w'))

# apply the check_weekend function
weekend_avg['weekend'] = weekend_avg.apply(check_weekend, axis=1)

# drop the day of week column as its no longer needed
weekend_avg.drop('day_of_week', axis=1)

# group by weekend
weekend_avg = weekend_avg.groupby('weekend').mean()

# keep only relevant columns
weekend_avg = weekend_avg[['sedentary_mins', 'calories', 'total_steps', 'total_dist']]

In [1]:
weekend_avg

There seems to be a small difference, but nothing too significant...

### **Sleeping habits for each user group**
---

In [1]:
sleep_df['user_group'] = sleep_df['id'].apply(above_below)

In [1]:
sns.boxplot(x="user_group", y="total_mins_asleep", data=sleep_df);

In [1]:
sns.countplot(data = sleep_df, x = 'user_group');

Number of rows in each group (in the `sleep_df` dataframe):

In [1]:
sleep_df['user_group'].value_counts()

In [1]:
print('Distinct users in group 0 (Less Sedentary group)')
print(sleep_df[sleep_df['user_group']==0]['id'].nunique())

print('Distinct users in group 1 (More Sedentary group)')
print(sleep_df[sleep_df['user_group']==1]['id'].nunique())

### Something's odd.

- 364 records of daily sleep activity for the less sedentary group 
- 49 records of daily sleep activity for thethe more sedentary group.
- total number of unique users from the less sedentary group is 14
- total number of unique users from the more sedentary group is 10

## Going back to our Activity and Sleep table
---

Let's go back to our `activity_sleep_df` dataframe, which essentially was an inner join of the `dailyActivity` and `SleepDay` tables.

In [1]:
activity_sleep_df.head()

In [1]:
sns.regplot(data = activity_sleep_df,
                x = 'total_mins_asleep',
                y = 'sedentary_mins');

Awesome!

It seems that the more you sleep, the less sedentary you are while awake. Makes sense!


# Share 🤝
---

Ok. We've done some good analysis, and found some interesting patterns. Let's see if we can answer the big question.

Remember, we're trying to understand how current user trends can guide Bellabeat's marketing strategy.

We'll start our sharing with some basic descriptions.

> Our dataset has data on 33 different users who logged their daily activities between 03.12.2016 and 05.12.2016.

## Describing the data

The main data on daily activities is in the full_dailyActivity_df.

Let's get some high-level statistics with the `describe()` method:

## Questions I want to answer

In this share phase, I want to answer some questions that might lead to some marketing strategy efforts:

- distributions by activity type, by calories and distance, and by sleep.
- if the day of the week affects behavior
- if steps affect calories burned
- how long do people stay sedentary in a day?
- does sleep affect sedentary time?

In [1]:
full_dailyActivity_df.loc[:, full_dailyActivity_df.columns != 'id'].describe().T

For the sleep habits data, we can use the `describe()` method on the `sleep_df` dataframe:

In [1]:
sleep_df.loc[:, sleep_df.columns != 'id'].describe()

## **Distributions**
 ---

### **By acivity type**

Let's check all activity types, expect SedentaryMinutes which we will analyze separately.

In [1]:
fig, axes = plt.subplots(1, 3, figsize=(22, 5))
fig.suptitle('Distribution according to activity type')

sns.histplot(data = full_dailyActivity_df, x = 'very_active_mins', ax = axes[0]);

sns.histplot(data = full_dailyActivity_df, x = 'fairly_active_mins', ax = axes[1]);

sns.histplot(data = full_dailyActivity_df, x = 'lightly_active_mins', ax = axes[2]);

Of all 940 original rows in our data, only 462 rows have partially logged their activities in a day. For these records:

In [1]:
fig, axes = plt.subplots(1, 3, figsize=(22, 5))
fig.suptitle('Distribution according to activity type - Partial day logged')

sns.histplot(data = not_daily_logs, x = 'very_active_mins', ax = axes[0]);

sns.histplot(data = not_daily_logs, x = 'fairly_active_mins', ax = axes[1]);

sns.histplot(data = not_daily_logs, x = 'lightly_active_mins', ax = axes[2]);

The `LightlyActiveMinutes` distribution is very symmetric with no peak at very few minutes of activity. Users who log the entire day may end up registering a lot of `LightlyActiveMinutes` while those who log only a part of the day might be registering only activities with higher demand.

Let's see the distribution of total logged time in this second group.

In [1]:
sns.histplot(data = not_daily_logs, x = 'total_mins')
plt.title('Logged minutes for partially logged days');

### **By calories and distance**

In [1]:
fig, axes = plt.subplots(1, 2, figsize=(22, 5))
fig.suptitle('Distribution of Calories Burned daily (left) and daily Distance (right)')

sns.histplot(data=full_dailyActivity_df, x="calories", ax = axes[0]);

sns.histplot(data=full_dailyActivity_df, x="total_dist", ax = axes[1]);

The distribution of burned calories is a bit skewed to the low calories while the distance distribtion is highly skewed to lower distances.

### **By sleeping patterns**

In [1]:
sns.histplot(data = sleep_df, x = 'total_mins_asleep')
plt.title('Daily minutes asleep')

plt.axvline(420, 0, 65, color='red', ls = '--', lw = 3);

plt.annotate('182 records', (100,50))
plt.annotate('231 records', (650,50))
plt.annotate('7h of sleep', (380,30), color='black')

## **Behavior based on day of the week**
---
Does the day of the week affect user behavior? Is this effect considerable enough to market some functionaly built around this?

### Let's find out!

## Steps, Calories, and Distance

In [1]:
fig, axes = plt.subplots(1, 3, figsize=(22, 5))
fig.suptitle('Distribution of average values across days of the week')

sns.boxplot(x="day_of_week", y="total_steps", data=activity_dist, ax=axes[0]).set_xticklabels(['Sun',
                                                                                               'Mon',
                                                                                               'Tue',
                                                                                               'Wed',
                                                                                               'Thu',
                                                                                               'Fri', 
                                                                                               'Sat']
                                                                                             );

sns.boxplot(x="day_of_week", y="calories", data=activity_dist, ax=axes[1]).set_xticklabels(['Sun',
                                                                                            'Mon',
                                                                                            'Tue',
                                                                                            'Wed',
                                                                                            'Thu',
                                                                                            'Fri', 
                                                                                            'Sat']
                                                                                          );

sns.boxplot(x="day_of_week", y="total_dist", data=activity_dist, ax=axes[2]).set_xticklabels(['Sun',
                                                                                              'Mon',
                                                                                              'Tue',
                                                                                              'Wed',
                                                                                              'Thu',
                                                                                              'Fri', 
                                                                                              'Sat']
                                                                                            );

With the current data, there is no considerable difference between the average Steps Taken, Calories Burned or Distance across different days of the week.

## **Sleep**

In [1]:
sns.boxplot(x="day_of_week", y="total_mins_asleep", data=sleep_df,
            order = ['0','1','2','3','4','5','6']).set_xticklabels(['Sun','Mon','Tue','Wed','Thu','Fri', 'Sat']);

No significant difference between days.

## **Steps**: Do steps affect calorie expenditure?

---

In [1]:
get_regression(full_dailyActivity_df)
plt.ylabel('calories')
plt.title('Daily calories burned by number of steps taken');

### Indeed!

As you'd probably expect, the more steps you take, the more calories you'll burn.

Remember that the base of our intercept will never be zero (0) because its impossible to burn zero calories in a day thanks to our BMR (basal metabolic rate).

BMR means that even if you stay completely still for 24 hours, your body will still burn calories.

In [1]:
full_dailyActivity_df[full_dailyActivity_df['total_steps']==0]['calories'].describe()

### **How long do users spend being sedentary in a day?**
---

In [1]:
g = sns.FacetGrid(activity_weekend, col="weekend", height=6, aspect=.7)
g.map(sns.histplot, "sedentary_mins", kde=True, stat='density')
g.fig.subplots_adjust(top=0.8)
g.fig.suptitle('Daily sedentary minutes')
axes = g.axes.flatten()
axes[0].set_title("Weekdays")
axes[1].set_title("Weekends");

It seems there are two groups of users based on the distribution of `SedentaryMinutes`.
We can see the average `SedentaryMinutes` per user:

### **Do average values change on weekends?**

In [1]:
weekend_avg

There's a slight difference, but nothing to right home about.

## **Do sleep habits influence sedentary time?**

In [1]:
sns.regplot(data = activity_sleep_df,
                x = 'total_mins_asleep',
                y = 'sedentary_mins');

This is an interesting graph: there is a clear tendency of **users with more minutes asleep to be less sedentary**. 

So, one conclusion might be that the more you sleep, the more active you are during the day!

# **Act** 🏃‍♂️🏃‍♀️
---

Aaaaaand we're done!

Based on our analysis, we've obtained some interesting insights.


## Steps:
- There's no apparent correlation between the day of the week and the level of activity.
- Users take 7670 steps on average; according to a [2011 study](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3197470/), the recommended amount of steps per day is 10,000 in healthy adults. Another cool note is that people who track their steps take ~2,500 more steps than those who don't. Walking is generally associated with a lower mortality risk.
- There's a direct correlation between steps taken and calories burned. So we could design a model that predicts how many steps a user should take to burn x calories. The model would consider gender, age, weight, and height variables. Then, we could take those same variables to calculate the user's BMR.

## Sleep:
- There's an inversely proportional relation between minutes of sleep and minutes of sedentary activity. The more you sleep, the more likely you will spend LESS time in a sedentary state. We could implement tools that remind users to sleep at a particular time. We could offer sleep-tracking services that allow users to sleep optimally. This should also motivate users to wear our products all day, which means more data to improve our product!

Conclusions: We should develop software that allows users to track their steps in real-time. Some gamification (achievements, motivational messages per steps taken, etc.) could also be implemented to promote more step-taking! Sleep tracking should also help our users take better care of their health by adopting healthier habits while awake!


# **FIN.**

And that's it for now.

I had tons of fun doing this. The python community is fantastic, and this project couldn't have been possible without the inspiration of other people's work.

Feedback is more than welcome!