# The Case Study Report
The details for this data analysis including the business task and all the six phases of data analysis life cycle can be found in this [Google Document](https://docs.google.com/document/d/1PjuteS8C1uapEPS2kFOpl6r966ZaIWHvYC0xzwxLcd0/edit?usp=sharing).

# Setup

In [None]:
# Load necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import numpy as np
import os
from pandas.plotting import register_matplotlib_converters

# %config InlineBackend.figure_format='retina'

register_matplotlib_converters()
# can add font_scale=1.5 if necessary to increase font size
sns.set(style='whitegrid', palette='muted', font_scale=1.25)

plt.style.use('fivethirtyeight')
# NOTE: rcParams need to be after plt.style.use
plt.rcParams["figure.figsize"] = (16, 10)

# Ignore warnings if necessary
# import warnings
# warnings.filterwarnings('ignore')

# Remove scientific notation if necessary for df.describe()
# pd.set_option('display.float_format', lambda x: f'{x:,.5f}')

In [None]:
# Set max columns to be displayed
pd.set_option('display.max_columns', 99)

# Prepare by sorting and filtering, then Process by cleaning

In [None]:
DATA_PATH = '../input/fitbit/Fitabase Data 4.12.16-5.12.16/'

In [None]:
df = pd.read_csv(DATA_PATH + 'dailyActivity_merged.csv', parse_dates={'Date': [1]})
df.head()

In [None]:
# Get the name of day of the week, e.g. Tuesday
#  and insert into the column after the Date column, i.e. index=1
df.insert(1, 'Day', df['Date'].dt.day_name())
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.nunique()

In [None]:
# 31 days in total
print(df.Date.nunique())
print(df.Date.unique())

In [None]:
# 33 instead of 30 respondents?
print(df.Id.nunique())
print(df.Id.unique())

In [None]:
user_id2encoded = {user_id: idx for idx, user_id in enumerate(df.Id.unique())}
user_encoded2id = {idx: user_id for user_id, idx in user_id2encoded.items()}

In [None]:
df['user_id'] = df.Id.map(user_id2encoded)

In [None]:
df.groupby('Id')['Day'].count()

- For a total of 31 days, there are some missing data especially a lot for subject of Id 4057192912

In [None]:
# checking every csv file at once
for csv_file in os.listdir(DATA_PATH):
    csv_df = pd.read_csv(DATA_PATH + csv_file)
    print(csv_file)
    print(csv_df.shape)
    display(csv_df.head())

- All the CSV files for the daily data were already included inside of the main `dailyActivity_merged.csv` file, therefore they can be filtered out

# Cleaning duplicated data 

In [None]:
daily_sleep = pd.read_csv(DATA_PATH + 'sleepDay_merged.csv', parse_dates={'Date': [1]})
daily_sleep.head()

In [None]:
daily_sleep.describe()

In [None]:
weight_log = pd.read_csv(DATA_PATH + 'weightLogInfo_merged.csv', parse_dates=['Date'])
weight_log.head()

In [None]:
weight_log.describe()

In [None]:
def find_duplicated_data(df):
    return df[df.duplicated()]

In [None]:
# only daily sleep data has duplicated data
len(find_duplicated_data(df)), len(find_duplicated_data(daily_sleep)), len(find_duplicated_data(weight_log))

In [None]:
daily_sleep.shape

In [None]:
# There are some duplicated data in the daily sleep data
daily_sleep[daily_sleep.duplicated(keep=False)]

In [None]:
# Dropping the duplicates
daily_sleep.drop_duplicates(inplace=True, ignore_index=False)

# Removing Outliers

NOTE: This is just a general way to remove outliers using Interquartile Range. Usually it's better to inspect them first before removing, because outliers could mean something sometimes. 

But just to make things simple, I will just remove them here because I will be calculating everything based on the **average**, and removing outliers is **vital** when calculating **average** as outliers can significantly affect the average values.

In [None]:
def remove_outliers(df, cols):
    all_outliers = set()
    for col in cols:
        q1 = np.percentile(df[col], 25)
        q3 = np.percentile(df[col], 75)

        iqr = q3 - q1

        lower_boundary = q1 - (iqr * 1.5)
        upper_boundary = q3 + (iqr * 1.5)

        outliers = df[(df[col] < lower_boundary) | (df[col] > upper_boundary)]
        n_outliers = len(outliers)
        pct_outliers = n_outliers / len(df) * 100
        print(f"[INFO] Found {n_outliers} outliers ({pct_outliers:.2f}%) for {col}")
        all_outliers.update(outliers.index)
    df = df.drop(all_outliers).reset_index(drop=True)
    print(f"[INFO] Removed {len(all_outliers)} rows.")
    return df

In [None]:
daily_sleep.head()

In [None]:
daily_sleep.shape

In [None]:
daily_sleep_cleaned = remove_outliers(daily_sleep, ['TotalSleepRecords', 'TotalMinutesAsleep', 'TotalTimeInBed'])

In [None]:
daily_sleep_cleaned.shape

In [None]:
df.head()

In [None]:
daily_col_names = df.loc[:, 'TotalSteps':].columns.values

In [None]:
df_cleaned = remove_outliers(df, daily_col_names)

In [None]:
weight_log.head()

In [None]:
weight_log_cols = weight_log.loc[:, 'WeightKg': 'BMI'].columns.values

In [None]:
weight_log_cleaned = remove_outliers(weight_log, weight_log_cols)

In [None]:
# daily_sleep_cleaned.to_csv(DATA_PATH + 'cleaned_sleepDay.csv', index=False)
# df_cleaned.to_csv(DATA_PATH + 'cleaned_dailyActivity.csv', index=False)
# weight_log_cleaned.to_csv(DATA_PATH + 'cleaned_weightLog.csv', index=False)

# Load cleaned data

In [None]:
# daily_sleep = pd.read_csv(DATA_PATH + 'cleaned_sleepDay.csv', parse_dates=['Date'])
# df = pd.read_csv(DATA_PATH + 'cleaned_dailyActivity.csv', parse_dates=['Date'])
# weight_log = pd.read_csv(DATA_PATH + 'cleaned_weightLog.csv', parse_dates=['Date'])

In [None]:
daily_sleep = daily_sleep_cleaned.copy()
df = df_cleaned.copy()
weight_log = weight_log_cleaned.copy()

# Merging sleep data with full daily data

In [None]:
# Merging full daily data with the daily sleep data, on unique combiunations of ID and Day
#  using LEFT JOIN to keep all the records in the full daily data
df_merge = pd.merge(df, daily_sleep, on=['Id', 'Date'], how='left')
df_merge.shape

In [None]:
df_merge.head()

In [None]:
# sort the data by Id then only by date
df_merge.sort_values(by=['Id', 'Date'], inplace=True)

In [None]:
df_merge.Id.nunique()

In [None]:
df_merge.Date.nunique()

In [None]:
# A lot of missing values in the daily sleep data, i.e. 63.3% missing
df_merge.isna().mean()

- These missing sleeping data cannot be resolved easily as there are more than 60% missing values, but they are left as it is, and will be analyzed later.

# Analysis

In [None]:
# Changing variable name to `df` to simplify code
df = df_merge.copy()

In [None]:
df.columns

In [None]:
df.Id.value_counts()

# Analyzing heart rate to determine average wearing hours per day

In [None]:
heart_rate = pd.read_csv(DATA_PATH + 'heartrate_seconds_merged.csv')
heart_rate.head()

In [None]:
heart_rate.Id.nunique()

- Only 14 out of 33 users use this feature.
- Not every user uses the feature of tracking their heart rate every second, most likely for privacy reason or finding it not useful, because it is unlikely that a user wears such a device and accidentally forgot to turn on the feature.
- But this requires further survey for qualitative data in order to confirm this statement. 
- This survey has a huge limitation of only included quantitative data without any sorts of qualitative data that include important feedbacks from users, which is equally as important as quantitative data.

- This makes it not possible to determine average wearing hours for every user through the heart rate usage data, as not every user uses this feature
- Still, it's interesting to use analyze this to determine on average how many hours do the users use this feature every day.

In [None]:
heart_rate.sample(5)

In [None]:
# Use this function to convert the date into proper format
#  to speed up the pd.to_datetime function to convert it into datetime datatype
def format_time(orig_time):
    split_date = orig_time.split(' ')[0].split("/")
    month = str.zfill(split_date[0], 2)
    day = str.zfill(split_date[1], 2)
    year = split_date[-1]
    
    split_time = orig_time.split(' ')[-2].split(':')
    hour = str.zfill(split_time[0], 2)
    minute = split_time[1]
    second = split_time[2]
    pm_or_am = orig_time[-2:]
    
    return f"{day}/{month}/{year} {hour}:{minute}:{second} {pm_or_am}"

heart_rate.Time = heart_rate.Time.apply(format_time)

In [None]:
heart_rate['Time'][:5]

In [None]:
# Became much faster after converting the time formats using the function above
heart_rate['Time'] = pd.to_datetime(heart_rate['Time'], format='%d/%m/%Y %I:%M:%S %p')

In [None]:
heart_rate['Time'].head()

In [None]:
# add a date column
heart_rate['Date'] = heart_rate.Time.dt.date

In [None]:
# saving the file to save the format of datetime to reduce time of parsing it as DateTime type
# heart_rate.to_csv(DATA_PATH + 'heart_rate-06-Jun-21.csv', index=False)

In [None]:
# heart_rate = pd.read_csv((DATA_PATH + 'heart_rate-06-Jun-21.csv'), parse_dates=['Time'])

In [None]:
heart_rate.shape

In [None]:
total_day_used_per_user = heart_rate.groupby('Id')['Date'].nunique().sort_index()
total_day_used_per_user

In [None]:
# multiplied by 5 because the heart rate tracking takes place every 5 secs
avg_user_wearing_minute = heart_rate.Id.value_counts() * 5 / 3600
avg_user_wearing_minute.sort_index(inplace=True)
avg_user_wearing_minute

In [None]:
# preparing the data to calculate avg usage per day
avg_heart_rate_hour = pd.concat([avg_user_wearing_minute, total_day_used_per_user], axis=1)
avg_heart_rate_hour.columns = ['avg_minute', 'total_days']
avg_heart_rate_hour

In [None]:
avg_heart_rate_hour['avg_hour'] = avg_heart_rate_hour['avg_minute'] / avg_heart_rate_hour['total_days']
avg_heart_rate_hour.sort_values('avg_hour', ascending=False, inplace=True)
avg_heart_rate_hour = avg_heart_rate_hour.reset_index().rename(columns={'index': 'Id'})
avg_heart_rate_hour

In [None]:
days_lt_30 = len(avg_heart_rate_hour[avg_heart_rate_hour['total_days'] >= 30])
print("Percentage of users that used the heart tracking feature for more than 30 days:")
days_lt_30 / len(avg_heart_rate_hour) * 100

In [None]:
ax = avg_heart_rate_hour['total_days'].plot(kind='bar', colormap='Paired', label='Total days used', figsize=(10, 8))
avg_heart_rate_hour['avg_hour'].plot(kind='line', label='Average hour per day')
plt.title('Average Daily Usage of Heart Rate Tracking for Every User')
plt.xlabel('User')
plt.legend();

- The precision of the numbers are not exactly accurate as the heart rate tracking takes place for every 5 secs instead of every second, therefore the calculations had to take into consideration the estimation, but the margin of error should be only within 1 hour.
- The average usage of heart rate tracking is around 12 hours, with one user using for 24 hours for every day he turned on the feature for recording.
- Although most users averaged at 12 hours or less usage every day, the total days were much higher.
- In summary, for a total of 14 users observed, 13 users decided to turn on the heart tracking feature for more than 15 days, and 6 users (42.9%) turned on the feature for 30 days or more, however, most of them only used the feature for around 12 hours a day. This shows that they could be neglecting the feature or turned off the feature voluntarily. Feedbacks from customers are required to verify this.

**Recommendation**: The company can try to promote the importance of heart rate monitoring, and mention that there have also been [cases of people](https://www.nytimes.com/2021/05/20/well/live/smartwatch-heart-rate-monitor.html) being saved from using smartwatch to alert the user that his heart rate is unusual. This can be incorporated into marketing strategy to showcase the ability of smartwatch to track heart rate accurately and continuously, which eventually leads to better quality of life.

# Do users wear their smartwatch the entire day generally? How many hours every day?

In [None]:
hour_step = pd.read_csv(DATA_PATH + 'hourlySteps_merged.csv', parse_dates=['ActivityHour'])
hour_step.head()

In [None]:
len(hour_step)

In [None]:
len(hour_step[hour_step['StepTotal'] == 0]) / len(hour_step)

In [None]:
# create date without hour
hour_step['Date'] = hour_step['ActivityHour'].dt.date

In [None]:
# get number of hourly records per day
hour_per_day = hour_step.groupby(['Id', 'Date']).count().reset_index()
hour_per_day

In [None]:
hour_per_day = hour_per_day[['Id', 'Date', 'ActivityHour']]

In [None]:
hour_per_day.rename(columns={'ActivityHour': 'Hour'}, inplace=True)

In [None]:
# hour_per_day.to_csv(DATA_PATH + 'dailyWearingHour.csv', index=False)

In [None]:
hour_per_day.head()

In [None]:
hour_per_day.groupby('Id')['Hour'].mean().plot(kind='bar', figsize=(16, 8))
plt.title('Average Hourly Records for Every User');

- This question "How many hours of usage per day?" is likely unable to be answered with the currently available data, because 0 step does not directly mean that the user did not wear the watch, it could also mean the user was wearing it but was resting during the recorded hour. This attribute is similar to other attributes such as hourly calories burned and hourly intensities, where 0 values could also mean that they were resting during that moment.

# For each user, on every day, what is the hour that they have the highest number of steps?

In [None]:
hour_step.head()

In [None]:
# get only the hour
hour_step['hour'] = hour_step.ActivityHour.dt.hour

In [None]:
# calculate average steps for every hour
avg_hour_step = hour_step.groupby('hour')['StepTotal'].mean()

In [None]:
avg_hour_step.head()

In [None]:
avg_hour_step.plot(kind='bar')
plt.title('Average Steps Taken for Every Hour');

- Working hours (0800 to 1700, or 8 AM to 5 PM) tend to have many steps, which is normal.
- The highest steps taken are around 1800 or 6 PM, which should be the hour where most users are the most active and like to enjoy their exercise routines.

- Lets compare it with the data of hourly calories burned to confirm it first

In [None]:
hour_cal = pd.read_csv(DATA_PATH + 'hourlyCalories_merged.csv', parse_dates=['ActivityHour'])
print(hour_cal.shape)
hour_cal.head()

In [None]:
hour_cal['hour'] = hour_cal.ActivityHour.dt.hour

In [None]:
avg_hour_cal = hour_cal.groupby('hour')['Calories'].mean()

In [None]:
avg_hour_cal.plot(kind='bar')
plt.title('Average Calories Burned for Every Hour');

- This validates the statement above that the highest calories burned are centered around 6 PM.
- This also shows the positive correlation that the higher the number of steps taken, the higher the amount of calories burnt, which can also be seen more clearly with the chart below.

In [None]:
ax = avg_hour_step.plot(label='Average Steps', figsize=(12, 8))
avg_hour_cal.plot(ax=ax, label='Average Calories Burned')
plt.vlines(x=18, ymin=0, ymax=620, colors='tab:green', label='Most Active Hour', linestyle='dashed')
plt.xticks(np.arange(0, 24, 1))
plt.title('Average Steps VS Average Calories Burned Every Hour')
plt.legend();

- **Recommendation**: Organize campaigns that target this hour (around 6 PM) to promote the ability of smartwatch to track steps taken and calories burnt, so that many users who are having their exercise routines at such time would be able to notice the value of such product.

# What are the most active hours for each user?

In [None]:
avg_hour_step = hour_step.groupby(['Id', 'hour'])['StepTotal'].mean().reset_index()
avg_hour_step.head()

In [None]:
avg_hour_step.shape

In [None]:
max_steps_indices = avg_hour_step.groupby('Id')['StepTotal'].idxmax()
max_steps_indices.head()

In [None]:
max_avg_hour = avg_hour_step.loc[max_steps_indices].reset_index(drop=True)
print(len(max_avg_hour))
max_avg_hour.head()

In [None]:
# Creating a function to use later
def get_max_avg(query_df, time_col, attr):
    avg_df = query_df.groupby(['Id', time_col])[attr].mean().reset_index()
    max_indices = avg_df.groupby('Id')[attr].idxmax()
    max_avg_df = avg_df.loc[max_indices].reset_index(drop=True)
    return max_avg_df

In [None]:
# confirm it's working
max_avg_hour = get_max_avg(hour_step, 'hour', 'StepTotal')
max_avg_hour.head()

In [None]:
# Refactored using idxmax() as shown above

# max_avg_step_per_user = avg_hour_step.groupby('Id')['StepTotal'].max()

# max_avg_hour = pd.DataFrame(columns=['Id', 'hour', 'StepTotal'])
# for user_id, max_step in max_avg_step_per_user.items():
#     max_row = avg_hour_step[(avg_hour_step['Id'] == user_id) & (avg_hour_step['StepTotal'] == max_step)]
#     max_avg_hour = max_avg_hour.append(max_row, ignore_index=True)

In [None]:
max_avg_hour.head()

In [None]:
max_avg_hour.sort_values('hour')['hour'].plot(kind='bar', figsize=(15, 8))
plt.xlabel('User')
plt.ylabel('Hour')
plt.title('Hour with the Highest Average Daily Steps for every User');

- **NOTE**: In case you don't understand what does these hours mean exactly: Each of the bar represents different user's hour. And for each user, the hour (the height of the bar) here represents the highest average daily steps taken. This same method will be used to apply to the `Calories` and `Intensities` later too.

In [None]:
max_avg_hour.hour.value_counts(sort=False).plot(kind='bar')
plt.title('Number of Users Having the Same Hour of Highest Activity')
plt.xlabel('Hour with the Max Average Number of Steps')
plt.ylabel('User count');

- This chart could be a bit misleading depending on how you interpret it. This chart is directly related to the chart above, where this chart shows the number of users that have the **same most active hour**, i.e. the hour where the users have the highest average number of steps taken.
- For example, for the highest bar in the chart, 5 users have the same maximum average number of steps at 8 AM. This does not mean that most users tend to exercise at this hour, this just means that many of the users have their **highest maximum steps** at this hour, these could be ***any number of steps***. The users could **exercise at any hour** with less steps than this hour because number of steps taken is not directly related to exercise, this will be confirmed with amount of calories burned later.
- This is not as insightful as an earlier chart of the average hourly steps that did not take into account the user preferences. Because this chart directly takes the hours with the maximum average steps for each user instead of the average of the total steps for each hour.

In [None]:
# Using the same method as for steps taken as shown above
max_avg_cal_hour = get_max_avg(hour_cal, 'hour', 'Calories')
max_avg_cal_hour.head()

In [None]:
# Getting the same metrics for intensities using the same method as above
# Metrics here: Hours for each user, where each hour represents the highest average calories burnt
hour_intensity = pd.read_csv(DATA_PATH + 'hourlyIntensities_merged.csv', parse_dates=['ActivityHour'])
hour_intensity['hour'] = hour_intensity['ActivityHour'].dt.hour
hour_intensity.head()

In [None]:
# Using the same method as for steps taken as shown above
max_avg_int_hour = get_max_avg(hour_intensity, 'hour', 'TotalIntensity')
max_avg_int_hour.head()

In [None]:
cal_users = max_avg_cal_hour.hour.value_counts(sort=False)
steps_users = max_avg_hour.hour.value_counts(sort=False)
intensity_users = max_avg_int_hour.hour.value_counts(sort=False)

In [None]:
# combining them to plot on the same figure
users_avg_hourly = pd.concat([cal_users, steps_users, intensity_users], axis=1)
users_avg_hourly.columns = ['Calories', 'Steps', 'Intensity']
users_avg_hourly

In [None]:
users_avg_hourly.plot(kind='bar')
plt.title('Number of Users Having the Same Hour of Most Activity')
plt.xlabel('Hour with the Max Average Calories/Steps/Intensity')
plt.ylabel('User count');

- To further clarify what this chart means, the y-axis represents the number of users in the same hour group, and each of the hour group represents the highest average hourly activity recorded for all the users in the same hour group.
- This chart further validates the statements above that most users definitely tend to exercise at around 1800 to 1900 (6 PM to 7 PM) as shown by the highest average amount of intensities and calories burned at this hour, although the highest number of steps taken is at 0800 (8 AM).
- At 8 AM, the calories, total intensity and steps taken are still quite high, so it probably means that some users also exercise at this hour.

# Analyzing all daily activities

In [None]:
df.head()

In [None]:
# Order the day of week
# https://stackoverflow.com/questions/47741400/pandas-dataframe-group-and-sort-by-weekday
from pandas.api.types import CategoricalDtype

cats = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
cat_type = CategoricalDtype(categories=cats, ordered=True)
df['Day'] = df['Day'].astype(cat_type)

In [None]:
df.describe()

### Checking the days where the users have zero steps/calories recorded

- This is most likely because they did not wear the watch for the entire day, or they voluntarily turned off the feature for the entire day.

In [None]:
zero_step = df[(df.TotalSteps == 0) | (df.Calories == 0)].copy()
zero_step.head()

In [None]:
zero_step.shape

In [None]:
zero_step.describe()

- For some reason, during some days, these users still had high amount of calories burned even though they had zero steps taken, this is a bit strange. This is most likely because they did turn off the feature of tracking their steps taken.

In [None]:
# Almost every day (30 out of 31 days) contains at least one user that had zero steps
zero_step.Date.nunique()

In [None]:
# About 30% of the users (12 out of 33 users) had zero steps for at least one day
zero_step.Id.nunique()

#### How many days that they had zero steps? And what's the average number of days?

In [None]:
number_of_users = zero_step.Id.value_counts().reset_index(drop=True)
number_of_users.plot(kind='bar')
plt.hlines(y=number_of_users.mean(), xmin=number_of_users.index[0], xmax=number_of_users.index[-1], color='Orange', label='Average')
plt.legend(['Average', 'Number of Days'])
plt.xlabel('User')
plt.ylabel('Number of Days')
plt.title('Number of Days with Zero Step Recorded for Every User');

- On average, there are at least 6 days on average (average for 15 users and not all users) that these users had zero steps recorded. This shows that many of the users did not use the smartwatch or the feature for an average of 6 days.

Recommendation: Promote the importance of wearing the smartwatch or turning on the step tracking feature to ensure more reliable estimation of calories burned and better suggestions such as workout schedule recommendations.

# What are the days that the users most active and least active?

In [None]:
df.head()

In [None]:
# fig = plt.figure(figsize=(18, 12))

def plot_subplot(plot_df, nrows, ncols, index, title=''):
    ax = plt.subplot(nrows, ncols, index)
    # Change bar colors
    # https://stackoverflow.com/questions/3832809/how-to-change-the-color-of-a-single-bar-if-condition-is-true-matplotlib
    values = plot_df.values
    clrs = []
    for x in values:
        if x == np.min(values):
            clrs.append('tab:red')
        elif x == np.max(values):
            clrs.append('tab:orange')
        else:
            clrs.append('tab:blue')
    plot_df.plot(kind='bar', ax=ax, color=clrs, title=title)
    plt.xticks(rotation=45)
    plt.xlabel(None)
    
    # Add annotations on bars
    # https://queirozf.com/entries/add-labels-and-text-to-matplotlib-plots-annotation-examples
    for x, y in zip(np.arange(len(plot_df)), plot_df.values):

        label = "{:,.0f}".format(y)

        plt.annotate(label, # this is the text
                     (x, y), # this is the point to label
                     textcoords="offset points", # how to position the text
                     xytext=(0, -20), # distance from text to points (x,y)
                     ha='center') # horizontal alignment can be left, right or center

In [None]:
avg_by_day = df.groupby('Day').mean()
avg_by_day

In [None]:
cols_to_plot = ['TotalSteps', 'VeryActiveMinutes', 'FairlyActiveMinutes', 'LightlyActiveMinutes', 'SedentaryMinutes', 'Calories']

In [None]:
fig = plt.figure(figsize=(16, 10))
nrows, ncols = 3, 2

for i, col in enumerate(cols_to_plot, start=1):
    plot_subplot(avg_by_day[col], nrows, ncols, i, title=('Average ' + col))
plt.tight_layout()
plt.show()

- These charts show that most likely Tuesdays are the most preferred days for the customers to wear the smartwatch and exercise at the same time, as shown by the highest average total steps, all the average active minutes, as well as the calories burned (although relatively similar to other days) on Tuesday.
- The day where the customers are the least active is most likely to be Sunday, which has the lowest average number of steps taken and also lowest active minutes.

In [None]:
max_daily_cal = get_max_avg(df, 'Day', 'Calories')
max_daily_cal.head()

In [None]:
max_daily_cal.sort_values('Day', inplace=True)

In [None]:
max_daily_cal.Day.value_counts(sort=False).plot(kind='barh', figsize=(8, 8))
plt.title('Number of Users Having the Same Day of Maximum Average Calories Burned');

- This diagram just further proves that Tuesdays are the days where users are most likely to be the most active.

Recommendation: Consider organizing campaigns on Tuesdays and Sundays to let more users notice the capability of smartwatch to track their days of highest and lowest activeness.

# Missing sleep data

In [None]:
# Inspecting some missing sleeping data
missing_sleep = df[df.isna().any(axis=1)].copy()
missing_sleep.head(10)

In [None]:
print("Number of records with missing sleep data:", len(missing_sleep))
print(f"Percentage of missing sleep data: {(len(missing_sleep) / len(df) * 100):.2f}%")

In [None]:
# number of unique dates where there are missing sleep data
missing_sleep.Date.nunique()

In [None]:
missing_sleep.Id.nunique()

- There are missing sleep data for every day, but on different users

In [None]:
# Many of the users did not record their sleeping data for more than 20 days.
# Two (2) of the users did not even record their sleeping data (31 days)
missing_sleep.Id.value_counts().plot(kind='bar');

In [None]:
missing_sleep_users = missing_sleep.Id.value_counts()
print((missing_sleep_users >= 15).sum())
print((missing_sleep_users >= 15).mean())

- 12 users (37.5%) did not record their sleep data for more than 15 days (half a month)
- This shows that many of the users did not like to use the feature, or did not want to wear a smartwatch device when they are sleeping.

Recommendation: Our company can try to incorporate useful features related to sleeping patterns into marketing strategy in order to appeal to the customers to make more use of their sleeping data.

# Analyzing sleep data

In [None]:
daily_sleep.head()

In [None]:
daily_sleep.dtypes

In [None]:
daily_sleep.nunique()

In [None]:
daily_sleep.insert(1, 'Day', daily_sleep.Date.dt.day_name())

In [None]:
daily_sleep.Date.nunique()

In [None]:
daily_sleep.TotalSleepRecords.unique()

In [None]:
daily_sleep['TotalTimeAwakeOnBed'] = daily_sleep['TotalTimeInBed'] - daily_sleep['TotalMinutesAsleep']

In [None]:
daily_sleep.describe()

In [None]:
fig, axes = plt.subplots(2, 1, sharex=True, figsize=(10,5))
for col, ax in zip(['TotalMinutesAsleep', 'TotalTimeInBed'], axes):
    sns.boxplot(data=daily_sleep, x=col, orient='h', ax=ax)
plt.xticks(np.arange(200, 750, 50));

In [None]:
400 / 60, 530 / 60

In [None]:
375 / 60, 480 / 60

- On average, users stay im bed for around 400 to 530 minutes, which is around 7 hours to 9 hours.
- But most of them only sleep for around 375 to 480 minutes, which is around 6 hours 15 minutes to 8 hours.
- It is advisable for adults to sleep for around 7-9 hours per day ([source](https://www.sleepfoundation.org/how-sleep-works/how-much-sleep-do-we-really-need#:~:text=National%20Sleep%20Foundation%20guidelines1,to%208%20hours%20per%20night.)), therefore the time spent on bed should be better utilized to to become sleep time instead of being awake or doing other things.

In [None]:
fig = plt.figure(figsize=(10, 3))
sns.boxplot(data=daily_sleep, x='TotalTimeAwakeOnBed')
plt.xticks(np.arange(0, daily_sleep['TotalTimeAwakeOnBed'].max(), 20));

- Most of them spent around 20-40 minutes staying on bed without actually being asleep, the outliers (the dots after the right whisker) should be mistakes of the smartwatch or maybe the users were staying on bed for a very long time even after waking up. Need further data to validate this.

Recommendation: The 20-40 minutes could be lessened. Campaigns can be organized to promote features such as teaching the users to meditate to fall asleep faster. 

# Analyzing Weight Data

In [None]:
weight_log.head()

In [None]:
weight_log.Id.value_counts().plot(kind='bar', figsize=(8, 5))
plt.title('Number of Records of Weights Logged for each User');

- Most of the users did not log their weights, either manually or automatically via smartwatch function.
- Only 6 users had some records of their weights, but only 2 persons had more than 20 days of records, while the rest of the users had 5 or less records.

In [None]:
plt.figure(figsize=(8, 5))
sns.countplot(data=weight_log, x='IsManualReport')
plt.title('Number of Records Logged Manually');

- This shows that most users did not want to use this feature, probably due to the fact that they find it inconvenient to record them manually, or maybe the automatic weight logging feature was not working very well. Let's check it.

In [None]:
weight_log.groupby(['Id', 'IsManualReport']).count()

In [None]:
weight_log.head()

In [None]:
plt.figure(figsize=(10,6))
sns.countplot(data=weight_log, x='Id', hue='IsManualReport', hue_order=[True, False])
plt.legend(loc='upper left')
plt.title('Number of Times the Users Manually Logged Their Weight');

- The chart shows that there is only 1 user that used the automatic weight logging feature for 24 days without manually logging at all.
- While 5 out of the 6 users had only manually recorded their weights before.
- This further proves that most of them find it troublesome to record manually, with only one user making use of the automatic weight logging feature. More feedback from users or qualitative data are needed to understand more about this.

Recommendation: Promote the weight logging feature of smartwatch, and the ability of smartwatch to automatically log their weight accurately to be able to provide future suggestions such as weight controlling advice.