In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Ask

Our client is Bellabeat, a high-tech company that focuses on health based smart products and manufatures a variety of them to check the user's health metrics like stress level, activity level, menstrual cycles, sleep etc.

We need to guide the client to generate a marketing strategy by utilizing the smart device data collected to gain insights into how consumers are using their smart devices.



# Prepare 

The data consists of daily uses by a few customers provided by Fitbit. The data is obtained from Kaggle.

The data consists of many different CSV files containing daily, hourly or per minute data for various metrics such as Activity, Sleep, Intensity and BMI.

In [None]:
#importing the required datasets

#daily_merged = pd.read_csv('/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv')
#daily_sleep = pd.read_csv('/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv')

hourly_cal = pd.read_csv('/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlyCalories_merged.csv')
hourly_intensity = pd.read_csv('/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv')
bmi = pd.read_csv('/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv')

The data consists of ~33 users. Since this is a small number, we can't use it to analyse the different types of users and what would have been a preferable segment for Bellabeat.

Additionally the data consists of mostly the users who use to product to track their daily exercises and calory uses. This restricts our analysis to only part of the features provided by Bellabeat's products. 

A document explaining the definition of different values in the datasets would have helped us incorporate more features in the analysis which were left out due to unclarity of the fields.

# Process

The analysis would be done using the Pandas, Matplotlib and Seaborn packages in Python.

We initially check whether the data is clean or not:

In [None]:
print(hourly_cal.count())
print('\n')
print(hourly_cal.head())

In [None]:
print(hourly_intensity.count())
print('\n')
print(hourly_intensity.head())

In [None]:
print(bmi.count())
print('\n')
print(bmi.head())

 The Calories and Intensity datasets have no missing values and a sample of the values look okay on a visual glance.
 
 The BMI data seems well populated apart from one column (fat) which won't be used in the analysis.

# Analyze


In [None]:
hourly = hourly_intensity.merge(hourly_cal, on = ['Id','ActivityHour'], how = 'outer', indicator = True)
hourly['Time'] = pd.to_datetime(hourly['ActivityHour']).dt.time
hourly['Date'] = pd.to_datetime(hourly['ActivityHour']).dt.date
hourly['Id'] = hourly['Id'].astype(str)

bmi_grp =  bmi.groupby('Id').agg({'BMI':'mean'}).reset_index(drop = False)
bmi_grp['Id'] = bmi_grp['Id'].astype(str)

hourly = hourly.merge(bmi_grp, on = 'Id',how = 'left')
hourly['BMI'] = hourly['BMI'].round(3)

hourly['CalIntRatio'] = hourly['Calories']/hourly['TotalIntensity']

hr_grp = hourly.groupby(['Id','Time']).agg({'Calories':'mean','TotalIntensity':'mean','BMI':'mean'}).reset_index(drop = False)


In [None]:
hr_grp['BMI'] = hr_grp['BMI'].round(3).astype(str).astype(float)
len(hr_grp.BMI.unique())

# Share

1. The Calories burnt is a direct relation with Total Intensity, without any visible delay.

The below graph shows the average calories burnt and average intensity of activity through out the day across all the users in the study. Different plots are made for users with different BMI.

In [None]:
fig, axes = plt.subplots(4,2, sharex = True, sharey = True)
i = 0
for index, grp in hr_grp.groupby('BMI'):
    sns.lineplot(x = 'Time', y = 'Calories', data = grp,  ax = axes[int(i/2)][int(i%2)], ci = False)
    axes[int(i/2)][int(i%2)].set_ylabel('')
    axes2 = axes[int(i/2)][int(i%2)].twinx()
    axes2.set_ylim(0,75)
    sns.lineplot(x = 'Time', y = 'TotalIntensity', data = grp, ax = axes2, color = 'red', ci = False)
    axes2.set_ylabel('')
    i = i+1

fig.text(0.96, 0.5, 'Intensity', va='center', rotation='vertical')
fig.text(0.04, 0.5, 'Calories', va='center', rotation='vertical')

As we can see from the above plot, Calories burnt is strongly related with the total intensity of the activity in real time.

Given this knowledge, we would like to see how this realtion quantitaively and qualitatively varies across BMI.

In the below chart, we can see that the calouries burnt for the same intensity varies by BMI, reaching a peak optimal value at BMI value of approximately 25.

In [None]:
sns.scatterplot(x = 'TotalIntensity', y = 'Calories', data = hourly, hue = 'BMI', palette=sns.color_palette("Set2", hourly.BMI.nunique())) 

The above chart seems to be spreading out at larger intensities. This suggests that the (Calories burnt / Intensity of activity) is not constant throught.

To check this, we conduct the below two analysis.

In the below plot, we can see that the size of the scatter point (hence the Calories burnt per Intensity metric) is larger for a lower intensity across all BMI values. However BMI value of 25 seems to have the most Calories burnt per Intensity at all intensities.

In [None]:
sns.scatterplot(data = hourly[hourly.TotalIntensity > 5], x = 'BMI', y = 'TotalIntensity', size = 'CalIntRatio', sizes = (10,200))

The below plot gives us a better relation of how the calories burnt per intensity is higher for lower intensities and how it largely depends on the BMI of a person.

In [None]:
sns.scatterplot(data = hourly[hourly.TotalIntensity > 5], x = 'TotalIntensity', y = 'CalIntRatio', hue = 'BMI', palette = sns.color_palette('Dark2',hourly[hourly.TotalIntensity > 5].BMI.nunique()))

# Act

Given the above analysis I would recommend the following to the client:
- Given the direct correlation of the intensity of activity and calories burnt, the client as use this as a marketing strategy to acquire users who would like to attain and maintain their fitness.
- Given the fact that the Calories burnt per intensity follows a trend for each BMI, the client can provide the estimated intensity required to burn a certain amount of calories as per the user's BMI. They can use this feature as a selling point as thier devices would help keep a track of the intensity of active being performed by them.


A better analysis can be performed if the following additions are made to the current data:
- Include more users so as to perform a user segmentation analysis to identify the best suited customer and their habits.
- Include data for more motnhs as this would help us check the seasonality of use of fitness products. This can be used to find the appropriate time to promote the product, when the users are most motivated to perform certain tasks like getting fitter.
- Include more users so as to find better relations between the sleep patterns and activities of users, the lower number of user datapoints acts as a roadblock since we can't generalize the sleep matter with such few numbers. We have the sleep data for ~50% of the users for which we have the data hence ~15-17 users, which is a low number since sleep patterns and natures tend to depend on various factors not captured in the data like stress levels etc. Additionally few fields in the sleep data can be explained so that we are aware how it can be used in the analysis.
- Include data of other features captured by the fitness devices, like menstrual cycles, stress levels, water consumption etc.