In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
#-------------------------------------------------------------------------------------------------------
%config Completer.use_jedi = False

# Business Task
To discover the trends in smart device usage and assist Bellabeat's marketing team to come up with an effective marketing strategy for the company based on the findings.
### Stakeholders
* Co-Founder and Chief Creative Officer, Ueska Srsen
* Co-founder and executive member, Sando Mur
* Bellabeat's marketing analytics team


# About Bellabeat
Bellabeat is a smart device manufacturer company found in 2013. It is a tech-driven wellness company for women with their offices around the world. The company offers a range of products including:
* Bellabeat App - connects to the company's lines of wellness products and provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits.
* Leaf - a bracelet which tracks user's sleep, activity and stress
* Time - a smart watch to track user's sleep, activity and stress
* Spring - smart water bottle to track daily water intake of the user
* Membership - gives users 24/7 access to fully personalized health guidance based on their lifestyle and goals

# About the Dataset
The dataset analyzed in this study is the public dataset titled '[FitBit Fitness Tracker Data](http://https://www.kaggle.com/arashnic/fitbit)' provided by Mobius. These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.
### Limitations of the Dataset:
* Small sample size of 30. 
* The data is collected over a span of 2 months which is a small period to uncover substantial long-term trends.
* Further, the data is not recently collected.
* The dataset does not come with metadata and/or a description and hence, we might need to invest some time to understand the variables and their relationships.

### Characteristics of the Dataset
* Long format
* All data tables are related by ID attribute. 
* Calories burnt, steps, intensity, heart rate, sleep and weight logs are the metrics included in the dataset.



# Objective 
This analysis focuses on the daily and hourly trends of usage data by the smart fitness device users.This analysis will help Bellabeat to provide the members with updated health guidance at the right timings. Specifically, I have attempted to address the following questions:
1. How does the intensity level of users generally vary throughout the 24 hours of a day? Are there any common patterns which apply to the daily user activity in general?
2. Is the daily activity level of users related with the day of the week? In other words, is the user activity typically higher on some days of the week as compared to the rest?
3. Is user engagement higher on some days of the week as compared to others?

# Preparing the Data
For our analysis, we will specifically need data from dailyActivity_merged.csv, hourlyIntensities_merged.csv and sleepDay_merged.csv. Lets prepare the data from these three tables so that we can analyse them later.
### Preparing the Daily Activity Data
The dataset stores the daily acitivity data of the users in dailyActivity_merged.csv. This file contains information about daily calories, steps and intensity level of the users. Here is a preview of the data: 

In [None]:
# reading the data from the csv file into pandas dataframe
daily_activity = pd.read_csv('/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv') # reading the csv file
daily_activity.head() # data preview

 Next, we will get information about the general structure of our dataset:

In [None]:
daily_activity.info() # getting to know the dataset

The daily_activity dataset has 940 entries and 15 columns. We see that all the variables have non-null entries in all the 940 rows. So, our data has no missing entries. Besides, all the values other than the ActivityDate are in proper integer or floating number format. We can convert the ActivityDate to a datetime format as follows to ensure consistent formats in our analysis:

In [None]:
daily_activity['ActivityDate']= pd.to_datetime(daily_activity['ActivityDate']) 

Lets check the exact number of user IDs included in the dataset:

In [None]:
daily_activity.Id.nunique()

This dataset has data for 33 fitbit users. Moving further, we must get rid of any duplicate values from our dataset so that our analysis is not biased towards any side.

In [None]:
daily_activity.drop_duplicates()

We see that no rows were dropped, this means that our dataset did not have any duplicate entries. The dataset is now all prepared to be used. 

### Preparing the Hourly Intensity Data
The hourly intensities with date and time information is stored for all users in hourlyIntensities_merged.csv. Let us take a look at this dataset:

In [None]:
hourly_intensities = pd.read_csv('/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv')
hourly_intensities.head() 

In [None]:
hourly_intensities.info()

This shows that the hourly_intensities dataset has no null values. Let us drop any duplicate rows, if any, in the dataset:

In [None]:
hourly_intensities.drop_duplicates()

We see that no rows were dropped. This implies that there were no duplicates in our data. Now, we will convert the data in the ActivityHour column to a datetime format:

In [None]:
hourly_intensities['ActivityHour']= pd.to_datetime(hourly_intensities['ActivityHour'])

The number of users whose data is contained in this dataset are:

In [None]:
hourly_intensities.Id.nunique()

# Processing the Data
Beginning with our first question, we want to see if there are some hours of the day in which a  typical user will be generally more active than others. Presently, our dataset has time and date merged into one column. We want specifically the hourly information for our present use case. Therefore, we will extract the hour information from the ActivityHour column:

In [None]:
# extracting the hour information from datetime object
hourly_intensities['Hour'] = hourly_intensities['ActivityHour'].dt.hour 
hourly_intensities['Hour'].head()

Next, we would find the average of the total intensities of all users in a given hour. We can then, identify the hourly trends in intensity levels per hour. For finding average intensity level per hour, the code can be written as follows:

In [None]:
# aggregating information based on hour of the day for all entries in hourly_intensities dataframe
grouped_by_hour = hourly_intensities.groupby('Hour')['TotalIntensity'].sum().reset_index()

# function for calculating the number of entries per hour in the dataframe
def find_divisor(grouped_series, reqd_colname, orig_df):
    index_list=[] # to store the index information
    number_per_index=[] # to store the number of entries per index
    for entry in grouped_series[reqd_colname]:
        index_list.append(entry)
        number_per_index.append(len(orig_df[orig_df[reqd_colname]==entry]))
    return pd.Series(number_per_index, index=index_list) 

#  Series with number of entries per hour
entries_per_hour = find_divisor(grouped_by_hour, 'Hour', hourly_intensities) 

In [None]:
# function for dividing grouped series with number of entries per index
def find_average(dividend_series, dividend_colname, divisor_series):
    avg_series = dividend_series[dividend_colname].divide(other = divisor_series, axis = 0)
    return avg_series

# Series with average activity level of users per hour
activity_per_hour = find_average(grouped_by_hour, 'TotalIntensity', entries_per_hour)

# dataframe to store hour and corresponding activity level as an entry
temp_frame = {'Hour': activity_per_hour.index.tolist(), 'AvgIntensity' : activity_per_hour}

# converting average_hourly_intensity dataframe to a csv file
average_hourly_intensity = pd.DataFrame(temp_frame)
average_hourly_intensity.to_csv('AverageHourlyIntensity.csv')
average_hourly_intensity.head()

# Analyzing the Data
Now, we can observe the trend in user intensity level throughout the 24 hours of a day. Below is a Tableau graph created by plotting hours on the X-axis and average intensity on Y-axis from the AverageHourlyIntensity.csv file generated above:

In [None]:
import matplotlib.pyplot as plt
# assigning data to be shown along each dimension
x = average_hourly_intensity['Hour']
y = average_hourly_intensity['AvgIntensity']
plt.plot(x,y) # plotting the data
plt.title('Average Activity Level Over 24 Hours') # giving a title to our plot
#naming the axes
plt.xlabel('Hour of the Day')
plt.ylabel('Average Activity Level of a User')
plt.show() # show the plot

To obtain a more interactive visualization, we can download the output file 'AverageHourlyIntensity.csv' and plot a similar line chart in Tableau. Below is an embedded resulting graph for our data:

In [None]:
%%HTML 
<div class='tableauPlaceholder' id='viz1623582488949' style='position: relative'><noscript><a href='#'><img alt='Average Activity Level over 24 Hours ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Av&#47;AverageIntensityLeveloveraDay&#47;Sheet1&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='AverageIntensityLeveloveraDay&#47;Sheet1' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Av&#47;AverageIntensityLeveloveraDay&#47;Sheet1&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en-US' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1623582488949');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

### Sharing the Observations from Above Analysis
The acitvity level of the users starts rising from 4 AM and then drops again after 8 PM at night as expected given the general day and night activity patterns of people. This line graph gives us two other important insights:
1. The activity levels undergo a little dip in afternoon at around 3 PM. This shows that people are sedentary in those hours of the room.
2. The peak of activity is reached in the evening at 6 PM. Those timings may correspond with the commute timings of the office goers or the workout schedules of people. Presently we have limited information to conclude anything about the reason behind this insight since we do not know the exact composition of the users included in the dataset.
### Next Steps
Moving forward, we can perform our analysis on finding the weekday(s) which are the most busy versus the weekday(s) that see the least user activity. For doing so, lets put the required information into a dataframe to include columns for date and corresponding weekday for each entry in the hourly_intensities dataframe. First, we extract the date from the 'ActivityHour' column of hourly_intensities dataframe and store it into a new column, 'Date'.

In [None]:
hourly_intensities['Date'] = pd.to_datetime(hourly_intensities['ActivityHour'].dt.date)
hourly_intensities['Date'].head()

Next, we insert code to find the week day corresponding to each date in the 'Date' column as follows:

In [None]:
# code to convert each date to its corresponding weekday
hourly_intensities['Day'] = hourly_intensities['Date'].dt.day_name()

# grouping the entries according to weekdays with total intensity values per week day
grouped_by_day = hourly_intensities.groupby('Day')['TotalIntensity'].sum()
weekday_intensity = pd.DataFrame({'Day':list(grouped_by_day.index.values), 'TotalIntensity': grouped_by_day})

# series to store number of entries per week day
entries_per_day = find_divisor(weekday_intensity, 'Day', hourly_intensities)

# average activity level per week day
temp_avg_intensity = find_average (weekday_intensity, 'TotalIntensity', entries_per_day)

# storing the results in a dataframe with desired scheme
temp_frame = pd.DataFrame({'Weekday': temp_avg_intensity.index.tolist(), 'AvgIntensity': temp_avg_intensity}).reset_index(drop=True)
# converting the dataframe to a csv file
temp_frame.to_csv('IntensitiesWeekday.csv')

### Sharing the Observations from Above Analysis
Now, we can visualize the trend for average activity level of a user over the seven days of a week. Below is a Tableau graph generated by plotting weekdays on x-axis and avergae intensity level on y-axis:

In [None]:
day = temp_frame['Weekday']
avg_int = temp_frame['AvgIntensity']
 
# Figure Size
fig = plt.figure(figsize =(10, 7))
 
# Horizontal Bar Plot
plt.bar(day, avg_int)
 
# Show Plot
plt.show()

For a more interactive graph in Tableau, we can download the output file 'IntensitiesWeekday.csv' obtained above and plot the data. Here's the visualization obtained in Tableau:

In [None]:
%%HTML
<div class='tableauPlaceholder' id='viz1623573791361' style='position: relative'><noscript><a href='#'><img alt='Average Activity Level Over a Week ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Bo&#47;Book2_16235736520540&#47;Sheet1&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Book2_16235736520540&#47;Sheet1' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Bo&#47;Book2_16235736520540&#47;Sheet1&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en-US' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1623573791361');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

We see that the users are generally least active on Sunday and most active on Saturday. This can be due to a number of possiblities:
1. People generally spend Saturdays on outings and relax on Sundays (applicable for people who follow a 5-day workweek). However, we do not have any demographic information about our sample population to confirm this hypothesis.
2. In the 5 working days of the week, Tuesday is the most busy day followed by Friday. Wednesday is the least active day among the 5 workdays of the week.

### Next Steps
Next, we need to find if the users are consistently skipping logging their activites on some days or not. This will help us in determining if the lower activity level on somedays is actually due to low activity at user end or due to users skipping to log their activity. For this analysis, we go back to our daily_activity dataframe and add a column for weekdays:

In [None]:
# Finding week days corresponding to dates
daily_activity['Day'] = daily_activity['ActivityDate'].dt.day_name()

# aggregating logging activity of users for each week day
act_temp = daily_activity.groupby('Day')['LoggedActivitiesDistance'].sum()
logact_day = pd.DataFrame({'Day': list(act_temp.index.values), 'LoggedActivitiesDistance': act_temp}) 

# counting entries for each week day
count_day = find_divisor(logact_day, 'Day', daily_activity)

# finding average logged activity level per week day
temp_avg = find_average(logact_day, 'LoggedActivitiesDistance', count_day)
temp_avg = logact_day['LoggedActivitiesDistance'].divide(other = count_day, axis=0)

# storing required data in a dataframe with a desired scheme
temp_fr = pd.DataFrame({'Weekday': temp_avg.index.tolist(), 'AvgLoggedDistance': temp_avg}).reset_index(drop=True)
# converting dataframe to a csv file
temp_fr.to_csv('LoggedActivitiesWeekday.csv')

### Sharing the Observations from Above Analysis
The following Tableau plot for logging user activity versus weekdays reveals an important insight:

In [None]:
import matplotlib.pyplot as plt
# assigning data to be shown along each dimension
x = temp_fr['Weekday']
y = temp_fr['AvgLoggedDistance']
plt.plot(x,y) # plotting the data
plt.title('Average Logged Distance Per Week Day') # giving a title to our plot
#naming the axes
plt.xlabel('Week Day')
plt.ylabel('Average Logged Distance')
plt.show() # show the plot

Plotting the same data from the output file 'LoggedActivitesWeekday.csv' obtained above results in the following graph in Tableau:

In [None]:
%%HTML
<div class='tableauPlaceholder' id='viz1623578323612' style='position: relative'><noscript><a href='#'><img alt='Logged Activity Per Week Day ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Bo&#47;Book1_16235779884360&#47;Sheet1&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Book1_16235779884360&#47;Sheet1' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Bo&#47;Book1_16235779884360&#47;Sheet1&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en-US' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1623578323612');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

Since we get zero logged distances by users on the weekends, it is a good idea to download daily_activity dataframe as a csv file and inspect it in an Excel workbook to check if the graph is consistent with actual data. 
After checking, I found that my data was consistent with this finding. This drawback of Bellabeat's products provide us with a huge opportunity to channelize our marketing efforts in this direction.

# Conclusions
1. Saturdays followed by Tuesdays are the two days in a week when users are generally indulged in high intensity activites.
2. Users consistently skip logging in their activity distance on weekends (i.e., Saturdays and Sundays). They are most likely to log in their distances on Mondays.
3. User activity rises overall from 4 AM to 12 PM and dips in the noon, hitting the minimum at 3 PM. This is followed by a rise in activity again which reaches the maximum at 6 PM.

# Recommendations and Future Scope
My recommendations to Bellabeat's marketing team are two-fold:
1. The company can start 'Weekend Engagement Programmes' for its users so that users don't get disconnected from their health routines on weekends. 
2. In the sedentary hours of the noon, the users can be reminded to take a small break from their work and leave their seats to cut the long sitting hours. At the same time, the notifications and triggers can be kept to a minimum during 10 AM to 12 PM and 5-7 PM to avoid disturbance to user. This intelligent alerting system can be marketed to attract more potential customers and providing user delight for existing customers.
This dataset can also be used in future to gain further insights into the following relevant questions:
1. How does user sleep on a particular day affect their activity and heartrate for the next day?
2. Does the activity level trends over the 24 hours of the day vary when seen over the working days as compared to on the weekends?
3. Do higher calories burnt necessarily correlate with a healthy heartrate?
